Class: Ronin::Web::Spider::Agent
- Inherits:
-
Spidr::Agent
- Object
- Spidr::Agent
- Ronin::Web::Spider::Agent
- Defined in:
- lib/ronin/web/spider/agent.rb
Overview
Extends Spidr::Agent.
Constant Summary collapse
- JAVASCRIPT_INLINE_REGEX_REGEX =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
Regex to match and skip JavaScript inline regexes.
%r{ (?# match before the regex to avoid matching division operators ) (?:[\{\[\(;:,]\s*|=\s*|return\s*) / (?# inline regex contents ) (?: \[ (?:\\. | [^\]]) \] (?# [...] ) | \\. (?# backslash escaped characters ) | [^/] (?# everything else ) )+ /[dgimsuvy]* (?# also match any regex flags ) }mx
- JAVASCRIPT_TEMPLATE_LITERAL_REGEX =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
Note:This regex will not properly match nested template literals:
`foo ${`bar ${1+1}`}`
Regex to match and skip JavaScript template literals.
/`(?:\\`|[^`])+`/m
- JAVASCRIPT_RELATIVE_PATH_REGEX =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
Note:This matches
foo/bar
,foo/bar.ext
,../foo
, andfoo.ext
, but not/foo
,foo
, orfoo.
.Regular expression that matches relative paths within JavaScript.
%r{ \A (?: [^/\\. ]+\.[a-z0-9]+ (?# filename.ext) | [^/\\ ]+(?:/[^/\\ ]+)+ (?# dir/filename or dir/filename.ext) ) \z }x
- JAVASCRIPT_ABSOLUTE_PATH_REGEX =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
Regular expression that matches absolute paths within JavaScript.
%r{\A(?:/[^/\\ ]+)+\z}
- URL_REGEX =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
Regular expression for identifying URLs.
/\A#{Support::Text::Patterns::URL}\z/
Instance Attribute Summary collapse
-
#collected_certs ⇒ Array<Ronin::Support::Crypto::Cert>
readonly
All certificates encountered while spidering.
-
#visited_hosts ⇒ Set<String>?
readonly
The visited host names.
Instance Method Summary collapse
-
#every_cert {|cert| ... } ⇒ Object
Passes every unique TLS certificate to the given block and populates #collected_certs.
-
#every_comment {|comment| ... } ⇒ Object
Passes every HTML and JavaScript comment to the given block.
-
#every_favicon {|favicon| ... } ⇒ Object
Pass every favicon from every page to the given block.
-
#every_host {|host| ... } ⇒ Object
Passes every unique host name that the agent visits to the given block and populates #visited_hosts.
-
#every_html_comment {|comment| ... } ⇒ Object
Passes every non-empty HTML comment to the given block.
-
#every_javascript {|js| ... } ⇒ Object
(also: #every_js)
Passes every piece of JavaScript to the given block.
-
#every_javascript_absolute_path_string {|string| ... } ⇒ Object
(also: #every_js_absolute_path_string)
Passes every JavaScript absolute path string to the given block.
-
#every_javascript_comment {|comment| ... } ⇒ Object
(also: #every_js_comment)
Passes every JavaScript comment to the given block.
-
#every_javascript_path_string {|string| ... } ⇒ Object
(also: #every_js_path_string)
Passes every JavaScript path string to the given block.
-
#every_javascript_relative_path_string {|string| ... } ⇒ Object
(also: #every_js_relative_path_string)
Passes every JavaScript relative path string to the given block.
-
#every_javascript_string {|string| ... } ⇒ Object
(also: #every_js_string)
Passes every JavaScript string value to the given block.
-
#every_javascript_url_string {|string| ... } ⇒ Object
(also: #every_js_url_string)
Passes every JavaScript URL string to the given block.
-
#initialize(proxy: Support::Network::HTTP.proxy, user_agent: Support::Network::HTTP.user_agent, **kwargs) {|agent| ... } ⇒ Agent
constructor
Creates a new Spider object.
Constructor Details
#initialize(proxy: Support::Network::HTTP.proxy, user_agent: Support::Network::HTTP.user_agent, **kwargs) {|agent| ... } ⇒ Agent
Creates a new Spider object.
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'lib/ronin/web/spider/agent.rb', line 98 def initialize(proxy: Support::Network::HTTP.proxy, user_agent: Support::Network::HTTP.user_agent, **kwargs, &block) proxy = case proxy when Addressable::URI Spidr::Proxy.new( host: proxy.host, port: proxy.port, user: proxy.user, password: proxy.password ) else proxy end user_agent = case user_agent when Symbol Support::Network::HTTP::UserAgents[user_agent] else user_agent end super(proxy: proxy, user_agent: user_agent, **kwargs,&block) end |
Instance Attribute Details
#collected_certs ⇒ Array<Ronin::Support::Crypto::Cert> (readonly)
All certificates encountered while spidering.
163 164 165 |
# File 'lib/ronin/web/spider/agent.rb', line 163 def collected_certs @collected_certs end |
#visited_hosts ⇒ Set<String>? (readonly)
The visited host names.
129 130 131 |
# File 'lib/ronin/web/spider/agent.rb', line 129 def visited_hosts @visited_hosts end |
Instance Method Details
#every_cert {|cert| ... } ⇒ Object
Passes every unique TLS certificate to the given block and populates #collected_certs.
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
# File 'lib/ronin/web/spider/agent.rb', line 180 def every_cert @collected_certs ||= [] serials = Set.new every_page do |page| if page.url.scheme == 'https' cert = sessions[page.url].peer_cert if serials.add?(cert.serial) cert = Support::Crypto::Cert(cert) @collected_certs << cert yield cert end end end end |
#every_comment {|comment| ... } ⇒ Object
Passes every HTML and JavaScript comment to the given block.
665 666 667 668 |
# File 'lib/ronin/web/spider/agent.rb', line 665 def every_comment(&block) every_html_comment(&block) every_javascript_comment(&block) end |
#every_favicon {|favicon| ... } ⇒ Object
Pass every favicon from every page to the given block.
217 218 219 220 221 |
# File 'lib/ronin/web/spider/agent.rb', line 217 def every_favicon every_page do |page| yield page if page.icon? end end |
#every_host {|host| ... } ⇒ Object
Passes every unique host name that the agent visits to the given block and populates #visited_hosts.
146 147 148 149 150 151 152 153 154 155 156 |
# File 'lib/ronin/web/spider/agent.rb', line 146 def every_host @visited_hosts ||= Set.new every_page do |page| host = page.url.host if @visited_hosts.add?(host) yield host end end end |
#every_html_comment {|comment| ... } ⇒ Object
Passes every non-empty HTML comment to the given block.
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
# File 'lib/ronin/web/spider/agent.rb', line 247 def every_html_comment(&block) every_html_page do |page| next unless page.doc page.doc.xpath('//comment()').each do |comment| comment_text = comment.inner_text.strip unless comment_text.empty? if block.arity == 2 yield comment_text, page else yield comment_text end end end end end |
#every_javascript {|js| ... } ⇒ Object Also known as: every_js
Passes every piece of JavaScript to the given block.
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 |
# File 'lib/ronin/web/spider/agent.rb', line 289 def every_javascript(&block) # yield inner text of every `<script type="text/javascript">` tag # and every `.js` URL. every_html_page do |page| next unless page.doc page.doc.xpath('//script[@type="text/javascript"]').each do |script| source = script.inner_text source.force_encoding(Encoding::UTF_8) unless source.empty? if block.arity == 2 yield source, page else yield source end end end end every_javascript_page do |page| source = page.body source.force_encoding(Encoding::UTF_8) if block.arity == 2 yield source, page else yield source end end end |
#every_javascript_absolute_path_string {|string| ... } ⇒ Object Also known as: every_js_absolute_path_string
Passes every JavaScript absolute path string to the given block.
504 505 506 507 508 509 510 511 512 513 514 |
# File 'lib/ronin/web/spider/agent.rb', line 504 def every_javascript_absolute_path_string(&block) every_javascript_string do |string,page| if string =~ JAVASCRIPT_ABSOLUTE_PATH_REGEX if block.arity == 2 yield string, page else yield string end end end end |
#every_javascript_comment {|comment| ... } ⇒ Object Also known as: every_js_comment
Passes every JavaScript comment to the given block.
624 625 626 627 628 629 630 631 632 633 634 |
# File 'lib/ronin/web/spider/agent.rb', line 624 def every_javascript_comment(&block) every_javascript do |js,page| js.scan(Support::Text::Patterns::JAVASCRIPT_COMMENT) do |comment| if block.arity == 2 yield comment, page else yield comment end end end end |
#every_javascript_path_string {|string| ... } ⇒ Object Also known as: every_js_path_string
Passes every JavaScript path string to the given block.
545 546 547 548 |
# File 'lib/ronin/web/spider/agent.rb', line 545 def every_javascript_path_string(&block) every_javascript_relative_path_string(&block) every_javascript_absolute_path_string(&block) end |
#every_javascript_relative_path_string {|string| ... } ⇒ Object Also known as: every_js_relative_path_string
Passes every JavaScript relative path string to the given block.
455 456 457 458 459 460 461 462 463 464 465 |
# File 'lib/ronin/web/spider/agent.rb', line 455 def every_javascript_relative_path_string(&block) every_javascript_string do |string,page| if string =~ JAVASCRIPT_RELATIVE_PATH_REGEX if block.arity == 2 yield string, page else yield string end end end end |
#every_javascript_string {|string| ... } ⇒ Object Also known as: every_js_string
Passes every JavaScript string value to the given block.
380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 |
# File 'lib/ronin/web/spider/agent.rb', line 380 def every_javascript_string(&block) every_javascript do |js,page| scanner = StringScanner.new(js) until scanner.eos? # NOTE: this is a naive JavaScript string scanner and should # eventually be replaced with a real JavaScript lexer or parser. case scanner.peek(1) when '"', "'" # beginning of a quoted string js_string = scanner.scan(Support::Text::Patterns::STRING) string = Support::Encoding::JS.unquote(js_string) if block.arity == 2 yield string, page else yield string end else scanner.skip(JAVASCRIPT_INLINE_REGEX_REGEX) || scanner.skip(JAVASCRIPT_TEMPLATE_LITERAL_REGEX) || scanner.getch end end end end |
#every_javascript_url_string {|string| ... } ⇒ Object Also known as: every_js_url_string
Passes every JavaScript URL string to the given block.
586 587 588 589 590 591 592 593 594 595 596 |
# File 'lib/ronin/web/spider/agent.rb', line 586 def every_javascript_url_string(&block) every_javascript_string do |string,page| if string =~ URL_REGEX if block.arity == 2 yield string, page else yield string end end end end |