Class: Ronin::Web::Spider::Agent

Inherits:
Spidr::Agent
  • Object
show all
Defined in:
lib/ronin/web/spider/agent.rb

Overview

Extends Spidr::Agent.

Constant Summary collapse

JAVASCRIPT_INLINE_REGEX_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Regex to match and skip JavaScript inline regexes.

Since:

  • 0.1.1

%r{
  (?# match before the regex to avoid matching division operators )
  (?:[\{\[\(;:,]\s*|=\s*|return\s*)
  /
    (?# inline regex contents )
    (?:
      \[ (?:\\. | [^\]]) \] (?# [...] ) |
      \\.                   (?# backslash escaped characters ) |
      [^/]                  (?# everything else )
    )+
  /[dgimsuvy]* (?# also match any regex flags )
}mx
JAVASCRIPT_TEMPLATE_LITERAL_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Note:

This regex will not properly match nested template literals:

`foo ${`bar ${1+1}`}`

Regex to match and skip JavaScript template literals.

Since:

  • 0.1.1

/`(?:\\`|[^`])+`/m
JAVASCRIPT_RELATIVE_PATH_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Note:

This matches foo/bar, foo/bar.ext, ../foo, and foo.ext, but not /foo, foo, or foo..

Regular expression that matches relative paths within JavaScript.

Since:

  • 0.2.0

%r{
  \A
    (?:
       [^/\\. ]+\.[a-z0-9]+ (?# filename.ext)
       |
       [^/\\ ]+(?:/[^/\\ ]+)+ (?# dir/filename or dir/filename.ext)
    )
  \z
}x
JAVASCRIPT_ABSOLUTE_PATH_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Regular expression that matches absolute paths within JavaScript.

Since:

  • 0.2.0

%r{\A(?:/[^/\\ ]+)+\z}
URL_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Regular expression for identifying URLs.

Since:

  • 0.2.0

/\A#{Support::Text::Patterns::URL}\z/

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(proxy: Support::Network::HTTP.proxy, user_agent: Support::Network::HTTP.user_agent, **kwargs) {|agent| ... } ⇒ Agent

Creates a new Spider object.

Parameters:

  • proxy (Spidr::Proxy, Addressable::URI, URI::HTTP, Hash, String, nil) (defaults to: Support::Network::HTTP.proxy)

    The proxy to use while spidering.

  • user_agent (String, nil) (defaults to: Support::Network::HTTP.user_agent)

    The User-Agent string to send.

  • kwargs (Hash{Symbol => Object})

    Additional keyword arguments for Spidr::Agent#initialize.

Options Hash (**kwargs):

  • :referer (String, nil)

    The referer URL to send.

  • :delay (Integer) — default: 0

    Duration in seconds to pause between spidering each link.

  • :schemes (Array) — default: ['http', 'https']

    The list of acceptable URI schemes to visit. The https scheme will be ignored if net/https cannot be loaded.

  • :host (String, nil)

    The host-name to visit.

  • :hosts (Array<String, Regexp, Proc>)

    The patterns which match the host-names to visit.

  • :ignore_hosts (Array<String, Regexp, Proc>)

    The patterns which match the host-names to not visit.

  • :ports (Array<Integer, Regexp, Proc>)

    The patterns which match the ports to visit.

  • :ignore_ports (Array<Integer, Regexp, Proc>)

    The patterns which match the ports to not visit.

  • :links (Array<String, Regexp, Proc>)

    The patterns which match the links to visit.

  • :ignore_links (Array<String, Regexp, Proc>)

    The patterns which match the links to not visit.

  • :exts (Array<String, Regexp, Proc>)

    The patterns which match the URI path extensions to visit.

  • :ignore_exts (Array<String, Regexp, Proc>)

    The patterns which match the URI path extensions to not visit.

Yields:

  • (agent)

    If a block is given, it will be passed the newly created web spider agent.

Yield Parameters:

  • agent (Agent)

    The newly created web spider agent.

See Also:



98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/ronin/web/spider/agent.rb', line 98

def initialize(proxy:      Support::Network::HTTP.proxy,
               user_agent: Support::Network::HTTP.user_agent,
               **kwargs,
               &block)
  proxy = case proxy
          when Addressable::URI
            Spidr::Proxy.new(
              host:     proxy.host,
              port:     proxy.port,
              user:     proxy.user,
              password: proxy.password
            )
          else
            proxy
          end

  user_agent = case user_agent
               when Symbol
                 Support::Network::HTTP::UserAgents[user_agent]
               else
                 user_agent
               end

  super(proxy: proxy, user_agent: user_agent, **kwargs,&block)
end

Instance Attribute Details

#collected_certsArray<Ronin::Support::Crypto::Cert> (readonly)

All certificates encountered while spidering.

Returns:

  • (Array<Ronin::Support::Crypto::Cert>)


163
164
165
# File 'lib/ronin/web/spider/agent.rb', line 163

def collected_certs
  @collected_certs
end

#visited_hostsSet<String>? (readonly)

The visited host names.

Returns:

  • (Set<String>, nil)


129
130
131
# File 'lib/ronin/web/spider/agent.rb', line 129

def visited_hosts
  @visited_hosts
end

Instance Method Details

#every_cert {|cert| ... } ⇒ Object

Passes every unique TLS certificate to the given block and populates #collected_certs.

Examples:

spider.every_cert do |cert|
  puts "Discovered new cert for #{cert.subject.command_name}, #{cert.subject_alt_name}"
end

Yields:

  • (cert)

Yield Parameters:

  • (Ronin::Support::Crypto::Cert)


180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# File 'lib/ronin/web/spider/agent.rb', line 180

def every_cert
  @collected_certs ||= []

  serials = Set.new

  every_page do |page|
    if page.url.scheme == 'https'
      cert = sessions[page.url].peer_cert

      if serials.add?(cert.serial)
        cert = Support::Crypto::Cert(cert)

        @collected_certs << cert
        yield cert
      end
    end
  end
end

#every_comment {|comment| ... } ⇒ Object

Passes every HTML and JavaScript comment to the given block.

Examples:

spider.every_comment do |comment|
  puts comment
end

Yields:

  • (comment)

    The given block will be passed each HTML or JavaScript comment.

  • (comment, page)

    If the block accepts two arguments, the HTML or JavaScript comment and the page that the HTML/JavaScript comment was found on will be passed to the given block.

Yield Parameters:

  • comment (String)

    The contents of a HTML or JavaScript comment.

  • page (Spidr::Page)

    The page that the HTML or JavaScript comment was found in or on.

See Also:



665
666
667
668
# File 'lib/ronin/web/spider/agent.rb', line 665

def every_comment(&block)
  every_html_comment(&block)
  every_javascript_comment(&block)
end

#every_favicon {|favicon| ... } ⇒ Object

Pass every favicon from every page to the given block.

Examples:

spider.every_favicon do |page|
  # ...
end

Yields:

  • (favicon)

    The given block will be passed every encountered .ico file.

Yield Parameters:

  • favicon (Spidr::Page)

    An encountered .ico file.

See Also:



217
218
219
220
221
# File 'lib/ronin/web/spider/agent.rb', line 217

def every_favicon
  every_page do |page|
    yield page if page.icon?
  end
end

#every_host {|host| ... } ⇒ Object

Passes every unique host name that the agent visits to the given block and populates #visited_hosts.

Examples:

spider.every_host do |host|
  puts "Spidring #{host} ..."
end

Yields:

Yield Parameters:

  • host (String)


146
147
148
149
150
151
152
153
154
155
156
# File 'lib/ronin/web/spider/agent.rb', line 146

def every_host
  @visited_hosts ||= Set.new

  every_page do |page|
    host = page.url.host

    if @visited_hosts.add?(host)
      yield host
    end
  end
end

#every_html_comment {|comment| ... } ⇒ Object

Passes every non-empty HTML comment to the given block.

Examples:

spider.every_html_comment do |comment|
  puts comment
end

Yields:

  • (comment)

    The given block will be pass every HTML comment.

  • (comment, page)

    If the block accepts two arguments, the HTML comment and the page that the comment was found on will be passed to the given block.

Yield Parameters:

  • comment (String)

    The HTML comment inner text, with leading and trailing whitespace stripped.

  • page (Spidr::Page)

    The page that the HTML comment exists on.



247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
# File 'lib/ronin/web/spider/agent.rb', line 247

def every_html_comment(&block)
  every_html_page do |page|
    next unless page.doc

    page.doc.xpath('//comment()').each do |comment|
      comment_text = comment.inner_text.strip

      unless comment_text.empty?
        if block.arity == 2
          yield comment_text, page
        else
          yield comment_text
        end
      end
    end
  end
end

#every_javascript {|js| ... } ⇒ Object Also known as: every_js

Passes every piece of JavaScript to the given block.

Examples:

spider.every_javascript do |js|
  puts js
end

Yields:

  • (js)

    The given block will be passed every piece of JavaScript source.

  • (js, page)

    If the block accepts two arguments, the JavaScript source and the page that the JavaScript source was found on will be passed to the given block.

Yield Parameters:

  • js (String)

    The JavaScript source code.

  • page (Spidr::Page)

    The page that the JavaScript source was found in or on.



289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
# File 'lib/ronin/web/spider/agent.rb', line 289

def every_javascript(&block)
  # yield inner text of every `<script type="text/javascript">` tag
  # and every `.js` URL.
  every_html_page do |page|
    next unless page.doc

    page.doc.xpath('//script[@type="text/javascript"]').each do |script|
      source = script.inner_text
      source.force_encoding(Encoding::UTF_8)

      unless source.empty?
        if block.arity == 2
          yield source, page
        else
          yield source
        end
      end
    end
  end

  every_javascript_page do |page|
    source = page.body
    source.force_encoding(Encoding::UTF_8)

    if block.arity == 2
      yield source, page
    else
      yield source
    end
  end
end

#every_javascript_absolute_path_string {|string| ... } ⇒ Object Also known as: every_js_absolute_path_string

Passes every JavaScript absolute path string to the given block.

Examples:

spider.every_javascript_absolute_path_string do |absolute_path|
  puts absolute_path
end

Yields:

  • (string)

    The given block will be passed each JavaScript absolute path string with the quote marks removed.

  • (string, page)

    If the block accepts two arguments, the JavaScript absolute path string and the page that the JavaScript absolute path string was found on will be passed to the given block.

Yield Parameters:

  • string (String)

    The parsed contents of a literal JavaScript absolute path string.

  • page (Spidr::Page)

    The page that the JavaScript absolute path string was found in or on.

Since:

  • 0.2.0



504
505
506
507
508
509
510
511
512
513
514
# File 'lib/ronin/web/spider/agent.rb', line 504

def every_javascript_absolute_path_string(&block)
  every_javascript_string do |string,page|
    if string =~ JAVASCRIPT_ABSOLUTE_PATH_REGEX
      if block.arity == 2
        yield string, page
      else
        yield string
      end
    end
  end
end

#every_javascript_comment {|comment| ... } ⇒ Object Also known as: every_js_comment

Passes every JavaScript comment to the given block.

Examples:

spider.every_javascript_comment do |comment|
  puts comment
end

Yields:

  • (comment)

    The given block will be passed each JavaScript comment.

  • (comment, page)

    If the block accepts two arguments, the JavaScript comment and the page that the JavaScript comment was found on will be passed to the given block.

Yield Parameters:

  • comment (String)

    The contents of a JavaScript comment.

  • page (Spidr::Page)

    The page that the JavaScript comment was found in or on.



624
625
626
627
628
629
630
631
632
633
634
# File 'lib/ronin/web/spider/agent.rb', line 624

def every_javascript_comment(&block)
  every_javascript do |js,page|
    js.scan(Support::Text::Patterns::JAVASCRIPT_COMMENT) do |comment|
      if block.arity == 2
        yield comment, page
      else
        yield comment
      end
    end
  end
end

#every_javascript_path_string {|string| ... } ⇒ Object Also known as: every_js_path_string

Passes every JavaScript path string to the given block.

Examples:

spider.every_javascript_path_string do |path|
  puts path
end

Yields:

  • (string)

    The given block will be passed each JavaScript path string with the quote marks removed.

  • (string, page)

    If the block accepts two arguments, the JavaScript path string and the page that the JavaScript path string was found on will be passed to the given block.

Yield Parameters:

  • string (String)

    The parsed contents of a literal JavaScript path string.

  • page (Spidr::Page)

    The page that the JavaScript path string was found in or on.

Since:

  • 0.2.0



545
546
547
548
# File 'lib/ronin/web/spider/agent.rb', line 545

def every_javascript_path_string(&block)
  every_javascript_relative_path_string(&block)
  every_javascript_absolute_path_string(&block)
end

#every_javascript_relative_path_string {|string| ... } ⇒ Object Also known as: every_js_relative_path_string

Passes every JavaScript relative path string to the given block.

Examples:

spider.every_javascript_relative_path_string do |relative_path|
  puts relative_path
end

Yields:

  • (string)

    The given block will be passed each JavaScript relative path string with the quote marks removed.

  • (string, page)

    If the block accepts two arguments, the JavaScript relative path string and the page that the JavaScript relative path string was found on will be passed to the given block.

Yield Parameters:

  • string (String)

    The parsed contents of a literal JavaScript relative path string.

  • page (Spidr::Page)

    The page that the JavaScript relative path string was found in or on.

Since:

  • 0.2.0



455
456
457
458
459
460
461
462
463
464
465
# File 'lib/ronin/web/spider/agent.rb', line 455

def every_javascript_relative_path_string(&block)
  every_javascript_string do |string,page|
    if string =~ JAVASCRIPT_RELATIVE_PATH_REGEX
      if block.arity == 2
        yield string, page
      else
        yield string
      end
    end
  end
end

#every_javascript_string {|string| ... } ⇒ Object Also known as: every_js_string

Passes every JavaScript string value to the given block.

Examples:

spider.every_javascript_string do |str|
  puts str
end

Yields:

  • (string)

    The given block will be passed each JavaScript string with the quote marks removed.

  • (string, page)

    If the block accepts two arguments, the JavaScript string and the page that the JavaScript string was found on will be passed to the given block.

Yield Parameters:

  • string (String)

    The parsed contents of a JavaScript string.

  • page (Spidr::Page)

    The page that the JavaScript string was found in or on.



380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
# File 'lib/ronin/web/spider/agent.rb', line 380

def every_javascript_string(&block)
  every_javascript do |js,page|
    scanner = StringScanner.new(js)

    until scanner.eos?
      # NOTE: this is a naive JavaScript string scanner and should
      # eventually be replaced with a real JavaScript lexer or parser.
      case scanner.peek(1)
      when '"', "'" # beginning of a quoted string
        js_string = scanner.scan(Support::Text::Patterns::STRING)
        string    = Support::Encoding::JS.unquote(js_string)

        if block.arity == 2
          yield string, page
        else
          yield string
        end
      else
        scanner.skip(JAVASCRIPT_INLINE_REGEX_REGEX) ||
          scanner.skip(JAVASCRIPT_TEMPLATE_LITERAL_REGEX) ||
          scanner.getch
      end
    end
  end
end

#every_javascript_url_string {|string| ... } ⇒ Object Also known as: every_js_url_string

Passes every JavaScript URL string to the given block.

Examples:

spider.every_javascript_url_string do |url|
  puts url
end

Yields:

  • (string)

    The given block will be passed each JavaScript URL string with the quote marks removed.

  • (string, page)

    If the block accepts two arguments, the JavaScript URL string and the page that the JavaScript URL string was found on will be passed to the given block.

Yield Parameters:

  • string (String)

    The parsed contents of a literal JavaScript URL string.

  • page (Spidr::Page)

    The page that the JavaScript URL string was found in or on.

Since:

  • 0.2.0



586
587
588
589
590
591
592
593
594
595
596
# File 'lib/ronin/web/spider/agent.rb', line 586

def every_javascript_url_string(&block)
  every_javascript_string do |string,page|
    if string =~ URL_REGEX
      if block.arity == 2
        yield string, page
      else
        yield string
      end
    end
  end
end