Class: Ronin::Web::Spider::Agent

Inherits:

Spidr::Agent

Object
Spidr::Agent
Ronin::Web::Spider::Agent

show all

Defined in:: lib/ronin/web/spider/agent.rb

Overview

Constant Summary collapse

JAVASCRIPT_INLINE_REGEX_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Regex to match and skip JavaScript inline regexes.

Since:

0.1.1

%r{
  (?# match before the regex to avoid matching division operators )
  (?:[\{\[\(;:,]\s*|=\s*|return\s*)
  /
    (?# inline regex contents )
    (?:
      \[ (?:\\. | [^\]]) \] (?# [...] ) |
      \\.                   (?# backslash escaped characters ) |
      [^/]                  (?# everything else )
    )+
  /[dgimsuvy]* (?# also match any regex flags )
}mx

JAVASCRIPT_TEMPLATE_LITERAL_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Note:
This regex will not properly match nested template literals:

`foo ${`bar ${1+1}`}`

Regex to match and skip JavaScript template literals.

Since:

0.1.1

/`(?:\\`|[^`])+`/m

JAVASCRIPT_RELATIVE_PATH_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Note:
This matches foo/bar, foo/bar.ext, ../foo, and foo.ext, but not /foo, foo, or foo..

Regular expression that matches relative paths within JavaScript.

Since:

0.2.0

%r{
  \A
    (?:
       [^/\\. ]+\.[a-z0-9]+ (?# filename.ext)
       |
       [^/\\ ]+(?:/[^/\\ ]+)+ (?# dir/filename or dir/filename.ext)
    )
  \z
}x

JAVASCRIPT_ABSOLUTE_PATH_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Regular expression that matches absolute paths within JavaScript.

Since:

0.2.0

%r{\A(?:/[^/\\ ]+)+\z}

URL_REGEX =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

Regular expression for identifying URLs.

Since:

0.2.0

/\A#{Support::Text::Patterns::URL}\z/

Instance Attribute Summary collapse

#collected_certs ⇒ Array<Ronin::Support::Crypto::Cert> readonly
All certificates encountered while spidering.
#visited_hosts ⇒ Set<String>^? readonly
The visited host names.

Instance Method Summary collapse

#every_cert {|cert| ... } ⇒ Object
Passes every unique TLS certificate to the given block and populates #collected_certs.
#every_comment {|comment| ... } ⇒ Object
Passes every HTML and JavaScript comment to the given block.
#every_favicon {|favicon| ... } ⇒ Object
Pass every favicon from every page to the given block.
#every_host {|host| ... } ⇒ Object
Passes every unique host name that the agent visits to the given block and populates #visited_hosts.
#every_html_comment {|comment| ... } ⇒ Object
Passes every non-empty HTML comment to the given block.
#every_javascript {|js| ... } ⇒ Object (also: #every_js)
Passes every piece of JavaScript to the given block.
#every_javascript_absolute_path_string {|string| ... } ⇒ Object (also: #every_js_absolute_path_string)
Passes every JavaScript absolute path string to the given block.
#every_javascript_comment {|comment| ... } ⇒ Object (also: #every_js_comment)
Passes every JavaScript comment to the given block.
#every_javascript_path_string {|string| ... } ⇒ Object (also: #every_js_path_string)
Passes every JavaScript path string to the given block.
#every_javascript_relative_path_string {|string| ... } ⇒ Object (also: #every_js_relative_path_string)
Passes every JavaScript relative path string to the given block.
#every_javascript_string {|string| ... } ⇒ Object (also: #every_js_string)
Passes every JavaScript string value to the given block.
#every_javascript_url_string {|string| ... } ⇒ Object (also: #every_js_url_string)
Passes every JavaScript URL string to the given block.
#initialize(proxy: Support::Network::HTTP.proxy, user_agent: Support::Network::HTTP.user_agent, **kwargs) {|agent| ... } ⇒ Agent constructor
Creates a new Spider object.

Constructor Details

#initialize(proxy: Support::Network::HTTP.proxy, user_agent: Support::Network::HTTP.user_agent, **kwargs) {|agent| ... } ⇒ `Agent`

Creates a new Spider object.

Parameters:

proxy (Spidr::Proxy, Addressable::URI, URI::HTTP, Hash, String, nil) (defaults to: Support::Network::HTTP.proxy) —
The proxy to use while spidering.
user_agent (String, nil) (defaults to: Support::Network::HTTP.user_agent) —
The User-Agent string to send.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments for Spidr::Agent#initialize.

Options Hash (**kwargs):

:referer (String, nil) —
The referer URL to send.
:delay (Integer) — default: 0 —
Duration in seconds to pause between spidering each link.
:schemes (Array) — default: ['http', 'https'] —
The list of acceptable URI schemes to visit. The https scheme will be ignored if net/https cannot be loaded.
:host (String, nil) —
The host-name to visit.
:hosts (Array<String, Regexp, Proc>) —
The patterns which match the host-names to visit.
:ignore_hosts (Array<String, Regexp, Proc>) —
The patterns which match the host-names to not visit.
:ports (Array<Integer, Regexp, Proc>) —
The patterns which match the ports to visit.
:ignore_ports (Array<Integer, Regexp, Proc>) —
The patterns which match the ports to not visit.
:links (Array<String, Regexp, Proc>) —
The patterns which match the links to visit.
:ignore_links (Array<String, Regexp, Proc>) —
The patterns which match the links to not visit.
:exts (Array<String, Regexp, Proc>) —
The patterns which match the URI path extensions to visit.
:ignore_exts (Array<String, Regexp, Proc>) —
The patterns which match the URI path extensions to not visit.

Yields:

(agent) —
If a block is given, it will be passed the newly created web spider agent.

Yield Parameters:

agent (Agent) —
The newly created web spider agent.

Instance Attribute Details

#collected_certs ⇒ `Array<Ronin::Support::Crypto::Cert>` (readonly)

All certificates encountered while spidering.

Returns:

(Array<Ronin::Support::Crypto::Cert>)



163
164
165

# File 'lib/ronin/web/spider/agent.rb', line 163

def collected_certs
  @collected_certs
end

#visited_hosts ⇒ `Set<String>`^? (readonly)

The visited host names.

Returns:

(Set<String>, nil)



129
130
131

# File 'lib/ronin/web/spider/agent.rb', line 129

def visited_hosts
  @visited_hosts
end

Instance Method Details

#every_cert {|cert| ... } ⇒ `Object`

Passes every unique TLS certificate to the given block and populates #collected_certs.

Examples:

spider.every_cert do |cert|
  puts "Discovered new cert for #{cert.subject.command_name}, #{cert.subject_alt_name}"
end

Yields:

(cert)

Yield Parameters:

(Ronin::Support::Crypto::Cert)

# File 'lib/ronin/web/spider/agent.rb', line 180

def every_cert
  @collected_certs ||= []

  serials = Set.new

  every_page do |page|
    if page.url.scheme == 'https'
      cert = sessions[page.url].peer_cert

      if serials.add?(cert.serial)
        cert = Support::Crypto::Cert(cert)

        @collected_certs << cert
        yield cert
      end
    end
  end
end

#every_comment {|comment| ... } ⇒ `Object`

Passes every HTML and JavaScript comment to the given block.

Examples:

spider.every_comment do |comment|
  puts comment
end

Yields:

(comment) —
The given block will be passed each HTML or JavaScript comment.
(comment, page) —
If the block accepts two arguments, the HTML or JavaScript comment and the page that the HTML/JavaScript comment was found on will be passed to the given block.

Yield Parameters:

comment (String) —
The contents of a HTML or JavaScript comment.
page (Spidr::Page) —
The page that the HTML or JavaScript comment was found in or on.

#every_favicon {|favicon| ... } ⇒ `Object`

Pass every favicon from every page to the given block.

Examples:

spider.every_favicon do |page|
  # ...
end

Yields:

(favicon) —
The given block will be passed every encountered .ico file.

Yield Parameters:

favicon (Spidr::Page) —
An encountered .ico file.

#every_host {|host| ... } ⇒ `Object`

Passes every unique host name that the agent visits to the given block and populates #visited_hosts.

Examples:

spider.every_host do |host|
  puts "Spidring #{host} ..."
end

Yields:

(host)

Yield Parameters:

host (String)

# File 'lib/ronin/web/spider/agent.rb', line 146

def every_host
  @visited_hosts ||= Set.new

  every_page do |page|
    host = page.url.host

    if @visited_hosts.add?(host)
      yield host
    end
  end
end

#every_html_comment {|comment| ... } ⇒ `Object`

Passes every non-empty HTML comment to the given block.

Examples:

spider.every_html_comment do |comment|
  puts comment
end

Yields:

(comment) —
The given block will be pass every HTML comment.
(comment, page) —
If the block accepts two arguments, the HTML comment and the page that the comment was found on will be passed to the given block.

Yield Parameters:

comment (String) —
The HTML comment inner text, with leading and trailing whitespace stripped.
page (Spidr::Page) —
The page that the HTML comment exists on.

# File 'lib/ronin/web/spider/agent.rb', line 247

def every_html_comment(&block)
  every_html_page do |page|
    next unless page.doc

    page.doc.xpath('//comment()').each do |comment|
      comment_text = comment.inner_text.strip

      unless comment_text.empty?
        if block.arity == 2
          yield comment_text, page
        else
          yield comment_text
        end
      end
    end
  end
end

#every_javascript {|js| ... } ⇒ `Object` Also known as: every_js

Passes every piece of JavaScript to the given block.

Examples:

spider.every_javascript do |js|
  puts js
end

Yields:

(js) —
The given block will be passed every piece of JavaScript source.
(js, page) —
If the block accepts two arguments, the JavaScript source and the page that the JavaScript source was found on will be passed to the given block.

Yield Parameters:

js (String) —
The JavaScript source code.
page (Spidr::Page) —
The page that the JavaScript source was found in or on.

# File 'lib/ronin/web/spider/agent.rb', line 289

def every_javascript(&block)
  # yield inner text of every `<script type="text/javascript">` tag
  # and every `.js` URL.
  every_html_page do |page|
    next unless page.doc

    page.doc.xpath('//script[@type="text/javascript"]').each do |script|
      source = script.inner_text
      source.force_encoding(Encoding::UTF_8)

      unless source.empty?
        if block.arity == 2
          yield source, page
        else
          yield source
        end
      end
    end
  end

  every_javascript_page do |page|
    source = page.body
    source.force_encoding(Encoding::UTF_8)

    if block.arity == 2
      yield source, page
    else
      yield source
    end
  end
end

#every_javascript_absolute_path_string {|string| ... } ⇒ `Object` Also known as: every_js_absolute_path_string

Passes every JavaScript absolute path string to the given block.

Examples:

spider.every_javascript_absolute_path_string do |absolute_path|
  puts absolute_path
end

Yields:

(string) —
The given block will be passed each JavaScript absolute path string with the quote marks removed.
(string, page) —
If the block accepts two arguments, the JavaScript absolute path string and the page that the JavaScript absolute path string was found on will be passed to the given block.

Yield Parameters:

string (String) —
The parsed contents of a literal JavaScript absolute path string.
page (Spidr::Page) —
The page that the JavaScript absolute path string was found in or on.

Since:

0.2.0

# File 'lib/ronin/web/spider/agent.rb', line 504

def every_javascript_absolute_path_string(&block)
  every_javascript_string do |string,page|
    if string =~ JAVASCRIPT_ABSOLUTE_PATH_REGEX
      if block.arity == 2
        yield string, page
      else
        yield string
      end
    end
  end
end

#every_javascript_comment {|comment| ... } ⇒ `Object` Also known as: every_js_comment

Passes every JavaScript comment to the given block.

Examples:

spider.every_javascript_comment do |comment|
  puts comment
end

Yields:

(comment) —
The given block will be passed each JavaScript comment.
(comment, page) —
If the block accepts two arguments, the JavaScript comment and the page that the JavaScript comment was found on will be passed to the given block.

Yield Parameters:

comment (String) —
The contents of a JavaScript comment.
page (Spidr::Page) —
The page that the JavaScript comment was found in or on.

# File 'lib/ronin/web/spider/agent.rb', line 624

def every_javascript_comment(&block)
  every_javascript do |js,page|
    js.scan(Support::Text::Patterns::JAVASCRIPT_COMMENT) do |comment|
      if block.arity == 2
        yield comment, page
      else
        yield comment
      end
    end
  end
end

#every_javascript_path_string {|string| ... } ⇒ `Object` Also known as: every_js_path_string

Passes every JavaScript path string to the given block.

Examples:

spider.every_javascript_path_string do |path|
  puts path
end

Yields:

(string) —
The given block will be passed each JavaScript path string with the quote marks removed.
(string, page) —
If the block accepts two arguments, the JavaScript path string and the page that the JavaScript path string was found on will be passed to the given block.

Yield Parameters:

string (String) —
The parsed contents of a literal JavaScript path string.
page (Spidr::Page) —
The page that the JavaScript path string was found in or on.

Since:

0.2.0

# File 'lib/ronin/web/spider/agent.rb', line 545

def every_javascript_path_string(&block)
  every_javascript_relative_path_string(&block)
  every_javascript_absolute_path_string(&block)
end

#every_javascript_relative_path_string {|string| ... } ⇒ `Object` Also known as: every_js_relative_path_string

Passes every JavaScript relative path string to the given block.

Examples:

spider.every_javascript_relative_path_string do |relative_path|
  puts relative_path
end

Yields:

(string) —
The given block will be passed each JavaScript relative path string with the quote marks removed.
(string, page) —
If the block accepts two arguments, the JavaScript relative path string and the page that the JavaScript relative path string was found on will be passed to the given block.

Yield Parameters:

string (String) —
The parsed contents of a literal JavaScript relative path string.
page (Spidr::Page) —
The page that the JavaScript relative path string was found in or on.

Since:

0.2.0

# File 'lib/ronin/web/spider/agent.rb', line 455

def every_javascript_relative_path_string(&block)
  every_javascript_string do |string,page|
    if string =~ JAVASCRIPT_RELATIVE_PATH_REGEX
      if block.arity == 2
        yield string, page
      else
        yield string
      end
    end
  end
end

#every_javascript_string {|string| ... } ⇒ `Object` Also known as: every_js_string

Passes every JavaScript string value to the given block.

Examples:

spider.every_javascript_string do |str|
  puts str
end

Yields:

(string) —
The given block will be passed each JavaScript string with the quote marks removed.
(string, page) —
If the block accepts two arguments, the JavaScript string and the page that the JavaScript string was found on will be passed to the given block.

Yield Parameters:

string (String) —
The parsed contents of a JavaScript string.
page (Spidr::Page) —
The page that the JavaScript string was found in or on.

# File 'lib/ronin/web/spider/agent.rb', line 380

def every_javascript_string(&block)
  every_javascript do |js,page|
    scanner = StringScanner.new(js)

    until scanner.eos?
      # NOTE: this is a naive JavaScript string scanner and should
      # eventually be replaced with a real JavaScript lexer or parser.
      case scanner.peek(1)
      when '"', "'" # beginning of a quoted string
        js_string = scanner.scan(Support::Text::Patterns::STRING)
        string    = Support::Encoding::JS.unquote(js_string)

        if block.arity == 2
          yield string, page
        else
          yield string
        end
      else
        scanner.skip(JAVASCRIPT_INLINE_REGEX_REGEX) ||
          scanner.skip(JAVASCRIPT_TEMPLATE_LITERAL_REGEX) ||
          scanner.getch
      end
    end
  end
end

#every_javascript_url_string {|string| ... } ⇒ `Object` Also known as: every_js_url_string

Passes every JavaScript URL string to the given block.

Examples:

spider.every_javascript_url_string do |url|
  puts url
end

Yields:

(string) —
The given block will be passed each JavaScript URL string with the quote marks removed.
(string, page) —
If the block accepts two arguments, the JavaScript URL string and the page that the JavaScript URL string was found on will be passed to the given block.

Yield Parameters:

string (String) —
The parsed contents of a literal JavaScript URL string.
page (Spidr::Page) —
The page that the JavaScript URL string was found in or on.

Since:

0.2.0

# File 'lib/ronin/web/spider/agent.rb', line 586

def every_javascript_url_string(&block)
  every_javascript_string do |string,page|
    if string =~ URL_REGEX
      if block.arity == 2
        yield string, page
      else
        yield string
      end
    end
  end
end

Class: Ronin::Web::Spider::Agent

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(proxy: Support::Network::HTTP.proxy, user_agent: Support::Network::HTTP.user_agent, **kwargs) {|agent| ... } ⇒ Agent

Instance Attribute Details

#collected_certs ⇒ Array<Ronin::Support::Crypto::Cert> (readonly)

#visited_hosts ⇒ Set<String>? (readonly)

Instance Method Details

#every_cert {|cert| ... } ⇒ Object

#every_comment {|comment| ... } ⇒ Object

#every_favicon {|favicon| ... } ⇒ Object

#every_host {|host| ... } ⇒ Object

#every_html_comment {|comment| ... } ⇒ Object

#every_javascript {|js| ... } ⇒ Object Also known as: every_js

#every_javascript_absolute_path_string {|string| ... } ⇒ Object Also known as: every_js_absolute_path_string

#every_javascript_comment {|comment| ... } ⇒ Object Also known as: every_js_comment

#every_javascript_path_string {|string| ... } ⇒ Object Also known as: every_js_path_string

#every_javascript_relative_path_string {|string| ... } ⇒ Object Also known as: every_js_relative_path_string

#every_javascript_string {|string| ... } ⇒ Object Also known as: every_js_string

#every_javascript_url_string {|string| ... } ⇒ Object Also known as: every_js_url_string

#initialize(proxy: Support::Network::HTTP.proxy, user_agent: Support::Network::HTTP.user_agent, **kwargs) {|agent| ... } ⇒ `Agent`

#collected_certs ⇒ `Array<Ronin::Support::Crypto::Cert>` (readonly)

#visited_hosts ⇒ `Set<String>`^? (readonly)

#every_cert {|cert| ... } ⇒ `Object`

#every_comment {|comment| ... } ⇒ `Object`

#every_favicon {|favicon| ... } ⇒ `Object`

#every_host {|host| ... } ⇒ `Object`

#every_html_comment {|comment| ... } ⇒ `Object`

#every_javascript {|js| ... } ⇒ `Object` Also known as: every_js

#every_javascript_absolute_path_string {|string| ... } ⇒ `Object` Also known as: every_js_absolute_path_string

#every_javascript_comment {|comment| ... } ⇒ `Object` Also known as: every_js_comment

#every_javascript_path_string {|string| ... } ⇒ `Object` Also known as: every_js_path_string

#every_javascript_relative_path_string {|string| ... } ⇒ `Object` Also known as: every_js_relative_path_string

#every_javascript_string {|string| ... } ⇒ `Object` Also known as: every_js_string

#every_javascript_url_string {|string| ... } ⇒ `Object` Also known as: every_js_url_string