Class: Ronin::Web::CLI::Commands::Wordlist Private

Inherits:

Ronin::Web::CLI::Command

Object
Core::CLI::Command
Ronin::Web::CLI::Command
Ronin::Web::CLI::Commands::Wordlist

show all

Includes:: Core::CLI::Logging, SpiderOptions

Defined in:: lib/ronin/web/cli/commands/wordlist.rb

Overview

This class is part of a private API. You should avoid using this class if possible, as it may be removed or be changed in the future.

Builds a wordlist by spidering a website.

Usage

ronin-web wordlist [options] {--host HOST | --domain DOMAIN | --site URL}

Options

    --open-timeout SECS          Sets the connection open timeout
    --read-timeout SECS          Sets the read timeout
    --ssl-timeout SECS           Sets the SSL connection timeout
    --continue-timeout SECS      Sets the continue timeout
    --keep-alive-timeout SECS    Sets the connection keep alive timeout
-P, --proxy PROXY                Sets the proxy to use
-H, --header NAME: VALUE         Sets a default header
    --host-header NAME=VALUE     Sets a default header
-u chrome-linux|chrome-macos|chrome-windows|chrome-iphone|chrome-ipad|chrome-android|firefox-linux|firefox-macos|firefox-windows|firefox-iphone|firefox-ipad|firefox-android|safari-macos|safari-iphone|safari-ipad|edge,
    --user-agent                 The User-Agent to use
-U, --user-agent-string STRING   The User-Agent string to use
-R, --referer URL                Sets the Referer URL
    --delay SECS                 Sets the delay in seconds between each request
-l, --limit COUNT                Only spiders up to COUNT pages
-d, --max-depth DEPTH            Only spiders up to max depth
    --enqueue URL                Adds the URL to the queue
    --visited URL                Marks the URL as previously visited
    --strip-fragments            Enables/disables stripping the fragment component of every URL
    --strip-query                Enables/disables stripping the query component of every URL
    --visit-host HOST            Visit URLs with the matching host name
    --visit-hosts-like /REGEX/   Visit URLs with hostnames that match the REGEX
    --ignore-host HOST           Ignore the host name
    --ignore-hosts-like /REGEX/  Ignore the host names matching the REGEX
    --visit-port PORT            Visit URLs with the matching port number
    --visit-ports-like /REGEX/   Visit URLs with port numbers that match the REGEX
    --ignore-port PORT           Ignore the port number
    --ignore-ports-like /REGEX/  Ignore the port numbers matching the REGEXP
    --visit-link URL             Visit the URL
    --visit-links-like /REGEX/   Visit URLs that match the REGEX
    --ignore-link URL            Ignore the URL
    --ignore-links-like /REGEX/  Ignore URLs matching the REGEX
    --visit-ext FILE_EXT         Visit URLs with the matching file ext
    --visit-exts-like /REGEX/    Visit URLs with file exts that match the REGEX
    --ignore-ext FILE_EXT        Ignore the URLs with the file ext
    --ignore-exts-like /REGEX/   Ignore URLs with file exts matching the REGEX
-r, --robots                     Specifies whether to honor robots.txt
    --host HOST                  Spiders the specific HOST
    --domain DOMAIN              Spiders the whole domain
    --site URL                   Spiders the website, starting at the URL
-o, --output PATH                The wordlist to write to
-X, --content-xpath XPATH        The XPath for the content (Default: //body)
-C, --content-css-path XPATH     The XPath for the content
    --meta-tags                  Parse certain meta-tags (Default: enabled)
    --no-meta-tags               Ignore meta-tags
    --alt-tags                   Parse alt-tags on images (Default: enabled)
    --no-alt-tags                Also parse alt-tags on images
    --paths                      Also parse URL paths
    --query-params-names         Also parse URL query param names
    --query-param-values         Also parse URL query param values
    --only-paths                 Only build a wordlist based on the paths
    --only-query-param           Only build a wordlist based on the query param names
    --only-query-param-values    Only build a wordlist based on the query param values
-f, --format txt|gz|bzip2|xz     Specifies the format of the wordlist file
-A, --append                     Append new words to the wordlist file intead of overwriting the file
-L, --lang LANG                  The language of the text to parse
    --stop-word WORD             A stop-word to ignore
    --only-query-param-values    Only build a wordlist based on the query param values
-f, --format txt|gz|bzip2|xz     Specifies the format of the wordlist file
-A, --append                     Append new words to the wordlist file intead of overwriting the file
-L, --lang LANG                  The language of the text to parse
    --stop-word WORD             A stop-word to ignore
    --ignore-word WORD           Ignores the word
    --digits                     Accepts words containing digits (Default: enabled)
    --no-digits                  Ignores words containing digits
    --special-char CHAR          Allows a special character within a word (Default: _, -, ')
    --numbers                    Accepts numbers as words (Default: disabled)
    --no-numbers                 Ignores numbers
    --acronyms                   Treats acronyms as words (Default: enabled)
    --no-acronyms                Ignores acronyms
    --normalize-case             Converts all words to lowercase
    --no-normalize-case          Preserve the case of words and letters (Default: enabled)
    --normalize-apostrophes      Removes apostrophes from words
    --no-normalize-apostrophes   Preserve apostrophes from words (Default: enabled)
    --normalize-acronyms         Removes '.' characters from acronyms
    --no-normalize-acronyms      Preserve '.' characters in acronyms (Default: enabled)
-h, --help                       Print help information

Since:

1.0.0

Constant Summary collapse

META_TAGS_XPATH =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

XPath to find description and keywords meta-tags.

Since:

1.0.0

'/head/meta[@name="description" or @name="keywords"]/@content'

TEXT_XPATH =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

XPath to find all text elements.

Since:

1.0.0

'//text()[not (ancestor-or-self::script or ancestor-or-self::style)]'

COMMENT_XPATH =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

XPath to find all HTML comments.

Since:

1.0.0

'//comment()'

ALT_TAGS_XPATH =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

XPath which finds all image alt-tags, SVG desc elements, and a title attributes.

Since:

1.0.0

'//img/@alt|//area/@alt|//input/@alt|//a/@title'

WORDLIST_BUILDER_OPTIONS =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

List of command options that directly map to the keyword arguments of Wordlist::Builder.new.

Since:

1.0.0

[
  :format,
  :append,
  :lang,
  :digits,
  :numbers,
  :acronyms,
  :normalize_case,
  :normalize_apostrophes,
  :normalize_acronyms
]

Instance Attribute Summary collapse

#content_xpath ⇒ String readonly private
The XPath or CSS-path for the page's content.
#ignore_words ⇒ Array<String> readonly private
List of words to ignore.
#special_chars ⇒ Array<String> readonly private
The list of special characters to allow in words.
#stop_words ⇒ Array<String> readonly private
List of stop-words to ignore.

Attributes included from SpiderOptions

#agent_kwargs

Instance Method Summary collapse

#infer_wordlist_path ⇒ String private
Generates the wordlist output path based on the --host, --domain, or --site options.
#initialize(**kwargs) ⇒ Wordlist constructor private
Initializes the ronin-web wordlist command.
#parse_html(page) ⇒ Object private
Parses the spidered page's HTML and adds the words to the wordlist.
#parse_page(page) ⇒ Object private
Parses the spidered page's content and adds the words to the wordlist.
#parse_url_path(url) ⇒ Object private
Parses the URL's directory names of a spidered page and adds them to the wordlist.
#parse_url_query_param_names(url) ⇒ Object private
Parses the URL's query param names of a spidered page and adds them to the wordlist.
#parse_url_query_param_values(url) ⇒ Object private
Parses the URL's query param values of a spidered page and adds them to the wordlist.
#run ⇒ Object private
Runs the ronin-web wordlist command.
#wordlist_builder_kwargs ⇒ Object private
Creates a keyword arguments Hash of all command options that will be directly passed to Wordlist::Builder.new.
#wordlist_path ⇒ String private
The wordlist output path.

Methods included from SpiderOptions

#continue_timeout, #continue_timeout=, #default_headers, #delay, #delay=, #history, #host_headers, #ignore_exts, #ignore_hosts, #ignore_links, #ignore_ports, #ignore_schemes, included, #keep_alive_timeout, #keep_alive_timeout=, #limit, #limit=, #max_depth, #max_depth=, #new_agent, #open_timeout, #open_timeout=, #proxy, #proxy=, #queue, #read_timeout, #read_timeout=, #referer, #referer=, #robots, #robots=, #ssl_timeout, #ssl_timeout=, #strip_fragments, #strip_fragments=, #strip_query, #strip_query=, #user_agent, #user_agent=, #visit_exts, #visit_hosts, #visit_links, #visit_ports, #visit_schemes

Constructor Details

#initialize(**kwargs) ⇒ `Wordlist`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Initializes the ronin-web wordlist command.

Parameters:

kwargs (Hash{Symbol => Object}) —
Additional keyword arguments for the command.

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 279

def initialize(**kwargs)
  super(**kwargs)

  @content_xpath = nil

  @parse_meta_tags = true
  @parse_comments  = true
  @parse_alt_tags  = true

  @stop_words    = []
  @ignore_words  = []
  @special_chars = []
end

Instance Attribute Details

#content_xpath ⇒ `String` (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The XPath or CSS-path for the page's content.

Returns:

(String)

Since:

1.0.0



256
257
258

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 256

def content_xpath
  @content_xpath
end

#ignore_words ⇒ `Array<String>` (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

List of words to ignore.

Returns:

(Array<String>)

Since:

1.0.0



266
267
268

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 266

def ignore_words
  @ignore_words
end

#special_chars ⇒ `Array<String>` (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The list of special characters to allow in words.

Returns:

(Array<String>)

Since:

1.0.0



271
272
273

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 271

def special_chars
  @special_chars
end

#stop_words ⇒ `Array<String>` (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

List of stop-words to ignore.

Returns:

(Array<String>)

Since:

1.0.0



261
262
263

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 261

def stop_words
  @stop_words
end

Instance Method Details

#infer_wordlist_path ⇒ `String`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Generates the wordlist output path based on the --host, --domain, or --site options.

Returns:

(String) —
The generated wordlist output path.

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 355

def infer_wordlist_path
  if    options[:host]   then "#{options[:host]}.txt"
  elsif options[:domain] then "#{options[:domain]}.txt"
  elsif options[:site]
    uri = URI.parse(options[:site])

    unless uri.port == uri.default_port
      "#{uri.host}:#{uri.port}.txt"
    else
      "#{uri.host}.txt"
    end
  else
    print_error "must specify --host, --domain, or --site"
    exit(1)
  end
end

#parse_html(page) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the spidered page's HTML and adds the words to the wordlist.

Parameters:

page (Spidr::Page) —
A spidered page.

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 471

def parse_html(page)
  page.search(@xpath).each do |node|
    text = node.inner_text
    text.strip!

    @wordlist.parse(text) unless text.empty?
  end
end

#parse_page(page) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the spidered page's content and adds the words to the wordlist.

Parameters:

page (Spidr::Page) —
A spidered page.

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 457

def parse_page(page)
  if page.html?
    log_info "Parsing HTML on #{page.url} ..."
    parse_html(page)
  end
end

#parse_url_path(url) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the URL's directory names of a spidered page and adds them to the wordlist.

Parameters:

url (URI::HTTP) —
A spidered URL.

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 411

def parse_url_path(url)
  log_info "Parsing #{url} ..."

  url.path.split('/').each do |dirname|
    @wordlist.add(dirname) unless dirname.empty?
  end
end

#parse_url_query_param_names(url) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the URL's query param names of a spidered page and adds them to the wordlist.

Parameters:

url (URI::HTTP) —
A spidered URL.

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 426

def parse_url_query_param_names(url)
  unless url.query_params.empty?
    log_info "Parsing query param for #{url} ..."
    @wordlist.append(url.query_params.keys)
  end
end

#parse_url_query_param_values(url) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the URL's query param values of a spidered page and adds them to the wordlist.

Parameters:

url (URI::HTTP) —
A spidered URL.

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 440

def parse_url_query_param_values(url)
  unless url.query_params.empty?
    log_info "Parsing query param values for #{url} ..."

    url.query_params.each_value do |value|
      @wordlist.add(value)
    end
  end
end

#run ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Runs the ronin-web wordlist command.

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 309

def run
  @wordlist = ::Wordlist::Builder.new(wordlist_path,**wordlist_builder_kwargs)

  @xpath = "#{@content_xpath}#{TEXT_XPATH}"
  @xpath << "|#{META_TAGS_XPATH}"                 if @parse_meta_tags
  @xpath << "|#{@content_xpath}#{COMMENT_XPATH}"  if @parse_comments
  @xpath << "|#{@content_xpath}#{ALT_TAGS_XPATH}" if @parse_alt_tags

  begin
    new_agent do |agent|
      if options[:only_paths]
        agent.every_url(&method(:parse_url_path))
      elsif options[:only_query_param_names]
        agent.every_url(&method(:parse_url_query_param_names))
      elsif options[:only_query_param_values]
        agent.every_url(&method(:parse_url_query_param_values))
      else
        agent.every_url(&method(:parse_url_path)) if options[:paths]

        agent.every_url(&method(:parse_url_query_param_names)) if options[:query_param_names]
        agent.every_url(&method(:parse_url_query_param_values)) if options[:query_param_values]

        agent.every_ok_page(&method(:parse_page))
      end
    end
  ensure
    @wordlist.close
  end
end

#wordlist_builder_kwargs ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Creates a keyword arguments Hash of all command options that will be directly passed to Wordlist::Builder.new

Since:

1.0.0

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 390

def wordlist_builder_kwargs
  kwargs = {}

  WORDLIST_BUILDER_OPTIONS.each do |key|
    kwargs[key] = options[key] if options.has_key?(key)
  end

  kwargs[:stop_words]    = @stop_words    unless @stop_words.empty?
  kwargs[:ignore_words]  = @ignore_words  unless @ignore_words.empty?
  kwargs[:special_chars] = @special_chars unless @special_chars.empty?

  return kwargs
end

#wordlist_path ⇒ `String`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The wordlist output path.

Returns:

(String)

Since:

1.0.0



344
345
346

# File 'lib/ronin/web/cli/commands/wordlist.rb', line 344

def wordlist_path
  options.fetch(:output) { infer_wordlist_path }
end

Class: Ronin::Web::CLI::Commands::Wordlist Private

Overview

Usage

Options

Constant Summary collapse

Instance Attribute Summary collapse

Attributes included from SpiderOptions

Instance Method Summary collapse

Methods included from SpiderOptions

Constructor Details

#initialize(**kwargs) ⇒ Wordlist

Instance Attribute Details

#content_xpath ⇒ String (readonly)

#ignore_words ⇒ Array<String> (readonly)

#special_chars ⇒ Array<String> (readonly)

#stop_words ⇒ Array<String> (readonly)

Instance Method Details

#infer_wordlist_path ⇒ String

#parse_html(page) ⇒ Object

#parse_page(page) ⇒ Object

#parse_url_path(url) ⇒ Object

#parse_url_query_param_names(url) ⇒ Object

#parse_url_query_param_values(url) ⇒ Object

#run ⇒ Object

#wordlist_builder_kwargs ⇒ Object

#wordlist_path ⇒ String

#initialize(**kwargs) ⇒ `Wordlist`

#content_xpath ⇒ `String` (readonly)

#ignore_words ⇒ `Array<String>` (readonly)

#special_chars ⇒ `Array<String>` (readonly)

#stop_words ⇒ `Array<String>` (readonly)

#infer_wordlist_path ⇒ `String`

#parse_html(page) ⇒ `Object`

#parse_page(page) ⇒ `Object`

#parse_url_path(url) ⇒ `Object`

#parse_url_query_param_names(url) ⇒ `Object`

#parse_url_query_param_values(url) ⇒ `Object`

#run ⇒ `Object`

#wordlist_builder_kwargs ⇒ `Object`

#wordlist_path ⇒ `String`