Class: Ronin::Web::CLI::Commands::Wordlist Private

Inherits:
Ronin::Web::CLI::Command show all
Includes:
Core::CLI::Logging, SpiderOptions
Defined in:
lib/ronin/web/cli/commands/wordlist.rb

Overview

This class is part of a private API. You should avoid using this class if possible, as it may be removed or be changed in the future.

Builds a wordlist by spidering a website.

Usage

ronin-web wordlist [options] {--host HOST | --domain DOMAIN | --site URL}

Options

    --open-timeout SECS          Sets the connection open timeout
    --read-timeout SECS          Sets the read timeout
    --ssl-timeout SECS           Sets the SSL connection timeout
    --continue-timeout SECS      Sets the continue timeout
    --keep-alive-timeout SECS    Sets the connection keep alive timeout
-P, --proxy PROXY                Sets the proxy to use
-H, --header NAME: VALUE         Sets a default header
    --host-header NAME=VALUE     Sets a default header
-u chrome-linux|chrome-macos|chrome-windows|chrome-iphone|chrome-ipad|chrome-android|firefox-linux|firefox-macos|firefox-windows|firefox-iphone|firefox-ipad|firefox-android|safari-macos|safari-iphone|safari-ipad|edge,
    --user-agent                 The User-Agent to use
-U, --user-agent-string STRING   The User-Agent string to use
-R, --referer URL                Sets the Referer URL
    --delay SECS                 Sets the delay in seconds between each request
-l, --limit COUNT                Only spiders up to COUNT pages
-d, --max-depth DEPTH            Only spiders up to max depth
    --enqueue URL                Adds the URL to the queue
    --visited URL                Marks the URL as previously visited
    --strip-fragments            Enables/disables stripping the fragment component of every URL
    --strip-query                Enables/disables stripping the query component of every URL
    --visit-host HOST            Visit URLs with the matching host name
    --visit-hosts-like /REGEX/   Visit URLs with hostnames that match the REGEX
    --ignore-host HOST           Ignore the host name
    --ignore-hosts-like /REGEX/  Ignore the host names matching the REGEX
    --visit-port PORT            Visit URLs with the matching port number
    --visit-ports-like /REGEX/   Visit URLs with port numbers that match the REGEX
    --ignore-port PORT           Ignore the port number
    --ignore-ports-like /REGEX/  Ignore the port numbers matching the REGEXP
    --visit-link URL             Visit the URL
    --visit-links-like /REGEX/   Visit URLs that match the REGEX
    --ignore-link URL            Ignore the URL
    --ignore-links-like /REGEX/  Ignore URLs matching the REGEX
    --visit-ext FILE_EXT         Visit URLs with the matching file ext
    --visit-exts-like /REGEX/    Visit URLs with file exts that match the REGEX
    --ignore-ext FILE_EXT        Ignore the URLs with the file ext
    --ignore-exts-like /REGEX/   Ignore URLs with file exts matching the REGEX
-r, --robots                     Specifies whether to honor robots.txt
    --host HOST                  Spiders the specific HOST
    --domain DOMAIN              Spiders the whole domain
    --site URL                   Spiders the website, starting at the URL
-o, --output PATH                The wordlist to write to
-X, --content-xpath XPATH        The XPath for the content (Default: //body)
-C, --content-css-path XPATH     The XPath for the content
    --meta-tags                  Parse certain meta-tags (Default: enabled)
    --no-meta-tags               Ignore meta-tags
    --alt-tags                   Parse alt-tags on images (Default: enabled)
    --no-alt-tags                Also parse alt-tags on images
    --paths                      Also parse URL paths
    --query-params-names         Also parse URL query param names
    --query-param-values         Also parse URL query param values
    --only-paths                 Only build a wordlist based on the paths
    --only-query-param           Only build a wordlist based on the query param names
    --only-query-param-values    Only build a wordlist based on the query param values
-f, --format txt|gz|bzip2|xz     Specifies the format of the wordlist file
-A, --append                     Append new words to the wordlist file intead of overwriting the file
-L, --lang LANG                  The language of the text to parse
    --stop-word WORD             A stop-word to ignore
    --only-query-param-values    Only build a wordlist based on the query param values
-f, --format txt|gz|bzip2|xz     Specifies the format of the wordlist file
-A, --append                     Append new words to the wordlist file intead of overwriting the file
-L, --lang LANG                  The language of the text to parse
    --stop-word WORD             A stop-word to ignore
    --ignore-word WORD           Ignores the word
    --digits                     Accepts words containing digits (Default: enabled)
    --no-digits                  Ignores words containing digits
    --special-char CHAR          Allows a special character within a word (Default: _, -, ')
    --numbers                    Accepts numbers as words (Default: disabled)
    --no-numbers                 Ignores numbers
    --acronyms                   Treats acronyms as words (Default: enabled)
    --no-acronyms                Ignores acronyms
    --normalize-case             Converts all words to lowercase
    --no-normalize-case          Preserve the case of words and letters (Default: enabled)
    --normalize-apostrophes      Removes apostrophes from words
    --no-normalize-apostrophes   Preserve apostrophes from words (Default: enabled)
    --normalize-acronyms         Removes '.' characters from acronyms
    --no-normalize-acronyms      Preserve '.' characters in acronyms (Default: enabled)
-h, --help                       Print help information

Since:

  • 1.0.0

Constant Summary collapse

META_TAGS_XPATH =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

XPath to find description and keywords meta-tags.

Since:

  • 1.0.0

'/head/meta[@name="description" or @name="keywords"]/@content'
TEXT_XPATH =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

XPath to find all text elements.

Since:

  • 1.0.0

'//text()[not (ancestor-or-self::script or ancestor-or-self::style)]'
COMMENT_XPATH =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

XPath to find all HTML comments.

Since:

  • 1.0.0

'//comment()'
ALT_TAGS_XPATH =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

XPath which finds all image alt-tags, SVG desc elements, and a title attributes.

Since:

  • 1.0.0

'//img/@alt|//area/@alt|//input/@alt|//a/@title'
WORDLIST_BUILDER_OPTIONS =

This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.

List of command options that directly map to the keyword arguments of Wordlist::Builder.new.

Since:

  • 1.0.0

[
  :format,
  :append,
  :lang,
  :digits,
  :numbers,
  :acronyms,
  :normalize_case,
  :normalize_apostrophes,
  :normalize_acronyms
]

Instance Attribute Summary collapse

Attributes included from SpiderOptions

#agent_kwargs

Instance Method Summary collapse

Methods included from SpiderOptions

#continue_timeout, #continue_timeout=, #default_headers, #delay, #delay=, #history, #host_headers, #ignore_exts, #ignore_hosts, #ignore_links, #ignore_ports, #ignore_schemes, included, #keep_alive_timeout, #keep_alive_timeout=, #limit, #limit=, #max_depth, #max_depth=, #new_agent, #open_timeout, #open_timeout=, #proxy, #proxy=, #queue, #read_timeout, #read_timeout=, #referer, #referer=, #robots, #robots=, #ssl_timeout, #ssl_timeout=, #strip_fragments, #strip_fragments=, #strip_query, #strip_query=, #user_agent, #user_agent=, #visit_exts, #visit_hosts, #visit_links, #visit_ports, #visit_schemes

Constructor Details

#initialize(**kwargs) ⇒ Wordlist

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Initializes the ronin-web wordlist command.

Parameters:

  • kwargs (Hash{Symbol => Object})

    Additional keyword arguments for the command.

Since:

  • 1.0.0



279
280
281
282
283
284
285
286
287
288
289
290
291
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 279

def initialize(**kwargs)
  super(**kwargs)

  @content_xpath = nil

  @parse_meta_tags = true
  @parse_comments  = true
  @parse_alt_tags  = true

  @stop_words    = []
  @ignore_words  = []
  @special_chars = []
end

Instance Attribute Details

#content_xpathString (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The XPath or CSS-path for the page's content.

Returns:

  • (String)

Since:

  • 1.0.0



256
257
258
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 256

def content_xpath
  @content_xpath
end

#ignore_wordsArray<String> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

List of words to ignore.

Returns:

  • (Array<String>)

Since:

  • 1.0.0



266
267
268
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 266

def ignore_words
  @ignore_words
end

#special_charsArray<String> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The list of special characters to allow in words.

Returns:

  • (Array<String>)

Since:

  • 1.0.0



271
272
273
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 271

def special_chars
  @special_chars
end

#stop_wordsArray<String> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

List of stop-words to ignore.

Returns:

  • (Array<String>)

Since:

  • 1.0.0



261
262
263
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 261

def stop_words
  @stop_words
end

Instance Method Details

#infer_wordlist_pathString

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Generates the wordlist output path based on the --host, --domain, or --site options.

Returns:

  • (String)

    The generated wordlist output path.

Since:

  • 1.0.0



355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 355

def infer_wordlist_path
  if    options[:host]   then "#{options[:host]}.txt"
  elsif options[:domain] then "#{options[:domain]}.txt"
  elsif options[:site]
    uri = URI.parse(options[:site])

    unless uri.port == uri.default_port
      "#{uri.host}:#{uri.port}.txt"
    else
      "#{uri.host}.txt"
    end
  else
    print_error "must specify --host, --domain, or --site"
    exit(1)
  end
end

#parse_html(page) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the spidered page's HTML and adds the words to the wordlist.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



471
472
473
474
475
476
477
478
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 471

def parse_html(page)
  page.search(@xpath).each do |node|
    text = node.inner_text
    text.strip!

    @wordlist.parse(text) unless text.empty?
  end
end

#parse_page(page) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the spidered page's content and adds the words to the wordlist.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



457
458
459
460
461
462
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 457

def parse_page(page)
  if page.html?
    log_info "Parsing HTML on #{page.url} ..."
    parse_html(page)
  end
end

#parse_url_path(url) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the URL's directory names of a spidered page and adds them to the wordlist.

Parameters:

  • url (URI::HTTP)

    A spidered URL.

Since:

  • 1.0.0



411
412
413
414
415
416
417
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 411

def parse_url_path(url)
  log_info "Parsing #{url} ..."

  url.path.split('/').each do |dirname|
    @wordlist.add(dirname) unless dirname.empty?
  end
end

#parse_url_query_param_names(url) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the URL's query param names of a spidered page and adds them to the wordlist.

Parameters:

  • url (URI::HTTP)

    A spidered URL.

Since:

  • 1.0.0



426
427
428
429
430
431
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 426

def parse_url_query_param_names(url)
  unless url.query_params.empty?
    log_info "Parsing query param for #{url} ..."
    @wordlist.append(url.query_params.keys)
  end
end

#parse_url_query_param_values(url) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Parses the URL's query param values of a spidered page and adds them to the wordlist.

Parameters:

  • url (URI::HTTP)

    A spidered URL.

Since:

  • 1.0.0



440
441
442
443
444
445
446
447
448
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 440

def parse_url_query_param_values(url)
  unless url.query_params.empty?
    log_info "Parsing query param values for #{url} ..."

    url.query_params.each_value do |value|
      @wordlist.add(value)
    end
  end
end

#runObject

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Runs the ronin-web wordlist command.

Since:

  • 1.0.0



309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 309

def run
  @wordlist = ::Wordlist::Builder.new(wordlist_path,**wordlist_builder_kwargs)

  @xpath = "#{@content_xpath}#{TEXT_XPATH}"
  @xpath << "|#{META_TAGS_XPATH}"                 if @parse_meta_tags
  @xpath << "|#{@content_xpath}#{COMMENT_XPATH}"  if @parse_comments
  @xpath << "|#{@content_xpath}#{ALT_TAGS_XPATH}" if @parse_alt_tags

  begin
    new_agent do |agent|
      if options[:only_paths]
        agent.every_url(&method(:parse_url_path))
      elsif options[:only_query_param_names]
        agent.every_url(&method(:parse_url_query_param_names))
      elsif options[:only_query_param_values]
        agent.every_url(&method(:parse_url_query_param_values))
      else
        agent.every_url(&method(:parse_url_path)) if options[:paths]

        agent.every_url(&method(:parse_url_query_param_names)) if options[:query_param_names]
        agent.every_url(&method(:parse_url_query_param_values)) if options[:query_param_values]

        agent.every_ok_page(&method(:parse_page))
      end
    end
  ensure
    @wordlist.close
  end
end

#wordlist_builder_kwargsObject

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Creates a keyword arguments Hash of all command options that will be directly passed to Wordlist::Builder.new

Since:

  • 1.0.0



390
391
392
393
394
395
396
397
398
399
400
401
402
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 390

def wordlist_builder_kwargs
  kwargs = {}

  WORDLIST_BUILDER_OPTIONS.each do |key|
    kwargs[key] = options[key] if options.has_key?(key)
  end

  kwargs[:stop_words]    = @stop_words    unless @stop_words.empty?
  kwargs[:ignore_words]  = @ignore_words  unless @ignore_words.empty?
  kwargs[:special_chars] = @special_chars unless @special_chars.empty?

  return kwargs
end

#wordlist_pathString

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The wordlist output path.

Returns:

  • (String)

Since:

  • 1.0.0



344
345
346
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 344

def wordlist_path
  options.fetch(:output) { infer_wordlist_path }
end