Class: Ronin::Web::CLI::Commands::Spider Private

Inherits:
Ronin::Web::CLI::Command show all
Includes:
CommandKit::Colors, CommandKit::Options::Verbose, CommandKit::Printing::Indent, SpiderOptions
Defined in:
lib/ronin/web/cli/commands/spider.rb

Overview

This class is part of a private API. You should avoid using this class if possible, as it may be removed or be changed in the future.

Spiders a website.

Usage

ronin-web spider [options] {--host HOST | --domain DOMAIN | --site URL}

Options

    --host HOST                  Spiders the specific HOST
    --domain DOMAIN              Spiders the whole domain
    --site URL                   Spiders the website, starting at the URL
    --open-timeout SECS          Sets the connection open timeout
    --read-timeout SECS          Sets the read timeout
    --ssl-timeout SECS           Sets the SSL connection timeout
    --continue-timeout SECS      Sets the continue timeout
    --keep-alive-timeout SECS    Sets the connection keep alive timeout
-P, --proxy PROXY                Sets the proxy to use
-H, --header NAME: VALUE         Sets a default header
    --host-header NAME=VALUE     Sets a default header
-U, --user-agent-string STRING   The User-Agent string to use
-u chrome-linux|chrome-macos|chrome-windows|chrome-iphone|chrome-ipad|chrome-android|firefox-linux|firefox-macos|firefox-windows|firefox-iphone|firefox-ipad|firefox-android|safari-macos|safari-iphone|safari-ipad|edge,
    --user-agent                 The User-Agent to use
-R, --referer URL                Sets the Referer URL
    --delay SECS                 Sets the delay in seconds between each request
-l, --limit COUNT                Only spiders up to COUNT pages
-d, --max-depth DEPTH            Only spiders up to max depth
    --enqueue URL                Adds the URL to the queue
    --visited URL                Marks the URL as previously visited
    --strip-fragments            Enables/disables stripping the fragment component of every URL
    --strip-query                Enables/disables stripping the query component of every URL
    --visit-scheme SCHEME        Visit URLs with the URI scheme
    --visit-schemes-like /REGEX/ Visit URLs with URI schemes that match the REGEX
    --ignore-scheme SCHEME       Ignore the URLs with the URI scheme
    --ignore-schemes-like /REGEX/
                                 Ignore the URLs with URI schemes matching the REGEX
    --visit-host HOST            Visit URLs with the matching host name
    --visit-hosts-like /REGEX/   Visit URLs with hostnames that match the REGEX
    --ignore-host HOST           Ignore the host name
    --ignore-hosts-like /REGEX/  Ignore the host names matching the REGEX
    --visit-port PORT            Visit URLs with the matching port number
    --visit-ports-like /REGEX/   Visit URLs with port numbers that match the REGEX
    --ignore-port PORT           Ignore the port number
    --ignore-ports-like /REGEX/  Ignore the port numbers matching the REGEXP
    --visit-link URL             Visit the URL
    --visit-links-like /REGEX/   Visit URLs that match the REGEX
    --ignore-link URL            Ignore the URL
    --ignore-links-like /REGEX/  Ignore URLs matching the REGEX
    --visit-ext FILE_EXT         Visit URLs with the matching file ext
    --visit-exts-like /REGEX/    Visit URLs with file exts that match the REGEX
    --ignore-ext FILE_EXT        Ignore the URLs with the file ext
    --ignore-exts-like /REGEX/   Ignore URLs with file exts matching the REGEX
-r, --robots                     Specifies whether to honor robots.txt
-v, --verbose                    Enables verbose output
    --print-stauts               Print the status codes for each URL
    --print-headers              Print response headers for each URL
    --print-header NAME          Prints a specific header
    --history FILE               The history file
    --archive DIR                Archive every visited page to the DIR
    --git-archive DIR            Archive every visited page to the git repository
-X, --xpath XPATH                Evaluates the XPath on each HTML page
-C, --css-path XPATH             Evaluates the CSS-path on each HTML page
    --print-hosts                Print all discovered hostnames
    --print-certs                Print all encountered SSL/TLS certificates
    --save-certs                 Saves all encountered SSL/TLS certificates
    --print-js-strings           Print all JavaScript strings
    --print-js-url-strings       Print URL strings found in JavaScript
    --print-js-path-strings      Print path strings found in JavaScript
    --print-js-absolute-path-strings
                                 Only print absolute path strings found in JavaScript
    --print-js-relative-path-strings
                                 Only print relative path strings found in JavaScript
    --print-html-comments        Print HTML comments
    --print-js-comments          Print JavaScript comments
    --print-comments             Print all HTML and JavaScript comments
-h, --help                       Print help information

Examples

ronin-web spider --host scanme.nmap.org
ronin-web spider --domain nmap.org
ronin-web spider --site https://scanme.nmap.org/

Since:

  • 1.0.0

Instance Attribute Summary

Attributes included from SpiderOptions

#agent_kwargs

Instance Method Summary collapse

Methods included from SpiderOptions

#continue_timeout, #continue_timeout=, #default_headers, #delay, #delay=, #history, #host_headers, #ignore_exts, #ignore_hosts, #ignore_links, #ignore_ports, #ignore_schemes, included, #initialize, #keep_alive_timeout, #keep_alive_timeout=, #limit, #limit=, #max_depth, #max_depth=, #new_agent, #open_timeout, #open_timeout=, #proxy, #proxy=, #queue, #read_timeout, #read_timeout=, #referer, #referer=, #robots, #robots=, #ssl_timeout, #ssl_timeout=, #strip_fragments, #strip_fragments=, #strip_query, #strip_query=, #user_agent, #user_agent=, #visit_exts, #visit_hosts, #visit_links, #visit_ports, #visit_schemes

Instance Method Details

#define_printing_callbacks(agent) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Defines callbacks that print information.

Parameters:

  • agent (Ronin::Web::Spider::Agent)

    The newly created agent.

Since:

  • 1.0.0



280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
# File 'lib/ronin/web/cli/commands/spider.rb', line 280

def define_printing_callbacks(agent)
  if options[:print_hosts]
    agent.every_host do |host|
      print_verbose "spidering new host #{host}"
    end
  end

  if options[:print_certs]
    agent.every_cert do |cert|
      print_verbose "encountered new certificate for #{cert.subject.common_name}"
    end
  end

  if options[:print_js_strings]
    agent.every_js_string do |string|
      print_content string
    end
  end

  if options[:print_js_url_strings]
    agent.every_js_url_string do |url|
      print_content url
    end
  end

  if options[:print_js_path_strings]
    agent.every_js_path_string do |path|
      print_content path
    end
  end

  if options[:print_js_absolute_path_strings]
    agent.every_js_absolute_path_string do |path|
      print_content path
    end
  end

  if options[:print_js_relative_path_strings]
    agent.every_js_relative_path_string do |path|
      print_content path
    end
  end

  if options[:print_html_comments]
    agent.every_html_comment do |comment|
      print_content comment
    end
  end

  if options[:print_js_comments]
    agent.every_js_comment do |comment|
      print_content comment
    end
  end

  if options[:print_comments]
    agent.every_comment do |comment|
      print_content comment
    end
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Print content from a page.

Parameters:

  • content (#to_s)

    The content to print.

Since:

  • 1.0.0



444
445
446
447
448
# File 'lib/ronin/web/cli/commands/spider.rb', line 444

def print_content(content)
  content.to_s.each_line do |line|
    puts "    #{line}"
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints the headers of a page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



405
406
407
408
409
# File 'lib/ronin/web/cli/commands/spider.rb', line 405

def print_headers(page)
  page.response.each_capitalized do |name,value|
    print_content "#{name}: #{value}"
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints a page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



384
385
386
387
388
389
390
391
392
393
394
395
396
397
# File 'lib/ronin/web/cli/commands/spider.rb', line 384

def print_page(page)
  print_status(page) if options[:print_status]
  print_url(page)

  if options[:print_headers]
    print_headers(page)
  elsif options[:print_header]
    if (header = page.response[options[:print_header]])
      print_content header
    end
  end

  print_query(page) if (options[:xpath] || options[:css_path])
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints the XPath or CSS-path query result for the page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



417
418
419
420
421
422
423
424
425
# File 'lib/ronin/web/cli/commands/spider.rb', line 417

def print_query(page)
  if page.html?
    if options[:xpath]
      print_content page.doc.xpath(options[:xpath])
    elsif options[:css_path]
      print_content page.doc.css(options[:css_path])
    end
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints the status of a page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



348
349
350
351
352
353
354
355
356
357
358
# File 'lib/ronin/web/cli/commands/spider.rb', line 348

def print_status(page)
  if page.code < 300
    print "#{colors.bright_green(page.code)} "
  elsif page.code < 400
    print "#{colors.bright_yellow(page.code)} "
  elsif page.code < 500
    print "#{colors.bright_red(page.code)} "
  else
    print "#{colors.bold(colors.bright_red(page.code))} "
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints the URL for a page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



366
367
368
369
370
371
372
373
374
375
376
# File 'lib/ronin/web/cli/commands/spider.rb', line 366

def print_url(page)
  if page.code < 300
    puts "#{colors.green(page.url)} "
  elsif page.code < 400
    puts "#{colors.yellow(page.url)} "
  elsif page.code < 500
    puts "#{colors.red(page.url)} "
  else
    puts "#{colors.bold(colors.red(page.url))} "
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints an information message.

Parameters:

  • message (String)

Since:

  • 1.0.0



432
433
434
435
436
# File 'lib/ronin/web/cli/commands/spider.rb', line 432

def print_verbose(message)
  if verbose?
    puts colors.yellow("* #{message}")
  end
end

#runObject

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Runs the ronin-web spider command.

Since:

  • 1.0.0



204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
# File 'lib/ronin/web/cli/commands/spider.rb', line 204

def run
  archive = if options[:archive]
              Web::Spider::Archive.open(options[:archive])
            elsif options[:git_archive]
              Web::Spider::GitArchive.open(options[:git_archive])
            end

  history_file = if options[:history]
                   File.open(options[:history],'w')
                 end

  agent = new_agent do |agent|
    agent.every_page do |page|
      print_page(page)
    end

    agent.every_failed_url do |url|
      print_verbose "failed to request #{url}"
    end

    define_printing_callbacks(agent)

    if history_file
      agent.every_page do |page|
        history_file.puts(page.url)
        history_file.flush
      end
    end

    if archive
      agent.every_ok_page do |page|
        archive.write(page.url,page.body)
      end
    end
  end

  # post-spidering tasks

  if options[:git_archive]
    archive.commit "Updated #{Time.now}"
  end

  if options[:print_hosts]
    puts
    puts "Spidered the following hosts:"
    puts

    indent do
      agent.visited_hosts.each do |host|
        puts host
      end
    end
  end

  if options[:print_certs]
    puts
    puts "Discovered the following certs:"
    puts

    agent.collected_certs.each do |cert|
      puts cert
      puts
    end
  end
ensure
  if options[:history]
    history_file.close
  end
end