Class: Ronin::Web::CLI::Commands::Spider Private

Inherits:
Ronin::Web::CLI::Command show all
Includes:
CommandKit::Colors, CommandKit::Options::Verbose, CommandKit::Printing::Indent
Defined in:
lib/ronin/web/cli/commands/spider.rb

Overview

This class is part of a private API. You should avoid using this class if possible, as it may be removed or be changed in the future.

Spiders a website.

Usage

ronin-web spider [options] {--host HOST | --domain DOMAIN | --site URL}

Options

-v, --verbose                    Enables verbose output
    --open-timeout SECS          Sets the connection open timeout
    --read-timeout SECS          Sets the read timeout
    --ssl-timeout SECS           Sets the SSL connection timeout
    --continue-timeout SECS      Sets the continue timeout
    --keep-alive-timeout SECS    Sets the connection keep alive timeout
-P, --proxy PROXY                Sets the proxy to use.
-H, --header NAME: VALUE         Sets a default header
    --host-header NAME=VALUE     Sets a default header
-u chrome-linux|chrome-macos|chrome-windows|chrome-iphone|chrome-ipad|chrome-android|firefox-linux|firefox-macos|firefox-windows|firefox-iphone|firefox-ipad|firefox-android|safari-macos|safari-iphone|safari-ipad|edge,
    --user-agent                 The User-Agent to use
-U, --user-agent-string STRING   The User-Agent string to use
-R, --referer URL                Sets the Referer URL
    --delay SECS                 Sets the delay in seconds between each request
-l, --limit COUNT                Only spiders up to COUNT pages
-d, --max-depth DEPTH            Only spiders up to max depth
    --enqueue URL                Adds the URL to the queue
    --visited URL                Marks the URL as previously visited
    --strip-fragments            Enables/disables stripping the fragment component of every URL
    --strip-query                Enables/disables stripping the query component of every URL
    --visit-host HOST            Visit URLs with the matching host name
    --visit-hosts-like /REGEX/   Visit URLs with hostnames that match the REGEX
    --ignore-host HOST           Ignore the host name
    --ignore-hosts-like /REGEX/  Ignore the host names matching the REGEX
    --visit-port PORT            Visit URLs with the matching port number
    --visit-ports-like /REGEX/   Visit URLs with port numbers that match the REGEX
    --ignore-port PORT           Ignore the port number
    --ignore-ports-like /REGEX/  Ignore the port numbers matching the REGEXP
    --visit-link URL             Visit the URL
    --visit-links-like /REGEX/   Visit URLs that match the REGEX
    --ignore-link URL            Ignore the URL
    --ignore-links-like /REGEX/  Ignore URLs matching the REGEX
    --visit-ext FILE_EXT         Visit URLs with the matching file ext
    --visit-exts-like /REGEX/    Visit URLs with file exts that match the REGEX
    --ignore-ext FILE_EXT        Ignore the URLs with the file ext
    --ignore-exts-like /REGEX/   Ignore URLs with file exts matching the REGEX
-r, --robots                     Specifies whether to honor robots.txt
    --host HOST                  Spiders the specific HOST
    --domain DOMAIN              Spiders the whole domain
    --site URL                   Spiders the website, starting at the URL
    --print-status               Print the status codes for each URL
    --print-headers              Print response headers for each URL
    --print-header NAME          Prints a specific header
    --history FILE               The history file
    --archive DIR                Archive every visited page to the DIR
    --git-archive DIR            Archive every visited page to the git repository
-X, --xpath XPATH                Evaluates the XPath on each HTML page
-C, --css-path XPATH             Evaluates the CSS-path on each HTML page
-h, --help                       Print help information

Examples

ronin-web spider --host scanme.nmap.org
ronin-web spider --domain nmap.org
ronin-web spider --site https://scanme.nmap.org/

Since:

  • 1.0.0

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(**kwargs) ⇒ Spider

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Initializes the spider command.

Parameters:

  • kwargs (Hash{Symbol => Object})

    Additional keyword arguments.

Since:

  • 1.0.0



530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
# File 'lib/ronin/web/cli/commands/spider.rb', line 530

def initialize(**kwargs)
  super(**kwargs)

  @default_headers = {}
  @host_headers    = {}

  @queue   = []
  @history = []

  @visit_schemes = []
  @visit_hosts   = []
  @visit_ports   = []
  @visit_links   = []
  @visit_exts    = []

  @ignore_hosts = []
  @ignore_ports = []
  @ignore_links = []
  @ignore_exts  = []
end

Instance Attribute Details

#default_headersHash{String => String} (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The default HTTP headers to send with every request.

Returns:

  • (Hash{String => String})

Since:

  • 1.0.0



462
463
464
# File 'lib/ronin/web/cli/commands/spider.rb', line 462

def default_headers
  @default_headers
end

#historyArray<String> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The pre-existing of previously visited URLs to start spidering with.

Returns:

  • (Array<String>)

Since:

  • 1.0.0



477
478
479
# File 'lib/ronin/web/cli/commands/spider.rb', line 477

def history
  @history
end

#host_headersHash{String => String} (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The mapping of custom Host headers.

Returns:

  • (Hash{String => String})

Since:

  • 1.0.0



467
468
469
# File 'lib/ronin/web/cli/commands/spider.rb', line 467

def host_headers
  @host_headers
end

#ignore_extsArray<String, Regexp> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The URL file extensions to ignore.

Returns:

  • (Array<String, Regexp>)

Since:

  • 1.0.0



522
523
524
# File 'lib/ronin/web/cli/commands/spider.rb', line 522

def ignore_exts
  @ignore_exts
end

#ignore_hostsArray<String, Regexp> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The hosts to ignore.

Returns:

  • (Array<String, Regexp>)

Since:

  • 1.0.0



507
508
509
# File 'lib/ronin/web/cli/commands/spider.rb', line 507

def ignore_hosts
  @ignore_hosts
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The links to ignore.

Returns:

  • (Array<String, Regexp>)

Since:

  • 1.0.0



517
518
519
# File 'lib/ronin/web/cli/commands/spider.rb', line 517

def ignore_links
  @ignore_links
end

#ignore_portsArray<Integer, Regexp> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The port numbers to ignore.

Returns:

  • (Array<Integer, Regexp>)

Since:

  • 1.0.0



512
513
514
# File 'lib/ronin/web/cli/commands/spider.rb', line 512

def ignore_ports
  @ignore_ports
end

#queueArray<String> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The pre-existing queue of URLs to start spidering with.

Returns:

  • (Array<String>)

Since:

  • 1.0.0



472
473
474
# File 'lib/ronin/web/cli/commands/spider.rb', line 472

def queue
  @queue
end

#visit_extsArray<String, Regexp> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The URL file extensions to visit.

Returns:

  • (Array<String, Regexp>)

Since:

  • 1.0.0



502
503
504
# File 'lib/ronin/web/cli/commands/spider.rb', line 502

def visit_exts
  @visit_exts
end

#visit_hostsArray<String, Regexp> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The hosts to visit.

Returns:

  • (Array<String, Regexp>)

Since:

  • 1.0.0



487
488
489
# File 'lib/ronin/web/cli/commands/spider.rb', line 487

def visit_hosts
  @visit_hosts
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The links to visit.

Returns:

  • (Array<String, Regexp>)

Since:

  • 1.0.0



497
498
499
# File 'lib/ronin/web/cli/commands/spider.rb', line 497

def visit_links
  @visit_links
end

#visit_portsArray<Integer, Regexp> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The port numbers to visit.

Returns:

  • (Array<Integer, Regexp>)

Since:

  • 1.0.0



492
493
494
# File 'lib/ronin/web/cli/commands/spider.rb', line 492

def visit_ports
  @visit_ports
end

#visit_schemesArray<String> (readonly)

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

The schemes to visit.

Returns:

  • (Array<String>)

Since:

  • 1.0.0



482
483
484
# File 'lib/ronin/web/cli/commands/spider.rb', line 482

def visit_schemes
  @visit_schemes
end

Instance Method Details

#agent_kwargsHash{Symbol => Object}

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Builds keyword arguments for Ronin::Web::Spider::Agent#initialize.

Returns:

  • (Hash{Symbol => Object})

    The keyword arguments for Ronin::Web::Spider::Agent#initialize.

Since:

  • 1.0.0



701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
# File 'lib/ronin/web/cli/commands/spider.rb', line 701

def agent_kwargs
  kwargs = {}

  kwargs[:proxy] = options[:proxy] if options[:proxy]

  unless @default_headers.empty?
    kwargs[:default_headers] = @default_headers
  end

  unless @host_headers.empty?
    kwargs[:host_headers] = @host_headers
  end

  kwargs[:user_agent] = @user_agent       if @user_agent
  kwargs[:referer]    = options[:referer] if options[:referer]

  kwargs[:delay]     = options[:delay]     if options[:delay]
  kwargs[:limit]     = options[:limit]     if options[:limit]
  kwargs[:max_depth] = options[:max_depth] if options[:max_depth]

  kwargs[:queue]   = @queue   unless @queue.empty?
  kwargs[:history] = @history unless @history.empty?

  if options.has_key?(:strip_fragments)
    kwargs[:strip_fragments] = options[:strip_fragments]
  end

  if options.has_key?(:strip_query)
    kwargs[:strip_query] = options[:strip_query]
  end

  kwargs[:schemes] = @visit_schemes unless @visit_schemes.empty?
  kwargs[:hosts]   = @visit_hosts   unless @visit_hosts.empty?
  kwargs[:ports]   = @visit_ports   unless @visit_ports.empty?
  kwargs[:links]   = @visit_links   unless @visit_links.empty?
  kwargs[:exts]    = @visit_exts    unless @visit_exts.empty?

  kwargs[:ignore_hosts] = @ignore_hosts unless @ignore_hosts.empty?
  kwargs[:ignore_ports] = @ignore_ports unless @ignore_ports.empty?
  kwargs[:ignore_links] = @ignore_links unless @ignore_links.empty?
  kwargs[:ignore_exts]  = @ignore_exts  unless @ignore_exts.empty?

  kwargs[:robots] = options[:robots] if options.has_key?(:robots)

  return kwargs
end

#define_printing_callbacks(agent) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Defines callbacks that print information.

Parameters:

  • agent (Ronin::Web::Spider::Agent)

    The newly created agent.

Since:

  • 1.0.0



630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
# File 'lib/ronin/web/cli/commands/spider.rb', line 630

def define_printing_callbacks(agent)
  if options[:print_hosts]
    agent.every_host do |host|
      print_verbose "spidering new host #{host}"
    end
  end

  if options[:print_certs]
    agent.every_cert do |cert|
      print_verbose "encountered new certificate for #{cert.subject.common_name}"
    end
  end

  if options[:print_js_strings]
    agent.every_js_string do |string|
      print_content string
    end
  end

  if options[:print_html_comments]
    agent.every_html_comment do |comment|
      print_content comment
    end
  end

  if options[:print_js_comments]
    agent.every_js_comment do |comment|
      print_content comment
    end
  end

  if options[:print_comments]
    agent.every_comment do |comment|
      print_content comment
    end
  end
end

#new_agent {|agent| ... } ⇒ Ronin::Web::Spider::Agent

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Creates a new web spider agent.

Yields:

  • (agent)

    The given block will be given the newly created and configured web spider agent.

Yield Parameters:

  • agent (Ronin::Web::Spider::Agent)

    The newly created web spider agent.

Returns:

  • (Ronin::Web::Spider::Agent)

    The newly created web spider agent, after the agent has completed it's spidering.

Since:

  • 1.0.0



682
683
684
685
686
687
688
689
690
691
692
693
# File 'lib/ronin/web/cli/commands/spider.rb', line 682

def new_agent(&block)
  if options[:host]
    Web::Spider.host(options[:host],**agent_kwargs,&block)
  elsif options[:domain]
    Web::Spider.domain(options[:domain],**agent_kwargs,&block)
  elsif options[:site]
    Web::Spider.site(options[:site],**agent_kwargs,&block)
  else
    print_error "must specify --host, --domain, or --site"
    exit(-1)
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Print content from a page.

Parameters:

  • content (#to_s)

    The content to print.

Since:

  • 1.0.0



850
851
852
853
854
# File 'lib/ronin/web/cli/commands/spider.rb', line 850

def print_content(content)
  content.to_s.each_line do |line|
    puts "    #{line}"
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints the headers of a page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



811
812
813
814
815
# File 'lib/ronin/web/cli/commands/spider.rb', line 811

def print_headers(page)
  page.response.each_capitalized do |name,value|
    print_content "#{name}: #{value}"
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints a page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



790
791
792
793
794
795
796
797
798
799
800
801
802
803
# File 'lib/ronin/web/cli/commands/spider.rb', line 790

def print_page(page)
  print_status(page) if options[:print_status]
  print_url(page)

  if options[:print_headers]
    print_headers(page)
  elsif options[:print_header]
    if (header = page.response[options[:print_header]])
      print_content header
    end
  end

  print_query(page) if (options[:xpath] || options[:css_path])
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints the XPath or CSS-path query result for the page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



823
824
825
826
827
828
829
830
831
# File 'lib/ronin/web/cli/commands/spider.rb', line 823

def print_query(page)
  if page.html?
    if options[:xpath]
      print_content page.doc.xpath(options[:xpath])
    elsif options[:css_path]
      print_content page.doc.css(options[:css_path])
    end
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints the status of a page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



754
755
756
757
758
759
760
761
762
763
764
# File 'lib/ronin/web/cli/commands/spider.rb', line 754

def print_status(page)
  if page.code < 300
    print "#{colors.bright_green(page.code)} "
  elsif page.code < 400
    print "#{colors.bright_yellow(page.code)} "
  elsif page.code < 500
    print "#{colors.bright_red(page.code)} "
  else
    print "#{colors.bold(colors.bright_red(page.code))} "
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints the URL for a page.

Parameters:

  • page (Spidr::Page)

    A spidered page.

Since:

  • 1.0.0



772
773
774
775
776
777
778
779
780
781
782
# File 'lib/ronin/web/cli/commands/spider.rb', line 772

def print_url(page)
  if page.code < 300
    puts "#{colors.green(page.url)} "
  elsif page.code < 400
    puts "#{colors.yellow(page.url)} "
  elsif page.code < 500
    puts "#{colors.red(page.url)} "
  else
    puts "#{colors.bold(colors.red(page.url))} "
  end
end

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Prints an information message.

Parameters:

  • message (String)

Since:

  • 1.0.0



838
839
840
841
842
# File 'lib/ronin/web/cli/commands/spider.rb', line 838

def print_verbose(message)
  if verbose?
    puts colors.yellow("* #{message}")
  end
end

#runObject

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Runs the ronin-web spider command.

Since:

  • 1.0.0



554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
# File 'lib/ronin/web/cli/commands/spider.rb', line 554

def run
  archive = if options[:archive]
              Web::Spider::Archive.open(options[:archive])
            elsif options[:git_archive]
              Web::Spider::GitArchive.open(options[:git_archive])
            end

  history_file = if options[:history]
                   File.open(options[:history],'w')
                 end

  agent = new_agent do |agent|
    agent.every_page do |page|
      print_page(page)
    end

    agent.every_failed_url do |url|
      print_verbose "failed to request #{url}"
    end

    define_printing_callbacks(agent)

    if history_file
      agent.every_page do |page|
        history_file.puts(page.url)
        history_file.flush
      end
    end

    if archive
      agent.every_ok_page do |page|
        archive.write(page.url,page.body)
      end
    end
  end

  # post-spidering tasks

  if options[:git_archive]
    archive.commit "Updated #{Time.now}"
  end

  if options[:print_hosts]
    puts
    puts "Spidered the following hosts:"
    puts

    indent do
      agent.visited_hosts.each do |host|
        puts host
      end
    end
  end

  if options[:print_certs]
    puts
    puts "Discovered the following certs:"
    puts

    agent.collected_certs.each do |cert|
      puts cert
      puts
    end
  end
ensure
  if options[:history]
    history_file.close
  end
end