Module: Ronin::Web::Spider

Defined in:: lib/ronin/web/spider.rb,
lib/ronin/web/spider/agent.rb,
lib/ronin/web/spider/archive.rb,
lib/ronin/web/spider/version.rb,
lib/ronin/web/spider/exceptions.rb,
lib/ronin/web/spider/git_archive.rb

Overview

A collection of common web spidering routines using the spidr gem.

Examples

Spider a host:

require 'ronin/web/spider'

Ronin::Web::Spider.start_at('http://tenderlovemaking.com/') do |agent|
  # ...
end

Spider a host:

Ronin::Web::Spider.host('solnic.eu') do |agent|
  # ...
end

Spider a domain (and any sub-domains):

Ronin::Web::Spider.domain('ruby-lang.org') do |agent|
  # ...
end

Spider a site:

Ronin::Web::Spider.site('http://www.rubyflow.com/') do |agent|
  # ...
end

Spider multiple hosts:

Ronin::Web::Spider.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent|
  # ...
end

Do not spider certain links:

Ronin::Web::Spider.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent|
  # ...
end

Do not spider links on certain ports:

Ronin::Web::Spider.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent|
  # ...
end

Do not spider links blacklisted in robots.txt:

Ronin::Web::Spider.site('http://company.com/', robots: true) do |agent|
  # ...
end

Print out visited URLs:

Ronin::Web::Spider.site('http://www.rubyinside.com/') do |spider|
  spider.every_url { |url| puts url }
end

Build a URL map of a site:

url_map = Hash.new { |hash,key| hash[key] = [] }

Ronin::Web::Spider.site('http://intranet.com/') do |spider|
  spider.every_link do |origin,dest|
    url_map[dest] << origin
  end
end

Print out the URLs that could not be requested:

Ronin::Web::Spider.site('http://company.com/') do |spider|
  spider.every_failed_url { |url| puts url }
end

Finds all pages which have broken links:

url_map = Hash.new { |hash,key| hash[key] = [] }

spider = Ronin::Web::Spider.site('http://intranet.com/') do |spider|
  spider.every_link do |origin,dest|
    url_map[dest] << origin
  end
end

spider.failures.each do |url|
  puts "Broken link #{url} found in:"

  url_map[url].each { |page| puts "  #{page}" }
end

Search HTML and XML pages:

Ronin::Web::Spider.site('http://company.com/') do |spider|
  spider.every_page do |page|
    puts ">>> #{page.url}"

    page.search('//meta').each do |meta|
      name = (meta.attributes['name'] || meta.attributes['http-equiv'])
      value = meta.attributes['content']

      puts "  #{name} = #{value}"
    end
  end
end

Print out the titles from every page:

Ronin::Web::Spider.site('https://www.ruby-lang.org/') do |spider|
  spider.every_html_page do |page|
    puts page.title
  end
end

Print out every HTTP redirect:

Ronin::Web::Spider.host('company.com') do |spider|
  spider.every_redirect_page do |page|
    puts "#{page.url} -> #{page.headers['Location']}"
  end
end

Find what kinds of web servers a host is using, by accessing the headers:

servers = Set[]

Ronin::Web::Spider.host('company.com') do |spider|
  spider.all_headers do |headers|
    servers << headers['server']
  end
end

Pause the spider on a forbidden page:

Ronin::Web::Spider.host('company.com') do |spider|
  spider.every_forbidden_page do |page|
    spider.pause!
  end
end

Skip the processing of a page:

Ronin::Web::Spider.host('company.com') do |spider|
  spider.every_missing_page do |page|
    spider.skip_page!
  end
end

Skip the processing of links:

Ronin::Web::Spider.host('company.com') do |spider|
  spider.every_url do |url|
    if url.path.split('/').find { |dir| dir.to_i > 1000 }
      spider.skip_link!
    end
  end
end

Detect when a new host name is spidered:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_host do |host|
    puts "Spidring #{host} ..."
  end
end

Detect when a new SSL/TLS certificate is encountered:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_cert do |cert|
    puts "Discovered new cert for #{cert.subject.command_name}, #{cert.subject_alt_name}"
  end
end

Print the MD5 checksum of every favicon.ico file:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_favicon do |page|
    puts "#{page.url}: #{page.body.md5}"
  end
end

Print every HTML comment:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_html_comment do |comment|
    puts comment
  end
end

Print all JavaScript source code:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_javascript do |js|
    puts js
  end
end

Print every JavaScript string literal:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_javascript_string do |str|
    puts str
  end
end

Print every JavaScript comment:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_javascript_comment do |comment|
    puts comment
  end
end

Print every HTML and JavaScript comment:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_comment do |comment|
    puts comment
  end
end

Defined Under Namespace

Classes: Agent, Archive, GitArchive, GitError, GitNotInstalled

Constant Summary collapse

VERSION = ronin-web-spider version

'0.1.0'

Class Method Summary collapse

.domain(name, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the entire domain.
.host(name, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the given host.
.site(url, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the web-site located at the given URL.
.start_at(url, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and begin spidering at the given URL.

Class Method Details

.domain(name, **kwargs) {|agent| ... } ⇒ `Object`

Creates a new agent and spiders the entire domain.

Parameters:

name (String) —
The top-level domain to spider.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments. See Ronin::Web::Spider::Agent#initialize.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

agent (Agent) —
The newly created agent.

See Also:

https://rubydoc.info/gems/spidr/Spidr/Agent#domain-class_method



398
399
400

# File 'lib/ronin/web/spider.rb', line 398

def self.domain(name,**kwargs,&block)
  Agent.domain(name,**kwargs,&block)
end

.host(name, **kwargs) {|agent| ... } ⇒ `Object`

Creates a new agent and spiders the given host.

Parameters:

name (String) —
The host-name to spider.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments. See Ronin::Web::Spider::Agent#initialize.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

agent (Agent) —
The newly created agent.

See Also:

https://rubydoc.info/gems/spidr/Spidr/Agent#host-class_method



350
351
352

# File 'lib/ronin/web/spider.rb', line 350

def self.host(name,**kwargs,&block)
  Agent.host(name,**kwargs,&block)
end

.site(url, **kwargs) {|agent| ... } ⇒ `Object`

Creates a new agent and spiders the web-site located at the given URL.

Parameters:

url (URI::HTTP, String) —
The web-site to spider.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments. See Ronin::Web::Spider::Agent#initialize.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

agent (Agent) —
The newly created agent.

See Also:

https://rubydoc.info/gems/spidr/Spidr/Agent#site-class_method



374
375
376

# File 'lib/ronin/web/spider.rb', line 374

def self.site(url,**kwargs,&block)
  Agent.site(url,**kwargs,&block)
end

.start_at(url, **kwargs) {|agent| ... } ⇒ `Object`

Creates a new agent and begin spidering at the given URL.

Parameters:

url (URI::HTTP, String) —
The URL to start spidering at.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments. See Ronin::Web::Spider::Agent#initialize.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

agent (Agent) —
The newly created agent.

See Also:

https://rubydoc.info/gems/spidr/Spidr/Agent#start_at-class_method



326
327
328

# File 'lib/ronin/web/spider.rb', line 326

def self.start_at(url,**kwargs,&block)
  Agent.start_at(url,**kwargs,&block)
end

Module: Ronin::Web::Spider

Overview

Examples

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.domain(name, **kwargs) {|agent| ... } ⇒ Object

.host(name, **kwargs) {|agent| ... } ⇒ Object

.site(url, **kwargs) {|agent| ... } ⇒ Object

.start_at(url, **kwargs) {|agent| ... } ⇒ Object

.domain(name, **kwargs) {|agent| ... } ⇒ `Object`

.host(name, **kwargs) {|agent| ... } ⇒ `Object`

.site(url, **kwargs) {|agent| ... } ⇒ `Object`

.start_at(url, **kwargs) {|agent| ... } ⇒ `Object`