Module: Ronin::Web::Spider
- Defined in:
- lib/ronin/web/spider.rb,
lib/ronin/web/spider/agent.rb,
lib/ronin/web/spider/archive.rb,
lib/ronin/web/spider/version.rb,
lib/ronin/web/spider/exceptions.rb,
lib/ronin/web/spider/git_archive.rb
Overview
A collection of common web spidering routines using the spidr gem.
Examples
Spider a host:
require 'ronin/web/spider'
Ronin::Web::Spider.start_at('http://tenderlovemaking.com/') do |agent|
# ...
end
Spider a host:
Ronin::Web::Spider.host('solnic.eu') do |agent|
# ...
end
Spider a domain (and any sub-domains):
Ronin::Web::Spider.domain('ruby-lang.org') do |agent|
# ...
end
Spider a site:
Ronin::Web::Spider.site('http://www.rubyflow.com/') do |agent|
# ...
end
Spider multiple hosts:
Ronin::Web::Spider.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent|
# ...
end
Do not spider certain links:
Ronin::Web::Spider.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent|
# ...
end
Do not spider links on certain ports:
Ronin::Web::Spider.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent|
# ...
end
Do not spider links blacklisted in robots.txt:
Ronin::Web::Spider.site('http://company.com/', robots: true) do |agent|
# ...
end
Print out visited URLs:
Ronin::Web::Spider.site('http://www.rubyinside.com/') do |spider|
spider.every_url { |url| puts url }
end
Build a URL map of a site:
url_map = Hash.new { |hash,key| hash[key] = [] }
Ronin::Web::Spider.site('http://intranet.com/') do |spider|
spider.every_link do |origin,dest|
url_map[dest] << origin
end
end
Print out the URLs that could not be requested:
Ronin::Web::Spider.site('http://company.com/') do |spider|
spider.every_failed_url { |url| puts url }
end
Finds all pages which have broken links:
url_map = Hash.new { |hash,key| hash[key] = [] }
spider = Ronin::Web::Spider.site('http://intranet.com/') do |spider|
spider.every_link do |origin,dest|
url_map[dest] << origin
end
end
spider.failures.each do |url|
puts "Broken link #{url} found in:"
url_map[url].each { |page| puts " #{page}" }
end
Search HTML and XML pages:
Ronin::Web::Spider.site('http://company.com/') do |spider|
spider.every_page do |page|
puts ">>> #{page.url}"
page.search('//meta').each do ||
name = (.attributes['name'] || .attributes['http-equiv'])
value = .attributes['content']
puts " #{name} = #{value}"
end
end
end
Print out the titles from every page:
Ronin::Web::Spider.site('https://www.ruby-lang.org/') do |spider|
spider.every_html_page do |page|
puts page.title
end
end
Print out every HTTP redirect:
Ronin::Web::Spider.host('company.com') do |spider|
spider.every_redirect_page do |page|
puts "#{page.url} -> #{page.headers['Location']}"
end
end
Find what kinds of web servers a host is using, by accessing the headers:
servers = Set[]
Ronin::Web::Spider.host('company.com') do |spider|
spider.all_headers do |headers|
servers << headers['server']
end
end
Pause the spider on a forbidden page:
Ronin::Web::Spider.host('company.com') do |spider|
spider.every_forbidden_page do |page|
spider.pause!
end
end
Skip the processing of a page:
Ronin::Web::Spider.host('company.com') do |spider|
spider.every_missing_page do |page|
spider.skip_page!
end
end
Skip the processing of links:
Ronin::Web::Spider.host('company.com') do |spider|
spider.every_url do |url|
if url.path.split('/').find { |dir| dir.to_i > 1000 }
spider.skip_link!
end
end
end
Detect when a new host name is spidered:
Ronin::Web::Spider.domain('example.com') do |spider|
spider.every_host do |host|
puts "Spidring #{host} ..."
end
end
Detect when a new SSL/TLS certificate is encountered:
Ronin::Web::Spider.domain('example.com') do |spider|
spider.every_cert do |cert|
puts "Discovered new cert for #{cert.subject.command_name}, #{cert.subject_alt_name}"
end
end
Print the MD5 checksum of every favicon.ico
file:
Ronin::Web::Spider.domain('example.com') do |spider|
spider.every_favicon do |page|
puts "#{page.url}: #{page.body.md5}"
end
end
Print every HTML comment:
Ronin::Web::Spider.domain('example.com') do |spider|
spider.every_html_comment do |comment|
puts comment
end
end
Print all JavaScript source code:
Ronin::Web::Spider.domain('example.com') do |spider|
spider.every_javascript do |js|
puts js
end
end
Print every JavaScript string literal:
Ronin::Web::Spider.domain('example.com') do |spider|
spider.every_javascript_string do |str|
puts str
end
end
Print every JavaScript comment:
Ronin::Web::Spider.domain('example.com') do |spider|
spider.every_javascript_comment do |comment|
puts comment
end
end
Print every HTML and JavaScript comment:
Ronin::Web::Spider.domain('example.com') do |spider|
spider.every_comment do |comment|
puts comment
end
end
Defined Under Namespace
Classes: Agent, Archive, GitArchive, GitError, GitNotInstalled
Constant Summary collapse
- VERSION =
ronin-web-spider version
'0.2.0'
Class Method Summary collapse
-
.domain(name, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the entire domain.
-
.host(name, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the given host.
-
.site(url, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the web-site located at the given URL.
-
.start_at(url, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and begin spidering at the given URL.
Class Method Details
.domain(name, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the entire domain.
399 400 401 |
# File 'lib/ronin/web/spider.rb', line 399 def self.domain(name,**kwargs,&block) Agent.domain(name,**kwargs,&block) end |
.host(name, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the given host.
351 352 353 |
# File 'lib/ronin/web/spider.rb', line 351 def self.host(name,**kwargs,&block) Agent.host(name,**kwargs,&block) end |
.site(url, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and spiders the web-site located at the given URL.
375 376 377 |
# File 'lib/ronin/web/spider.rb', line 375 def self.site(url,**kwargs,&block) Agent.site(url,**kwargs,&block) end |
.start_at(url, **kwargs) {|agent| ... } ⇒ Object
Creates a new agent and begin spidering at the given URL.
327 328 329 |
# File 'lib/ronin/web/spider.rb', line 327 def self.start_at(url,**kwargs,&block) Agent.start_at(url,**kwargs,&block) end |