Robots.txt Discovery

Display a site's robot exclusion policy.
Instructions
Input URLs or text into the harvester and choose depth of search (example.com/depth1/depth2/depth3).
In the box you can enter URLs. After clicking submit all unique hosts of the URLs will be checked for robots.txt (e.g.
http://www.bla.com/bla/bla/index.html will be checked for
http://www.bla.com/robots.txt) and each unique URL will be checked for <meta name="robots" content="bla">. From those URLs the links are fetched and the process starts again for the specified depth.
In pseudocode:
Harvest the urls
while i < depth
for every url
get the host
if host/robots.txt exists //this is only checked if an url has a different host from the previous
display robots.txt
else say it didn't exist
end if
if robot meta tag exists //this is checked for every url
display meta tag
else say it didn't exit
end if
get all the links of the url
end for
i++
end while
Note: Not all frames are supported. For more information about robot exclusion protocols, please see
http://en.wikipedia.org/wiki/Robots.txt
Sample project
Dmi ProtocolsProtocols devised by the DMI This page is being replaced gradually by our new research protocols and methods page. Hyperlink Analysis * Perfom an issue craw...
Summer School 2007Digital Methods Summer School 2007: New Objects of Study 2010 2009 2008 2007 How does one do research online? What are the new objects of study, and how do ...
Test HomeDMI Tools Digital Methods Project Overview FAQ Tag Cloud Introduction The Digital Methods Initiative is a contribution to doing research into the "nati...