ToolRobotsTxtHowTo < Dmi

You are here: Foswiki>Dmi Web>ToolDatabase>ToolRobotsTxt>ToolRobotsTxtHowTo (21 Dec 2008, RichardRogers)Edit Attach

Input URLs or text into the harvester and choose depth of search (example.com/depth1/depth2/depth3).

In the box you can enter URLs. After clicking submit all unique hosts of the URLs will be checked for robots.txt (e.g. http://www.bla.com/bla/bla/index.html will be checked for http://www.bla.com/robots.txt) and each unique URL will be checked for <meta name="robots" content="bla">. From those URLs the links are fetched and the process starts again for the specified depth.

In pseudocode:

Harvest the urls

while i < depth

   for every url

      get the host 

      if host/robots.txt exists  //this is only checked if an url has a different host from the previous 

         display robots.txt

      else say it didn't exist

      end if

      if robot meta tag exists  //this is checked for every url

         display meta tag

      else say it didn't exit

      end if

      get all the links of the url

   end for

   i++

end while

Note: Not all frames are supported. For more information about robot exclusion protocols, please see http://en.wikipedia.org/wiki/Robots.txt

Topic revision: r3 - 21 Dec 2008, RichardRogers

Digital Methods

Course

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback