This page is meant to give webmasters information about the issuecrawler.
Index
What is the issuecrawler
The issuecrawler is a spider which locates issue networks on the web. It is mainly used for academic research and by NGOs.
What is the issuecrawler (in detail)
The Issuecrawler is web network location software. It consists of a crawler, a co-link analysis engine and three visualisation modules. It is server-side software that crawls specified sites, captures the outlinks from the specified sites, performs co-link analysis on the outlinks, returns densely interlinked networks, and visualises them in circle and cluster maps, as well as on a geographical map. For user tips, see also scenarios of use, available at
http://www.govcom.org/scenarios_use.htm. For a list of articles resulting from the use of the Issue Crawler, see
http://www.govcom.org/publications.html.
How often will the issuecrawler access a web page from my web server?
The issuecrawler crawls your site according to startingpoints a user gives in (see the
FAQ for more information on this). The issuecrawler should not try to access your site more often than once every 7 seconds. If you find that the issuecrawler places too high a load on your web site, please send an email to newmedia[at]sonologic.nl. We can make the intervals of requests higher for your site.
Why is the issuecrawler trying to access a robots.txt file that is not on your server?
The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". For information on how to create a robots.txt file, see
The Robot Exclusion Standard. If you want to prevent the "File not found" error messages from appearing in your server log, create an empty file named robots.txt.
How do I prevent the issuecrawler from crawling some or all of my website?
The robots.txt file is used to prevent web crawlers from accessing a web site. The format of the robots.txt file is specified in
The Robot Exclusion Standard. The issuecrawler excludes those pages of its crawl where the User-Agent is specified as either "IssueCrawler" or "*". Based on this, the issuecrawler crawls only the web pages that are not specified in the robots.txt.
How do I keep the issuecrawler from following links from a particular web page?
The issuecrawler obeys the noindex and nofollow meta-tags. The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links.
Does the issuecrawler store data from my site?
The issuecrawler does store the pages retrieved for your site, but only for a short period of time. To be able to do methodological correct research we need to be sure that each
iteration uses the same page. After a crawl completes the pages are removed from the system again.
How do I report problems / questions regarding the issuecrawler?
If you have questions about our issuecrawler technology, please e-mail them to info[at]digitalmethods.net. Please provide as much detail as possible.
Have a look at
http://www.govcom.org/scenarios_use.htm