This page is meant to give webmasters information about the issuecrawler.

Index

What is the issuecrawler

The issuecrawler is a spider which locates issue networks on the web. It is mainly used for academic research and by NGOs.

What is the issuecrawler (in detail)

The Issuecrawler is web network location software. It consists of a crawler, a co-link analysis engine and three visualisation modules. It is server-side software that crawls specified sites, captures the outlinks from the specified sites, performs co-link analysis on the outlinks, returns densely interlinked networks, and visualises them in circle and cluster maps, as well as on a geographical map. For user tips, see also scenarios of use, available at http://www.govcom.org/scenarios_use.htm. For a list of articles resulting from the use of the Issue Crawler, see http://www.govcom.org/publications.html.

How often will the issuecrawler access a web page from my web server?

The issuecrawler crawls your site according to startingpoints a user gives in (see the FAQ for more information on this). The issuecrawler should not try to access your site more often than once every 7 seconds. If you find that the issuecrawler places too high a load on your web site, please send an email to newmedia[at]sonologic.nl. We can make the intervals of requests higher for your site.

Why is the issuecrawler trying to access a robots.txt file that is not on your server?

The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". For information on how to create a robots.txt file, see The Robot Exclusion Standard. If you want to prevent the "File not found" error messages from appearing in your server log, create an empty file named robots.txt.

How do I prevent the issuecrawler from crawling some or all of my website?

The robots.txt file is used to prevent web crawlers from accessing a web site. The format of the robots.txt file is specified in The Robot Exclusion Standard. The issuecrawler excludes those pages of its crawl where the User-Agent is specified as either "IssueCrawler" or "*". Based on this, the issuecrawler crawls only the web pages that are not specified in the robots.txt.

The issuecrawler obeys the noindex and nofollow meta-tags. The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links.

Does the issuecrawler store data from my site?

The issuecrawler does store the pages retrieved for your site, but only for a short period of time. To be able to do methodological correct research we need to be sure that each iteration uses the same page. After a crawl completes the pages are removed from the system again.

How do I report problems / questions regarding the issuecrawler?

If you have questions about our issuecrawler technology, please e-mail them to newmedia[at]sonologic.nl. Please provide as much detail as possible.

Where can i find more information about the issuecrawler?

Have a look at http://www.govcom.org/scenarios_use.htm
Topic revision: r3 - 06 Jul 2006, RichardRogers
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback