this page is a draft!!!

The 'raw data' format file is actually just a database dump of the particular network - some uninteresing fields are left out. All fields are printed comma separated in a textfile. On this page you will see a description of all headings/fields in this file.

im_network

Provides a description of the network

Field Description
id The network id
schedule_id Id of the schedule which generated this map.
schedule_index Chronological position of the network within the series (0 = the initial map, not actually produced by the scheduler)
crawl_queued Time at which the request to crawl this network was sent
crawl_start Time at which the crawl of this network started
crawl_end Time at which the crawl of this network finished
crawl_timeouts Number of timeouts
page_downloads Number of pages downloaded during the crawl
excluded_pages Number of pages excluded during the crawl
num_starting_points Total number of starting points
[starting_point_privilege Currently expected values are 0 (do not privilege startingpoints) or 2 (privilege startingpoints).
iterations Number of iterations of the algorithm. Expected values are 1, 2 or 3.
depth Depth to which each site is crawled.
co_link_analysis Type of co-link analysis: 1 = by site; 2 = by page
exclusion_list List of sites to exclude. XML.
title Title of the network
minimum_diversity Minimum number of domain categories the network must contain
required_authority Number of inward links a node must receive to be included in the network

im_site

Provides a description of the host.

Column Description
id site_id
url URL to be linked to when the map is rendered, usally the homepage.
host the host of the url
name Name of the website or organisation
category e.g. gov/com/org, international/national
authority Number of inward links the site receives from the network
knowledge Number of links from this site to other sites in the network
in_network 1 = in the network 0 = an “External Site” (not in the network, but part of the set of nodes which generates the network)

im_page

Provides a description of a deep-link (aka page).

ColumnSorted descending Description
url the full url
site_id the id of the site/host this page belongs - refers to #im_site
id page_id
date_stamp the date of retrieval of this link

Provides a description of the links between pages.

Column Description
page_id the id of the page - refers to #im_page
target_page_id the id of the page it links to - refers to #im_page

This topic: Issuecrawler > FAQ > RawDataFormat
Topic revision: 08 Feb 2005, ErikBorra
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback