Tutorial: How to create a scraper using Solvent and Piggybank Firefox plugins.
Step 1. Installation
You need to install the necessary plugins for Firefox*, these are:
- Piggybank: http://simile.mit.edu/wiki/Piggy_Bank
- Solvent: http://simile.mit.edu/wiki/Solvent
Piggy Bank also requires the Java Plugin for Java 1.5 or later to be installed in your browser.
Get it from
http://www.java.com/ (on MacOSX, you can skip this step since Java is already installed). More about installing Piggybank can be found at the
Piggy Bank Installation Wiki
We came to the conclusion that you definitely need Firefox 2.0 for all this. Which you can get here:
Firefox 2.0Make sure to rename the Firefox 2 Application from Firefox.app into something else, for example FF2LovesPiggies.app and/or don't place it directly in the Applications folder, or else it'll overwrite all your previous saved preferences, bookmarks etc. For example: Save the old version of Firefox onto your desktop, unpack the .dmg file, rename Firefox.app to FF2LovesPiggies.app and just run it from your desktop.
Step 2. What do you want to scrape?
For this example, we will use Google Search at Google.com as example. Now that you've installed Piggybank and Solvent, you'll see a Piggybank icon right in top of your browser. And a 'spraycan' (=Solvent) and 'coin' (=gets data into Piggybank) icon in the right below bar.
- Open Google.com (in a new Tab)
- Set your Google preferences to 100 results per page to get a maximum result. Save preferences.
- Do a query/Search for any word in Google
- Click on the spraycan icon in the bottom right for the Solvent panel to pop up.
- Click on 'capture' icon in the panel and guide your mouse to the data you want to get. In the right panel, you can see what you've capture both in Xpath format and below that in Item(s).
- Click the arrow of the first item of the set to open it and start naming it them: Select the first line that contains the first word/sentence of the Item's URI, Item's title or Item's description. Select the appropriate name by clicking on Name in the upper right corner and select Item's URI, Item's title or Item's description. Repeat this naming process Item's URI, Item's title or Item's description.
- Click 'generate' in the right panel and the code will appear in the left panel.
- Click 'run' in this panel to check in the right panel whether you're getting the data you want.
One of the things that you might notice in the output is that it doesn't get all the text for the title for example, you can fix this by editing 'nodeValue' for 'textContent' and deleting '/text()[1]' in the following code
at step number 7:
try {
data.addStatement(uri, dc + 'title', cleanString(getNode(document, element, './H2[1]/A[1]/EM[1]/text()[1]', nsResolver).nodeValue), true);
} catch (e) { log(e); }
will become:
try {
data.addStatement(uri, dc + 'title', cleanString(getNode(document, element, './H2[1]/A[1]/EM[1]', nsResolver).textContent), true);
} catch (e) { log(e); }
This is basically a scraper for one page. Congrats, you've build your first (?) scraper.
Step 3. How much do you want to scrape?
Now we're going to expand our one-page-scraper to a multipage-scraper.
- Add another code tab by clicking on the plus ('+') sign on the right of 'Code'
- Click insert and select 'Code to scrape several pages'. Have a quick look at this code and you'll find that Solvent has nicely assigned the correct places for the code for one page (which we've just made) which is under the following text:
// This function scrapes a page
and the code for which urls the scraper should apply our code for one page.
// This function should return an array of URLs (strings)
// of other pages to scrape (e.g., subsequent search results
// pages). It should not include the URL of the current
// page.
- Now go to the right panel again and click 'Capture' and capture/select the link to page 2 on Google. You'll see that it'll select the numbers 3 to 10 as well.
- In the right panel click on Item1 to open it. If you see only a number and no url, you've selected it wrong. But you can easily fix this by deleting the /span in the Xpath in the right panel. So it looks like this:
//div[@class="n"][@id="navbar"]/table/tbody/tr/td/a
instead of like this:
//div[@class="n"][@id="navbar"]/table/tbody/tr/td/a/span
Now you have both the URIs and the number.
- Give the url the appropriate name (URI).
- Select in the left panel:
return [];
- Click generate. It's now placed on the right place in the code in the left panel. Now we're going to cut and paste the one-page-code into this code:
- Go to the first code tab, select all and copy.
- Select the following code from the second tab:
var uri = document.location.href;
data.addStatement(uri, dc + "title", document.title, true);
and paste your one-page-scraper into it (replace it).
- Your code will look like this:
const rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
const dc = "http://purl.org/dc/elements/1.1/";
const loc = "http://simile.mit.edu/2005/05/ontologies/location#";
// add other useful namespace prefixes here
// This function scrapes a page
var scrapePage = function(document) {
var rdf = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#';
var dc = 'http://purl.org/dc/elements/1.1/';
var namespace = document.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
return (prefix == 'x') ? namespace : null;
} : null;
var getNode = function(document, contextNode, xpath, nsResolver) {
return document.evaluate(xpath, contextNode, nsResolver, XPathResult.ANY_TYPE,null).iterateNext();
}
var cleanString = function(s) {
return utilities.trimString(s);
}
var xpath = '//div[@id="res"]/div/div[@class="g"]/h2[@class="r"]/a[@class="l"]';
var elements = utilities.gatherElementsOnXPath(document, document, xpath, nsResolver);
for each (var element in elements) {
// element.style.backgroundColor = 'red';
try {
var uri = cleanString(getNode(document, element, '.', nsResolver).href);
} catch (e) { log(e); }
data.addStatement(uri, rdf + 'type', 'http://simile.mit.edu/ns#Unknown', false); // Use your own type here
// log('Scraping URI ' + uri);
try {
data.addStatement(uri, dc + 'title', cleanString(getNode(document, element, '.', nsResolver).textContent), true);
} catch (e) { log(e); }
try {
data.addStatement(uri, dc + 'description', cleanString(getNode(document, element, '.', nsResolver).nodeValue), true);
} catch (e) { log(e); }
}
}
// This function should return an array of URLs (strings)
// of other pages to scrape (e.g., subsequent search results
// pages). It should not include the URL of the current
// page.
var gatherPagesToScrape = function(document) {
var rdf = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#';
var dc = 'http://purl.org/dc/elements/1.1/';
var namespace = document.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
return (prefix == 'x') ? namespace : null;
} : null;
var getNode = function(document, contextNode, xpath, nsResolver) {
return document.evaluate(xpath, contextNode, nsResolver, XPathResult.ANY_TYPE,null).iterateNext();
}
var cleanString = function(s) {
return utilities.trimString(s);
}
var xpath = '//div[@class="n"][@id="navbar"]/table/tbody/tr/td/a';
var elements = utilities.gatherElementsOnXPath(document, document, xpath, nsResolver);
for each (var element in elements) {
// element.style.backgroundColor = 'red';
try {
var uri = cleanString(getNode(document, element, '.', nsResolver).href);
} catch (e) { log(e); }
data.addStatement(uri, rdf + 'type', 'http://simile.mit.edu/ns#Unknown', false); // Use your own type here
// log('Scraping URI ' + uri);
//alert(uri);
}
}
// This function is called if there is a failure in any
// the subscraping invocations
var failure = function(e) {
alert("Error occurred: " + e);
};
// =========================================================
// first scrape the current page
scrapePage(document);
// then gather the next pages to scrape
var urls = gatherPagesToScrape(document);
// and tell piggy bank to scrape them (and what function should do it)
for each (var url in urls) {
piggybank.scrapeURL(url, scrapePage, failure);
}
- Just for fun, click on RUN. Note that you can only RUN a script when you're on a page that this scraper is meant for. In this case, a Google search page. But even then you, for now, get the warning that your syntax isn't quite as it should be, so on to step 4.
Step 4. A bit of editing
Although Solvent has broughgt us a long way, we still are going to do a little bit of editing in the code we've just created. What you want to do is go to the part of the code that scrapes several pages, which looks kind of like this:
// This function should return an array of URLs (strings)
// of other pages to scrape (e.g., subsequent search results
// pages). It should not include the URL of the current
// page.
(...)
var xpath = '//div[@id="res"]/div/div[@class="g"]';
var elements = utilities.gatherElementsOnXPath(document, document, xpath, nsResolver);
for each (var element in elements) {
// element.style.backgroundColor = 'red';
try {
var uri = cleanString(getNode(document, element, './H2[1]/A[1]', nsResolver).href);
} catch (e) { log(e); }
data.addStatement(uri, rdf + 'type', 'http://simile.mit.edu/ns#Unknown', false); // Use your own type here
// log('Scraping URI ' + uri);
try {
data.addStatement(uri, dc + 'title', cleanString(getNode(document, element, './H2[1]/A[1]/EM[1]/text()[1]', nsResolver).nodeValue), true);
} catch (e) { log(e); }
try {
data.addStatement(uri, dc + 'description', cleanString(getNode(document, element, './TABLE[1]/TBODY[1]/TR[1]/TD[1]/DIV[1]/text()[1]', nsResolver).nodeValue), true);
} catch (e) { log(e); }
}
}
In this part of the code you want to insert three strings of code:
- var uris = new Array;
- uris.push(uri);
- return uris;
Look at this code to see where you have to place those three strings:
// This function should return an array of URLs (strings)
// of other pages to scrape (e.g., subsequent search results
// pages). It should not include the URL of the current
// page.
(...)
var xpath = '//div[@class="n"][@id="navbar"]/table/tbody/tr/td/a';
var elements = utilities.gatherElementsOnXPath(document, document, xpath, nsResolver);
var uris = new Array;
for each (var element in elements) {
// element.style.backgroundColor = 'red';
try {
var uri = cleanString(getNode(document, element, '.', nsResolver).href);
} catch { log(e); }
data.addStatement(uri, rdf + 'type', 'http://simile.mit.edu/ns#Unknown', false); // Use your own type here
// log('Scraping URI ' + uri);
uris.push(uri);
}
return uris;
}
Now click RUN.
step 5. Saving the scraper for next time
We've been working in the 'code' tab all the time in the left panel, but let's have a look at the 'URLs' tab.
-
Click on the URLs tab and click on 'Grab' on the right. This will produce the kind of URL that this scraper is suitable for, eg:
http://www\.google\.com/search\?hl=en\&q=whatever\&btnG=Google\+Search
- But we want it to apply to all kinds of Google searches, so we have to simplify the URL by getting rid of all the syntax in the end and replacing it by either '.*' or '.+';
http://www\.google\.com/search.+
- You'll see the 'Match' lighting up in yellow, as long as the URL is formulated correctly.
- Now click on the 'save' icon left of the 'URLs' tab.
- Under 'Scraper's Info': Give your scraper an appropriate name, such as GoogleSearch.
- And fill in the URI of where you're going to put the file, for Mac users this will look like this if you'll save it on your Desktop:
file:///Users/YourName/Desktop/GoogleSearch/GoogleSearch
Under 'Author's Info' you can fill in your name.
Under 'Files and URLs':
The number of the Code Buffer indicates the tab under which your full code is. Following this tutorial, it'll be under the 2nd tab.
Browse to the place where you want to save the code file, for this tutorial let's say on the Desktop (in a folder with the name of the scraper you've just built) and append .js to the name;
/Users/YourName/Desktop/GoogleSearch/GoogleSearch.js
Do the same for the metadata file, but append .n3 to the name.
Click 'Save' and you'll have the files ready on your desktop.