Scrapester

Introduction

Social networking or community sites such as Myspace, Facebook and Hyves in the Netherlands have stirred anxiety about the public display of the informal. Researchers of social software have concentrated on what especially a non-member -- a mother, a prospective boss or a teacher -- can see about a person. The Scrapster project is focused on looking at some of the most popular social networking sites and looking at their scrapability. Scrapability entails the amount and type of data which can be gathered from those sites by using tools and applications, what data is protected/restricted and what data is preceived as to be private but in fact can be retreived. This data can be used to provide insight in several specifics of the this space of the web like:

  • Demographics
  • Political
  • Social
  • Economic

Sources

Six souces have been selected which will be analysed withing the Scrapster project.

  • Hyves
  • Friendster
  • Linked In
  • Orkut
  • Myspace
  • Facebook

At the moment however, the project has focused it's attention on Hyves.

Policy

The robot.txt are meant to provide information to scrawlers like Google on which data the particular site might want to open up to the world and which not. A lot can be said about the policiy a website and therefore we have looked at the list of sources and listed the policy extracted from those robot.txt files.

Site Robots.txt
Facebook excludes all bots from /profile.php, /album.php, /photo.php, /feeds.php
Friendster excludes all bots from /websearch.php, /gallery.php, /usersearch.php, /group/search.php, /searchcollege.php, /searchschool.php
Hyves allows all bots (index, follow)
Linked In excludes all bots from /ppl/, /answers/, /answers/browse/using-linkedIn, /answers/using-linkedIn
allows /find
Myspace excludes Internet Archive bot from all pages
Orkut allows all bots

Hyves

Data

Within Hyves interesting scrape date can be found in the:

  • Profile
  • Connections/friends
  • Topics/Hyves

Profile

User profiles data is interesting to be able to determine a vitual identity and provide (demographic) data of the specific network. The main issue about retreiving this information is the relative freedom users have to fill in the fields of their profile. When no required fields a present, no predefined set of important fields can be established. Although this is a restriction, by scraping the data and analysing the information which did get filled in can provide lots of insight. Not in the least because of the tendency of people to provide lots of information willingly. A second issue is the user-based restriction settings which enable users to set the level of information which can be viewed by people. Although some fields such as email address are often hidden, many people are eager to share their profile information about their interest and demographics.

Topics/Hyves

Hyves relating to a topic(group) instead of a specific user is interesting when looking at the several issues which are current in the social network space. In combination with the profile information the specific topic space of the network or a certain demograpihc group can be made visible.

Connections/Friends

Scraping information about connections/friends can be used to establish the network between different data based on specific users or user groups.

Scraping issues

Acces
Hyves can be accessed without logging into the wesite but the information available is limited. Although Hyves uses a POST to send the login information, by using a url with the required information it is possible to access the website using a script.

A possible way to acces the page to get the source data is:

lynx --source -nolist http://www.hyves.nl/friends/myfriends/\?login_username=$gebruikers\&login_password=$password

Code
Hyves uses basic HTML to generate their pages. By analysing the HTML the preffered data can be scraped and sotred into XML or a database. The first issue related to coding is the HTML itself. WHen adding new features or fields the make-up of the HTML changes. Because scraping tool rely on the specific structure of the HTML, changes can cause the script to fail. Next to HTML, Hyves uses AJAX to retreive information from the server to display a list of connection friends. Due to this, only the first 15 results are generated in the HTML of the friends. Scraping multiple pages withing a user hyves thus needs to be done by looking closely at the sourcecode and html tree to see from what location the information can be gathered from.

Hyves accountgegevens: Wie mag wat zien?

Weergegeven naam: wordt altijd weergegeven.
Deze bestaat uit een naam of nickname.

Voor- en achternaam: tenminste zichtbaar voor vrienden
De volgende gegevens kunnen zichtbaar zijn voor anderen:
  • Voornaam
  • Achternaam

De mate van zichtbaarheid kan als volgt worden ingesteld:
  • Alleen vrienden
  • Vrienden van vrienden
  • Hyvers
  • Iedereen

Voorwaarde is dus dat de voor- en achternaam niet onzichtbaar kunnen zijn. Ze moeten tenminste voor vrienden zichtbaar zijn.

Overige accountgevens: nooit zichtbaar voor iedereen
De volgende gegevens kunnen zichtbaar zijn voor anderen:
  • Email
  • Land
  • Woonplaats
  • Mobiel nummer
  • MSN
  • Verjaardag
  • Geboortejaar

De mate van zichtbaarheid kan als volgt worden ingesteld:
  • Niemand
  • Alleen vrienden
  • Vrienden van vrienden
  • Hyvers

Deze gegevens zijn nooit zichtbaar voor iedereen. Alleen mensen met een hyves account kunnen eventueel toegang krijgen tot deze gegevens.

De volgende gegevens kunnen niet worden ingesteld voor zichtbaarheid:
  • Adres
  • Postcode
  • Google Maps

Weergave van blogs
Sommige blogs worden niet weergegeven wanneer er niet is ingelogd met een Hyves-account. Er kan namelijk ingesteld worden in welke mate een blog zichtbaar is.

De volgende instellingen zijn mogelijk:
  • Niemand
  • Alleen vrienden
  • Vrienden van vrienden
  • Iedereen

Iemand kan een blog alleen voor zichzelf houden totaan iedereen. Opmerkelijk is dat er geen optie bijstaat dat iemand een blog kan plaatsen voor Hyvers. Een blog kan ook gesplaatst worden in een Hyve. De blog wordt dus gekoppeld aan een Hyve.

Polls
Met een poll kan iemand een meerkeuzevraag stellen. Deze kunnen gesteld worden aan de volgende groepen:
  • Alleen vrienden
  • Vrienden van vrienden
  • Hyvers

Polls zijn dus niet toegankelijk zonder hyves-account.

Hyves privacy

As check some privacy in hyves, a particular private hyves what scraped to see wether or not this data was scrapable. The following steps have been taken:

The result was that although these users where under the impression that they where part of a private Hyve, their information could still be scraped.

Facebook

Various privacy settings within profile features & contact information.

As Facebook Grows, Longtime Users Draw Privacy Veil - Mary Jane Irwin 07.17.07 Private Facebook Pages Are Not So Private - Ryan Singel 06.28.07 CIA Gets in Your Face(book) - Chaddus Bruce 01.24.07 Privacy Fears Shock Facebook - Michael Calore 09.06.06

See also

Social networking site visualization ( video)

Dapper content scraper demo


Tags:
create new tag
, view all tags
Topic revision: r18 - 31 Aug 2007, ErikBorra
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback