Track the Trackers

Team Members

Yngvil Beyer, Erik Borra, Carolin Gerlitz, Anne Helmond, Koen Martens, Simeona Petkova, JC Plantin, Bernhard Rieder, Lonneke van der Velden, Esther Weltevrede

Introduction

The cloud seems to be a buzz word; what it refers to could be difficult to grasp. This project aims to make (some parts) of the cloud tangible. The project focuses on devices that track users online and show their encounters with the cloud, both those that require active participation of the user (through Widgets) and those encounters that are automated (through tags, web bugs, pixels and beacons). For this purpose we have re-purposed Ghostery, a browser plugin that informs users which companies are present on websites they visit to build a custom tool for tracker detection. We focus on automated tracking devices, that operate as default setting once a user requests a website and widgets, including social buttons, which require user action to set further data transmission in motion. We use a wide definition of “tracker”, including a number of devices that allow for user-data collection, such as internal tracking devices, bugs, widgets, external analytic services and further interfaces to the cloud.


The newly developed tool also allows us to create connections among websites, defining relations based on their connection to the same tracking devices, giving insight into the fluidity of content. In short, by repurposing the Ghostery tool we are able to characterize different collections of URLs. We are further interested to study tracking ecologies in a number of URL collections, issue spaces or web spheres, to see if there are specific trackers at work in particular countries, whether data-protective countries or web spheres deploy less tracking devices and whether countries like Iran use trackers from major US corporations. On top of that we are interested in which trackers are at work in the news sphere, in specific issue spaces, such as health/addiction sites, adults' and childrens' sites, privacy-concerned sites and technology blogs.

The wider aim of the project is to contribute to explicate and make more concrete the more abstract claims of ongoing data-veillance in the back-end by providing detailed insights in the ecology and economy of tracking.

Research questions

General questions:
  • What is the relative presence of the different trackers?
  • Are there clusters of trackers? i.e. Which trackers can be found on the tracker sites?
  • What is the reach of specific trackers?
  • What is the political economy of the cloud looking at ‘cloud technology’? ‘Power concentrations of the cloud’
Sub questions related to specific types of URLs:
  • Which trackers are at work in specific issue spaces, such as health/addiction sites, privacy-concerned sites or technology blogs?
  • What type of statistics do the privacy advocate sites recommend?
  • How ‘likable’ (or ‘social’ or ‘tracky’) are different URL-lists, for instance, adult’s sites compared to children’s sites?
  • Do data-protective countries or web spheres deploy less tracking devices? Do countries such as Iran use trackers from major US corporations?
Sub-questions related to News sphere:
  • How 'social' is the news?
  • How 'tracky' is the news?
  • How did specific sites (New York Times) become more social or tracky over time?
Further research:
  • Can we visualize the visible sharing elements (buttons) and invisible tracking elements (cookies)
  • Are there friendly relationships between beacons?
  • Self-hosted versus cloud-hosted analytics
  • When did the trackers start and when did they cease to exist? Related: The lifespan of a beacon :)
  • “Back-date” trackers

    • Find out how the tracker code has changed over time, so we can detect the same technology over time. For example, the tracking javascript for Google Analytics may have been coded differently in 2008 and 2010.
    • Find ‘old’ trackers by looking into the source code of the retrieved pages of the Internet Archive. Add those to the list of trackers. (Date them?)

  • Comparing adult entertainment sphere and the kids and teens entertainment sphere: which one is most likable or trackable?
  • The focus on widgets and tracking devices could further be extended to also tracking content delivery APIs in order to not only see what data is flowing from the website to tracking services and analytics, but also what data is flowing into the site itself.

Methodology

Our methodology is based on repurposing Ghostery. We have built a tool on top of Ghostery, designed as alternative hyperlink analysis, which is not following the links but tracking the trackers. We named it ‘Tracker Tracker’.

About Ghostery: “Ghostery is your window into the invisible web – tags, web bugs, pixels and beacons that are included on web pages in order to get an idea of your online behavior.
Ghostery tracks the trackers and gives you a roll-call of the ad networks, behavioral data providers, web publishers, and other companies interested in your activity.” http://www.ghostery.com/

Comparing the “trackiness” of three national spheres
Ghostery itself distinguishes between five different types of elements: analytics, ads, widgets, trackers, privacy. We adopt these categories for our analysis.

Mapping the back-end of the web for a set of URLs
  1. Make URL lists of top 100 results, ideally based on Alexa Topsites in a Googledocs : We composed 25 lists of URL types (source is Alexa, unless mentioned otherwise): news sites, country sites (Germany, Iran), Wired (Archive.org), privacy advocate sites (privacyadvocates.ca), adult industry, trackers (ghostery), technology and science blogs (technorati), Top 100 alexa (Dutch, Brasil, Russia), New York Times 1996-2011 (archive wayback machine), social media networking sites (Wikipedia), e-commerce sites/shopping, US embassies (US site), Children's sites, dating sites, alcohol&drugs, addictions.
  2. Select three types of URL to compare: eg. Germany, Netherlands, Iran - and run it in the “Tracker Tracker” tool: https://testing.digitalmethods.net/tools/beta/trackerTracker/
  3. Map the results using Mondrian (presence of trackers across sites) or Gephi (relations between sites based on tracker presence)
Mapping the increasing complexity of the back-end of the web over time
Proposed method:

  1. Select a website eg: http://www.newyorktimes.com
  2. Run it through the Internet Archive Tool: https://tools.issuecrawler.net/beta/internetArchiveWaybackMachineLinkRipper/
  3. For the scope of this study, take one single Archive URL per year
  4. Put URLs in Track tool: https://testing.digitalmethods.net/tools/beta/trackerTracker/
  5. Write down the trackers per year
  6. Visualize the rise of trackers over time

Data and tools

URL collections: https://docs.google.com/spreadsheet/ccc?key=0Amv6UO8S5qbHdF93LVhERngxTWt6ektXVjZuTjRUN2c&hl=en_US#gid=5
Collected with: https://tools.issuecrawler.net/beta/harvestUrls/ and https://tools.issuecrawler.net/beta/linkRipper/

Tracker Tracker tool: https://tools.digitalmethods.net/beta/trackerTracker/
Visualization tools: http://www.theusrus.de/Mo ndrian/
http://gephi.org

Findings

Trackers in Top10 Technology blogs according to Technorati

Top10_techblogs_frontpages.png

Trackers in Top10 News websites according to Alexa

News_Trackers_Top_10.png

50 Trackers from Ghostery list

trackers_ghostery.png

Top15 German sites tracked

Untitled.png

Top 10 Children Pre-School sites (Alexa)

TtT_children_sites.png
Top 100 Technology Blogs based on Technorati

Top100_Technology_Blogs_Technorati.png

Tools adjustment requests

To work with Internet Archive URLs

Input: http://web.archive.org/web/19961112181513id_/http://www.nytimes.com/

Array ( [0] => http://web.archive.org/web/19961112181513id_/http [1] => http://www.nytimes.com/ ) Retrieving: http://web.archive.org/web/19961112181513id_/http

> Does the tool break the URL? It outputs http://www.nytimes.com/ which is not the archive URL but the live web URL. It takes the second part of the second http://


Tools adjustment requests

Include content delivery sites, such as youtube, flickr, amazon.
Howto? Detect platform URL not sufficient (may also point to presence). Widget codes. Embed codes. Or content farm URLs: 6486946117_c8942b21bf_z.jpg

Repurpose Firebug (?)

Include other plugin tools.

Further projects that use this tool:

Visualizing Facebook’s Alternative Fabric of the Web by Anne Helmond & Carolin Gerlitz, see blogpost for more information.

unlikeus_likeeconomy_web.010.jpg

Trackers gebruikt op de websites van Nederlandse politieke partijen

Trackers on Dutch political websites by Anne Helmond.

Het gebruik van trackers als beacons, cookies, plugins, widgets en analytics op de websites van Nederlandse politieke partijen in kaart gebracht. > Download full PDF.

trackers_nlpolitiekewebsites.png

Resources

Andrejevic, M. 2007. iSpy: Surveillance and Power in the Interactive Era. Lawrence: University Press of Kansas.
Berry, D. 2011. Philosophy of Software. London: Palgrave Macmillan.
Elmer, Greg, Profiling machines: mapping the personal information economy, MIT press, 2004
Langlois, G., McKelvey, F., Elmer, G. and Werbin, G. 2009. Mapping Commercial Web 2.0 Worlds: Towards a New Critical Ontogenesis. Fibreculture 14, online.
Roosendaal, A. 2010. Facebook Tracks and Traces Everyone: Like This!, [[http://papers.ssrn.com/sol3/papers.cfm##?abstract_id=1717563][Tilburg Law School Research Paper No. 03/2011]
I Attachment Action Size Date Who Comment
News_Trackers_Top_10.pngpng News_Trackers_Top_10.png manage 138 K 27 Jan 2012 - 15:24 CarolinGerlitz  
Screen_shot_2012-01-27_at_4.35.56_PM.pngpng Screen_shot_2012-01-27_at_4.35.56_PM.png manage 79 K 27 Jan 2012 - 15:36 AnneHelmond  
Top100_Technology_Blogs_Technorati.pngpng Top100_Technology_Blogs_Technorati.png manage 624 K 29 Jan 2012 - 08:34 CarolinGerlitz  
Top10_techblogs_frontpages.pngpng Top10_techblogs_frontpages.png manage 223 K 27 Jan 2012 - 15:22 EstherWeltevrede  
Top_SNS_pages_wiki_frontpage.pngpng Top_SNS_pages_wiki_frontpage.png manage 71 K 29 Jan 2012 - 08:25 CarolinGerlitz  
TtT_children_sites.pngpng TtT_children_sites.png manage 78 K 27 Jan 2012 - 16:01 LonnekeVanDerVelden Screenshot top 10 children 'pre-school' sites Alexa
Untitled.pngpng Untitled.png manage 100 K 27 Jan 2012 - 15:53 CarolinGerlitz  
top1000_color.pdfpdf top1000_color.pdf manage 144 K 29 Feb 2012 - 13:34 AnneHelmond  
top1000_grey.pdfpdf top1000_grey.pdf manage 144 K 29 Feb 2012 - 13:34 AnneHelmond  
trackers_ghostery.pngpng trackers_ghostery.png manage 218 K 27 Jan 2012 - 15:57 AnneHelmond  
trackers_nlpolitiekewebsites.pdfpdf trackers_nlpolitiekewebsites.pdf manage 174 K 11 Jun 2012 - 15:22 AnneHelmond  
trackers_nlpolitiekewebsites.pngpng trackers_nlpolitiekewebsites.png manage 466 K 11 Jun 2012 - 15:24 AnneHelmond  
trackertracker_4fd5c916df6ae.csvcsv trackertracker_4fd5c916df6ae.csv manage 15 K 11 Jun 2012 - 14:03 AnneHelmond  
unlikeus_alexa1000_colored_fb.pdfpdf unlikeus_alexa1000_colored_fb.pdf manage 740 K 12 Mar 2012 - 14:21 AnneHelmond  
unlikeus_alexa1000_colored_google.pdfpdf unlikeus_alexa1000_colored_google.pdf manage 736 K 12 Mar 2012 - 14:21 AnneHelmond  
unlikeus_alexa1000_colored_google_fb_twitter.pdfpdf unlikeus_alexa1000_colored_google_fb_twitter.pdf manage 758 K 12 Mar 2012 - 14:20 AnneHelmond  
unlikeus_alexa1000_colored_pertype.pdfpdf unlikeus_alexa1000_colored_pertype.pdf manage 741 K 12 Mar 2012 - 14:20 AnneHelmond  
unlikeus_alexa1000_colored_twitter.pdfpdf unlikeus_alexa1000_colored_twitter.pdf manage 730 K 12 Mar 2012 - 14:21 AnneHelmond  
unlikeus_likeeconomy_web.010.jpgjpg unlikeus_likeeconomy_web.010.jpg manage 1 MB 11 Jun 2012 - 13:17 AnneHelmond  
Topic revision: r29 - 11 Jun 2012, AnneHelmond
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback