Mapping Open Data as an Idea
Team Members
Jonathan Gray and Liliana Bounegru (coordinators and facilitators), Mahsa Alimardani, Òscar Coromina, Carlo De Gaetano, Vivian Fernàndez, Salla Laaksonen, Michele Mauri, Trilce Navarrete, Gali Ronen, Albana Shala, Fernando van der Vlist. Visualisations by Federica Bardelli and Michele Mauri, with assistance from Alessandro Piscopo.
Keywords
open data, civil society, wikipedia, digital methods, google search
Introduction
In recent years, open data has transitioned from a relatively niche concept in legal, information policy and technical spheres, to gaining significant traction in political discourse at both global and national levels. Major institutions have promoted the idea of open data, governments around the world have launched open data portals and international organisations have also been active in promoting open data across the globe.
Open data plays a different role in various countries. The Global Open Data Index, an initiative of the Open Knowledge Foundation, ranks 97 countries based on the availability and accessibility of information. The UK has topped the 2014 index, with a score of 97%, followed by Denmark, France, Finland, Australia, New Zealand, Norway, the United States, Germany and India. Sierra Leone, Haiti and Mali are at the bottom of the list, with Guinea closing off the list scoring a mere 10%.
Advocates of open data argue that it enables transparency and accountability, encourages public participation, new civic applications and government efficiency, and allows economic growth (Longo 40-41; Pollock; Shadbolt; Trivedy and Battcock). Critics argue that open data may actually end up empowering the empowered or lead to neoliberalisation and marketisation of public services (Bates 389-391, 394; Gurstein; Longo 41-42).
In this research, we set to map open data as an idea within the international civil society agenda and on Wikipedia, using digital methods, to study the evolution of open data as an idea and the main concerns of the open data movement.
Mapping Open Data within the Civil Society Agenda
This study endeavoured to understand whether open data finds traction beyond the conventional domains found in legal, information policy and technical circles. Our area of study were Civil Society Organisations (CSOs) within eight different sectors. Civil Society, as defined by the Oxford Handbook of Civil Society, is a nuanced and debated concept that considers elements within society engaged in collective action towards important social and economic goals (Edwards, 2-8). Within this study, we look at organisation websites associated with issues that have generated significant societal engagement through petition websites such as avaaz.org and change.org. As such, this is a study of CSOs with a significant digital presence. We queried the occurrence of open data in over 871 different websites of CSOs associated with eight different sectors of civil society through Google search. Our overall results indicated that the level of resonance amongst organisations depends on their approach to open data; open data as instrument for a cause, or organisations that promote the concept of open data.
Mapping Open Data as a concept on Wikipedia
In this study we use Wikipedia as a platform for cross-lingual and cross-cultural research of the evolution of '
Open Data', utilizing medium-specific features. Wikipedia is especially of importance in understanding the evolution of the idea of open data, as previous research has indicated that Google highlights Wikipedia entries in its top results, though the criteria behind these ranking remain somewhat obfuscated (Rogers 100). Wikipedia is also an editorial source providing lists and list-building opportunities. As Wikipedia aims at neutrality and verifiability rather than cross-lingual uniformity, same-topic language-versions may differ, in relation to country-specific interests and national perspectives. The Wikipedia 'Open Data' entry has 21 language versions, namely English, Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Catalan, Serbian, Bulgarian, Russian, Ukrainian, Arabic, Indonesian, Chinese, Japanese, Thai, Hindi, Tamil and Khmer. The language versions are of different length and have been created in a period of over 8 years, between 2006 and as recently as 2014.
Research Questions
Research Goal
This study aims to explore open data as an idea, or movement, using digital methods to assess the online sphere. This project utilizes the digital methods tool set of Google Search, Wikipedia medium-specific research, and list building methodology, alongside visualisations.
RQ1: Where, or in which domains of civil society does the concept of open data find traction?
RQ2: How is open data as an issue reflected in the Wikipedia language spheres?
- A.When did open data become an issue in different Wikipedia language spheres?
- B. Are there significant differences between the Wikipedia open data Tables of Contents (TOCs)?
- C. Are there any controversies apparent in the open data Wikipedia articles?
- D. Which open data related topics are mentioned cross-lingually in the open data Wikipedia versions?
- E. Are there common references (sources) across the different Wikipedia open data language versions?
- F. What are the common wiki-links and back wiki-links shared by the different Wikipedia open data language versions?
Methodology
We used various digital methods to study the evolution of the concept open data, namely Google scraping, list-building and Wikipedia medium-specific features.
RQ1: To understand the resonance of open data amongst CSOs, we initially defined the pool of CSOs by building a list of eight different sectors of CSOs. We first aggregated keywords from high ranking and high volume petition websites (e.g. avaz.org and change.org). From these petitions, a list of top topics were identified based on the topics listed - in the case of change.org the topics were listed while avaz.org had a questionnaire that asked website users to vote on the most important topics. Lastly, the top petitions were manually reviewed and sorted.
The topics formed the starting point in defining the thematic classifications needed to identify, and organise lists of CSOs. The categories were developed in a three-step process. The first categories were identified to be Human Rights, Environment, Health, Government and Politics. The topics were cleaned to reflect larger issues in which associated organisations with an online presence could be identified.
The CSOs were identified, and their websites were classified into categories. This led to a second category classification, which resulted in: Human Rights, Environment, Health, Transparency and Financial Transparency. A last revision of the key topics, the CSO and the categories led to one last re-classification of categories, to ensure all major topics were properly represented in our CSO list. Global Health Development, Climate Change, Human Rights, International Transparency, Financial Transparency, International Anti-War Organizations, International Anti-Nuclear Organizations, Media and Democracy formed our final list. Privacy and Global Health were also part of this original set, however technical difficulties meant that we were unable to complete these areas during the allocated and time. These categories were then put through the digital method approach of list-building through URL links (Rogers 5) from editorial sources, such as Wikipedia lists, and the memberships of related networks and coalitions. Expertise from the team of researchers, such as Albana Shala from Free Press Unlimited, was used to curate the Media Development list.
These lists were then inserted into DMIs Google Scraper tool, for queries into the number of times these organisations mentioned open data. The results for each can be found in
Table 1.0. The data we collected from each website was visualised as the last step of this research.
Table 1.0
Find Website lists
here.
RQ2-A:To examine when open data became an issue in the different Wikipedia language spheres, we manually recovered some metrics from the Page Information of each language version, such as length in bytes, creation date, number of editors, number of authors.
RQ2-B: We analyzed the Table of Contents for each of the language version, using a custom-code TOC scraper created by Michele Mauri, utilizing the Wikimedia's APIs. We studied the development of each language version's TOC and then compared the TOCs of the all language version, to examine possible differences in (sub)headings across the language versions, which may reflect a focus on a specific aspect of Open Data, or the absence of such (Rogers 167, 171). The length in bytes of each TOC subsection per language version was also extracted, using the custom-made tool.
RQ2-C: We used the Contropedia tool to perform a preliminary review of possible controversies within the English version. The English version was the first created (2006) and has the most edits, and thus most controversy potential.
RQ2-D: We manually examined all language versions for specific topic-related terminology, using a list of open data related arguments and terms, derived from a set of recommended literature linked to from the "Open Data" entry of the Open Government, which included the folowing entries; Public service delivery and policy; -Efficiency and waste; Unlocking innovation and enabling new applications and services; Economic growth and new businesses. We also searched for privacy-related terms, namely: Privacy; Data protection; Fair information principles; Surveillance.
RQ2-E: As national points of view are often related to editors writing in their own language and using a specific set of references (Rogers 166), we analyzed the reference lists of all the 'open data' Wikipedia language versions, in search of commonalities or differences. We extracted the reference lists using the Harvester tool and the Triangulation tool to discover commonalities among them. Harvesting the French reference list proved to be more complex, as some URLs had additional wikiwix.com permalinks, which were displayed by the Harvester, and had to be manually removed to allow for cross-list comparison.
RQ2-F: Using a custom-made script created by Michele Mauri, we extracted all the wiki-links (internal outbound Wikipedia links) and the back wiki-links (internal inbound Wikipedia links) of each language version, using the Wikimedia API. Using the query API, we first searched for all the language links to the English version, resulting in 21 pages. For each page, we then collected all the links to other Wikipedia articles. The original list of 21 pages was then reconciled again using inter-language links. The results were filtered removing special pages (user pages, categories, templates). A bipartite network of languages and Wikipedia pages was then created. We repeated the same process for each of the language versions, collecting wiki-links and back wiki-links for all language versions of each page, and reconciling them. As inter-wiki linking is manual, some language versions of the same article were not inner-linked. We therefore used Google Translate to convert all languages to English, reconciling pages using Google Refine clustering algorithms.
Findings
RQ1: Our query results told us that about 40% of the CSO websites we studied mentioned open data. The average occurrence of open data per website was 50%, out of 991. Our understanding of the organisations also allowed us to parcel out the different CSO sectors into three types of resonance:
-
Those promoting open data as a concept.
-
Those for whom open data is instrumental to achieving some other goal.
-
Low resonance with open data.
The following findings for each CSO sector are organised according to these 3 types:
International transparency was the CSO group that resonated the best with the open data query (results found in CSO List, table 2.6). This group mentioned open data about 20,827 times within a pool of 539 websites. 240 websites had no mentions in this list. Within the pool of websites, open data has considerably high amount of resonance. The top 16 websites to mention open data are seen in the first row of the image below. It is notable however, that this set also resulted in the largest pool of URLs, which had been previously used in open data studies. (20,827 mentions; 539 sites/240 have zero mentions)
Media Development was another high resulting data set with 3,693 mentions out of 92 websites (see the results in
CSO lists figure 2.8). Only 26 organisations did not mention the term, meaning about 68% of the organisations under our study mentioned the term. Within the pool of websites, open data has a relatively high level of resonance. News organisations such as Deutsche Welle and Internews with media development programs were also amongst the top resonating websites. Advocacy media organisations such as Global Voices, ICFJ and Article 19 were amongst the top ranked websites to mention the term.
(3,480 mentions; 81 sites/26 have zero mentions)
Philanthropic organisations did not resonate at significant numbers, with only 214 mentions within a pool of 35 websites. 16 out of the 36 websites we queried mentioned the term, meaning about 45% of our sample pool worked, or used the concept of open data. At the top of these lists were big foundations like the The Wellcome Trust, Macarthur Foundation, and the Getty Trust. These organisations however, do not maintain high levels of content regarding their projects, as such the level of digital footprints they maintain may explain its low level of resonance with open data.
(214 mentions; 35 sites/19 have zero mentions)
Human Rights organisations resulted in 330 mentions out of 30 websites.
(330 mentions; 30 sites/11 have zero mentions)
Climate Change organisations had 516 mentions of open data, a relatively small number considering its pool of 75 websites. Over half of this URL set had zero mentions of open data.
(516 mentions; 75 sites/43 have zero mentions)
Democracy oriented organisations had a fair amount of open data mentions, however most institutions were either academic institutions, or political foundations.
(851 mentions; 66 sites/40 have zero mentions)
Low resonance with open data.
Anti-War organisations do not resonate with open data, with only 1 mention within a pool of 15 websites from war resisters international. (1 mention; 15 sites/14 have zero mentions)
Anti-Nuclear organisations had about 101 mentions of open data, which came only from the greenpeace.org website.
(101 mentions; 21 sites/18 have zero mentions)
RQ2-A: In regards to the Wikipedia open data entry, the English version was the first created, in 2006, with the Tamil version interestingly following in 2009. Most language versions were created between 2010 and 2012, while the more recent versions, created in 2014, are Arabic, Serbian and Thai. While the English page has the most edits, the French page has by far the most extensive content (measured by bytes).
Wikipedia 'open data' language versions: date of creation, number of edits and length of page. Retrieved: 17 Jan 2015.
RQ2-B: Although some of the TOCs terminology of the open data articles varies across the different Wikipedia language versions, the topics are similar. With the exception of the French TOC, no particular country-specific or language-specific topics are highlighted. The French TOC stands out as the most detailed by far, corresponding with the articles length. The shortest TOCs (such as Arabic, Hindi and Thai) are limited to an overview or definition of the term, links and references.
Wikipedia 'open data' language versions: Tables of Content. Retrieved: 17 Jan 2015.
RQ2-C: Analyzing the English version of the open data article, using the Contropedia tool, no apparent controversial issues nor major edits were evident. As Contropedia is mostly used for detecting controversies within an entry rather than cross-lingual entry comparison, and as no controversies were apparent on the English version, no additional language versions (which have significantly less edits) were examined using Contropedia.
Wikipedia 'open data' English version: Contropedia analysis. Retrieved: 17 Jan 2015.
RQ2-D: The most mentioned topics across all Wikipedia language versions of the open data article, are transparency and democracy, as well as public service. The English version is the only to refer to all topics as well as privacy, while the French version, although the largest in size, does not address the issue of privacy, but rather focuses on civil, public and economic advantages of open data. Efficiency and waste are only present in the Italian version, while privacy is addressed in the English, German and Serbian versions. The Spanish version focuses on open source, free software and open knowledge movements, while the Italian and Ukrainian pages refers mainly to open government. Some of the language versions, such as the Portuguese and Serbian pages, present arguments in favor and against open data. Six of the language versions (namely Catalan, Swedish, Thai, Hindi, Khmer and Tamil) do not include any of the above-mentioned terminology.
Wikipedia 'open data' language versions: issue mapping. Retrieved: 17 Jan 2015.
RQ2-E: The twenty-one Wikipedia open data language versions refer to 192 sources on total. The French version has the most references (59), followed by the English version (30) and the Japanese page (15). The Dutch, Catalan and Thai versions have only a single reference each, while the Arabic, Bulgarian, Hindi and Tamil versions have no references at all. No references are common to all language versions. Only one source is common to six of the language versions, namely the Science Commons website. The Open Definition website was common to five of the language versions. Five sources are common to four languages, while nine sources are shared among three language versions and eight sources are common to two language versions. As 168 of all sources appear only on one language version, most references are thus language-version specific. However, the language or origin of the reference sources does not necessarily correspond with the language version, with many pages referencing English-language sources, as may be expected, while the Japanese and Khmer pages, for example, reference Italian sources.
Wikipedia 'open data' language versions: references (sources) network. Retrieved: 17 Jan 2015.
RQ2-F: The wiki-links (internal outbound Wikipedia links) are composed of 940 nodes. 65% of the nodes are related to only one language, with the main links being Open Access, linked by 11 languages, followed by Open Source, linked by 10 languages, and then Internet, Open Content and Tim Berners-Lee. Thus, the most linked pages are related to open topics, such open government, open source and open access, as well as copyright- and licensing-related topics. Most pages are linked by one language, with the French page linking to 193 unique pages, followed by Russian (39), English (38), Japanese (29), German (23), Spanish (20) and Italian (15). However, many language versions of the same pages were not inner-linked within Wikipedia. Using Google Translate and Google Refine, we then reconciled those pages to create lists of unique pages. For most language versions, the unique links are mainly about national initiatives and institutions, yet some languages present interesting features. The Arabic version is the only one linking to censorship and privacy, while the Catalan has only one unique link to an open-data economic source. The French version, which is the largest in content, has the most unique links, but the majority are related to generic concepts.
Wikipedia 'open data' language versions: wiki-links network. Retrieved: 17 Jan 2015.
Wikipedia 'open data' language versions: wiki-links word cloud. Retrieved: 17 Jan 2015.
The back wiki-links (internal inbound Wikipedia links) are composed of 567 nodes. 56% of the nodes are related to only one language. The main links are Open Knowledge, Creative Commons and Linked Data, linked to 8 language versions respectively. The English version is the most linked-to page, followed by the French version. Some of the language versions also have back wiki-links to the English version of the Open Data entry.
Wikipedia 'open data' language versions: back wiki-links network. Retrieved: 17 Jan 2015.
Wikipedia 'open data' language versions: back wiki-links word cloud. Retrieved: 17 Jan 2015.
Discussion
Open data is a rapidly growing topic of interest on both global and national levels, related to various concepts, as is well reflected in the civil society agenda and the open data Wikipedia languages spheres. Of the top 10 ranked countries in The Global Open Data Index, only seven countries have a related language-version of the open data article, with countries such as Finland, Denmark and Norway, where open data seems to be an issue of public importance, having no language-specific versions, while Cambodia, which is ranked as 76, has an open data article in Khmer. Thus, there is no necessary correlation between the national significance of open data and its presence on Wikipedia.
In the civil society agenda, the most predominant resonance was found amongst international transparency and media development organisations. However, such open-data related terminology and arguments are not as present on the various Wikipedia language versions, focused mainly on democracy and transparency. The wiki-links and the back wiki-links also indicate a clear relation between open data and the free-software and open-source movements, as most inner-wiki-links (links to pages within Wikipedia) are related to these topics.
Within the civil society agenda, beyond the top resonating organisations to mention open data, a pattern evolved in relevance to open data. Through an examination of the nature of the organisations, and their online content, three types of organisations were identified, namely organisations promoting open data as a concept, organisations for whom open data is instrumental to achieving some other goal and organisations with overall low resonance with open data.
It was expected that organisations that use open data in itself would scrape high queries. International transparency and media development organizations had the most mentions, and were most closely aligned with the ethos of open data, as most were digital rights organizations and advocacy initiatives related to access to information. The top ranking websites however, seemed to belong to government affiliated organisations such as data.gov.uk. Philanthropic organisations returned surprising results, with some of the lowest mentions of open data, possibly due to their websites often containing very little content regarding the specifics of their funded programs.
The power of open data to appear amongst organisations who would use open data as instrumental towards another goal would perhaps underline the power of open data to effect change for significant social causes. Democracy promotion organisations, mainly academic institutions and politically-affiliated organizations, had the highest resonance with open data, while within the climate change data set, over half of the websites queried had no mention of open data. Human rights organisations had low resonance, with those organisations that maintained the most traction were often government affiliated, while NGOs had less use of the term. We surmised that these NGOs would rely on data related to information that is often hard to become aggregated and distributed within open data sources, such as demographics related to executions, torture, and prisoners. This finding ultimately speaks to the nature of information and research needed by human rights organisations, that open data has yet to enter into. The most surprising results belonged to anti-war organisations, who had 1 overall mention of open data. Significant results came from Facebook pages, which the Google Scraper could not query for mentions, and were thus omitted from our findings. However, the numbers in these results do not necessarily represent the traction of open data within this civil society sector, as those participating in this movement may not be organisations, but participants in protests, campaigns, and events which dont maintain digital footprints of their work that fit within the scope of this study. Anti-nuclear organisations only found traction with Greenpeace, which already was part of our climate change data set, so it is unclear if this result is an accurate reflection of this civil society sector. Thus, in the civil society agenda, open data finds the most significant resonance within areas related to transparency, and media development, corresponding with the Wikipedia language spheres, in which the issue finds traction in a similar area, as a digital commons topic.
Given the limited scope of this study, further research into the ways each website utilises the concept of open data would be worthwhile. At the moment, this study was not able to search the depth in which some of these civil society organisations use open data. Furthermore, the list building process for the civil society organisations may have excluded a number of projects, and networks that did not maintain strong digital footprints. Additionally, the limitations of time, and difficulties with using the Google Scraper tool with large data sets meant that significant sets of data (such as our queries for open data amongst privacy) had to be omitted. Thus, we acknowledge the limitations of digital methods in capturing the work of civil society organisations, especially given the fluidity and unconventional structures within civil society, as they may exist as events, protests, or campaigns without associated websites.
Due to time constraints, Googles sense of national spheres (Rogers 101) and language queries were largely not addressed. Further research could therefore be conducted into the multiple language queries of open data, to fully gage civil society approaches amongst more localised organisations outside of the English language sphere, as has been studied within Wikipedia. The wiki-links and wiki backlinks analysis methodology used in this study is in itself of significance, as it offers a unique perspective on wiki-inner-linked topics, and is certainly worthy of further research.
Conclusion
Open data is thus an evolving concept of rapidly growing interest, reflecting the shifting conceptions and practices of governance and democracy, as well as diverse visions and values of the various topic-related actors in different online spaces. Both the Wikipedia language spheres and the civil society agenda are mainly associated with digital commons, with the Wikipedia entries relating mostly to other open movements, such as open government, open software and open sources, while the civil society agenda is related to issues of digital rights and access to information advocacy. Our findings indicate that open data does not have significant traction outside the communities of digital rights, and access to information. Considering the nature of the mediums under study, some of the results may be attributed to the digital methods limitations. These limitations are inherent in both Wikipedia as cultural reference for open data, whereby our findings indicated the occurrence of open data as a topic often relied on the preferences, and availability of language editors to make those pages available in their languages. The digital methods limitations of open data in the civil society agenda was considered when thinking about civil society areas that do not have the digital footprints to have been discovered through our methodology (i.e. movements, or campaigns who maintain Facebook pages, or very limited online presence). As such, this research focuses on open data as an issue rather than on the practices of open data usage.
Bibliography
Bates, Jo. The Strategic Importance of Information Policy for the Contemporary Neoliberal State: The Case of Open Government Data in the United Kingdom. Government Information Quarterly . Vol. 31, Issue 3 (2014): 388-395. 17 Jan 2015. < http://www.sciencedirect.com/science/article/pii/S0740624X14000951 >
Edwards, Michael, ed. The Oxford Handbook of Civil Society . 1st ed. Oxford University Press, 2011. Web. 17 Jan. 2015.
Gurstein, Michael. Open data: Empowering the empowered or effective data use for everyone?. First Monday : 16:2-7 (2011). 17 Jan 2015. < http://journals.uic.edu/ojs/index.php/fm/article/view/3316/2764 >
Longo, Justin. #Opendata: Digital-Era Governance Thoroughbred or New Public Management Trojan Horse? Public Policy & Governance Review. Vol. 2, No. 2 (2011): 38. 16 Jan 2015. < http://ppgr.files.wordpress.com/2011/05/longo-ostry.pdf >
Pollock, Rufus. A Data Revolution That Works For All Of Us. Open Knowledge Foundation Blog . 24 Sep 2014. 17 Jan 2015. < http://blog.okfn.org/2014/09/24/a-data-revolution-that-works-for-all-of-us>>
Rogers, Richard. Digital Methods . Cambridge, MA: MIT Press, 2013.
Shadbolt, Nigel. Britain is at the forefront of the open data revolution. The Guardian . 28 Jun 2012. 17 Jan 2015. <http://www.theguardian.com/commentisfree/2012/jun/28/britain-open-data-revolution>>
Trivedy, Roy, and Mike Battcock. An open data revolution, but what's next?. Oxfam Policy and Practice Blog . 18 Dec 2013. 16 Jan 2015. <http://policy-practice.oxfam.org.uk/blog/2013/12/open-data-revolution>