Situating code sharing platforms in the journalism ecology

GitHub ’s resonance in the online data journalism sphere

Project: GitHub as a Transparency Device
Project leaders: Stefania Milan, Liliana Bounegru, Jonathan Gray
Project Facilitator: Cristel Kolopaking
Report by: Lisa Langenkamp, Jonathan Albright, Cristel Kolopaking
Date: 13th of July 2015

Team Members

Jonathan Albright, University of Auckland; Ivo Furman, Istanbul Bilgi University; Lisa Langenkamp, University of Amsterdam; Sjoukje van der Meulen, University of Amsterdam; Janna Joceli Omena, NOVA University of Lisbon.

Keywords

Data Journalism, GitHub, Code Sharing Platforms, Digital Methods.

Introduction

Since 2008 GitHub gained popularity making it the top used platform for code sharing practices. With the rise of computational journalism not only coders rely on the platform’s storage and sharing capacities, but also journalists are increasingly using the repositories for data storage on a collaborative basis. Next to that, the usage of scripts or programs to scrape data for the purpose of data journalism is increasingly becoming ubiquitous and networked through GitHub ’s platform. Altogether, it is interesting to zoom in on the role of GitHub as a code sharing platform within the realm of computational or data journalism. This research aims to locate GitHub in the online data journalism sphere in terms of its resonance in comparison to other code sharing platforms and in terms of its usage within data journalism. The online data journalism sphere is demarcated by related events, data journalism platforms such as Nieman Lab, Source, DataDrivenJournalism, Knight Journalism Lab, School of Data, LinkedIn groups and moreover a three-year Twitter collection of more than one million Tweets containing keywords and hashtags related to data journalism.

Research Questions

  1. Locate GitHub and alternative code sharing platforms in data journalism spheres:
    - What code sharing platforms are being used for journalism? Are there any alternatives or competing platforms to GitHub?
    - What is the resonance of GitHub in the data journalism space? (Google Scraper)
    - What role does GitHub have within this media ecology? (Minor in relation to datajournalism)

  2. Document character of GitHub in online spheres:
    - How is GitHub discussed in relation to journalism? (MIT Media Cloud / Twitter)
    - What types of journalism practises are associated with it? (Twitter)
    - What is being said about GitHub in relation datajournalism? (Twitter)
    - What styles of collaboration and participation are discussed in relation to GitHub? (Twitter)
    - What values/skills are attached to it? (LinkedIn)

  3. What kinds of claims are made (situating GitHub in the journalism ecology)?

Methodology

In order to locate and trace the resonance of GitHub and alternative code sharing platforms, various online spheres have been examined based on their relation to (data)journalism. First, we started with a qualitative analysis in various spheres by searching for the linkage between GitHub and the online journalism sphere manually and through scrapers. Looking into specialty publications such as Nieman Lab, Source, DataDrivenJournalism, Knight Journalism Lab, School of Data, Columbia Journalism Review and the Tow Center Blog, we found out that all of these rely on code and data sharing repositories from GitHub.

Subsequently we roamed through a list of events related to (data) journalism, which we found by querying Google for [data journalism event], [github event journalism] and [code journalism event]. Although we expected some introducing lectures on the usage of GitHub within data journalism, there was no notice or mention of GitHub within these events. Also, we examined the subreddit ‘ r/journalism’ on Reddit, which is the largest journalism-related subreddit, to see if there was any discussion on GitHub in relation to journalism. We pulled the most recent 1000 posts from there through the IO Magic API and queried for any domain, comment, or post text containing the text [github]. There were no mentions of GitHub in the entire dataset. We also ran queries for BitBucket, SourceForge, BeanStalkApp, Gitlab and CodePlex, but there were no mentions found as well.

In addition, we consulted the more ubiquitous and omnipresent social network of Facebook through the Netvizz tool by searching for Pages on data journalism, using queries [github], [data journalism] and [newsroom code] and see if there was any mention of GitHub, but this also led to no results. Finally, we explored the sphere of MOOCs and University courses to see if there was a referral to the usage of GitHub in (data) journalism courses. It turned out that the MOOCs only offer technical courses, such as different programming skills, but no GitHub tutorial. The university courses of Berkeley, Stanford and Georgetown on journalism did refer to GitHub.

After doing this exploratory round through different social, journalistic and tech-related spheres, we decided to move on by examining four established spheres, namely Google, Twitter, the news sphere of the MIT Media Cloud and LinkedIn. In order to find alternative platforms that might be competitive to GitHub, we ran a Google search in the Lippmannian debice based on queries related to code sharing platforms or ‘GitHub alternatives’:

Google

  1. We used the DMI Google Scraper for the first 100 search results for the following queries:
    - Code sharing platform - query: "code platform" OR "code sharing platform" OR "coding platform".
    - Data sharing platform - query: “data sharing platform”.
    - Data journalism platform - query: “data journalism platform”.
    - Open journalism - query: “open journalism”.

  2. Then we filtered for ‘GitHub’ in any text in the search results and identified six alternative platforms.

  3. Subsequently we queried the six alternative platforms together with GitHub in the results for their resonance.

News ecology

  1. To offset the bias of the Google Search results for Github resonance, we ran a query for ‘GitHub’ (Query A) and ‘data journalism’ (Query B) in MIT’s Media Cloud API against a corpus of 300 million news stories and 4 billion sentences from 2012-12-01 (the first date GitHub appeared in the database) until 2015-07-06.
  2. We also queried GitHub and the alternative platforms in Google Trends.

Twitter

  1. We downloaded the entire TCAT ‘data journalism’ dataset.

  2. We created an extract in Tableau and Excel to explore themes in conversational (discursive) GitHub -related activity.

  3. We did a co-hashtag analysis on #GitHub and on [GitHub] AND #ddj (keyword for data driven journalism) through Gephi.

LinkedIn

  1. We queried LinkedIn groups with the following terms: 'github', 'data journalism', 'newsroom'.

  2. We created a separate list with the LinkedIn groups.

  3. We used the Google Scraper to look for profiles on ‘GitHub’ and ‘journalism’.

  4. 50 profiles have been manually analysed by looking at the skills.

Findings

By exploring the Google sphere for alternative code sharing platforms to GitHub, we found out through snowballing techniques from the [data journalism] results, that there are roughly five alternative platforms to GitHub, which are (1) BitBucket, (2) GitLab, (3) CodePlex, (4) Beanstalkapp and (5) SourceForge. Running an easy query in Google Trends, we are allowed to see how much each of these platforms has been queried in comparison to each other and to GitHub. Since CodePlex had minor requests we removed that from the queries, as there is a maximum of five queries to be made in Google Trends. Based on the output in figure 1, it becomes clear that SourceForge was previously a popular platform for code sharing practices, but since 2008, GitHub has been searched for more widely, confirming their rising popularity and dominance as main code sharing platform anno 2015.

Figure 1: Google Trends analysis for BitBucket, Gitlab, SourceForge, GitHub and Beanstalkapp.

After seeing this quick comparison, we carried on by measuring the resonance of GitHub and the five alternative platforms within four different spheres, namely ‘journalism’, ‘computational journalism’, ‘datajournalism’ and ‘open journalism’. By using the DMI Google Scraper / Lippmannian Device we aimed to explore the top 100 results for each sphere and see the comparative resonance of the six platforms. Unfortunately, we got blocked for three consecutive days from the scraper thus being unable to perform this entire search. However, we did have some results for the resonance of GitHub within these particular spheres:

Google resonance GitHub

  1. Github was only represented in 3 of the top 100 search results on Google in relation to the queries ‘code platform’, ‘code sharing platform’ and ‘coding platform’.

  2. Github was only represented in 1 of the top 100 search results on Google in relation to ‘data sharing platform’.

  3. Github was represented in only 2 of the top 100 search results on Google in relation to ‘data journalism platform’.

  4. Github was represented in only 2 of the top 100 search results on Google in relation to ‘open journalism platform’.

As can be seen in the results, GitHub is in the margins of discussion within the online data/open journalism spheres as well as in the code sharing platform spheres. In relation to the Google Trends graph it is interesting that GitHub is increasingly searched for by Google users, but it seems that it has not penetrated the journalism sphere to the extent that it is discussed as internal practice. Therefore, we turned to the news media sphere in order to find out to what extent GitHub has been discussed there.

GitHub within the media ecology

chart.png

Figure 2. MIT Media Cloud on GitHub and journalism.

The MIT Media Cloud results presented in figure 2 reflect the disparity in media coverage and the news agenda related to each topic. For GitHub, the media discussion has revolved around the features and technical aspects of the platform (“repository”, “coding”, “fork”, “bug”) and the data journalism sphere topics seem to relate to actors in the professional news realm (“guardian”, “fivethiryeight”, “journalists”), the products of ‘data journalism’ (“visualization”, “handbook”, “spreadsheet”), and the practice of journalism. The “crossover” is a normalized (non-sample weighted) collection of words that are in both of the datasets (‘Query A’ and ‘Query B’). As you can see there is no clear relation between the two, as GitHub remains technical in appearance and journalism is not yet in discursively discussed as a technically influenced practice. Now that we have a grasp of the discursive level of the two topics separately within the news realm, we looked more into the discussion of GitHub in relation to datajournalism in the public sphere of Twitter.

GitHub on Twitter

githubddjcohashtagfinal.jpg

Figure 3. Co-hashtag analysis on GitHub and #DDJ.

For the analysis on Twitter we consulted a dataset of over one million Tweets related to data journalism. When querying the #GitHub in that dataset there was a low resonance of merely [X] tweets containing that hashtag. However, when we entered the keyword GitHub without the hashtag, there were more results, namely [X] tweets. Based on the co-hashtag analysis performed with the query #GitHub we found out that the #ddj, which stands for ‘data driven journalism’ is quite popular and central to a cluster with related practices. We therefore decided to query for [github] AND [#ddj] in order to see how GitHub is exactly being discussed in combination to the keyword for data journalism. As can be seen in figure 3, it becomes clear that there are different clusters around related practices, such as the programming language cluster, the educational cluster, the cartography/mapping cluster, a cluster of related stories and the data visualization cluster. It is interesting to see how these practices are clearly separated from each other with only a linkage between the data visualization cluster and the practice of mapping or cartography, which is a logical connection.

Figure 4. Twitter timeline on GitHub.

Next to the Gephi output, we created a timeline of the tweets related to [GitHub] from the data journalism collection through Google Tableau, as can be seen in figure 4. What becomes apparent from this timeline is that within the three year dataset, most of the Tweets on GitHub have been created in 2013, with a small decrease in 2014. The bigger dots in the graph represent the @mentions, so the direct conversations between Twitter users on GitHub, of which the appearance is not determined by a hashtag but could appear anywhere in the sentence. When zooming in on the bigger dots in 2013, it becomes clear that users are mostly discussing troubleshooting problems with their repositories on GitHub through Twitter, asking for solutions from friends. In 2014, there is a shift towards more promotional messages on GitHub when mentioned between users. The outstanding line of big green dots addresses the interactive ‘Happy New Year’ story that Twitter produced through GitHub.

Based on the same three year tweet collection on data journalism, we also created the output for the top domains in [GitHub], as can be seen in figure 5. Figure 5 shows a strong pattern in outgoing links on Twitter, moving from github.com in 2012 and 2013 to github.io in 2013 and 2104, and moving strongly towards githut.info in 2015. There is also a small presence of twitter.com links. After locating the discursive language on GitHub in relation to data journalism through Twitter, we lastly wanted to know how GitHub is discussed in relation to (data) journalism within the professional sphere. We therefore moved towards LinkedIn, in order to get a grasp of the discourse by employees using GitHub as a benchmark in their profiles.

Figure 5. The aggregate count of outgoing URL’s in the GitHub Twitter dataset.

GitHub on LinkedIn

Screenshot 2015-07-13 at 22.04.45.png

Figure 6. Computing and coding skills on LinkedIn given by 50 profiles.

When querying LinkedIn for groups related to data journalism, we found 16 groups. Through a manual research and qualitative analysis, it became clear that there were no discussions on GitHub within these groups. Therefore, we moved to the personal profiles to see if we could define GitHub as a skill or core asset in relation to data journalism. The majority of the users linked to GitHub from their profile, to show their repositories as a technical portfolio. Next to that, GitHub is mentioned as a skill within the majority of the profiles. In order to understand better in what way people conceive GitHub as a skill, we further classified the related mentions in categories of the top 50 users. You can see in figure 5 that GitHub is mostly linked to a variety of programming languages, but also to certain software programs.

Conclusion/discussion

The Twitter data suggests that GitHub-related conversational activity on Twitter is largely based around topical data visualization discussion. Drilling down into the individual tweets, we found examples of Twitter being used to promote personal portfolios and GitHub projects; to alert other users directly about data journalism events (notification, event); and also to link to teaching and course projects (educational). Many of the tweets that were linking to GitHub often didn’t have a hashtag, which suggests that much of the discursive activity on Twitter related to GitHub takes place outside of the realm of the hashtag.

Github, at least as a platform, is not well represented on Google related to related topical searches. Google’s PageRank may not be ideal for measuring the resonance of GitHub. The internal structure of GitHub might not be ideal for tracing outside of the platform itself, meaning that there are few ‘pages’ that are easily traceable using common digital methods tools (e.g. the Google Scraper). Git pages have few incoming/outgoing hyperlinks, and its users and repositories are connected, updated, merged, and forked inside the infrastructure of the platform.

When scraping Google with the DMI scraper tool, it appeared that some results are Dutch. When matching the scraper results CSV output (w/setting to ‘any language’, ‘all regions’, and ‘set to .com’) to a web browser query (with personalisation turned off and account signed out) from google.com and google.nl, the DMI scraper data matched exactly that of the Dutch Google (NL) web browser search. The scraper results weren’t international, but based on Google NL.

Within the professional sphere it seems that LinkedIn is used to show different coding or language skills. Sometimes GitHub is seen as a skill, but from the users that have been analysed, it is clear that LinkedIn is a way to serve their portfolio.

Further research

The TCAT dataset could be explored further to see if any temporal patterns in conversation (specially, looking at tweets directly between users related to Github and/or tweets that link to github.com) exist, and if so, what the broader conversation about Github entails and how it has shifted over time.
Topic revision: r4 - 28 Aug 2019, JannaJoceliOmena
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback