You are here: Foswiki>Dmi Web>City_Speaks>DmiSummer2014Projects>DmiSummer2014Edinburgh (29 Jul 2014, LiuYang)Edit Attach

Edinburgh

Team Members

Moritz Büchi, Benjamin Koeck, & Liu Yang

Introduction

Twitter can show us which issues and themes are associated with a city and what the people in a city are talking about. Edinburgh is a great example to investigate this since it is a relevant British city, yet it is not as much of a brand as is London. This means that tweets mentioning Edinburgh or tweets from Edinburgh are likely to concern actual city issues. We investigated two different approaches into exploring the content of tweets that mention Edinburgh. That is, we are interested in which general themes are closely associated with the city of Edinburgh and in a second step of analysis, who the proponents of certain topics are.

Data: We used a one week sample of tweets containg the keyword Edinburgh in any of the European languages collected by TCAT via the 1% Twitter API. This data set contains roughly 76000 tweets (41000 users) from June 23 26, 2014. Additionally, a collection of geo-located tweets of this week originating from Edinburgh was used which contains 47000 tweets from only 5000 users.

Research Questions

RQ1: Which topics are associated with the city of Edinburgh?

RQ2: Who are the active Edinburgh tweeters?

RQ3: Can we separate local and tourist tweeting activity?

Methodology

Identifying topics: method 1 hashtag clusters

The first method uses Twitters own structuring mechanism, the hashtag. For the week in question, a co-hashtag network was exported and analyzed in gephi. Each node is a frequently used hashtag (more than 30 times in the corpus). An edge along with its thickness indicates the frequency of co-occurrence of two hashtags. Modularity calculations revealed a couple of interesting clusters. One of the two biggest hashtag clusters is about jobs and hiring in Edinburgh. Another is concerned with the political movement that pushes for Scottish independence. Their main hashtag indyref is associated with other hashtags such as RT, yes, hopeoverfear, and bizforscotland. The largest part of thr co-hashtag network though here without a clear set of dominating nodes is about travel in the UK, or Scotland and Edinburgh in particular (green cluster in the figure below). Related hashtags are travel, trip, photography, architecture, or summer.

Identifying topics: method 2 LDA topic modeling

The second method of thematic extraction goes beyond merely connecting the most frequent hashtags but rather takes all the text of the tweet collection into account and tries to capture more nuanced topics or those that are unlikely to be organized via hashtags. The basic questions remains the same: Which topics are associated with Edinburgh on Twitter?

LDA topic modeling is a probabilistic and generative method to infer topics from a collection of text documents. Basically, the only user input besides the text data is the number of topics. After several trials, this parameter was set 20. A document, i.e. a tweet, is modeled as a mixture of the 20 topics for a tweet clearly concerned with a single topic, the probability for all other 19 topics will accordingly be close to zero. The topics themselves are modeled as a distribution over all the terms of the corpus. Consider the following tweet as an example:

So apparently the weather report from our Edinburgh office is that it is "sunny". Great detail folks!

From the LDA perspective, this document (tweet) has a very skewed topic distribution toward the initially hidden or latent topic weather, and e.g. the term sunny is very likely to having been generated by the topic weather. Other words such as detail have a much more even distribution over topics as this word may be used in many contexts.

The data was imported into R (RStudio,tm package, mallet package) and preprocessed. Stopwords and punctuation were removed, and all text was converted to lowercase. From this data, a termdocument matrix is produced so that every row is a word from the corpus and every column represents a tweet; the cells indicate how often a term is present in a specific tweet (so most cells are zero). From this, the distance/similarity of documents can be calculated and the LDA algorithm is based on this.

An important but difficult part of this analysis was interpreting the resulting topics. A library for R called LDAvis was used to visualize the relevance of each topic (size of the circle), their thematic similarity (high-dimensional distance matrix projected onto two axes using MDS), and the terms representative of each topic. The visualization, output to html, is interactive and allows detailed insights into the term distributions. Depending on the adjustable lambda parameter, the terms of a topic can be reordered (to facilitate interpretation) either more in favor of exclusive terms for the topic or generally frequent terms of a topic.

For example, the term UK is important for the Edinburgh weather topic as it appears frequently in this context but it is not very discriminating as it may be used in many tweets of other topics. On the other hand, humidity is valuable to detect the weather topic but might not be featured in a majority of weather tweets (see down below).

Findings

As previously mentioned, the main clusters identified are either general such as weather or news. Most interesting we found politics (referendum for scottish independence) and travel in connection with the upcoming fringe festival. Where as the politics cluster was very concentrated and mostly discussed by a few actors, travel was very diverse. One interesting aspect is if tourist write about Edinburgh differently than locals enjoying their weekend. However, this distinction has proved to be rather complex. An overview of the two key clusters politics and travel is presented below.

Clusters identified:

general (weather)
news (gym accident)
politics (referendum
travel (fringe festival, general travel)

Instagram Visualisation: http://cdb.io/1zcWfsu

Discussion & Conclusions

Politics

We export the data by querying top hashtags (politically related) in the TCAT dataset of Edinburgh: YES4Freedom OR voteyes OR indyref OR HopeOverFear. And we used a wordcloud to discover users idea about this independence referendum. A big Yes shows up, but whose voice is it?

Then we investigated the interactions between users. The top mentioned users are @citizentommy, who is a former Scottish Parliament member, and @indyreffilm, which is a screen event of the referendum campaign. The peak of tweets is caused by @citizentommy, who posted several tweets and got many retweets during these three days. These two top mentioned actors, however, do not have much interaction with each other.

On the other hand, when we look at the intensity of tweets on geolocations, we found that those relatively intense part are just individuals who are actively tweeting about the event, but they are seldom mentioned by others.

Another interesting perspective to look at is how does Edinburgh relate to other cities in the UK? In this graph we can see that the cities in Scotland are closely and intensely related to each other, and they have a relatively loose relation with cities elsewhere.

Travel

Hashtags related to travelling have been found as very diverse. Furthermore it was not known if locals are using the terms or tourists (good day, weekend, park). To further distinct locals from tourists, further steps had been taken.

Language separation: en vs. others

The first (as most convenient) step was the distinction based on language settings. We distinguished categories based on language = en or others. To map out where people actually talk about travelling related topics, we used the geolocated data and put in on a map using CartoDB (see comparison down below.)

Language: other Language: en

It has been found that users identified as tourists (language=not en) are more concentrated on the city centre (middle), around the beach, and the way to the airport (left side). On the other hand, locals, are diversed in areas outside of the city centre.

User specified location

To further distinguish locals from tourists, user specified location has been obtained where Googlerefine has been used to cluster user specified location with different spellings. After the data has been cleaned, we found that although not from Edinburgh, users are mainly located around 50km of Edinburgh rather than diverse e.g. Amsterdam findings.

Distinction by timezones

As user specified location was not very helpful in this case, timezones were investigated. Luckily, the timezones vary not only on time difference but also location. Hence, there was a clear distinction between Edinburgh and London for example, although situated in the same timezone. As the timezones are located in the userdata, we had to link tweets to users and then make a further analysis. After the data was further cleaned, topics based on most frequent hashtag mentions were displayed through a Sankey diagram below.

Bibliography

1. LDA topic modeling: latent Dirichlet allocation, see http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

2. LDAvis: http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

Topic revision: r2 - 29 Jul 2014, LiuYang

Digital Methods

Course

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback