Twitter can show us which issues and themes are associated with a city and what the people in a city are talking about. Edinburgh is a great example to investigate this since it is a relevant British city, yet it is not as much of a brand as is London. This means that tweets mentioning Edinburgh or tweets from Edinburgh are likely to concern actual city issues. We investigated two different approaches into exploring the content of tweets that mention Edinburgh. That is, we are interested in which general themes are closely associated with the city of Edinburgh and in a second step of analysis, who the proponents of certain topics are.
Data: We used a one week sample of tweets containg the keyword Edinburgh in any of the European languages collected by TCAT via the 1% Twitter API. This data set contains roughly 76000 tweets (41000 users) from June 23 26, 2014. Additionally, a collection of geo-located tweets of this week originating from Edinburgh was used which contains 47000 tweets from only 5000 users.RQ1: Which topics are associated with the city of Edinburgh?
RQ2: Who are the active Edinburgh tweeters?
RQ3: Can we separate local and tourist tweeting activity?The second method of thematic extraction goes beyond merely connecting the most frequent hashtags but rather takes all the text of the tweet collection into account and tries to capture more nuanced topics or those that are unlikely to be organized via hashtags. The basic questions remains the same: Which topics are associated with Edinburgh on Twitter?
LDA topic modeling is a probabilistic and generative method to infer topics from a collection of text documents. Basically, the only user input besides the text data is the number of topics. After several trials, this parameter was set 20. A document, i.e. a tweet, is modeled as a mixture of the 20 topics for a tweet clearly concerned with a single topic, the probability for all other 19 topics will accordingly be close to zero. The topics themselves are modeled as a distribution over all the terms of the corpus. Consider the following tweet as an example:
So apparently the weather report from our Edinburgh office is that it is "sunny". Great detail folks!
From the LDA perspective, this document (tweet) has a very skewed topic distribution toward the initially hidden or latent topic weather, and e.g. the term sunny is very likely to having been generated by the topic weather. Other words such as detail have a much more even distribution over topics as this word may be used in many contexts.
The data was imported into R (RStudio,tm package, mallet package) and preprocessed. Stopwords and punctuation were removed, and all text was converted to lowercase. From this data, a termdocument matrix is produced so that every row is a word from the corpus and every column represents a tweet; the cells indicate how often a term is present in a specific tweet (so most cells are zero). From this, the distance/similarity of documents can be calculated and the LDA algorithm is based on this.
An important but difficult part of this analysis was interpreting the resulting topics. A library for R called LDAvis was used to visualize the relevance of each topic (size of the circle), their thematic similarity (high-dimensional distance matrix projected onto two axes using MDS), and the terms representative of each topic. The visualization, output to html, is interactive and allows detailed insights into the term distributions. Depending on the adjustable lambda parameter, the terms of a topic can be reordered (to facilitate interpretation) either more in favor of exclusive terms for the topic or generally frequent terms of a topic.
For example, the term UK is important for the Edinburgh weather topic as it appears frequently in this context but it is not very discriminating as it may be used in many tweets of other topics. On the other hand, humidity is valuable to detect the weather topic but might not be featured in a majority of weather tweets (see down below).
As previously mentioned, the main clusters identified are either general such as weather or news. Most interesting we found politics (referendum for scottish independence) and travel in connection with the upcoming fringe festival. Where as the politics cluster was very concentrated and mostly discussed by a few actors, travel was very diverse. One interesting aspect is if tourist write about Edinburgh differently than locals enjoying their weekend. However, this distinction has proved to be rather complex. An overview of the two key clusters politics and travel is presented below.
Clusters identified:
general (weather)
news (gym accident)
politics (referendum
Instagram Visualisation: http://cdb.io/1zcWfsu
Then we investigated the interactions between users. The top mentioned users are @citizentommy, who is a former Scottish Parliament member, and @indyreffilm, which is a screen event of the referendum campaign. The peak of tweets is caused by @citizentommy, who posted several tweets and got many retweets during these three days. These two top mentioned actors, however, do not have much interaction with each other.
On the other hand, when we look at the intensity of tweets on geolocations, we found that those relatively intense part are just individuals who are actively tweeting about the event, but they are seldom mentioned by others.
Another interesting perspective to look at is how does Edinburgh relate to other cities in the UK? In this graph we can see that the cities in Scotland are closely and intensely related to each other, and they have a relatively loose relation with cities elsewhere.
Travel
Hashtags related to travelling have been found as very diverse. Furthermore it was not known if locals are using the terms or tourists (good day, weekend, park). To further distinct locals from tourists, further steps had been taken.
Language separation: en vs. others
The first (as most convenient) step was the distinction based on language settings. We distinguished categories based on language = en or others. To map out where people actually talk about travelling related topics, we used the geolocated data and put in on a map using CartoDB (see comparison down below.)
Language: other Language: en
It has been found that users identified as tourists (language=not en) are more concentrated on the city centre (middle), around the beach, and the way to the airport (left side). On the other hand, locals, are diversed in areas outside of the city centre.
User specified location
To further distinguish locals from tourists, user specified location has been obtained where Googlerefine has been used to cluster user specified location with different spellings. After the data has been cleaned, we found that although not from Edinburgh, users are mainly located around 50km of Edinburgh rather than diverse e.g. Amsterdam findings.
Distinction by timezones
As user specified location was not very helpful in this case, timezones were investigated. Luckily, the timezones vary not only on time difference but also location. Hence, there was a clear distinction between Edinburgh and London for example, although situated in the same timezone. As the timezones are located in the userdata, we had to link tweets to users and then make a further analysis. After the data was further cleaned, topics based on most frequent hashtag mentions were displayed through a Sankey diagram below.
1. LDA topic modeling: latent Dirichlet allocation, see http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
2. LDAvis: http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf