Team Members
Christian, Samuel, Simeona, Bronwen
Research Question
Is
YouTube becoming less user-generated?
Sub-Question: What was watched on the
YouTube of 2007, 2008, 2009, and 2010?
Context
YouTube is a web platform that is generally perceived to facilitate user generated content, communities and social interactions. It was founded in 2005 as a
‘consumer media company for people to watch and share original videos worldwide through a web experience’. In November 2006
YouTube was acquired by Google.
In 2009 a variety of financial analysts publicly criticized Google's acquisition of
YouTube. They claimed it was moving toward profitability too slowly, or that it had no chance of ever being profitable. In response, Google embarked on a variety of "monetization" strategies, incliding cross-licensing agreements, better integration with Google
AdWords and
DoubleClick, and modifications to the user interface to make
YouTube seem more like television. It now has video agreements with several US entertainment corporations: Columbia Broadcasting System, National Broadcasting Company, Universal Music group, Sony BMG Music Entertainment Group and The National Hockey League.
While no one disputes that it is possible to upload a quirky video to
YouTube, if most or all attention to
YouTube never flows to this independent content how different is
YouTube from a traditional mass medium? Most people have not been able to study attention to
YouTube because historical data about viewership is unobtainable.
In an attempt at answering this broad question we have subdivided our research into time frames of one month.
Steps in Method
Before the DMI Summer School began, collaborators wrote a
YouTube crawler and compiled meta information about popular videos on
YouTube for 28 months. This resulted in a set of ~130 million records describing 440,036 unique
YouTube videos seen between 27/11/2007 and 21/03/2010. The video archive was generated by a crawler based in Cambridge, MA, USA, so it provides a US-centric, English view of
YouTube.
Christian wrote a
web interface that enables us to explore this historical archive.
The crawler discovered videos in a variety of ways:
Most (213913) of the videos were discovered by the Digg Explorer.
Some (49969) were discovered by the Most Today Explorer.
Some (38134) were discovered by the Top Today Explorer.
A few (2371) were discovered by the Technorati Explorer.
The rest (132864) were discovered by feeding the crawler with search terms that the research team thought might be of interest because they are controversial topics. Examples: kkk, muslims, islam, army, scientology, censor ...
Crawler activity:
The gaps in data represent where the crawler was inactive. This could be due to a number of reasons including: blocked by
YouTube, technical problems with server/network ...
First Attempt
First we sampled the top 1000 most viewed videos between 27/11/2007 and 21/03/2010. We extracted the text from their title, tags, category and description, and clouded the data in Wordle. The visuals that we produced were then animated (click here to see the video). During this stage of playing around and coming to terms with the data, we encountered a few issues. The general drawback of tag clouds has been discussed elsewhere, but below we outline a few of the issues that we discovered were directly affecting our work.
1. The "Peanut Butter Jelly" effect
One of the videos was a novelty song called
"It's Peanut Butter Jelly Time." The uploader had posted the lyrics (mainly repetitions of the line, "It's Peanut Butter Jelly Time") in the description of the text. The multiple repetition of these words in the description field meant that the cloud generated for this series was disproportionately dominated by these words.
We have overcome this issue by 'cleaning' the text fields (removing multiple repetitions of words). This is important because we want to see common themes amongst all videos, rather than prominent terms in single videos.
2. Yes, it's a "VIDEO"!!
Most of the tags and titles and descriptions for the
YouTube videos had the word 'video' included somewhere in their text fields. We decided that we could remove this word as it was a redundant term for our research, yet was appearing in every single tag cloud, very very prominently. We already know that we're working with videos, so the prominence of this word was detracting from what else we could learn from the tag clouds. Christian devised a way to "stop words," but so far the only word we have justifiably removed is "video" due to its redundancy. The fact that the word was capitalised in different ways meant that it often appeared in the clouds more than once. This also presented us with another issue (the capitalisation of words) and so during the cleaning process we made sure that all of the text was converted to lowercase (this still does not account for misspellings or alternate spellings which will hopefully have little impact on the data).
3. Text overload
Some of the videos have a disproportionately large amount of text in their titles or descriptions than others. Some have few tags, while others have many. The amount of text is not necessarily correlated with how many times the videos have been viewed so this presents us with an issue. When we cloud, we may get a prominent keyword appearing because of a video that has had fewer views than others but has overloaded us with data.
We need to develop a "weighting" system whereby videos that have had more views are multiplied by a factor giving their text more weight than videos that have not been viewed as much. For example, let's hypothesise that for the month of March 2008 we have video X and video Y.
Video X has been viewed 1,000,000 times but video Y has been viewed 10,000,000 times.
Video X provides us with 200 keywords but video Y provides us with only 20.
It is logical that video X then becomes more prominent in the tag cloud because it has more text.
Video Y should be more prominent, because it was viewed more times.
A weighting system counteracts this by multiplying video Y's keywords with a factor that accounts for viewership, and multiplying by another factor to correct for videos that have an unusually small number of keywords (effectively repeating the words).
Problems remain: If video X has 200 words but Video Y has 20 words, repeating the words in Video Y ten times is not an effective way to balance them. Since video Y's words are repeated, they will become more powerful in the tag cloud. This would only be defensible if indeed Video Y had a more focused topic. (This kind of thing may not be worth worrying too much about as tag clouds are meant to be aesthetic and broadly diagnostic rather than precise.)
Second Attempt
We are now working with the top 10,000 videos from the archived period that have had their text fields cleaned and weighted (according to the processes outlined above). These are organised by number of views per month, and we are producing comparative tag clouds that reveal how keywords have changed dynamically over time from the start of the crawler's expedition until it went offline.
We filter the videos according to number of views and apply the clouding process to the top 100 viewed videos for each month.
Findings
Conclusion
This project is part of the
DmiSummer2011Projects