YouTube ’s recommendation system challenge - Analyzing the content of YouTube ’s recommendation algorithms
Amaury L (Project Leader, Check First)
Hajar LAGLIL (Facilitator)
Ramata MAIGA
Inoussa KORA CHIABI
Busra KULAKLI
Perpetue MADUNGU TUMWAKA
Team Members
Contents
Summary of Key Findings
1. Introduction
2. Initial Data Sets
3. Research Questions
4. Methodology
5. Findings
6. Discussion
7. Conclusion
8. References
We worked on the YouTube recommendation system with the query “covid”. By analyzing Youtube videos’s tags and titles, we managed to evaluate the deviation of the recommended video’s language from french to another. Also, we wanted to see which countries are related to YouTube ’ recommendations from one level to another.
Firstly, the analysis of Limburg (Belgium) shows that YouTube suggests to Limbug users videos which are about COVID, China and his COVID policy. Several communities of videos are found. Some of these communities are intriguing and highlight how the query “covid” can lead to various suggestions of videos about medical, comedy and funny, media or political aspects of covid.
Secondly, the geo-spatialization of videos content, shows that when somebody living in Belgium uses “ covid” as key words, the videos that are mainly recommended to him are videos which titles relate to France, Uk, Russia, Ukraine, Canada , Australia, India, China, Brazil and USA at the first level. And in level 2, the videos related, we find the same countries of the first level and news countries (Germany, Ivory Costa and Afghanistan).
In order to monitor and analyze Youtube recommendation’s algorithm (as social networks), we chose to verify if the youtube algorithm recommends to (or not) users videos related to Covid or if it recommends irrelevant videos to Covid.
The Datasets come from the CrossOver project (https:// crossover.social/) analyzing the algorithms of platforms with a focus on trends and recommendations. We will focus on the video’s titles and tags.
The research questions were :
What content is recommended to Belgian users?
What content is recommended to Belgian users?
Our starting hypothesis states that when a Belgian searches on youtube, the algorithm should mostly recommend videos related to his or her search term in the first and second position.
In order to test our hypothesis, we use Gephi and Cortex as social network analysis tools and Raw graph for result visualization.
Tools used
i) Gephi
ii) Cortex Manager
We converted the Json file to CSV. Then we filtered to have only the French contents. Then we implemented it on Cortext using the following scripts: data parsing, terms extraction, network mapping, named entity recognizer, geocoding, geospatial exploration. We did this work in order to have the locations associated with the Covid regarding the 1st level videos and also to see the speed of the geographical deviation of the recommendations. We performed language detection due to the fact that our corpus was multilingual. Despite the fact that we filtered through the pre-existing column entitled "Relevants languages". Then, we were interested in the speed of the deviation according to the language of the recommendations. To finish, we represent the top 150 nodes with Louvain Algorithm.
iii)RawGraphs
After obtaining the results in CorText Manager, we made visualizations with Raw graphs (the matrix, histogram, the circle).
Methodology
The dataset we had was in json files format. We had to convert it to a csv format.
We were interested in the French part of belgian youtube’s recommendations obtained by the query <covid>.
Gephi
The first step is setting up some top nodes which have at least 700 degrees (connections). Minimum of 700 connections because below this number, the graph is more unreadable (initially the dataset contains 10973 nodes and 324729 edges). So after filtering some top nodes, the graph is based on 0.9 % nodes of the dataset representing 99 nodes and 4948 edges. The second step is to set the nodes size: their size is between 2 and 200 connections. We use the Force Atlas 2 algorithm (scaling 50, gravity 1, normalize edges weight, displayed edges by the target nodes) to make the graph. The graph shows tags of videos which are suggested to Belgium users (Limburg's territory) when they search « covid » in YouTube.
CorTexT Manager
We used the platform CorTexT Manager which is the digital platform of LISIS Unit and a project launched and sustained by IFRIS and INRAE. This platform aims at empowering open research and studies in humanities about the dynamic of science, technology, innovation and knowledge production.
First of all, we uploaded our csv file in the platform CorTexT Manager. The we used the script Data Parsing that handles a wide range of data formats: isi files (as downloaded from the Web Of Science), Factiva datasets, Pubmed, RIS files, batches of simple text files or any file formatted in csv format (please use “robust csv” parser in that case, see below). It is also possible to parse xls files from Excel or LibreOffice (.xls not .xlsx !). Europress parser is also available plus other specific database parsers. Then, we used the script Terms Extraction, on the fields (title and description), which automatically identifies terms pertaining to a given corpus. In fact, Natural Language Processing (see supported languages below) tools that we use allow us to identify not only simple terms but also multi-terms (called n-grams).
In order to make a network analysis with these extracted terms, we used the script Network Mapping. The maps feature homogeneous or heterogeneous nodes which can be linked according to different types of proximity measures.
We were also interested to see the geographic evolution of Youtube’s recommendations from level 1 to level 2. For that we used the script Named Entity Recognizer which It allows to identify and index persons, places, organizations, etc. At the moment it can handle 6 different languages. In English, one can select among 19 kinds of entities. We selected LOC-GPE that combines GeoPolitical Entities (Countries, cities, states) and locations (Non-GPE locations, mountain ranges, bodies of water).
We used the CorText geocoding engine that has been built to manipulate semi-structured addresses written by humans. So, it is able to solve complex situations as:
Different formats that rely on national postal services (or data providers), that largely vary across country;
Non-geographic information (building names, lab names, person names…), that have ambiguities and could be multi-located;
Ambiguous toponyms (e.g. Is “Paris” one the Paris in Canada or the capital city in France? Is “Osaka” in Japan, the region name or the city name?);
Alternative and vernacular toponym.
Finally, we used the script Geospatial Exploration CorText Tool which is designed to work after CorText Geocoding script as it needs coordinates (longitude, latitude) in a specific format (e.g. 104.068108|30.652751 for longitude|latitude).
We were also interested in how fast does the language of Youtube’s recommendations diverge. For that, we used a trained model that detects the language of the Youtube videos in level 1 and their Recommendations in level 2.
Raw graph
Raw graph is used for the results visualization.
We observe that when Limburg users search « covid » in YouTube, YouTube ’s algorithm suggests to them videos that are mainly about Covid. But so many videos are about China and its covid policy and other kinds of videos that are not related to covid. We found communities of videos (reminder : we use the tags of the videos) on: China and Covid, China, Politics, virus, Business news, disease, medecine, Germany or also Australia (we made a focus on these tags on the graphs), comedy and funny and so on.
The intriguing findings are videos’s tags: comedy, funny, Germany… Why do we have a community “Comedy” Or “funny” ?
The communities of videos’s tags around the medical aspect are linked to the health one; and the small community of videos « comedy/funny » is linked to the medical’s community and political community (we can see “Politics” on the graphs).
Fig 1: YouTube videos tags (first) recommendation to Belgium users (Limburg) with « covid » as keyword.
Our analyses on the tag data of videos from Liege, with the Force Atlas 2 algorithm. The size of the nodes is a function of the number of nodes. For the statistics we used degree, weighted degree and modularity. With these parameters we could find several communities with mainly two themes: The community of everything related to health links in orange color figure 2 and the community of policies implemented by China links in green color figure 2. The thickness of the links is a function of the number of videos suggested on the keywords. In conclusion, for this region we can see that with COVID as a keyword, YouTube suggests videos that are related to health, China and the policies implemented against COVID by China. All the communities have a link with NEWS which is probably a media that talks about themes linking clusters.
Fig 2 : Connexions videos tags network by Belgium provinces
Figure 3: network mapping of recommended videos tags in East_Flanders/Belgium
The analysis of EAST FLANDERS region’s data, with the same parameters mentioned above, shows that three main communities are easily distinguishable , the first community regrouping China and the set of policies adopted with green nodes figure 3. the second community regrouping the rest of the world (USA, Europe) with the policies and the consequences (bordy’s tunnel) .
The third community includes research associated with the treatment of HIV/COVID, all research on HIV/COVID, education on behavioral change (awareness raising), biology and medicine in general. Figure 3 the nodes in blue
Figure 4 : Network mapping of the content of Youtube’s recommendations
Figure 5 : Network mapping of the content of Youtube’s recommendations
This study includes the search for the most recommended Youtube video titles with the keyword "Covid". As a result of the analysis of the obtained data set, nine clusters were identified. All clusters are interconnected.
The light green cluster represents the medical cluster which talks about covid variants and vaccines.
The red cluster groups terms that are related to China’s lockdown which is directly related to hunger, anxiety and food shortages.
The blue cluster is mainly about the CNN reporter David Culver who wrote about his experience leaving Shanghai after living for 50 days under Covid lockdown.
The dark green cluster is around the Speaker Nancy Pelosi negotiating directly with the Trump Administration, she was crucial to securing vital relief for families and small businesses in early COVID-19 pandemic relief laws.
Finally, the yellow cluster which is about the Former FDA Commissioner Dr. Scott Gottlieb. He offered insight into what we need to watch out for with the new COVID-19 Omicron variant, including transmissibility, severity of symptoms, efficacy of current vaccines to protect against it, and what we might expect with the pandemic in the upcoming months.
Figure 6 : Geo-spatial exploration of the videos on level 1 of youtube recommendation
As you can see from the figure, when somebody living in Belgium uses “ covid” as key words, the videos that are mainly recommended to him are videos which titles relate to France, Uk, Russia, Ukraine, Canada , Australia, India, China, Brazil and USA at the first level.
Figure 7 : Geo-spatial exploration of the videos on level 1 of youtube recommendation
In level 2 as shown in the figure below, Youtube’s recommendations recognize mainly the same countries by adding Germany, Ivory Costa and Afghanistan .
Figure 8 :Top recommended vidéos level 1
In the figure 8, our result figure out that when we consider the 5 top youtube vidéos ‘s recommendations, our big challenge in this analysis is to explain why the result coming from the Key word “covid “ recommends firstly, video not related to the word covid in the title in the front recommendations. But the 4th following videos are related.
Figure 9 : Top recommended vidéos level 2
As a consequence in the second recommendation, we learn in the figure 9 that videos titled with our keyword “ covid “v disappear in the five top recommendations.
In the figure 10 below, we show titles of the top 10’s recommendations first vidéos (big rectangles) with our keyword and how much others each recommended videos related to others in the second recommendation as you can see in the short cells.
Figure 10: Tree Map of the titles of recommended videos (level 1 and 2)Figure 11: Multi set bar chart of the five most present languages in recommended videos of level 1 (red) and of level 2 (yellow)
We were also interested to get a look on what languages are mainly used in the youtube recommendation algorithm. We find that in the first and the second recommendations, English vidéos are the most used (Figure 11).
And as shown in the figure 12 (below), even though your keyword is in a language , English for example, recommended videos can be in other languages.
Figure 12: The deviation of the recommended video from french keyword to another language.
As we can see in the results and mainly with the geo-spatialization of videos content, the youTube algorithm recommends videos related to countries which are the most mediatized and/or are the most impacted by the covid pandemic. The results are interesting in recommending videos related to Ivory Costa and Afghanistan or Ukraine. In the case of Ukraine, have we had the effect of the Ukrainian war which boosted the mediatization of covid’s situation in Ukraine ? That is a real research’s question which can be investigated.
The network mapping have, revealed communities of media (news, bbc,...), politics (covid politics, Joe Biden, Prime Minister, and so on), regarding China, medecine… The focus on Limburg revealed a specific community on “comedy” and “funny” which is linked to the political and health community. As much as the appearance of this community (comedy or funny) is intriguing, its links to other communities are even more intriguing. we know that during the lockdown, people let their imagination speak, they were creative and funny in doing their publications (TF1, mars 2020). This can explain the appearance of communities “comedy” and “funny.
Our results highlight the most impacted or publicized parts of the world regarding the covid’s pandemics. The results also permit well being’s subject as we see on the fig 5 with the pink community. But this work also gave birth to a desire to know more about the edges between comedy/funny communities and politics or medical community, to dig deepers into Australia, Afghanistan.
Rieder, B., Matamoros-Fernández, A., & Coromina, Ò. (2018). From ranking algorithms to ‘ranking cultures’: Investigating the modulation of visibility in YouTube search results. Convergence, 24(1), 50–68.
Arthurs, J., Drakopoulou, S., & Gandini, A. (2018). Researching YouTube. Convergence: The International Journal of Research into New Media Technologies, 24(1), 3–15.
Schwemmer, C., & Ziewiecki, S. (2018). Social Media Sellout: The Increasing Role of Product Promotion on YouTube. Social Media + Society, 4(3), 205630511878672.
Breucker P., Cointet J., Hannud Abdo A., Orsal G., de Quatrebarbes C., Duong T., Martinez C., Ospina Delgado J.P., Medina Zuluaga L.D., Gómez Peña D.F., Sánchez Castaño T.A., Marques da Costa J., Laglil H., Villard L., Barbier M. (2016). CorTexT Manager (version v2).
TF1, Mieux vaut en rire. des vidéos et photos drôles sur le confinement, 24 mars 2022
Scott, J. (1988). Social network analysis. Sociology, 22(1), 109-127.
Azam, M. & de Federico, A. (2016). Sociologie de l’art et analyse des réseaux sociaux. Sociologie de l'Art, PS2526, 13-36.
ToolName | Analyzing the content of Youtube’s recommendation system |
ToolUri | http://tools.digitalmethods.net/Dmi/WinterSchool2023YoutubeSystemRecommendation |
ShortDescription | YouTube system recommendation |
ImageUrl | |
MediaAnalysisTags | |
DataTreatmentTags | |
NativelyDigitalTags | |
DeviceCentricTags | |
SphericalTags |