Youtube API, PostgreSQL database on AWS and EDA
Photo by Alexander Shatov on Unsplash
Agenda
1. Introduction
2. Objectives
3. Youtube API
4. Database
5. Results and Discussion
6. Conclusions
1. Introduction
In this project, Grubby's youtube channel was analyzed in terms of views, likes, dislikes and comments on its videos about the new game Age of Empires IV. Data was collected by the Youtube API and was placed in a PostgreSQL database in the cloud on a server of Amazon Web Services (AWS). The whole project was done in the Python programming language. The methods and code of how the data was obtained and allocated in the database will be exposed later. Finally, an exploratory data analysis (EDA) was performed using Power BI.
Age of Empires IV is a real-time strategy (RTS) video game developed by Relic Entertainment in partnership with World's Edge and published by Xbox Game Studios. The game was released on October 28th, 2021. The game drew a lot of attention as it is the only triple RTS game developed and released since Starcraft 2, which was released in 2010 by Blizzard Entertainment. Its success is demonstrated in its numbers, all-time peak players last month 73,928 (steamdb), 19,162 very positives reviews on steam and thousands of games played every day.
Manuel Schenkhuizen, also known as Grubby, is a Dutch real-time strategy gamer and former professional esports player. He competed in the RTS games Warcraft III (WC3), Warcraft III: The Frozen Throne and Starcraft II. Grubby is the most successful WC3 player of all time, as an Orc player, having won more than 38 LAN tournaments (in person events), of which six were World Championships (WIKIPEDIA). Now he livestreams at the livestream platform Twitch and he has a youtube channel (FollowGrubby) with more than 1 million views, 227,000 subscribers and 500 videos published to this date.
All the code can be found in the link of my github at the end of the post.
2. Objectives
The objective of this project was to put into practice the use of one of the best technologies for obtaining data currently employed by most companies, the APIs, and the use of the cloud database to store the data obtained, perform EDA in the collected data, and analyze which videos performed best and their motives.
3. Youtube API
Figure 1 - Pandas dataframe with the data collected from the API
The first part of the project was to get the api data from youtube. For this, an API key was created and the channel ID was obtained, essences for the operation of the API. In their possession, an API call is made using the request library and the data is obtained in JSON format. From the API, vídeo_id, vídeo_title, upload_date, view_count, like_count, dislike_count and comment_count were requested. The data has already been pre-requested in order of publication of the videos, from the most recent to the oldest. Then the data was saved in a Panda's dataframe in order to save it in a database, as seen in Figure 1. Finally, the data was saved in a csv file.
4. Database
The cloud database chosen was relational type, PostgreSQL, and it was created on the AWS platform. After the connection was created and made to the jupyter notebook, the first commit was made to load the data into the database.
Figure 2 - Data successfully loaded to the database
Figure 3 - Updated data
In Figure 2, we see that all data collected from API, vídeo_id, vídeo_title, upload_date, view_count, like_count, dislike_count and comment_count, were successfully imported. They were organized by date, as seen in the query above the table. To test whether the code is working for when new videos are added to the channel and when the numbers are updated by the API, a second API call and updated new data was made in the database, as seen in Figure 3. In it, we see two new videos, '2v2 WITH THEVIPER' and 'The unfinished CLASH ft TheViper', and also the updated data for all other videos.
5. Results and Discussion
With the database working properly, the next step was to do an analysis of the data. In my other projects I almost always use Python and the matplotlib and seaborn libraries to do the analysis, so I decided to use Power BI this time. All the steps in detail that were done in Power BI, in the 'Transform Data' part, and the previews are available on github as well. In addition to the metrics obtained by the API, I implemented two more metrics for a deeper analysis. They are: 'views per like', how many views are required to generate 1 like and 'views per comment', how many views are required to generate 1 comment. In both we want the smallest possible number, because the smaller these numbers, the fewer views are needed to generate 1 comment or 1 like, generating greater engagement in the videos.
Figure 4 - View_count for each vídeo
Figure 5 - View_count and like_count for each video
As the scope of the project is the game Age of Empires, the analysis will contain only the videos on this subject. In Figure 4, we see which videos have the highest number of views and the average number of views. Videos containing educational content, which would be videos helping other players improve in the game, have the highest number of views and all are above the midline of views. The other videos, which are mostly gameplay, don't perform as well. This is due to the game being new and many people look for videos on youtube to help them improve. In Figure 5, we have the same views for each video, but with the amount of likes they received. We soon realize that, generally, the more views, the greater the tendency to have more likes, but for example, we have the videos 'AoE4 Civilizations TIER LIST' and 'Struggling in AoE4? Watch THIS!' have not had a similar amount of likes from other videos with the same number of likes.
Figure 6 - Views per like
To generate engagement on youtube, likes are key and the more likes a video has the better. In Figure 6, we have the required number of views to generate a like. Comparing videos with almost the same number of views, we have those videos with the participation of someone else (collabs) generate more engagement, for example, 'The Unfinished FT CLASH. TheViper' generated more engagement than the video 'What if I only make BARRACKS?', even both. The contents containing how to play guides are also the ones that generate the most engagement, compared to other videos with the number of similar views, for example, 'THE ULTIMATE CHINESE GUIDE' and 'Can't beat longbowmen? WATCH THIS!'.
Figure 7 - like_count for each vídeo
Figure 7 confirms that the category of guides is always well received by viewers and generates an engagement, in the likes, above average.
Figure 8 - Number of comments for each vídeo
Figure 9 - Comments per view
In the area of comments, we have in figure 8 practically the same result of engagement of likes. Tab videos are always above average comments. In Figure 9, we have how many views are needed to generate 1 comment. An interesting video to highlight is the 'COLLAB ft. TheViper! – AoE4 Balance Discussion Podcast While Watching Replay', because it was the one that had the best result in, perhaps by the podcast format, the only one of all the videos analyzed.
In the area of comments, we have in figure 8 practically the same result of engagement of likes. Tab videos are always above average comments. In Figure 9, we have how many views are needed to generate 1 comment. An interesting video to highlight is the 'COLLAB ft. TheViper! – AoE4 Balance Discussion Podcast While Watching Replay', because it was the one that had the best result in, perhaps by the podcast format, the only one of all the videos analyzed.
Figure 10 - Dislikes for each video
The channel in general does not have a lot of dislikes, but a video catches the eye. In Figure 10, you can see that the video 'AoE4 Civilizations TIER LIST' received a much larger number of dislikes than the other videos. It is a tierlist video, one of the categories that is making success at the moment, namely a video with personal opinion and apparently people who did not agree gave dislike in the video.
Figure 11 - Number of views for the last quarter of the year
Regarding the number of views in the last quarter of the year, we have in figure 11, that in October (almost no uploads) and November, the number of views were above average and, in December below average.
Figure 12 - Number of views for the last quarter of the year with video title
The reason for November is always above the average views and the month of December below is in figure 12. In it we see that the videos of guides and help were all uploaded in the month of November, while the rest in the following month.
6. Conclusions
It was possible to collect the data from the youtube API and store it in a database in the cloud. The analysis showed us that videos with the themes of guides are the most popular and the ones that produce the most community engagement. Finally, we have a video containing personal opinions, in which generated the largest number of dislikes among all the videos analyzed.
GitHub: Link for code and pdf version
Comments