This project was created in order try various Machine Learning models on Youtube's Trending video statistics obtained from Kaggle for educational purposes. The main dataset used in this project is the one from the United States last updated on December 5th 2021. Datasets from various countries can be downloaded and retrieved from: YouTube Trending Video Dataset (updated daily)
Image retrieved from: Galaxy Marketing YouTube Stats
This dataset was created using a webscraper that used the Youtube Data API, which is now a part of Google Cloud Platform. The scraper itself can be found at the following link: https://github.com/mitchelljy/Trending-YouTube-Scraper. The dataset that is updated daily is at the following kaggle site YouTube Trending Video Dataset (updated daily).
The scrapper can create useable data in the from '.csv' files for different countries. Every single dataset comes with a column called category_id which is different for every region (there are a total of five regions in the dataset) most likely corresponding to:
- Americas (North and South America)
- Europe
- Africa
- Asia
- Australia
Each file comes with a 'JSON' file in which users can retrieve the corresponding caterogry id's. An example of a category is music. I'll initially start with creating models with just data from the United States. Then potentially test on data from other countries to see if the models are consitent.
The csv file has 95391 rows and 16 columns. The category id's json file creates an additional column. I then created the following:
- 'category' descriptive qualitative representations of the 'categoryId'
- 'trending_date_dt' python datetime version of the 'trending date'
- 'published_date' python datetime version of the 'publishedAt'
- 'time_till_trending' python datetime version of the 'trending_date_dt'