Skip to content

This project/ research was created in order various Machine Learning models on Youtube's Trending video statistics obtained from Kaggle for educational purposes. Link to Dataset: https://www.kaggle.com/datasnaek/youtube-new

Notifications You must be signed in to change notification settings

GateraGael/Machine-Learning-Project-Youtube-Trend-Analysis

Repository files navigation

YouTube Trending Data Analysis Machine Learning Project

This project was created in order try various Machine Learning models on Youtube's Trending video statistics obtained from Kaggle for educational purposes. The main dataset used in this project is the one from the United States last updated on December 5th 2021. Datasets from various countries can be downloaded and retrieved from: YouTube Trending Video Dataset (updated daily)

Youtube Trending Statistics Image retrieved from: Galaxy Marketing YouTube Stats

Table of contents

  1. Introduction
    1. USA Dataset
    2. Test with Colab Notebook

Introduction

This dataset was created using a webscraper that used the Youtube Data API, which is now a part of Google Cloud Platform. The scraper itself can be found at the following link: https://github.com/mitchelljy/Trending-YouTube-Scraper. The dataset that is updated daily is at the following kaggle site YouTube Trending Video Dataset (updated daily).

The scrapper can create useable data in the from '.csv' files for different countries. Every single dataset comes with a column called category_id which is different for every region (there are a total of five regions in the dataset) most likely corresponding to:

  1. Americas (North and South America)
  2. Europe
  3. Africa
  4. Asia
  5. Australia

Each file comes with a 'JSON' file in which users can retrieve the corresponding caterogry id's. An example of a category is music. I'll initially start with creating models with just data from the United States. Then potentially test on data from other countries to see if the models are consitent.

USA Dataset

The csv file has 95391 rows and 16 columns. The category id's json file creates an additional column. I then created the following:

  • 'category' descriptive qualitative representations of the 'categoryId'
  • 'trending_date_dt' python datetime version of the 'trending date'
  • 'published_date' python datetime version of the 'publishedAt'
  • 'time_till_trending' python datetime version of the 'trending_date_dt'

About

This project/ research was created in order various Machine Learning models on Youtube's Trending video statistics obtained from Kaggle for educational purposes. Link to Dataset: https://www.kaggle.com/datasnaek/youtube-new

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published