This is my static data visualization research assignment for "Grand Challenges in Computer science", this project asked us to pose a question that can be answered from a dataset, and with analysis, present your question and our research as a poster.
My question was whether pop music is getting repetitive, the dataset was scraped off of metrolyrics.
You can download my dataset here.
These instructions will get you a copy of the project up and running on your local machine for development.
First run through the dependencies installation process.
install a python3 virtualenv, run:
pip3 -r requirements.txt
Fonts used in my visualizations: xkcd font and Humor-Sans
Python scripts have been named in the order of intended use, a brief descriptions of each of the scripts are below.
This python script generated a text file called searchSeq which held a metrolyrics link to the parent web page for artists starting with A, B, ... , Y, Z.
For every link in searchSeq, get all of the artists with their letter and place them in a file called Artist_[A-Z]
For every artist in each Artist_ file, append to queue.txt the year of their songs produced and the song link.
Read queue.txt and append them to a list of things to operate on, then spawn 50 workers and asynchronously retrieve the lyrics and build the directory structure.
For each lyric file in the dataset directory, remove copyright notices and instrumental lyrics. Calculate the repetition ratio by putting the lyrics into a set, and record all needed information into lyricDataset.csv
Python scripts used to generate visualizations used in my poster.
This generates the line graph of Year vs Compression Ratio used in my poster.
Generates a histogram showing the distribution of the dataset
Generates a histogram showing the distribution of the top100 songs dataset from here
- Beautiful Soup - The web scraper
- MatplotLib - The graphing library