AI-lab task by Overfitter
Fine-tuning and prompting a transformer(GPT2) for a song-lyrics generator.
Group members and Contributors:
Tr33Bug | gusse-dev | CronJorian | BFuertsch |
---|
- README.md
- 10_DataEngineering.ipynb
- 11_createDataset.py
- 20_GPT2_TrainingLoop.ipynb
- 21_GPT2_TrainingLoop.py
- startTraining.sh
- 30_ModelEvaluation.ipynb
- 40_Prompting.ipynb
- 41_Prompting.py
- 50_GeneratorGUI.py
Notebook to generate, clean, and analyze the lyrics dataset for the lyrics generator.
-
generate: to generate the dataset, we use 3 lists from IMDB and the lyricsgenius framework to crawl the song lyrics from the API from (genius.com)[www.genius.com]:
- Top 49 20th Century: https://www.imdb.com/list/ls058480497/
- Top 100 AllTime: https://www.imdb.com/list/ls064818015/
- Top 100 Rapper: https://www.imdb.com/list/ls054191097/
- LyricsGenius-framework: https://github.com/johnwmillr/LyricsGenius
-
clean: cleaning the
.txt
files and deleting all unnecessary characters such as워
,()
, etc. -
analyzing: viewing graphs, merging the lists of artists, dropping short songs and artists with fewer songs, and counting the most used words.
In the end, we export the generated dataset files to df_rap.csv
, df_songs.csv
, and df_top.csv
.
Python script to generate the datasets from the folders and save the datasets as df_rap.csv
, df_songs.csv
, and df_top.csv
.
Notebook to export the test data for evaluation and train the dataset on the GPT2 model.
Python script exported from 20_GPT2_TrainingLoop.ipynb
file to train remote on the KILab pool PC.
Helper script to perform the remote training on the KILab pool PC.
Notebook to evaluate the Training of our models and compare them to pretrained GPT2. For that, we load the training results and the models and calculate the BLEU score.
Notebook to perform prompting with OpenPrompt on the models.
Python script exported from 40_Prompting.ipynb
file to train remote on the KILab pool PC.
Python script to demonstrate the song lyrics generation via a GUI.
For 10_DataEngineering.ipynb
there needs to be a API token for the genius.com API stored in a file called geniusToken.txt
.
cd ML-NLP-LyricsGen-Transformer/
touch geniusToken.txt
echo TOKEN > geniusToken.txt
To use the dataset in the notebooks run the 11_createDataset.py
to create the CSV files stored in ./datasets/
.
pip install pandas
python 11_createDataset.py