This repository contains tutorials, materials for testing purposes, and other documents relating to natural language processing and machine learning for the Knowledge Commons. Author Tianyi (Titi) Kou-Herrema is a German Studies PhD candidate who has a deep interest in applying computational methods to assist scholarly work. She is hired as a research assistant for the Knowledge Commons Project from Summer 2023 to Fall 2024 where she develops this project while primarily working with Ian Scott and Stephanie Vasko (with much support from the rest of the tech team including Mike Thicke, Cassie Lem, Dimitrios Tzouris, and Bonnie Russell).
This project comes in different stages and each stage contains various steps. The overall workflow follows the sequence: "data acquisition - data extraction - data preprocessing - subset training - evaluation - implementation." The three end goals are: improving search functionality; offering related record recommendations; and implementing subject tagging.
Another goal of this project is to demonstrate that, despite the use of computational methods to handle large quantities of textual data, human decisions were made at almost every step, many of which were based on my experience and knowledge. A common misconception is that we can feed data into a black box, and magic simply happens, producing beautiful and meaningful output on the other end. However, this is not the case. Developers and researchers make decisions based on their best judgment, learning and improving along the way. In hindsight, some of the steps taken early on may seem naive, but as a researcher, I was working with the knowledge I had at the time, which documents the process of acquiring expertise.
- InvenioRDM: Read InvenioRDM documentations
- Using API to access files and stats: Postman & Rest API
- Docker environments
- Short-term goals: testing .py libraries for extracting text from different types of deposited files
- Long-term goals: clean files and build (a) structured data frame(s); perform topic modeling or other analysis on the data at hand
- The script for text extraction comparison using different Python libraries on various file types: "stage1/tutorial1-textout.md"
- Materials for test purposes are stored in the folder "text4test"
During this process, I also used the Miro Board to guide my steps and drew workflow to explain this process to colleagues.
You can also find detailed explanation in "guide.md" in folder "stage2"
Accessing and downloading files from the Invenio API, and extracting text data from all downloaded files. Since the files come in various formats (e.g., PDF, Word, JPG, MP3), I developed a strategy to extract text based on each format. To save local storage space, files that are successfully processed are deleted afterward.
- Script "apiinvenio-9th.py"
- CSV file "output9clean.csv" (for data security stored elsewhere)
Examine the csv file from the previous step, get a general ideas about what the data structure looks like, checking for missing value, checking files that can not be processed. Invenio set a hardcore limit of 10k, this will be addressed later on.
- Markdown file "stage2/examine-output9.md"
After comparing NLTK and SpaCy for data preprocessing, I decided to use SpaCy for initial cleaning because it’s lighter, more up-to-date, and has a manageable learning curve. The goal of this step is to produce a clean, processed CSV for future use. Some files contain over 3 million tokens, so I developed strategies to process them in chunks to avoid out-of-memory (OOM) errors.
- Script "csv-preprocessing2.py"
- CSV file "processed_output.csv" (for data security stored elsewhere)
As I wrapped up with the previous stage, I received a clean and processed CSV file. This stage is mainly about applying vectorization to the preprocessed output so I can further develop applications on this data. Embedding/Vectorization itself is a complicated field and I have done some learning. Notes can be found in deliverables.
Always examining and evaluating data acquired from the previous step is a good practice. At the same time, I studied embeddings and vectorization from a theoretical level so I understand what happens behind the scene to avoid applying tools to data without comprehension. I have considered to continue work with SpaCy in this stage. However, after learning about SpaCy's functions and limitations, I decided to only use SpaCy in Stage 2 and move on with mBERT vectorization method in Stage 3.
- Script "stage3/examineprocessed_outputcsv.py"
- Markdown file "stage3/learningnotes.md"
Vectorization takes a long time to process. To be more sufficient, I created a subset that contains 100 records to test and evaluate.
- Script "stage3/ml-step1.py"