- This project was awarded the Summer Undergraduate Research Fellowship 2023 and is done under the guidance of Dr. Rajiv Ratn Shah from MIDAS-IIITD.
- It aims to apply techniques from the domain of controllable scientific text generation to High School Physics
- The main idea behind the research project stems from the hypothesis that Physics Word Problems (PWPs) require understanding of concepts based on physics formulae and is thus a fundamentally different task from Math Word Problems (MWPs).
- Topics from Indian High School Physics textbooks are collected alongwith questions from datasets such as SCIMAT (Kollepara et al. 2021) which consist of inconsistencies we fix.
- Linear transformations are applied on the questions to augment the data to a bigger size based off the idea that linearly transformed questions will help the language model better understand the underlying concept.
-
Vicuna is a state-of-the-art model, and fine-tuning it can yield superior results for specific applications. This document provides an overview of how we fine-tuned Vicuna using the LoRA technique for both 8-bit and 16-bit.
-
Low-Rank Adaptation or LoRA (Hu et al. 2021) is a method used to efficiently fine-tune large neural networks by decomposing the weight matrix to lower rank matrices. By adapting only a small part of the model, it allows for quicker updates and can yield significant benefits in performance, especially when there's infrastructure for fine-tuning.
-
The rank of the matrix is adjusted for achieving 8-bit and 16-bit quantisation.
-
We refer to the following repository for helping us fine-tune using LoRA: Link
-
We use our hand-annotated dataset comprising 9.5K physics questions. We divide that into a training and testing split and fine-tune the model on the training set using supervised fine-tuning.
-
We check for inference on the test set.
- Wikipedia Articles are extracted using similarity search on sub-topics and the title of Wikipedia pages.
- These are stored as embeddings in a vector database (e.g. Pinecone).
- At the time of inference when running the model, the question is sent to the vector database. Here Approximate Nearest Neighbor (ANN) search is applied to find N relevant passages for solving the question.
- The question and N relevant passages are then sent as input-prompt to the Language Model for solving the question. The inference is checked on the test again to get the results.
- We release our data augmentation codes for generating the dataset alongwith the train and testing questions used.
- We additionally release the code for the retrieval pipeline.