This repository contains code and resources for automating the classification of court orders into substantive and non-substantive categories using machine learning (ML) models and large language models (LLMs). The project leverages text extraction techniques, natural language processing (NLP), and model-based classification to streamline the analysis of legal documents.
Court hearings produce numerous orders daily, but not all contribute to resolving a case. Distinguishing between substantive hearings (those advancing case resolution) and non-substantive ones (e.g., adjournments) is critical for analyzing court performance. Manual classification is labor-intensive, so this project automates the process by applying data science techniques to classify hearings based on their textual content.
We have implemented two primary approaches:
- Machine Learning Classifier (LightGBM): Achieves 89% accuracy with labeled training data.
- Large Language Model (LLM) Classifier: Achieves 81% accuracy without needing labeled training data, reducing the cost of data preparation.
All code is designed to run on Google Colab, ensuring reproducibility and accessibility.
- Text Extraction: Extracts text from court order PDFs using
pdfplumber
andpytesseract
(OCR). - Preprocessing: Tokenizes, cleans, and normalizes text for further analysis.
- Machine Learning Models: Trains classifiers such as LightGBM and Naive Bayes on extracted text.
- Large Language Model (LLM) Integration: Leverages the pre-trained capabilities of LLMs via Ollama for unsupervised classification.
- Binary Classification: Predicts whether each court order is substantive or non-substantive.
- All notebooks are available and pre-configured for Colab execution.
- Simply upload the notebook to Colab or access the shared link to start running the models.
-
Text Extraction: Extract text from court order PDFs using the
pdfplumber
library or OCR with Tesseract. -
Preprocessing: Clean and preprocess the text using tokenization, lemmatization, and stopword removal.
-
Classification:
- Train the machine learning classifier (LightGBM) using labeled data.
- Alternatively, use the LLM-based classifier to categorize court orders without the need for labeled training data.
-
Results:
- Machine Learning Classifier: 89% accuracy.
- Large Language Model (LLM) Classifier: 81% accuracy.
The dataset used consists of court orders from various hearings. Each order is classified into two categories:
- Substantive: Court orders advancing the resolution of a case.
- Non-Substantive: Orders that are procedural or involve adjournments.
For reproducibility, we use an SQLite database to store the court orders, which can be processed and classified via the scripts provided.
- Extract PDFs: The orders are stored in an SQLite database as BLOBs. The script retrieves the BLOB data and converts it into text.
- Preprocess Text: Text data is cleaned and transformed using tokenization, TF-IDF vectorization, and lemmatization.
- Model Training:
- LightGBM Classifier: Trains using supervised learning on the extracted features.
- LLM Classifier: Uses pre-trained language models to classify without labeled training data.
- Evaluation: Evaluate model performance using accuracy, precision, recall, and F1-score.
Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE
file for details.