Handwriting data creation

This project is our second project for the ML Course at EPFL.

Team Members

David Schulmeister
Amene Gafsi
Rosa Mayila

Overview

This project generates synthetic "handwritten" math exercises to simulate student solutions. The output is designed for use in training machine learning models for tasks such as text detection, OCR, and automated grading. By leveraging language models, the pipeline creates LaTeX-based exercises, converts them into PDFs and PNG images, and applies realistic augmentations like noise, blur, and simulated mistakes. This approach produces data that closely resembles scanned handwritten documents while incorporating irregular formatting and errors to mimic human handwriting.

Key Features

LaTeX-Based Exercise Generation: Utilizes a language model to create math exercises with hard equations involving square roots, powers, and text explanations. Generated exercises are formatted in LaTeX.
Multi-Language Support: Exercises can be generated in various languages, including English, French, German, and Italian.
Mistake Simulation: Adds a realistic touch by striking through a word using a LaTeX \strikeMistake command to mimic student corrections.
Handwriting Irregularities: Dynamically adjusts text placement with a custom \processtext LaTeX command to introduce word irregularities.
PDF and PNG Conversion:
- Compiles LaTeX documents into PDFs.
- Converts PDFs into high-resolution PNG images for broader usability.
Image Augmentations:
- Adds noise and blur to PNG images to simulate real-world scanned handwriting artifacts.
- Generates multiple versions of the same exercise with different visual distortions.
Flexible Design Templates:
- Supports various fonts (e.g., "ML4Science" and "JaneAusten").
- Allows customization of page colors, text colors, and optional grid overlays.

Repository Structure

Key Files and Directories:

latex_generator.py
Handles LaTeX-based exercise generation using a language model API. It:
- Generates exercises with equations and explanations in LaTeX.
- Supports language variation and mistake simulation.
- Saves generated LaTeX content to structured directories.
main.py
The main script orchestrating the full pipeline. It:
- Generates LaTeX exercises.
- Applies headers and templates to the LaTeX files.
- Converts LaTeX to PDFs and PDFs to PNG images.
- Adds noise and blur to simulate scanned handwritten artifacts.
utils.py
Contains utility functions to:
- Compile LaTeX files into PDFs.
- Convert PDFs to PNG images.
- Add noise and blur to images.
- Manage directories, clean auxiliary files, and generate LaTeX headers.
os_utils.py
Provides helper functions to run shell commands, check file existence, create folders, and retrieve subfolders.
data/
Directory where all generated data is stored:
- latex/: Contains generated LaTeX files.
- generated/: Contains compiled PDFs and PNG images with augmentations.
.env
Store here the API key for the language model API.

Dependencies and Requirements

System Requirements

Programming Language: Python 3.8+
System Tools:
- xelatex (for LaTeX compilation, part of TeX Live or MiKTeX)
- pdftoppm (from poppler-utils for PDF-to-PNG conversion)
Python Packages:
- openai (for the language model API client)
- numpy (for noise generation)
- Pillow (for image processing: noise and blur)
- dotenv (to load environment variables)

Setup

Install System Dependencies: Ensure xelatex, pdfcrop, and pdftoppm are installed and available in your system's PATH.

Example (Ubuntu/Debian):
```
sudo apt-get update
sudo apt-get install texlive-full poppler-utils
```
Install Python Dependencies: In your Python virtual environment, run:
```
pip install -r requirements.txt
```
(If you don't have a requirements.txt, ensure all necessary packages listed above are installed.)
Set Up API Key: The model requires an API key for access. Create a .env file in your project directory:
```
echo "API_KEY=your_api_key_here" > .env
```
Replace your_api_key_here with your actual API key. Make sure you have permission to use the provided model endpoint.

Directory Structure: Ensure the following directory structure exists:

handwriting_data_creation/
├── .env
├── latex_generator.py
├── main.py
├── utils.py
├── os_utils.py
├── data/
│   ├── latex/
│   └── png/
└── requirements.txt

Usage

1. Run the Pipeline: In the main script (e.g., main.py or the provided snippet in if __name__ == "__main__":)

Example:

python main.py

This will:

Load the API_KEY from .env.
Interact with the model to generate prompts and exercises.
Create LaTeX documents, compile them to PDFs, and convert to PNGs.
Add noise and blur to the generated images.

2. Adjust Parameters: You can customize:

You can customize the following parameters in main.py:
•	languages: List of supported languages (e.g., ["English", "French", "German"]).
•	fonts: Fonts used in LaTeX documents (e.g., ["ML4Science", "JaneAusten"]).
•	nbr_of_texfiles: Number of exercises to generate.
•	pagecolors : List of page background colors (e.g., ["white", "yellow"]).
•	textcolors : List of text colors (e.g., ["black", "darkblue", "red"]).
•	Augmentations: Adjust noise and blur levels in the add_noise_and_blur function.

3. Viewing Results: After running the pipeline, check:

	LaTeX files will be stored under data/latex/.
•	PDFs and augmented PNGs will be available under data/png/.

Each exercise will have:
•	A clean version (PNG).
•	Noisy and blurred versions of the PNG files.

You can view the generated images in any image viewer or open the LaTeX files for further customization.

Sample Dataset

An example of the generated dataset can be downloaded here:
Download Sample Dataset (Google Drive, 3GB)

Troubleshooting

Compilation Errors: If LaTeX compilation fails, ensure that xelatex and the required LaTeX packages (tikz, mathspec, fontspec) are installed.
Missing Tools: If PDF-to-PNG conversion fails, ensure pdftoppm from poppler-utils is installed.
API Errors: Check that your API_KEY is set correctly and you have access to the model endpoint.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
generate_font		generate_font
some_examples/generated		some_examples/generated
.gitignore		.gitignore
README.md		README.md
latex_generator.py		latex_generator.py
main.py		main.py
os_utils.py		os_utils.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Handwriting data creation

Team Members

Overview

Key Features

Repository Structure

Dependencies and Requirements

System Requirements

Setup

Usage

Sample Dataset

Troubleshooting

About

Releases

Packages

Contributors 3

Languages

CS-433/ml-project-2-radioactiv

Folders and files

Latest commit

History

Repository files navigation

Handwriting data creation

Team Members

Overview

Key Features

Repository Structure

Dependencies and Requirements

System Requirements

Setup

Usage

Sample Dataset

Troubleshooting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages