This project is a showcase for haystack's api 2.0, highlighting its pipeline. This project consists of a few parts:
- Scraping TUM CIT website with either scrapy (
scripts/TUM_RAG.ipynb
) or beautiful soup (scripts/TUM_RAG_with_beautiful_soup.ipynb
) - HTML text processing, chunking, and writing to different stores
- Retrieval-Augmented Question Answering system, which handles english and german input differently.
Go to tum_crawler with cd tum_crawler
In terminal, run scrapy crawl tum
(might require installation)
Go back to project root.
Run docker-compose up
to run the qdrant databases.
Create .env at project root, with your token: OAI_TOKEN="sk-..."
Read & execute scripts/TUM_RAG.ipynb
Read & execute scripts/TUM_RAG_with_beautiful_soup.ipynb