Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

Step-by-step guide on TowardsDataScience: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8

Context

Third-party commercial large language model (LLM) providers like OpenAI's GPT4 have democratized LLM use via simple API calls.
However, there are instances where teams would require self-managed or private model deployment for reasons like data privacy and residency rules.
The proliferation of open-source LLMs has opened up a vast range of options for us, thus reducing our reliance on these third-party providers.
When we host open-source LLMs locally on-premise or in the cloud, the dedicated compute capacity becomes a key issue. While GPU instances may seem the obvious choice, the costs can easily skyrocket beyond budget.
In this project, we will discover how to run quantized versions of open-source LLMs on local CPU inference for document question-and-answer (Q&A).

Quickstart

Ensure you have downloaded the GGML binary file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML and placed it into the models/ folder
To start parsing user queries into the application, launch the terminal from the project directory and run the following command: poetry run python main.py "<user query>"
For example, poetry run python main.py "What is the minimum guarantee payable by Adidas?"
Note: Omit the prepended poetry run if you are NOT using Poetry

Tools

LangChain: Framework for developing applications powered by language models
C Transformers: Python bindings for the Transformer models implemented in C/C++ using GGML library
FAISS: Open-source library for efficient similarity search and clustering of dense vectors.
Sentence-Transformers (all-MiniLM-L6-v2): Open-source pre-trained transformer model for embedding text to a 384-dimensional dense vector space for tasks like clustering or semantic search.
Llama-2-7B-Chat: Open-source fine-tuned Llama 2 model designed for chat dialogue. Leverages publicly available instruction datasets and over 1 million human annotations.
Poetry: Tool for dependency management and Python packaging

Files and Content

/assets: Images relevant to the project
/config: Configuration files for LLM application
/data: Dataset used for this project (i.e., Manchester United FC 2022 Annual Report - 177-page PDF document)
/models: Binary file of GGML quantized LLM model (i.e., Llama-2-7B-Chat)
/src: Python codes of key components of LLM application, namely llm.py, utils.py, and prompts.py
/vectorstore: FAISS vector store for documents
db_build.py: Python script to ingest dataset and generate FAISS vector store
main.py: Main Python script to launch the application and to pass user query via command line
pyproject.toml: TOML file to specify which versions of the dependencies used (Poetry)
requirements.txt: List of Python dependencies (and version)

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
assets		assets
config		config
models		models
studysearch		studysearch
vectorstore		vectorstore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bibliography.bib		bibliography.bib
db_build.py		db_build.py
poetry.lock		poetry.lock
project.ipynb		project.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
resp.pkl		resp.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

Context

Quickstart

Tools

Files and Content

References

About

Releases

Packages

Languages

License

swssl/qki-studysearch

Folders and files

Latest commit

History

Repository files navigation

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

Context

Quickstart

Tools

Files and Content

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages