Inserted GenAI project

Added the instructions and the code
andkret · Nov 28, 2024 · b93568e · b93568e
1 parent f7c3eb6
commit b93568e
Showing 1 changed file with 137 additions and 0 deletions.
diff --git a/sections/04-HandsOnCourse.md b/sections/04-HandsOnCourse.md
@@ -3,11 +3,148 @@ Data Engineering Course: Building A Data Platform
 
 ## Contents
 
+- [GenAI Retrieval Augmented Generation with Ollama and ElasticSearch](04-HandsOnCourse.md#genai-retrieval-augmented-generation-with-ollama-and-elasticsearch)
 - [Free Data Engineering Course with AWS, TDengine, Docker and Grafana](04-HandsOnCourse.md#free-data-engineering-course-with-aws-tdengine-docker-and-grafana)
 - [Monitor your data in dbt & detect quality issues with Elementary](04-HandsOnCourse.md#monitor-your-data-in-dbt-and-detect-quality-issues-with-elementary)
 - [Solving Engineers 4 Biggest Airflow Problems](04-HandsOnCourse.md#solving-engineers-4-biggest-airflow-problems)
 - [The best alternative to Airlfow? Mage.ai](04-HandsOnCourse.md#the-best-alternative-to-airlfow?-mage.ai)
 
+## GenAI Retrieval Augmented Generation with Ollama and ElasticSearch
+
+- This how-to is based on this one from Elasticsearch: https://www.elastic.co/search-labs/blog/rag-with-llamaIndex-and-elasticsearch
+- Instead of Elasticsearch cloud we're going to run everything locally
+- The simplest way to get this done is to just clone this GitHub Repo for the code and docker setup
+- I've tried this on a M1 Mac. Changes for Windows with WSL will come later.
+- The biggest problems that I had were actually installing the dependencies rather than the code itself.
+
+### Install Ollama
+1. Download Ollama from here https://ollama.com/download/mac
+2. Unzip, drag into applications and install
+3. do `ollama run mistral` (It's going to download the Mistral 7b model, 4.1GB size)
+4. Create a new folder in Documents "Elasticsearch-RAG"
+5. Open that folder in VSCode
+
+### Install Elasticsearch & Kibana (Docker)
+1. Use the docker-compose file from the Log Monitoring course: https://github.com/team-data-science/GenAI-RAG/blob/main/docker-compose.yml
+2. Download Docker Desktop from here: https://www.docker.com/products/docker-desktop/
+3. Install docker desktop and sign in in the app/create a user -> sends you to the browser
+
+**For Windows Users**
+Configure WSL2 to use max only 4GB of ram:
+```
+wsl --shutdown
+notepad "$env:USERPROFILE/.wslconfig"
+```
+.wslconfig file:
+```
+[wsl2]
+memory=4GB   # Limits VM memory in WSL 2 up to 4GB
+```
+**Modify the Linux kernel map count in WSL**
+Do this before the start because Elasticsearch requires a higher value to work
+`sudo sysctl -w vm.max_map_count=262144`
+
+4. go to the Elasticsearch-RAG folder and do `docker compose up`
+5. make sure you have Elasticsearch 8.11 or later (we use 8.16 here in this project) if you want to use your own Elasticsearch image
+6. if you get this error on a mac then just open the console in the docker app: *error getting credentials - err: exec: docker-credential-desktop: executable file not found in $PATH, out:*
+7. Install xcode command line tools: `xcode-select --install`
+8. make sure you're at python 3.8.1 or larger -> installed 3.13.0 from https://www.python.org/downloads/
+
+### Setup the virtual Python environment
+
+#### preparation on a Mac
+##### install brew
+which brew
+/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
+export PATH="/opt/homebrew/bin:$PATH"
+brew --version
+brew install pyenv
+brew install pyenv-virtualenv
+
+##### install pyenv
+```
+brew install pyenv
+brew install pyenv-virtualenv
+```
+
+Modify the path so that pyenv is in the path variable
+`nano ~/.zshrc`
+
+```
+export PYENV_ROOT="$HOME/.pyenv"
+export PATH="$PYENV_ROOT/bin:$PATH"
+eval "$(pyenv init --path)"
+eval "$(pyenv init -)"
+eval "$(pyenv virtualenv-init -)"
+```
+
+install dependencies for building python versions
+`brew install openssl readline sqlite3 xz zlib`
+
+Reload to apply changes
+`source ~/.zshrc`
+
+install python
+```
+pyenv install 3.11.6
+pyenv version
+```
+
+Set Python version system wide
+`pyenv global 3.11.6`
+
+```
+pyenv virtualenv <python-version> <new-virtualenv-name>
+pyenv activate <your-virtualenv-name>
+pyenv virtualenv-delete <your-virtualenv-name>
+```
+
+#### Windows without pyenv
+setup virtual python environment - go to the Elasticsearch-RAG folder and do
+`python3 -m venv .elkrag`
+enable the environment
+`source .elkrag/bin/activate`
+
+
+### Install required libraries (do one at a time so you see errors):
+```
+pip install llama-index (optional python3 -m pip install package name)
+pip install llama-index-embeddings-ollama
+pip install llama-index-llms-ollama
+pip install llama-index-vector-stores-elasticsearch
+pip install python-dotenv
+```
+
+### Write the data to Elasticsearch
+1. create / copy in the index.py file
+2. download the conversations.json file from the folder code examples/GenAI-RAG
+3. if you get an error with the execution then check if pedantic version is <2.0 `pip show pydantic` if not do this: `pip install "pydantic<2.0`
+4. run the program index.py: https://github.com/andkret/Cookbook/blob/master/Code%20Examples/GenAI-RAG/index.py
+
+### Check the data in Elasticsearch
+1. go to kibana http://localhost:5601/app/management/data/index_management/indices and see the new index called calls
+2. go to dev tools and try out this query `GET calls/_search?size=1 http://localhost:5601/app/dev_tools#/console/shell`
+
+### Query data from elasticsearch and create an output with Mistral
+1. if everything is good then run the query.py file https://github.com/andkret/Cookbook/blob/master/Code%20Examples/GenAI-RAG/query.py
+2. try a few queries :)
+
+### Install libraries to extract text from pdfs
+
+
+### Extract data from CV and put it into Elasticsearch
+I created a CV with ChatGPT https://github.com/andkret/Cookbook/blob/master/Code%20Examples/GenAI-RAG/Liam_McGivney_CV.pdf
+
+Install the library to extract text from the pdf
+`pip install PyMuPDF`
+I had to Shift+Command+p then python clear workspace cache and reload window. Then it saw it :/
+
+The file cvpipeline.py has the python code for the indexing. It's not working right now though!
+https://github.com/andkret/Cookbook/blob/master/Code%20Examples/GenAI-RAG/cvpipeline.py
+
+
+I'll keep developing this and update it once it's working.
+
 
 ## Free Data Engineering Course with AWS TDengine Docker and Grafana