From 420d614226643846dd2b335eb25624fae02c3848 Mon Sep 17 00:00:00 2001 From: ChengZi Date: Wed, 4 Dec 2024 15:47:22 +0800 Subject: [PATCH] optimize cognee Signed-off-by: ChengZi --- .../build_RAG_with_milvus_and_cognee.ipynb | 437 ++++++++++-------- 1 file changed, 234 insertions(+), 203 deletions(-) diff --git a/bootcamp/tutorials/integration/build_RAG_with_milvus_and_cognee.ipynb b/bootcamp/tutorials/integration/build_RAG_with_milvus_and_cognee.ipynb index 2c875c4a9..314620542 100644 --- a/bootcamp/tutorials/integration/build_RAG_with_milvus_and_cognee.ipynb +++ b/bootcamp/tutorials/integration/build_RAG_with_milvus_and_cognee.ipynb @@ -2,17 +2,21 @@ "cells": [ { "cell_type": "markdown", - "source": [ - "\"Open \n", - " \"GitHub" - ], + "id": "ca0af8235fb9fc06", "metadata": { "collapsed": false }, - "id": "ca0af8235fb9fc06" + "source": [ + "\"Open \n", + " \"GitHub" + ] }, { "cell_type": "markdown", + "id": "11a8105a1399bb07", + "metadata": { + "collapsed": false + }, "source": [ "### Build RAG with Milvus and Cognee\n", "\n", @@ -21,11 +25,7 @@ "With strong support for vector stores like Milvus, graph databases, and LLMs, Cognee provides a flexible and customizable framework for building retrieval-augmented generation (RAG) systems. Its production-ready architecture ensures improved accuracy and efficiency for AI-powered applications. \n", "\n", "In this tutorial, we will show you how to build a RAG (Retrieval-Augmented Generation) pipeline with Milvus and Cognee.\n" - ], - "metadata": { - "collapsed": false - }, - "id": "11a8105a1399bb07" + ] }, { "cell_type": "code", @@ -36,130 +36,120 @@ }, "outputs": [], "source": [ - "! pip install pymilvus cognee" + "! pip install pymilvus git+https://github.com/topoteretes/cognee.git" ] }, { "cell_type": "markdown", - "source": [ - "> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)." - ], + "id": "41c91a9956ad5cab", "metadata": { "collapsed": false }, - "id": "41c91a9956ad5cab" + "source": [ + "> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)." + ] }, { "cell_type": "markdown", - "source": [ - "To configure Milvus as the vector database, set the `VECTOR_DB_PROVIDER` to `milvus` and specify the `VECTOR_DB_URL` and `VECTOR_DB_KEY`. Since we are using Milvus Lite to store data in this demo, only the `VECTOR_DB_URL` needs to be provided." - ], + "id": "f905770ba241d2d8", "metadata": { "collapsed": false }, - "id": "f905770ba241d2d8" + "source": [ + "By default, it use OpenAI as the LLM in this example. You should prepare the [api key](https://platform.openai.com/docs/quickstart), and set it in the config `set_llm_api_key()` function.\n", + "\n", + "To configure Milvus as the vector database, set the `VECTOR_DB_PROVIDER` to `milvus` and specify the `VECTOR_DB_URL` and `VECTOR_DB_KEY`. Since we are using Milvus Lite to store data in this demo, only the `VECTOR_DB_URL` needs to be provided." + ] }, { "cell_type": "code", + "execution_count": 41, + "id": "44f80293c073bc3b", + "metadata": { + "ExecuteTime": { + "end_time": "2024-12-04T04:27:47.585495Z", + "start_time": "2024-12-04T04:27:47.581022Z" + }, + "collapsed": false + }, "outputs": [], "source": [ "import os\n", "\n", - "os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\"\n", + "import cognee\n", + "\n", + "cognee.config.set_llm_api_key(\"YOUR_OPENAI_API_KEY\")\n", + "\n", "\n", "os.environ[\"VECTOR_DB_PROVIDER\"] = \"milvus\"\n", "os.environ[\"VECTOR_DB_URL\"] = \"./milvus.db\"" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2024-12-04T04:27:47.585495Z", - "start_time": "2024-12-04T04:27:47.581022Z" - } - }, - "id": "44f80293c073bc3b", - "execution_count": 9 + ] }, { "cell_type": "markdown", + "id": "588aec63a9589d58", + "metadata": { + "collapsed": false + }, "source": [ "> As for the environment variables `VECTOR_DB_URL` and `VECTOR_DB_KEY`:\n", "> - Setting the `VECTOR_DB_URL` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n", "> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `VECTOR_DB_URL`.\n", "> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `VECTOR_DB_URL` and `VECTOR_DB_KEY`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud." - ], - "metadata": { - "collapsed": false - }, - "id": "588aec63a9589d58" + ] }, { "cell_type": "markdown", + "id": "5be22ba3778abd63", + "metadata": { + "collapsed": false + }, "source": [ "### Prepare the data\n", "\n", "We use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge in our RAG, which is a good data source for a simple RAG pipeline.\n", "\n", "Download the zip file and extract documents to the folder `milvus_docs`." - ], - "metadata": { - "collapsed": false - }, - "id": "5be22ba3778abd63" + ] }, { "cell_type": "code", - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "--2024-12-03 23:27:50-- https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip\r\n", - "Resolving github.com (github.com)... 140.82.114.3\r\n", - "Connecting to github.com (github.com)|140.82.114.3|:443... connected.\r\n", - "HTTP request sent, awaiting response... 302 Found\r\n", - "Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/267273319/c52902a0-e13c-4ca7-92e0-086751098a05?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20241204%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241204T042750Z&X-Amz-Expires=300&X-Amz-Signature=48fb1d999a151a69e1f8f16dfd3c3da0e035aa92f81cf5fb60911b0ed4556447&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dmilvus_docs_2.4.x_en.zip&response-content-type=application%2Foctet-stream [following]\r\n", - "--2024-12-03 23:27:50-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/267273319/c52902a0-e13c-4ca7-92e0-086751098a05?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20241204%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241204T042750Z&X-Amz-Expires=300&X-Amz-Signature=48fb1d999a151a69e1f8f16dfd3c3da0e035aa92f81cf5fb60911b0ed4556447&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dmilvus_docs_2.4.x_en.zip&response-content-type=application%2Foctet-stream\r\n", - "Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...\r\n", - "Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.109.133|:443... connected.\r\n", - "HTTP request sent, awaiting response... 200 OK\r\n", - "Length: 613094 (599K) [application/octet-stream]\r\n", - "Saving to: ‘milvus_docs_2.4.x_en.zip.5’\r\n", - "\r\n", - "milvus_docs_2.4.x_e 100%[===================>] 598.72K 1.53MB/s in 0.4s \r\n", - "\r\n", - "2024-12-03 23:27:51 (1.53 MB/s) - ‘milvus_docs_2.4.x_en.zip.5’ saved [613094/613094]\r\n", - "\r\n", - "replace milvus_docs/__MACOSX/._en? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C\r\n" - ] - } - ], - "source": [ - "! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip\n", - "! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs" - ], + "execution_count": null, + "id": "170036eeaff4d65", "metadata": { - "collapsed": false, "ExecuteTime": { "end_time": "2024-12-04T04:27:53.605474Z", "start_time": "2024-12-04T04:27:50.566656Z" - } + }, + "collapsed": false }, - "id": "170036eeaff4d65", - "execution_count": 10 + "outputs": [], + "source": [ + "! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip\n", + "! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs" + ] }, { "cell_type": "markdown", - "source": [ - "We load all markdown files from the folder `milvus_docs/en/faq`. For each document, we just simply use \"# \" to separate the content in the file, which can roughly separate the content of each main part of the markdown file." - ], + "id": "c3d842331863acc1", "metadata": { "collapsed": false }, - "id": "c3d842331863acc1" + "source": [ + "We load all markdown files from the folder `milvus_docs/en/faq`. For each document, we just simply use \"# \" to separate the content in the file, which can roughly separate the content of each main part of the markdown file." + ] }, { "cell_type": "code", + "execution_count": 2, + "id": "218714f8a38620cd", + "metadata": { + "ExecuteTime": { + "end_time": "2024-12-04T04:27:58.751584Z", + "start_time": "2024-12-04T04:27:58.744840Z" + }, + "collapsed": false + }, "outputs": [], "source": [ "from glob import glob\n", @@ -171,101 +161,107 @@ " file_text = file.read()\n", "\n", " text_lines += file_text.split(\"# \")" - ], - "metadata": { - "collapsed": false, - "ExecuteTime": { - "end_time": "2024-12-04T04:27:58.751584Z", - "start_time": "2024-12-04T04:27:58.744840Z" - } - }, - "id": "218714f8a38620cd", - "execution_count": 11 + ] }, { "cell_type": "markdown", + "id": "a1353cb69767aec9", + "metadata": { + "collapsed": false + }, "source": [ "## Build RAG\n", "\n", "### Resetting Cognee Data" - ], - "metadata": { - "collapsed": false - }, - "id": "a1353cb69767aec9" + ] }, { "cell_type": "code", + "execution_count": null, + "id": "8df449343b75f43f", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ - "import cognee\n", - "\n", "await cognee.prune.prune_data()\n", "await cognee.prune.prune_system(metadata=True)" - ], - "metadata": { - "collapsed": false - }, - "id": "8df449343b75f43f", - "execution_count": null + ] }, { "cell_type": "markdown", - "source": [ - "With a clean slate ready, we can now add our dataset and process it into a knowledge graph." - ], + "id": "f2412c46b55f1fc7", "metadata": { "collapsed": false }, - "id": "f2412c46b55f1fc7" + "source": [ + "With a clean slate ready, we can now add our dataset and process it into a knowledge graph." + ] }, { "cell_type": "markdown", - "source": [ - "### Adding Data and Cognifying" - ], + "id": "b3e4a273b0cbc1ee", "metadata": { "collapsed": false }, - "id": "b3e4a273b0cbc1ee" + "source": [ + "### Adding Data and Cognifying" + ] }, { "cell_type": "code", - "outputs": [], - "source": [ - "await cognee.add(data=text_lines, dataset_name=\"milvus_faq\")\n", - "await cognee.cognify()" - ], + "execution_count": null, + "id": "639dffbb4e60d7fc", "metadata": { "collapsed": false }, - "id": "639dffbb4e60d7fc" + "outputs": [], + "source": [ + "await cognee.add(data=text_lines, dataset_name=\"milvus_faq\")\n", + "await cognee.cognify()\n", + "\n", + "# [DocumentChunk(id=UUID('6889e7ef-3670-555c-bb16-3eb50d1d30b0'), updated_at=datetime.datetime(2024, 12, 4, 6, 29, 46, 472907, tzinfo=datetime.timezone.utc), text='Does the query perform in memory? What are incremental data and historical data?\\n\\nYes. When ...\n", + "# ..." + ] }, { "cell_type": "markdown", - "source": [ - "The `add` method loads the dataset (Milvus FAQs) into Cognee and the `cognify` method processes the data to extract entities, relationships, and summaries, constructing a knowledge graph." - ], + "id": "8f82418abc9f0537", "metadata": { "collapsed": false }, - "id": "8f82418abc9f0537" + "source": [ + "The `add` method loads the dataset (Milvus FAQs) into Cognee and the `cognify` method processes the data to extract entities, relationships, and summaries, constructing a knowledge graph." + ] }, { "cell_type": "markdown", + "id": "f96e00d1ea3aea43", + "metadata": { + "collapsed": false + }, "source": [ "### Querying for Summaries\n", "\n", "Now that the data has been processed, let's query the knowledge graph." - ], - "metadata": { - "collapsed": false - }, - "id": "f96e00d1ea3aea43" + ] }, { "cell_type": "code", - "outputs": [], + "execution_count": 23, + "id": "d0fe3f8ebcac034d", + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'id': 'de5c6713-e079-5d0b-b11d-e9bacd1e0d73', 'text': 'Milvus stores two data types: inserted data and metadata.'}\n" + ] + } + ], "source": [ "from cognee.api.v1.search import SearchType\n", "\n", @@ -273,61 +269,84 @@ "search_results = await cognee.search(SearchType.SUMMARIES, query_text=query_text)\n", "\n", "print(search_results[0])" - ], - "metadata": { - "collapsed": false - }, - "id": "d0fe3f8ebcac034d" + ] }, { "cell_type": "markdown", - "source": [ - "This query searches the knowledge graph for a summary related to the query text, and the most related candidate is printed." - ], + "id": "b2c93b72f2162102", "metadata": { "collapsed": false }, - "id": "b2c93b72f2162102" + "source": [ + "This query searches the knowledge graph for a summary related to the query text, and the most related candidate is printed." + ] }, { "cell_type": "markdown", + "id": "a54fbbf4d15eacbc", + "metadata": { + "collapsed": false + }, "source": [ "### Querying for Chunks\n", "\n", "Summaries offer high-level insights, but for more granular details, we can query specific chunks of data directly from the processed dataset. These chunks are derived from the original data that was added and analyzed during the knowledge graph creation." - ], - "metadata": { - "collapsed": false - }, - "id": "a54fbbf4d15eacbc" + ] }, { "cell_type": "code", + "execution_count": 24, + "id": "4f6baf214d0f6c3b", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "from cognee.api.v1.search import SearchType\n", "\n", "query_text = \"How is data stored in milvus?\"\n", "search_results = await cognee.search(SearchType.CHUNKS, query_text=query_text)" - ], - "metadata": { - "collapsed": false - }, - "id": "4f6baf214d0f6c3b" + ] }, { "cell_type": "markdown", - "source": [ - "Let's format and display it for better readability!" - ], + "id": "c121035194a6ce8f", "metadata": { "collapsed": false }, - "id": "c121035194a6ce8f" + "source": [ + "Let's format and display it for better readability!" + ] }, { "cell_type": "code", - "outputs": [], + "execution_count": 26, + "id": "f4bc483b753adc42", + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ID: 4be01c4b-9ee5-541c-9b85-297883934ab3\n", + "\n", + "Text:\n", + "\n", + "Where does Milvus store data?\n", + "\n", + "Milvus deals with two types of data, inserted data and metadata.\n", + "\n", + "Inserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n", + "\n", + "Metadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n", + "\n", + "###\n", + "\n" + ] + } + ], "source": [ "def format_and_print(data):\n", " print(\"ID:\", data[\"id\"])\n", @@ -339,116 +358,128 @@ "\n", "\n", "format_and_print(search_results[0])" - ], - "metadata": { - "collapsed": false - }, - "id": "f4bc483b753adc42" + ] }, { "cell_type": "markdown", + "id": "3a084fa4cb6b1c20", + "metadata": { + "collapsed": false + }, "source": [ "In our previous steps, we queried the Milvus FAQ dataset for both summaries and specific chunks of data. While this provided detailed insights and granular information, the dataset was large, making it challenging to clearly visualize the dependencies within the knowledge graph.\n", "\n", "To address this, we will reset the Cognee environment and work with a smaller, more focused dataset. This will allow us to better demonstrate the relationships and dependencies extracted during the cognify process. By simplifying the data, we can clearly see how Cognee organizes and structures information in the knowledge graph." - ], - "metadata": { - "collapsed": false - }, - "id": "3a084fa4cb6b1c20" + ] }, { "cell_type": "markdown", - "source": [ - "### Reset Cognee" - ], + "id": "5960d05c899d17c2", "metadata": { "collapsed": false }, - "id": "5960d05c899d17c2" + "source": [ + "### Reset Cognee" + ] }, { "cell_type": "code", + "execution_count": null, + "id": "c61e3f717b8b2c5f", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "await cognee.prune.prune_data()\n", "await cognee.prune.prune_system(metadata=True)" - ], - "metadata": { - "collapsed": false - }, - "id": "c61e3f717b8b2c5f" + ] }, { "cell_type": "markdown", - "source": [ - "### Adding the Focused Dataset\n", - "\n", - "Here, a smaller dataset is added and processed to ensure a focused and easily interpretable knowledge graph." - ], + "id": "8d1b704691014472", "metadata": { "collapsed": false }, - "id": "8d1b704691014472" + "source": [ + "### Adding the Focused Dataset\n", + "\n", + "Here, a smaller dataset with only one line of text is added and processed to ensure a focused and easily interpretable knowledge graph." + ] }, { "cell_type": "code", + "execution_count": null, + "id": "5018e6af97cdef2f", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ - "text = \"Milvus is a vector database and stores vectors\"\n", + "# We only use one line of text as the dataset, which simplifies the output later\n", + "text = \"\"\"\n", + " Natural language processing (NLP) is an interdisciplinary\n", + " subfield of computer science and information retrieval.\n", + " \"\"\"\n", "\n", "await cognee.add(text)\n", "await cognee.cognify()" - ], - "metadata": { - "collapsed": false - }, - "id": "5018e6af97cdef2f" + ] }, { "cell_type": "markdown", + "id": "4db3b6403c0ddb9b", + "metadata": { + "collapsed": false + }, "source": [ "### Querying for Insights\n", "\n", "By focusing on this smaller dataset, we can now clearly analyze the relationships and structure within the knowledge graph." - ], - "metadata": { - "collapsed": false - }, - "id": "4db3b6403c0ddb9b" + ] }, { "cell_type": "code", + "execution_count": null, + "id": "14c49e5f657c3a6c", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ - "query_text = \"Tell me about Milvus\"\n", + "query_text = \"Tell me about NLP\"\n", "search_results = await cognee.search(SearchType.INSIGHTS, query_text=query_text)\n", "\n", "for result_text in search_results:\n", - " print(result_text)" - ], - "metadata": { - "collapsed": false - }, - "id": "14c49e5f657c3a6c" + " print(result_text)\n", + "\n", + "# Example output:\n", + "# ({'id': UUID('bc338a39-64d6-549a-acec-da60846dd90d'), 'updated_at': datetime.datetime(2024, 11, 21, 12, 23, 1, 211808, tzinfo=datetime.timezone.utc), 'name': 'natural language processing', 'description': 'An interdisciplinary subfield of computer science and information retrieval.'}, {'relationship_name': 'is_a_subfield_of', 'source_node_id': UUID('bc338a39-64d6-549a-acec-da60846dd90d'), 'target_node_id': UUID('6218dbab-eb6a-5759-a864-b3419755ffe0'), 'updated_at': datetime.datetime(2024, 11, 21, 12, 23, 15, 473137, tzinfo=datetime.timezone.utc)}, {'id': UUID('6218dbab-eb6a-5759-a864-b3419755ffe0'), 'updated_at': datetime.datetime(2024, 11, 21, 12, 23, 1, 211808, tzinfo=datetime.timezone.utc), 'name': 'computer science', 'description': 'The study of computation and information processing.'})\n", + "# (...)\n", + "#\n", + "# It represents nodes and relationships in the knowledge graph:\n", + "# - The first element is the source node (e.g., 'natural language processing').\n", + "# - The second element is the relationship between nodes (e.g., 'is_a_subfield_of').\n", + "# - The third element is the target node (e.g., 'computer science')." + ] }, { "cell_type": "markdown", - "source": [ - "This output represents the results of a knowledge graph query, showcasing entities (nodes) and their relationships (edges) as extracted from the processed dataset. Each tuple includes a source entity, a relationship type, and a target entity, along with metadata like unique IDs, descriptions, and timestamps. The graph highlights key concepts, such as \"Milvus\" and \"vector,\" and their semantic connections, providing a structured understanding of the dataset." - ], + "id": "f331e5714f44fc5c", "metadata": { "collapsed": false }, - "id": "f331e5714f44fc5c" + "source": [ + "This output represents the results of a knowledge graph query, showcasing entities (nodes) and their relationships (edges) as extracted from the processed dataset. Each tuple includes a source entity, a relationship type, and a target entity, along with metadata like unique IDs, descriptions, and timestamps. The graph highlights key concepts and their semantic connections, providing a structured understanding of the dataset." + ] }, { "cell_type": "markdown", - "source": [], - "metadata": { - "collapsed": false - }, - "id": "f6a9306de97b5423" + "id": "e0d1fec8", + "metadata": {}, + "source": [ + "Congratulations, you have learned the basic usage of cognee with Milvus. If you want to know more advanced usage of cognee, please refer to its official [page](https://github.com/topoteretes/cognee) .\n" + ] } ], "metadata": { @@ -460,14 +491,14 @@ "language_info": { "codemirror_mode": { "name": "ipython", - "version": 2 + "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" + "pygments_lexer": "ipython3", + "version": "3.10.13" } }, "nbformat": 4,