Skip to content

Commit

Permalink
switch metodo
Browse files Browse the repository at this point in the history
  • Loading branch information
Laure-di committed Oct 3, 2024
1 parent ea655a8 commit f5b4d4d
Showing 1 changed file with 55 additions and 35 deletions.
90 changes: 55 additions & 35 deletions tutorials/how-to-implement-rag/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -97,19 +97,18 @@ load_dotenv()

# Establish connection to PostgreSQL database using environment variables
conn = psycopg2.connect(
database=os.getenv("SCW_DB_NAME"),
user=os.getenv("SCW_DB_USER"),
password=os.getenv("SCW_DB_PASSWORD"),
host=os.getenv("SCW_DB_HOST"),
port=os.getenv("SCW_DB_PORT")
)
database=os.getenv("SCW_DB_NAME"),
user=os.getenv("SCW_DB_USER"),
password=os.getenv("SCW_DB_PASSWORD"),
host=os.getenv("SCW_DB_HOST"),
port=os.getenv("SCW_DB_PORT")
)

# Create a cursor to execute SQL commands
cur = conn.cursor()
```



### Set Up Document Loaders for Object Storage

In this section, we will use LangChain to load documents stored in your Scaleway Object Storage bucket. The document loader retrieves the contents of each document for further processing, such as vectorization or embedding generation.
Expand Down Expand Up @@ -197,45 +196,66 @@ PGVector: This creates the vector store in your PostgreSQL database to store the

Use the S3FileLoader to load documents and split them into chunks. Then, embed and store them in your PostgreSQL database.

1. Lazy loadings documents: This method is designed to efficiently load and process documents one by one from Scaleway Object Storage. Instead of loading all documents at once, it loads them lazily, allowing us to inspect each file before deciding whether to embed it.
1. Load Metadata for Improved Efficiency: By loading the metadata for all objects in your bucket, you can speed up the process significantly. This allows you to quickly check if a document has already been embedded without the need to load the entire document.

```python
files = document_loader.lazy_load()
endpoint_s3 = f"https://s3.{os.getenv('SCW_DEFAULT_REGION', '')}.scw.cloud"
session = boto3.session.Session()
client_s3 = session.client(service_name='s3', endpoint_url=endpoint_s3,
aws_access_key_id=os.getenv("SCW_ACCESS_KEY", ""),
aws_secret_access_key=os.getenv("SCW_SECRET_KEY", ""))
paginator = client_s3.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=BUCKET_NAME)

```
#### Why lazy loading?
The key reason for using lazy loading here is to avoid reprocessing documents that have already been embedded. In the context of Retrieval-Augmented Generation (RAG), reprocessing the same document multiple times is redundant and inefficient. Lazy loading enables us to check if a document has already been embedded (by querying the database) before actually loading and embedding it.

In this code sample we:
- Set Up a Boto3 Session: We initialize a Boto3 session, which is the AWS SDK for Python, fully compatible with Scaleway Object Storage. This session manages configuration, including credentials and settings, that Boto3 uses for API requests.
- Create an S3 Client: We establish an S3 client to interact with the Scaleway Object storage service.
- Set Up Pagination for Listing Objects: We prepare pagination to handle potentially large lists of objects efficiently.
- Iterate Through the Bucket: This initiates the pagination process, allowing us to list all objects within the specified Scaleway Object bucket seamlessly.

2. Iterate Through Metadata: Next, we will iterate through the metadata to determine if each object has already been embedded. If an object hasn’t been processed yet, we will embed it and load it into the database.

```python
text_splitter = RecursiveCharacterTextSplitter(chunk_size=480, chunk_overlap=20)

for file in files:
cur.execute("SELECT object_key FROM object_loaded WHERE object_key = %s", (file.metadata["source"],))
if cur.fetchone() is None:
fileLoader = S3FileLoader(
bucket=os.getenv("SCW_BUCKET_NAME"),
key=file.metadata["source"].split("/")[-1],
endpoint_url=endpoint_s3,
aws_access_key_id=os.getenv("SCW_ACCESS_KEY"),
aws_secret_access_key=os.getenv("SCW_API_KEY")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0, add_start_index=True, length_function=len, is_separator_regex=False)
for page in page_iterator:
for obj in page.get('Contents', []):
cur.execute("SELECT object_key FROM object_loaded WHERE object_key = %s", (obj['Key'],))
response = cur.fetchone()
if response is None:
file_loader = S3FileLoader(
bucket=BUCKET_NAME,
key=obj['Key'],
endpoint_url=endpoint_s3,
aws_access_key_id=os.getenv("SCW_ACCESS_KEY", ""),
aws_secret_access_key=os.getenv("SCW_SECRET_KEY", "")
)
file_to_load = fileLoader.load()
chunks = text_splitter.split_text(file.page_content)

embeddings_list = [embeddings.embed_query(chunk) for chunk in chunks]
for chunk, embedding in zip(chunks, embeddings_list):
vector_store.add_embeddings(embedding, chunk)
file_to_load = file_loader.load()
cur.execute("INSERT INTO object_loaded (object_key) VALUES (%s)", (obj['Key'],))
chunks = text_splitter.split_text(file_to_load[0].page_content)
try:
embeddings_list = [embeddings.embed_query(chunk) for chunk in chunks]
vector_store.add_embeddings(chunks, embeddings_list)
cur.execute("INSERT INTO object_loaded (object_key) VALUES (%s)",
(obj['Key'],))
except Exception as e:
logger.error(f"An error occurred: {e}")

conn.commit()
```

1. S3FileLoader: The S3FileLoader loads each file individually from your ***Scaleway Object Storage bucket*** using the file's object_key (extracted from the file's metadata). It ensures that only the specific file is loaded from the bucket, minimizing the amount of data being retrieved at any given time.
2. RecursiveCharacterTextSplitter: The RecursiveCharacterTextSplitter breaks each document into smaller chunks of text. This is crucial because embeddings models, like those used in Retrieval-Augmented Generation (RAG), typically have a limited context window (the number of tokens they can process at once).
- S3FileLoader: The S3FileLoader loads each file individually from your ***Scaleway Object Storage bucket*** using the file's object_key (extracted from the file's metadata). It ensures that only the specific file is loaded from the bucket, minimizing the amount of data being retrieved at any given time.
- RecursiveCharacterTextSplitter: The RecursiveCharacterTextSplitter breaks each document into smaller chunks of text. This is crucial because embeddings models, like those used in Retrieval-Augmented Generation (RAG), typically have a limited context window (the number of tokens they can process at once).
- Chunk Size: Here, the chunk size is set to 480 characters, with an overlap of 20 characters. The choice of 480 characters is based on the context size supported by the embeddings model. Models have a maximum number of tokens they can process in a single pass, often around 512 tokens or fewer, depending on the specific model you are using. To ensure that each chunk fits within this limit, 380 characters provide a buffer, as different models tokenize characters into variable-length tokens.
- Chunk Overlap: The 20-character overlap ensures continuity between chunks, which helps prevent loss of meaning or context between segments.
3. Embedding the Chunks: For each document, the text is split into smaller chunks using the text splitter, and an embedding is generated for each chunk using the embeddings.embed_query(chunk) function. This function transforms each chunk into a vector representation that can later be used for similarity search.
4. Embedding Storage: After generating the embeddings for each chunk, they are stored in a vector database (e.g., PostgreSQL with pgvector) using the vector_store.add_embeddings(embedding, chunk) method. Each embedding is stored alongside its corresponding text chunk, enabling retrieval during a query.
5. Avoiding Redundant Processing: The script checks the object_loaded table in PostgreSQL to see if a document has already been processed (i.e., the object_key exists in the table). If it has, the file is skipped, avoiding redundant downloads, vectorization, and database inserts. This ensures that only new or modified documents are processed, reducing the system's computational load and saving both time and resources.
- Embedding the Chunks: For each document, the text is split into smaller chunks using the text splitter, and an embedding is generated for each chunk using the embeddings.embed_query(chunk) function. This function transforms each chunk into a vector representation that can later be used for similarity search.
- Embedding Storage: After generating the embeddings for each chunk, they are stored in a vector database (e.g., PostgreSQL with pgvector) using the vector_store.add_embeddings(embedding, chunk) method. Each embedding is stored alongside its corresponding text chunk, enabling retrieval during a query.
- Avoiding Redundant Processing: The script checks the object_loaded table in PostgreSQL to see if a document has already been processed (i.e., the object_key exists in the table). If it has, the file is skipped, avoiding redundant downloads, vectorization, and database inserts. This ensures that only new or modified documents are processed, reducing the system's computational load and saving both time and resources.

#### Why 480 characters?
#### Why 500 characters?

The chunk size of 480 characters is chosen to fit comfortably within the context size limits of typical embeddings models, which often range between 512 and 1024 tokens. Since most models tokenize text into smaller units (tokens) based on words, punctuation, and subwords, the exact number of tokens for 480 characters will vary depending on the language and the content. By keeping chunks small, we avoid exceeding the model’s context window, which could lead to truncated embeddings or poor performance during inference.
The chunk size of 500 characters is chosen to fit comfortably within the context size limits of typical embeddings models, which often range between 512 and 1024 tokens. Since most models tokenize text into smaller units (tokens) based on words, punctuation, and subwords, the exact number of tokens for 480 characters will vary depending on the language and the content. By keeping chunks small, we avoid exceeding the model’s context window, which could lead to truncated embeddings or poor performance during inference.

This approach ensures that only new or modified documents are loaded into memory and embedded, saving significant computational resources and reducing redundant work.

Expand Down

0 comments on commit f5b4d4d

Please sign in to comment.