-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add milvus vector db integration #419
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking time to add a new vector db integration and making corrections to the mongo file. Please have a look at the requested changes.
backend/modules/vector_db/milvus.py
Outdated
|
||
logger.debug(f"[Milvus] Deleted {len(data_point_vectors)} data point vectors") | ||
|
||
def get_embedding_dimensions(self, embeddings: Embeddings) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is already present in the base class. Please check if you can use that directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah actually I thought about leveraging the method from the base class but
logger.debug(f"[VectorDB] Embedding a dummy doc to get vector dimensions")
made me question it a little.
It's a singular string right ? and [VectorDB] not a variable ?
I was also thinking if there's any specific reason for adding f-string
without any {}
or expressions within {}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking if there's any specific reason for adding f-string without any {} or expressions within {}
No reason. Looks like a mistake. Feel free to edit this.
logger.debug(f"[VectorDB] Embedding a dummy doc to get vector dimensions") made me question it a little.
You can change this message and make it generic across all vector dbs. LMK if you need any help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
backend/modules/vector_db/milvus.py
Outdated
|
||
logger.debug(f"[Milvus] Created new collection {collection_name}") | ||
|
||
# TODO: take a look at this again. Have Doubts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what the doubts are? We can resolve them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually initially i was picking the ids list from the langchain Document objects and then using them to delete via the delete
method defined by milvus client, but noticed while creating collection, the field id
is set with auto_id=True
which basically assigns random number as id value to the documents, so to handle this i thought of using the metadata taking ref from other vector dbs implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's preferable to delete documents by fields in the metadata.
Best way would be to accept both params:
- If id is provided, delete by id
- If id is not provided and metadata is not empty, delete documents by any of the attributes in metadata
backend/modules/vector_db/milvus.py
Outdated
self.milvus_client.create_collection( | ||
collection_name=collection_name, | ||
dimension=vector_size, | ||
metric_type="COSINE", # https://milvus.io/docs/metric.md#Metric-Types : check for other supported metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is cosine the best metric_type?
Also, please make this configurable from the class initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, by make this configurable
do you mean metric_type
?
The definition of best metric actually is subjective, it depends on the data and the use caase actually as per my understanding. if let's say the embedding vectors are sparse and our use case demands a full text exact search then BM25
is more preferrable, if data is not normalized then Inner Product (IP)
would make much more sense, but in general for dense vectors cosine similarity
captures the orientation of the vectors and hence the semantic meaning and is default choice in lot of vector DBs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, by make this configurable do you mean metric_type ?
Yes, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my initial thought was to make it configurable and there are actually multiple paths to do it, since this is a generic / common field across all attributes so it makes more sense to me to add it directly under VectorDBConfig
as an attribute but that would demand changes in rest of the vectordb implementations so to avoid that for now, I am thinking of leveraging the config
attribute which is meant for additional params.
Will update the code in the next commit and pls lemme know if that implementation looks good to start with.
backend/modules/vector_db/milvus.py
Outdated
|
||
if incremental and len(documents) > 0: | ||
# Instead of using document IDs, we'll delete based on metadata matching | ||
for doc in documents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if batch delete possible here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think extracting the deletion logic into another method makes sense here to keep the method simple and readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's good point, we can make it modular following the separation of concerns principle, will do.
backend/modules/vector_db/milvus.py
Outdated
filter=delete_expr, | ||
) | ||
|
||
Milvus( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use self.get_vector_store here
@AbhishekRP2002 Please add an example config in the compose.env and comment so anyone who wants to use Milvus can refer to it. |
Customized setup
, I am thinking.