Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic search integration #131

Open
sergiimk opened this issue Sep 12, 2024 · 0 comments
Open

Semantic search integration #131

sergiimk opened this issue Sep 12, 2024 · 0 comments

Comments

@sergiimk
Copy link
Member

sergiimk commented Sep 12, 2024

According to this design document we want to implement a semantic search capability in kamu.

Given a free-form text user prompt, the search API should return N most relevant datasets.

The relevance will be based on vector distance between the prompt and the dataset metadata in the embedding space.

Metadata should initially include:

  • Dataset name
  • Schema column names
  • Description
  • Tags
  • Readme and other textual attachments

Design document should include:

  • Proposed model for generating embeddings (low-cost, low-footprint)
  • Deployment proposal of where the embedding model will run
  • Eventually consistent mechanism for synchronzing dataset metadata with the vector DB
  • Semantic search API
  • How kamu search command will interact with the new system
  • How WebUI will interact with the new system

Scope:

  • English-only embeddings model is acceptable at this stage

Based on preliminary research we would like to use Qdrant vector DB for storing the embeddings and performing search.

See also:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant