Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate search with dense and sparse embedding #110

Open
svenseeberg opened this issue Dec 22, 2024 · 0 comments
Open

Investigate search with dense and sparse embedding #110

svenseeberg opened this issue Dec 22, 2024 · 0 comments
Labels
component:chat Chat Back End enhancement New feature or request

Comments

@svenseeberg
Copy link
Member

svenseeberg commented Dec 22, 2024

Investigate if OpenSearch is an option for combined sparse and dense vector search. txtai is another option.

Alternatively, we can use PostgresSQL with pg_vector. A very simple SQL setup can look like this:

CREATE DATABASE document_embeddings;
\c document_embeddings
CREATE TABLE document_chunks (id SERIAL PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, url VARCHAR(512), embedding vector(384), sparse_embedding vector(1024));
CREATE TABLE documents (id SERIAL PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, url VARCHAR(512));
CREATE INDEX ON document_chunks USING hnsw (embedding vector_l2_ops);
CREATE INDEX ON document_chunks USING hnsw (sparse_embedding vector_l2_ops);
CREATE INDEX idx_documents_url_btree ON documents (url);
CREATE INDEX idx_chunks_url_btree ON document_chunks (url);
CREATE USER document_embeddings WITH ENCRYPTED PASSWORD 'CHANGEME';
GRANT ALL PRIVILEGES ON DATABASE document_embeddings TO document_embeddings;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO document_embeddings;
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO document_embeddings;

And a retrieval can then be done with the following query:

SELECT *, ({a} * distance + (1 - {a}) * sparse_distance) AS total_distance FROM (SELECT d.url, d.title, d.content, MIN(c.embedding <-> '{embedding}') AS distance, MIN(c.sparse_embedding <-> '{sparse_vector}') AS sparse_distance FROM document_chunks c LEFT JOIN documents d ON c.url=d.url GROUP BY d.url, d.title, d.content) ORDER BY total_distance ASC LIMIT 10;"
@svenseeberg svenseeberg added the component:chat Chat Back End label Dec 22, 2024
@svenseeberg svenseeberg added the enhancement New feature or request label Dec 22, 2024
@svenseeberg svenseeberg changed the title Investigate OpenSearch with dense and sparse embedding Investigate search with dense and sparse embedding Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:chat Chat Back End enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant