Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A redesign of tugraph-db vector index #819

Open
zhjwpku opened this issue Dec 16, 2024 · 0 comments
Open

A redesign of tugraph-db vector index #819

zhjwpku opened this issue Dec 16, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@zhjwpku
Copy link
Member

zhjwpku commented Dec 16, 2024

Background

The current implementation of vector indexing in TuGraph-DB creates a single index on the vector attribute column of a Label. For in-memory graph indexes like HNSW, it allows for incremental index construction (currently only supported by VSAG's HNSW index). However, this implementation has a drawback for indexes that do not support incremental construction, such as cluster-based indexes (e.g., IVF-FLAT) and disk-based graph indexes (e.g., DiskANN), where a single index becomes insufficient.

In TuGraph-DB, the update operation for vector index columns is converted into a delete operation followed by an add operation. The delete is a mark-delete, meaning vector data is only added and not deleted.

This feature is aimed at TuGraph-DB v5.x.

General Design

Inspired by the compaction concept from RocksDB, we segment the Vids of vertices, where each IndexChunk corresponds to a specific range of Vids with no overlap between the [min_vid, max_vid] of different IndexChunks. Utilizing this characteristic, we can construct an LSM-tree with a layer depth of one. This setup allows us to merge multiple segments using background threads.

image

Terms & Data structures

  • Vector Index

    • When we create a vector index on a vector attribute column of a specific Label, we are essentially creating a Vector Index.
  • MemIndexChunk

    • Substructure of the Vector Index, which is a in-memory data structure that stores vector data for a specific range of Vids. Internally, it maintains a Faiss Flat index to provide exhaustive search capabilities. At any given time, a Vector Index has only one active MemIndexChunk. When a MemIndexChunk reaches a certain amount of data, it is sealed. At this point, the Vector Index creates a new MemIndexChunk to handle further writes.
      Updates to a sealed MemIndexChunk are written to a Delta data structure. Depending on the volume of data written and the feasibility of compaction, there can be multiple sealed MemIndexChunks simultaneously.
  • IndexChunk

    • Substructure of the Vector Index, which is a persistent vector index that stores data for a specific range of Vids. Internally, it maintains the corresponding index type for the Vector Index, such as HNSW, IVF-FLAT, DISKANN, etc. This structure is generated through the compaction of a MemIndexChunk or the compaction of multiple IndexChunks.
  • IdMapper

    • Substructure of an IndexChunk, maps each Vid to a labelid in the vector index (or can be simply understood as the position of the vector in the vector index). This data structure is generated and written to a disk file during compaction. When the database restarts, this data structure is directly read from the disk and loaded into memory.
  • DeleteMap

    • Substructure of an IndexChunk, it records which labelid in the vector index has been marked as deleted. Although this structure resides in memory, when an element is deleted, WAL (Write-Ahead Logging) data is constructed based on the corresponding Vid and written to RocksDB. When the database restarts, this data structure is restored based on the WAL corresponding to the IndexChunk and the IdMapper.
  • Delta

    • Substructure of both IndexChunk and MemChunkIndex. After these two data structures are sealed, subsequent update (non-delete) operations are written to this data structure. The difference is that for IndexChunk, the Vids corresponding to the update operations need to be persisted to RocksDB, and this data structure needs to be reconstructed from RocksDB upon restart. In contrast, the Delta in MemChunkIndex does not need to be persisted.

CRUD

  • Insert

    • For newly written data, due to the incrementing nature of Tugraph-db Vids, most data will fall within the Vid range of the Active MemIndexChunk. However, due to concurrency, a small number of Vids may fall into Sealed MemIndexChunks or IndexChunk.
  • Delete

    • To delete a vector corresponding to a Vid, the specific IndexChunk/MemIndexChunk can be easily identified based on the Vid (thanks to the non-overlapping nature of our LSM-tree). The label can then be marked as deleted by referencing the IdMapper. For an IndexChunk, the Vid associated with the delete operation must be persisted to RocksDB. During queries, an IDSelector will be constructed to filter out the deleted data.
  • Update

    • In Tugraph-db, updates to the vector index column are internally transformed into a delete and an insert. Therefore, the Vid of the old vertex will be marked as deleted in the corresponding IndexChunk/MemIndexChunk, and the new index data will be written into the Delta data structure. For an IndexChunk, the Vid associated with the update operation must be persisted to RocksDB.
  • Query

    • A query will search through all MemIndexChunks and IndexChunks. The results from multiple queries will be merged into a single top-k result in the context of the Vector Index (handled using a heap structure based on the returned scores) and then returned to the client.

Compaction

Inspired by the compaction concept from RocksDB, sealed MemIndexChunks are converted into IndexChunks, or multiple IndexChunks are merged into a single larger IndexChunk. We need to optimize the algorithm to provide a better compaction scheduling strategy that improves query recall while minimizing write amplification.

It is worth noting that the compaction process involves re-reading vector values from the original data based on Vid ranges and regenerating the vector index. The compaction merge operation must occur between adjacent IndexChunks. Before starting the compaction, an incrementing sequence value x is read within an Index. After the index file and IdMapper are persisted to disk, the WAL for the merged IndexChunk is replayed starting from x. This ensures that the newly generated IndexChunk is equivalent to the multiple IndexChunks before the merge.

zhjwpku added a commit to zhjwpku/tugraph-db that referenced this issue Dec 16, 2024
This is an implementation of TuGraph-family#819

Signed-off-by: Junwang Zhao <[email protected]>
@zhjwpku zhjwpku added the enhancement New feature or request label Dec 16, 2024
ljcui pushed a commit that referenced this issue Dec 16, 2024
* Redesign vector index

This is an implementation of #819

Signed-off-by: Junwang Zhao <[email protected]>

* set vector index related column family to nullptr

---------

Signed-off-by: Junwang Zhao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant