You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of vector indexing in TuGraph-DB creates a single index on the vector attribute column of a Label. For in-memory graph indexes like HNSW, it allows for incremental index construction (currently only supported by VSAG's HNSW index). However, this implementation has a drawback for indexes that do not support incremental construction, such as cluster-based indexes (e.g., IVF-FLAT) and disk-based graph indexes (e.g., DiskANN), where a single index becomes insufficient.
In TuGraph-DB, the update operation for vector index columns is converted into a delete operation followed by an add operation. The delete is a mark-delete, meaning vector data is only added and not deleted.
This feature is aimed at TuGraph-DB v5.x.
General Design
Inspired by the compaction concept from RocksDB, we segment the Vids of vertices, where each IndexChunk corresponds to a specific range of Vids with no overlap between the [min_vid, max_vid] of different IndexChunks. Utilizing this characteristic, we can construct an LSM-tree with a layer depth of one. This setup allows us to merge multiple segments using background threads.
Terms & Data structures
Vector Index
When we create a vector index on a vector attribute column of a specific Label, we are essentially creating a Vector Index.
MemIndexChunk
Substructure of the Vector Index, which is a in-memory data structure that stores vector data for a specific range of Vids. Internally, it maintains a Faiss Flat index to provide exhaustive search capabilities. At any given time, a Vector Index has only one active MemIndexChunk. When a MemIndexChunk reaches a certain amount of data, it is sealed. At this point, the Vector Index creates a new MemIndexChunk to handle further writes.
Updates to a sealed MemIndexChunk are written to a Delta data structure. Depending on the volume of data written and the feasibility of compaction, there can be multiple sealed MemIndexChunks simultaneously.
IndexChunk
Substructure of the Vector Index, which is a persistent vector index that stores data for a specific range of Vids. Internally, it maintains the corresponding index type for the Vector Index, such as HNSW, IVF-FLAT, DISKANN, etc. This structure is generated through the compaction of a MemIndexChunk or the compaction of multiple IndexChunks.
IdMapper
Substructure of an IndexChunk, maps each Vid to a labelid in the vector index (or can be simply understood as the position of the vector in the vector index). This data structure is generated and written to a disk file during compaction. When the database restarts, this data structure is directly read from the disk and loaded into memory.
DeleteMap
Substructure of an IndexChunk, it records which labelid in the vector index has been marked as deleted. Although this structure resides in memory, when an element is deleted, WAL (Write-Ahead Logging) data is constructed based on the corresponding Vid and written to RocksDB. When the database restarts, this data structure is restored based on the WAL corresponding to the IndexChunk and the IdMapper.
Delta
Substructure of both IndexChunk and MemChunkIndex. After these two data structures are sealed, subsequent update (non-delete) operations are written to this data structure. The difference is that for IndexChunk, the Vids corresponding to the update operations need to be persisted to RocksDB, and this data structure needs to be reconstructed from RocksDB upon restart. In contrast, the Delta in MemChunkIndex does not need to be persisted.
CRUD
Insert
For newly written data, due to the incrementing nature of Tugraph-db Vids, most data will fall within the Vid range of the Active MemIndexChunk. However, due to concurrency, a small number of Vids may fall into Sealed MemIndexChunks or IndexChunk.
Delete
To delete a vector corresponding to a Vid, the specific IndexChunk/MemIndexChunk can be easily identified based on the Vid (thanks to the non-overlapping nature of our LSM-tree). The label can then be marked as deleted by referencing the IdMapper. For an IndexChunk, the Vid associated with the delete operation must be persisted to RocksDB. During queries, an IDSelector will be constructed to filter out the deleted data.
Update
In Tugraph-db, updates to the vector index column are internally transformed into a delete and an insert. Therefore, the Vid of the old vertex will be marked as deleted in the corresponding IndexChunk/MemIndexChunk, and the new index data will be written into the Delta data structure. For an IndexChunk, the Vid associated with the update operation must be persisted to RocksDB.
Query
A query will search through all MemIndexChunks and IndexChunks. The results from multiple queries will be merged into a single top-k result in the context of the Vector Index (handled using a heap structure based on the returned scores) and then returned to the client.
Compaction
Inspired by the compaction concept from RocksDB, sealed MemIndexChunks are converted into IndexChunks, or multiple IndexChunks are merged into a single larger IndexChunk. We need to optimize the algorithm to provide a better compaction scheduling strategy that improves query recall while minimizing write amplification.
It is worth noting that the compaction process involves re-reading vector values from the original data based on Vid ranges and regenerating the vector index. The compaction merge operation must occur between adjacent IndexChunks. Before starting the compaction, an incrementing sequence value x is read within an Index. After the index file and IdMapper are persisted to disk, the WAL for the merged IndexChunk is replayed starting from x. This ensures that the newly generated IndexChunk is equivalent to the multiple IndexChunks before the merge.
The text was updated successfully, but these errors were encountered:
zhjwpku
added a commit
to zhjwpku/tugraph-db
that referenced
this issue
Dec 16, 2024
* Redesign vector index
This is an implementation of #819
Signed-off-by: Junwang Zhao <[email protected]>
* set vector index related column family to nullptr
---------
Signed-off-by: Junwang Zhao <[email protected]>
Background
The current implementation of vector indexing in TuGraph-DB creates a single index on the vector attribute column of a Label. For in-memory graph indexes like HNSW, it allows for incremental index construction (currently only supported by VSAG's HNSW index). However, this implementation has a drawback for indexes that do not support incremental construction, such as cluster-based indexes (e.g., IVF-FLAT) and disk-based graph indexes (e.g., DiskANN), where a single index becomes insufficient.
In TuGraph-DB, the update operation for vector index columns is converted into a delete operation followed by an add operation. The delete is a mark-delete, meaning vector data is only added and not deleted.
This feature is aimed at TuGraph-DB v5.x.
General Design
Inspired by the compaction concept from RocksDB, we segment the Vids of vertices, where each IndexChunk corresponds to a specific range of Vids with no overlap between the [min_vid, max_vid] of different IndexChunks. Utilizing this characteristic, we can construct an LSM-tree with a layer depth of one. This setup allows us to merge multiple segments using background threads.
Terms & Data structures
Vector Index
MemIndexChunk
Updates to a sealed MemIndexChunk are written to a Delta data structure. Depending on the volume of data written and the feasibility of compaction, there can be multiple sealed MemIndexChunks simultaneously.
IndexChunk
IdMapper
DeleteMap
Delta
CRUD
Insert
Delete
Update
Query
Compaction
Inspired by the compaction concept from RocksDB, sealed MemIndexChunks are converted into IndexChunks, or multiple IndexChunks are merged into a single larger IndexChunk. We need to optimize the algorithm to provide a better compaction scheduling strategy that improves query recall while minimizing write amplification.
It is worth noting that the compaction process involves re-reading vector values from the original data based on Vid ranges and regenerating the vector index. The compaction merge operation must occur between adjacent IndexChunks. Before starting the compaction, an incrementing sequence value
x
is read within an Index. After the index file and IdMapper are persisted to disk, the WAL for the merged IndexChunk is replayed starting fromx
. This ensures that the newly generated IndexChunk is equivalent to the multiple IndexChunks before the merge.The text was updated successfully, but these errors were encountered: