[BUG]: Llama Sharp LLamaEmbedder Chunking #1011

koureasstavros · 2024-12-03T15:41:42Z

Description

I am using the following code to generate embeddings from very large documents, of-course each document's tokens can exceed the maximum context of selected model and therefore document's tokens should be splitted in chunks and for each chunk get an embedding, then in the end get the average embedding.

string param_model_path = Parameters.model_embedding_local_path; //USED FOR CUSTOM MODEL PATH

var parameters = new ModelParams(param_model_path)
{
    ContextSize = param_model_embedding_global_maxtokens, // The longest length of chat as memory.
    GpuLayerCount = 0, // How many layers to offload to GPU. Please adjust it according to your GPU memory.
    Embeddings = true, //Embeding Size cannot change as it is fixed due to the llm embedding layer.
    VocabOnly = false, //Needs more memory based on input tokens based on ContextSize
    PoolingType = LLama.Native.LLamaPoolingType.Mean,
    BatchSize = 1024,
    UBatchSize = 1024
};

LLamaWeights weights = LLamaWeights.LoadFromFile(parameters);
model_llama = new LLamaEmbedder(weights, parameters);

IQueryable contents_lama = _dbContext.Contents.Where(x => types.Contains(x.content_type));
foreach (Shared.Models.Database.Content content_lama in contents_lama)
{
    float[]? text_embedding = model_llama.GetEmbeddings(content_lama.content_document, cancellationToken).Result.Single();
    content_lama.content_embedding = FloatToDouble(text_embedding);
    content_lama.content_timestamp_emd = DateTime.Now;
    _dbContext.Entry(content_lama).State = EntityState.Modified;
}

_dbContext.SaveChanges();

I have tested the above with multiple settings for BatchSize and UBatchSize (like 512 or 2048) but I always get the following error:

One or more errors occurred. (Input contains more tokens than configured batch size (Parameter 'batch'))'

Then I even tried to create my own token chunking method and average calculator which works perfectly with Azure Open AI ADA model and I incoporated this also with usage of lama sharp but still the same error occured even chunk was les than 512 tokens.

I tested the following language models:

Phi-3.5-mini-instruct-Q6_K_L.gguf
Meta-Llama-3-8B-Instruct.Q3_K_L.gguf

I tested using the following LlamaSharp versions:
-0.19.0
-0.18.0

Notice in another github issue (#921) there is a guy (martindevans) who mentions that "embedder can't split input into multiple batches at the moment"

So LLamaEmbedder and LLama.Native.LLamaPoolingType.Mean is not supposed to do chunking and average?

Reproduction Steps

Try to simply embed a small text like "This is a test"
Then try to embed a lage test like a document using the batch and mean parameters

Environment & Configuration

Operating system: Windows and Linux
.NET runtime version: .NET 8
LLamaSharp version: 0.19.0
CUDA version (if you are using cuda backend): None
CPU & GPU device: Lenovo ThinkPad Laptop

Known Workarounds

No response

The text was updated successfully, but these errors were encountered:

martindevans · 2024-12-03T16:51:25Z

The embedder does not currently do any chunking of large documents - it simply takes all of the content you feed it and processes it in one go. It's up to you to ensure that's small enough to fit within the configured BatchSize.

LLama.Native.LLamaPoolingType.Mean will do a mean average of the embeddings within that batch.

koureasstavros · 2024-12-04T08:30:30Z

Regarding this statement:

LLama.Native.LLamaPoolingType.Mean will do a mean average of the embeddings within that batch.

I assume that I can set the UBatchSize to the context size of the selected model, and the BatchSize to a bigger size than the UBatchSize. So if I have a custom chunking mechanism with a fixed token count split smaller than BatchSize, it should work.

But based on this github issue (#921) BatchSize cannot be different than UBatchSize

@params.UBatchSize != @params.BatchSize

martindevans · 2024-12-04T14:53:55Z

UBatchSize to the context size of the selected model, and the BatchSize to a bigger size than the UBatchSize

UBatchSize doesn't really have anything to do with the context size of the model, instead it is simply setting how much work the GPU will do at once. So if you have e.g. BatchSize=100, UBatchSize=20 you can't possibly submit more than 100 tokens (that's the max batch size), if you submit 100 then the GPU will internally process it in 100/20 = 5 chunks.

However, this isn't relevant to embedding models because of:

if (@params.UBatchSize != @params.BatchSize)
    throw new ArgumentException("For non-causal models, batch size must be equal to ubatch size", nameof(@params));

The UBatch mechanism doesn't support non-causal models, all the work must be processed at once. That restriction is simply copied from llama.cpp here.

So if I have a custom chunking mechanism with a fixed token count split smaller than BatchSize, it should work.

This sounds right to me.

I think you should be able to:

Tokenize document
Split into chunks <= BatchSize
Embed each chunk (this produces one vector per chunk)
Take mean average of all chunk embedding vectors (element-wise sum vectors, divide by number of chunks)

koureasstavros · 2024-12-06T11:45:48Z

So I am curious what batch is exacly doing?
Why I have to make chunking and then then lamasharp will do another "batching"?

martindevans · 2024-12-06T14:57:17Z

Batching is a fairly low-level implementation detail which makes processing large amounts of data more efficient.

For example If you want the model to process a large prompt before generating some text, it's more efficient to process that prompt in a single large batch than to process it one. token. at. a. time.

Alternatively if you're generating multiple different sequences all at once (e.g. 100 parallel conversations) rather than processing each conversation at once, you can process all 100 at simultaneously in a batch.

BatchSize sets how many tokens you can submit to the model at once, and how many results can be returned. If BatchSize is 100 you can't submit more than 100 tokens at once, and you can't generate more than 100 tokens at once. It's usually fine to make this larger (it costs some memory to hold the larger batch).

However, the GPU can't always handle a whole batch. UBatchSize sets how much work is actually submitted to the GPU at once - a batch of work is split up into small batches and processed, then all of the results are returned at the end. Normally this would be tuned to fit your GPU capability.

None of this is really relevant to embedding though! Since there's that UBatchSize == BatchSize requirement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Llama Sharp LLamaEmbedder Chunking #1011

[BUG]: Llama Sharp LLamaEmbedder Chunking #1011

koureasstavros commented Dec 3, 2024

martindevans commented Dec 3, 2024

koureasstavros commented Dec 4, 2024 •

edited

Loading

martindevans commented Dec 4, 2024

koureasstavros commented Dec 6, 2024

martindevans commented Dec 6, 2024

[BUG]: Llama Sharp LLamaEmbedder Chunking #1011

[BUG]: Llama Sharp LLamaEmbedder Chunking #1011

Comments

koureasstavros commented Dec 3, 2024

Description

Reproduction Steps

Environment & Configuration

Known Workarounds

martindevans commented Dec 3, 2024

koureasstavros commented Dec 4, 2024 • edited Loading

martindevans commented Dec 4, 2024

koureasstavros commented Dec 6, 2024

martindevans commented Dec 6, 2024

koureasstavros commented Dec 4, 2024 •

edited

Loading