Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Llama Sharp LLamaEmbedder Chunking #1011

Open
koureasstavros opened this issue Dec 3, 2024 · 5 comments
Open

[BUG]: Llama Sharp LLamaEmbedder Chunking #1011

koureasstavros opened this issue Dec 3, 2024 · 5 comments

Comments

@koureasstavros
Copy link

Description

I am using the following code to generate embeddings from very large documents, of-course each document's tokens can exceed the maximum context of selected model and therefore document's tokens should be splitted in chunks and for each chunk get an embedding, then in the end get the average embedding.

string param_model_path = Parameters.model_embedding_local_path; //USED FOR CUSTOM MODEL PATH

var parameters = new ModelParams(param_model_path)
{
    ContextSize = param_model_embedding_global_maxtokens, // The longest length of chat as memory.
    GpuLayerCount = 0, // How many layers to offload to GPU. Please adjust it according to your GPU memory.
    Embeddings = true, //Embeding Size cannot change as it is fixed due to the llm embedding layer.
    VocabOnly = false, //Needs more memory based on input tokens based on ContextSize
    PoolingType = LLama.Native.LLamaPoolingType.Mean,
    BatchSize = 1024,
    UBatchSize = 1024
};

LLamaWeights weights = LLamaWeights.LoadFromFile(parameters);
model_llama = new LLamaEmbedder(weights, parameters);

IQueryable contents_lama = _dbContext.Contents.Where(x => types.Contains(x.content_type));
foreach (Shared.Models.Database.Content content_lama in contents_lama)
{
    float[]? text_embedding = model_llama.GetEmbeddings(content_lama.content_document, cancellationToken).Result.Single();
    content_lama.content_embedding = FloatToDouble(text_embedding);
    content_lama.content_timestamp_emd = DateTime.Now;
    _dbContext.Entry(content_lama).State = EntityState.Modified;
}

_dbContext.SaveChanges();

I have tested the above with multiple settings for BatchSize and UBatchSize (like 512 or 2048) but I always get the following error:

One or more errors occurred. (Input contains more tokens than configured batch size (Parameter 'batch'))'

Then I even tried to create my own token chunking method and average calculator which works perfectly with Azure Open AI ADA model and I incoporated this also with usage of lama sharp but still the same error occured even chunk was les than 512 tokens.

I tested the following language models:

  • Phi-3.5-mini-instruct-Q6_K_L.gguf
  • Meta-Llama-3-8B-Instruct.Q3_K_L.gguf

I tested using the following LlamaSharp versions:
-0.19.0
-0.18.0

Notice in another github issue (#921) there is a guy (martindevans) who mentions that "embedder can't split input into multiple batches at the moment"

So LLamaEmbedder and LLama.Native.LLamaPoolingType.Mean is not supposed to do chunking and average?

Reproduction Steps

Try to simply embed a small text like "This is a test"
Then try to embed a lage test like a document using the batch and mean parameters

Environment & Configuration

  • Operating system: Windows and Linux
  • .NET runtime version: .NET 8
  • LLamaSharp version: 0.19.0
  • CUDA version (if you are using cuda backend): None
  • CPU & GPU device: Lenovo ThinkPad Laptop

Known Workarounds

No response

@martindevans
Copy link
Member

The embedder does not currently do any chunking of large documents - it simply takes all of the content you feed it and processes it in one go. It's up to you to ensure that's small enough to fit within the configured BatchSize.

LLama.Native.LLamaPoolingType.Mean will do a mean average of the embeddings within that batch.

@koureasstavros
Copy link
Author

koureasstavros commented Dec 4, 2024

Regarding this statement:

LLama.Native.LLamaPoolingType.Mean will do a mean average of the embeddings within that batch.

I assume that I can set the UBatchSize to the context size of the selected model, and the BatchSize to a bigger size than the UBatchSize. So if I have a custom chunking mechanism with a fixed token count split smaller than BatchSize, it should work.

But based on this github issue (#921) BatchSize cannot be different than UBatchSize

@params.UBatchSize != @params.BatchSize

@martindevans
Copy link
Member

UBatchSize to the context size of the selected model, and the BatchSize to a bigger size than the UBatchSize

UBatchSize doesn't really have anything to do with the context size of the model, instead it is simply setting how much work the GPU will do at once. So if you have e.g. BatchSize=100, UBatchSize=20 you can't possibly submit more than 100 tokens (that's the max batch size), if you submit 100 then the GPU will internally process it in 100/20 = 5 chunks.

However, this isn't relevant to embedding models because of:

if (@params.UBatchSize != @params.BatchSize)
    throw new ArgumentException("For non-causal models, batch size must be equal to ubatch size", nameof(@params));

The UBatch mechanism doesn't support non-causal models, all the work must be processed at once. That restriction is simply copied from llama.cpp here.

So if I have a custom chunking mechanism with a fixed token count split smaller than BatchSize, it should work.

This sounds right to me.

I think you should be able to:

  1. Tokenize document
  2. Split into chunks <= BatchSize
  3. Embed each chunk (this produces one vector per chunk)
  4. Take mean average of all chunk embedding vectors (element-wise sum vectors, divide by number of chunks)

@koureasstavros
Copy link
Author

So I am curious what batch is exacly doing?
Why I have to make chunking and then then lamasharp will do another "batching"?

@martindevans
Copy link
Member

Batching is a fairly low-level implementation detail which makes processing large amounts of data more efficient.

For example If you want the model to process a large prompt before generating some text, it's more efficient to process that prompt in a single large batch than to process it one. token. at. a. time.

Alternatively if you're generating multiple different sequences all at once (e.g. 100 parallel conversations) rather than processing each conversation at once, you can process all 100 at simultaneously in a batch.

BatchSize sets how many tokens you can submit to the model at once, and how many results can be returned. If BatchSize is 100 you can't submit more than 100 tokens at once, and you can't generate more than 100 tokens at once. It's usually fine to make this larger (it costs some memory to hold the larger batch).

However, the GPU can't always handle a whole batch. UBatchSize sets how much work is actually submitted to the GPU at once - a batch of work is split up into small batches and processed, then all of the results are returned at the end. Normally this would be tuned to fit your GPU capability.

None of this is really relevant to embedding though! Since there's that UBatchSize == BatchSize requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants