You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
When processing a document (input) and splitting it into chunks, the library does not respect the configuration specified in the settings.yaml file if the overlap is set to 0. Instead, it defaults to a value of 100. However, when the overlap is set to 1, the expected behavior is observed. This means there is currently no way to split the original text without applying overlap.
Steps to reproduce
The problematic configuration in the settings.yaml file is as follows:
The input document has a total of 5,273 tokens. As shown in the image below, the total sums to 5,673, which is 400 tokens higher.
When using an overlap of 1, the result is 5,277 tokens, only 4 tokens above the original text. This demonstrates that the overlap is functioning correctly for values greater than 0. However, it does not work when set to 0, meaning when no overlap is desired.
jose-mendez-santos
changed the title
[Bug]: <title> Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0)
[Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0)
Dec 12, 2024
Do you need to file an issue?
Describe the bug
When processing a document (input) and splitting it into chunks, the library does not respect the configuration specified in the settings.yaml file if the overlap is set to 0. Instead, it defaults to a value of 100. However, when the overlap is set to 1, the expected behavior is observed. This means there is currently no way to split the original text without applying overlap.
Steps to reproduce
The problematic configuration in the settings.yaml file is as follows:
chunks:
size: 1200
overlap: 0
group_by_columns: [id]
encoding_model: o200k_base
Expected Behavior
The input document has a total of 5,273 tokens. As shown in the image below, the total sums to 5,673, which is 400 tokens higher.
When using an overlap of 1, the result is 5,277 tokens, only 4 tokens above the original text. This demonstrates that the overlap is functioning correctly for values greater than 0. However, it does not work when set to 0, meaning when no overlap is desired.
GraphRAG Config Used
Logs and screenshots
No response
Additional Information
The text was updated successfully, but these errors were encountered: