[Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) #1506

jose-mendez-santos · 2024-12-12T13:02:01Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When processing a document (input) and splitting it into chunks, the library does not respect the configuration specified in the settings.yaml file if the overlap is set to 0. Instead, it defaults to a value of 100. However, when the overlap is set to 1, the expected behavior is observed. This means there is currently no way to split the original text without applying overlap.

Steps to reproduce

The problematic configuration in the settings.yaml file is as follows:

chunks:
size: 1200
overlap: 0
group_by_columns: [id]
encoding_model: o200k_base

Expected Behavior

The input document has a total of 5,273 tokens. As shown in the image below, the total sums to 5,673, which is 400 tokens higher.

When using an overlap of 1, the result is 5,277 tokens, only 4 tokens above the original text. This demonstrates that the overlap is functioning correctly for values greater than 0. However, it does not work when set to 0, meaning when no overlap is desired.

GraphRAG Config Used

chunks:
  size: 1200
  overlap: 0
  group_by_columns: [id]
  encoding_model: o200k_base

Logs and screenshots

No response

Additional Information

GraphRAG Version: 1.0.0
Operating System: Windows 11
Python Version: Python 3.11.9
Related Issues:

jose-mendez-santos added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Dec 12, 2024

jose-mendez-santos changed the title ~~[Bug]: <title> Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0)~~ [Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) #1506

[Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) #1506

jose-mendez-santos commented Dec 12, 2024

[Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) #1506

[Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) #1506

Comments

jose-mendez-santos commented Dec 12, 2024

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information