Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) #1506

Open
3 tasks done
jose-mendez-santos opened this issue Dec 12, 2024 · 0 comments
Open
3 tasks done
Labels
bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer

Comments

@jose-mendez-santos
Copy link

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When processing a document (input) and splitting it into chunks, the library does not respect the configuration specified in the settings.yaml file if the overlap is set to 0. Instead, it defaults to a value of 100. However, when the overlap is set to 1, the expected behavior is observed. This means there is currently no way to split the original text without applying overlap.

Steps to reproduce

The problematic configuration in the settings.yaml file is as follows:

chunks:
size: 1200
overlap: 0
group_by_columns: [id]
encoding_model: o200k_base

Expected Behavior

The input document has a total of 5,273 tokens. As shown in the image below, the total sums to 5,673, which is 400 tokens higher.

Image

When using an overlap of 1, the result is 5,277 tokens, only 4 tokens above the original text. This demonstrates that the overlap is functioning correctly for values greater than 0. However, it does not work when set to 0, meaning when no overlap is desired.

Image

GraphRAG Config Used

chunks:
  size: 1200
  overlap: 0
  group_by_columns: [id]
  encoding_model: o200k_base

Logs and screenshots

No response

Additional Information

  • GraphRAG Version: 1.0.0
  • Operating System: Windows 11
  • Python Version: Python 3.11.9
  • Related Issues:
@jose-mendez-santos jose-mendez-santos added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Dec 12, 2024
@jose-mendez-santos jose-mendez-santos changed the title [Bug]: <title> Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) [Bug]: Incorrect Handling of overlap Parameter in Chunking Process (with overlap: 0) Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer
Projects
None yet
Development

No branches or pull requests

1 participant