[Bug]: Loss of entity records of previous text after training new text #1519

xldistance · 2024-12-15T15:46:00Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

After training a text can normally query the entity records of the text, but after training a new text the entity records of the previous text can not be queried, is it because the training text parquet is overwritten?

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  # exllamav2
  api_key: xxx
  type: openai_chat # or azure_openai_chat
  model: Rombos-Coder-V2.5-Qwen-32b-exl2_5.0bpw
  model_supports_json: true # recommended if this is available for your model.
  max_tokens: 16000
  api_base: http://127.0.0.1:5001/v1
  requests_per_minute: 5_000 # set a leaky bucket throttle
  max_retries: 10
  max_retry_wait: 0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  temperature: 0.5 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate
  # audience: "https://cognitiveservices.azure.com/.default"
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>

parallelization:
  stagger: 0.3
  # num_threads: 50

async_mode: asyncio # or asyncio

embeddings:
  async_mode: asyncio # or asyncio
  vector_store: 
    type: lancedb
    db_uri: 'output\lancedb'
    container_name: default
    overwrite: true
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: bge-m3:Q4
    api_base: http://localhost:11434/v1
    max_tokens: 8192
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # or blob
  base_dir: "cache"

reporting:
  type: file # or console, blob
  base_dir: "input/reports"

storage:
  type: file # or blob
  base_dir: "input/artifacts"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
  # type: file # or blob
  # base_dir: "update_output"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  embeddings: false
  transient: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"

Logs and screenshots

No response

Additional Information

GraphRAG Version:1.0
Operating System:
Python Version:
Related Issues:

xldistance · 2024-12-15T15:51:08Z

    global_context_builder = GlobalCommunityContext(
        communities = communities ,
        community_reports=reports,
        entities=entities,
        token_encoder=token_encoder,
    )

    global_context_builder_params = {
        "use_community_summary": False,
        "shuffle_data": True,
        "include_community_rank": True,
        "min_community_rank": 0.5,
        "community_rank_name": "rank",
        "include_community_weight": True,
        "community_weight_name": "occurrence weight",
        "normalize_community_weight": True,
        "max_tokens": 12_000,
        "context_name": "Reports",
    }

    map_llm_params = {
        "max_tokens": 12_000,
        "temperature": 0.5,
        "response_format": {"type": "json_object"},
    }

    reduce_llm_params = {
        "max_tokens": 12_000,
        "temperature": 0.5,
    }

    global_search_engine = GlobalSearch(
        llm=llm,
        context_builder=global_context_builder,
        token_encoder=token_encoder,
        max_data_tokens=12_000,
        map_llm_params=map_llm_params,
        reduce_llm_params=reduce_llm_params,
        allow_general_knowledge=True,
        json_mode=True,
        context_builder_params=global_context_builder_params,
        concurrent_coroutines=32,
        response_type="multiple paragraphs",
    )

global_search only uses the files create_final_communities.parquet, create_final_community_reports.parquet, and create_final_entities.parquet, and each time the training text Each time the text is trained, these files are overwritten, resulting in the loss of the previously trained text data.

xldistance added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Dec 15, 2024

xldistance closed this as completed Dec 21, 2024

xldistance reopened this Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Loss of entity records of previous text after training new text #1519

[Bug]: Loss of entity records of previous text after training new text #1519

xldistance commented Dec 15, 2024

xldistance commented Dec 15, 2024 •

edited

Loading

[Bug]: Loss of entity records of previous text after training new text #1519

[Bug]: Loss of entity records of previous text after training new text #1519

Comments

xldistance commented Dec 15, 2024

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

xldistance commented Dec 15, 2024 • edited Loading

xldistance commented Dec 15, 2024 •

edited

Loading