Bad performance for running run_retrieve_tevatron.sh #6

acphile · 2024-11-20T01:23:21Z

Hi, I try to build the index of the wiki corpus using the script you provide in scripts/run_retrieve_tevatron.sh. However, I find the performance of retrieval evaluation is very bad.
The command I run is

for s in $(seq -f "%02g" 0 4)
do
CUDA_VISIBLE_DEVICES=${s} python -m tevatron.retriever.driver.encode \
  --output_dir=temp \
  --model_name_or_path BAAI/bge-large-en-v1.5 \
  --normalize True \
  --fp16 \
  --per_device_eval_batch_size 128 \
  --passage_max_len 512 \
  --dataset_name "TIGER-Lab/LongRAG" \
  --dataset_config "hotpot_qa_corpus" \
  --dataset_split "train" \
  --dataset_number_of_shards 4 \
  --encode_output_path emb_bge_official/corpus_emb_${s}.pkl \
  --dataset_shard_index ${s} >${s}.log 2>&1 &
done

CUDA_VISIBLE_DEVICES=0 python -m tevatron.retriever.driver.encode \
  --output_dir=temp \
  --model_name_or_path BAAI/bge-large-en-v1.5  \
  --normalize True \
  --query_prefix "Represent this sentence for searching relevant passages: " \
  --fp16 \
  --per_device_eval_batch_size 256 \
  --dataset_name "TIGER-Lab/LongRAG" \
  --dataset_config "hotpot_qa" \
  --dataset_split "subset_1000" \
  --encode_output_path query_hotpot_1000.pkl \
  --query_max_len 32 \
  --encode_is_query

CUDA_VISIBLE_DEVICES=0 python -m tevatron.retriever.driver.search \
  --query_reps query_hotpot_1000.pkl \
  --passage_reps "emb_bge_official/corpus_emb*.pkl" \
  --depth 200 \
  --batch_size -1 \
  --save_text \
  --save_ranking_to hqa_official_rank_200_new.txt

After checking the implementation of tevatron, I think it does not have the implementation of the max_p design as described in 'Similarity search' Section 2.1 of your paper. Do you mind providing your implementation and commands for similarity search which can be used to reproduce the BGE-large row in your Table 4? Thank you very much!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance for running run_retrieve_tevatron.sh #6

Bad performance for running run_retrieve_tevatron.sh #6

acphile commented Nov 20, 2024

Bad performance for running run_retrieve_tevatron.sh #6

Bad performance for running run_retrieve_tevatron.sh #6

Comments

acphile commented Nov 20, 2024