Skip to content

CODEC is a document and entity ranking dataset that focuses on complex essay-style topics.

Notifications You must be signed in to change notification settings

grill-lab/CODEC

Repository files navigation


CODEC

Complex Document and Entity Collection

Table of Contents
  1. Overview
  2. Paper
  3. Dataset
  4. Change Log
  5. Tasks
  6. Complex Topics
  7. Document Corpus
  8. Entity KB
  9. Judgments
  10. Query Reformulations
  11. Entity-Centric Search
  12. Evaluation
  13. System Performance
  14. Future Work

Colab demo showing indexing, query reformulations, entity links, and evaluation: Open In Colab

Overview

CODEC is a new document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers across history, economics, and politics. For example, ‘How has the UK’s Open Banking Regulation benefited Challenger Banks?’

CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. It includes expert judgments on 6,186 document (147.3 per topic) and 11,323 entity (269.6 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations (9.2 per topic), providing data for query performance prediction and automatic rewriting evaluation.

CODEC Diagram

Paper

This work will be presented at SIGIR 2022: https://arxiv.org/abs/2205.04546

Correct citation:

@inproceedings{mackie2022codec,
 title={CODEC: Complex Document and Entity Collection},
 author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery},
 booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 year={2022}
}

Dataset

CODEC provides 42 topics for document and entity retrieval:

CODEC full document corpus is available for research purpose: FULL.

CODEC entity KB is KILT's snapshot of Wikipedia (~30GB).

Colab demo showing indexing, query reformulations, entity links, and evaluation: Open In Colab

Dataset is available via ir-datasets.

Change Log

Major dataset changes historic users should be aware:

  • 25th April: CODEC v1 released.

Tasks

CODEC is a test collection that provides two tasks: document ranking and entity ranking. This dataset benchmarks a social science researcher who is attempting to find supporting entities and documents that will form the basis of a long-form essay discussing the topic from various perspectives. The researcher would explore the topic to (1) identify relevant sources and (2) understand key concepts.

Document ranking systems have to return a relevance-ranked list of documents for a given natural language query. Entity ranking systems have to return a relevance-ranked list of entities for a given natural language query. Document ranking uses CODEC’s new document corpus and entity ranking uses KILT as the entity knowledge base. For the experimental setup, we provide four pre-defined ‘standard’ folds for k-fold cross-validation to allow parameter tuning. Initial retrieval or re-ranking of provided baseline runs can both be evaluated using this dataset.

Complex Topics

CODEC provides 42 complex topics which intend to benchmark the role of a researcher. Social science experts from history (history teacher, published history scholar), economics (FX trader, accountant, investment banker) and politics (political scientists, politician) helped to generate interesting and factually-grounded topics. The authors develop the following criteria for complex topics:
  • Open-ended essay-style
  • Natural language question
  • Multiple points of view
  • Concern multiple key entities
  • Complex
  • Requires knowledge

Each topic contains a query and narrative. The query is the question the researcher seeks to understand by exploring documents and entities, i.e., the text input posed to the search system. The narratives provide an overview of the topic (key concepts, arguments, facts, etc.) and allow non-domain-experts to understand the topic.

CODEC Topcs

Document Corpus

We use Common Crawl to curate a 729,824 document corpus with focused content across finance, history, and politics.

The corpus is released in jsonline format with following fields:

  • id: Unique identifier is the MD5 hash of URL.
  • url: Location of the webpage (URL).
  • title: Title of the webpage if available.
  • contents: The text content of the webpage after removing any unnecessary advertising or formatting. New lines provide some structure between the extracted sections of the webpage, while still easy for neural systems to process.

Document distribution:

Document Count
reuters.com 172,127
forbes.com 147,399
cnbc.com 100,842
britannica.com 93,484
latimes.com 88,486
usatoday.com 31,803
investopedia.com 21,459
bbc.co.uk 21,414
history.state.gov 9,187
brookings.edu 9,058
ehistory.osu.edu 8,805
history.com 6,749
spartacus-educational.com 3,904
historynet.com 3,811
historyhit.com 3,173
... ...
TOTAL 721,701

Entity KB

CODEC uses KILT’s Wikipedia KB for the entity ranking task, which is based on the 2019/08/01 Wikipedia snapshot. KILT contains 5.9M preprocessed articles which is freely available to use: link.

Judgments

CODEC uses a 2-stage assessment approach to balance adequate coverage of current systems while allowing annotators to explore topics using iterative search system. This creates 6,186 document judgments (147.3 per topic) and 11,323 entity judgments (269.6 per topic):

These raw judgements are released: link.

Judgment Document Ranking Entity Ranking
0 2,353 7,053
1 2,210 2,241
2 1,207 1,252
3 416 777
TOTAL 6,186 11,323

Query Reformulations

During assessment process, researchers use a live search system to explore the complex topic. We release the full 387 queries and mapped relevance judgment: link

An example of these manual query reformulations:

CODEC query reformulations

CODEC provides aligned document and entity judgments that allows for new entity-centric search models to be developed.

Evaluation

We provide TREC-style query-relevance files (entity rankings: link) and (document ranking: link).

The official measures for both tasks include MAP, NDCG@10, and Recall@1000.

System Performance

Systems:

Sparse retrieval BM25 and BM25+RM3 runs use Pyserini with Porter stemming and stopwords removed. We cross validate and release tuned paramters here.

ANCE is a dense retrieval model. We use an MS Marco fined-tune ANCE model and Pyserini’s wrapper for easy indexing. ANCE+FirstP takes the first 512 BERT tokens of each document to represent that document. While ANCE+MaxP shards the document into a maximum of four 512-token shards, with the maximum score representing the document.

T5 is state-of-the-art LM re-ranker that casts text re-ranking into a sequence-to-sequence. Pygaggle’s MonoT5 model, which is fine-tuned using MS Marco. We employ a max-passage approach similar to Nogueira et al. (2020) to re-rank all initial retrieval runs.

Document ranking:

MAP NDCG@10 Recall@1000
BM25 0.213 0.322 0.762
BM25+RM3 0.233 0.327 0.800
ANCE-MaxP 0.186 0.363 0.689
BM25+T5 0.340 0.468 0.762
BM25+RM3+T5 0.346 0.472 0.800
ANCE-MaxP+T5 0.316 0.481 0.689

Entity Ranking:

MAP NDCG@10 Recall@1000
BM25 0.181 0.397 0.615
BM25+RM3 0.209 0.412 0.685
ANCE-FirstP 0.076 0.269 0.340
BM25+T5 0.172 0.361 0.615
BM25+RM3+T5 0.179 0.362 0.685
ANCE-FirstP+T5 0.136 0.407 0.340

Future Work

We envision CODEC to be an evolving collection, with additional judgments and tasks added in the future, i.e. knowledge grounded generation, passage ranking, and entity linking. The topics could also be further enhanced with facet annotations and semantic annotations to support tail and non-KG entities research

Please suggest any future extensions or bug fixes on github or email ([email protected]).