- Overview
- Paper
- Dataset
- Change Log
- Tasks
- Complex Topics
- Document Corpus
- Entity KB
- Judgments
- Query Reformulations
- Entity-Centric Search
- Evaluation
- System Performance
- Future Work
Colab demo showing indexing, query reformulations, entity links, and evaluation:
CODEC is a new document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers across history, economics, and politics. For example, ‘How has the UK’s Open Banking Regulation benefited Challenger Banks?’
CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. It includes expert judgments on 6,186 document (147.3 per topic) and 11,323 entity (269.6 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations (9.2 per topic), providing data for query performance prediction and automatic rewriting evaluation.
This work will be presented at SIGIR 2022: https://arxiv.org/abs/2205.04546
Correct citation:
@inproceedings{mackie2022codec,
title={CODEC: Complex Document and Entity Collection},
author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery},
booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
year={2022}
}
CODEC provides 42 topics for document and entity retrieval:
CODEC full document corpus is available for research purpose: FULL.
CODEC entity KB is KILT's snapshot of Wikipedia (~30GB).
Colab demo showing indexing, query reformulations, entity links, and evaluation:
Dataset is available via ir-datasets.
Major dataset changes historic users should be aware:
- 25th April: CODEC v1 released.
CODEC is a test collection that provides two tasks: document ranking and entity ranking. This dataset benchmarks a social science researcher who is attempting to find supporting entities and documents that will form the basis of a long-form essay discussing the topic from various perspectives. The researcher would explore the topic to (1) identify relevant sources and (2) understand key concepts.
Document ranking systems have to return a relevance-ranked list of documents for a given natural language query. Entity ranking systems have to return a relevance-ranked list of entities for a given natural language query. Document ranking uses CODEC’s new document corpus and entity ranking uses KILT as the entity knowledge base. For the experimental setup, we provide four pre-defined ‘standard’ folds for k-fold cross-validation to allow parameter tuning. Initial retrieval or re-ranking of provided baseline runs can both be evaluated using this dataset.
CODEC provides 42 complex topics which intend to benchmark the role of a researcher. Social science experts from history (history teacher, published history scholar), economics (FX trader, accountant, investment banker) and politics (political scientists, politician) helped to generate interesting and factually-grounded topics. The authors develop the following criteria for complex topics:- Open-ended essay-style
- Natural language question
- Multiple points of view
- Concern multiple key entities
- Complex
- Requires knowledge
Each topic contains a query and narrative. The query is the question the researcher seeks to understand by exploring documents and entities, i.e., the text input posed to the search system. The narratives provide an overview of the topic (key concepts, arguments, facts, etc.) and allow non-domain-experts to understand the topic.
We use Common Crawl to curate a 729,824 document corpus with focused content across finance, history, and politics.
The corpus is released in jsonline format with following fields:
- id: Unique identifier is the MD5 hash of URL.
- url: Location of the webpage (URL).
- title: Title of the webpage if available.
- contents: The text content of the webpage after removing any unnecessary advertising or formatting. New lines provide some structure between the extracted sections of the webpage, while still easy for neural systems to process.
Document distribution:
Document Count | |
---|---|
reuters.com | 172,127 |
forbes.com | 147,399 |
cnbc.com | 100,842 |
britannica.com | 93,484 |
latimes.com | 88,486 |
usatoday.com | 31,803 |
investopedia.com | 21,459 |
bbc.co.uk | 21,414 |
history.state.gov | 9,187 |
brookings.edu | 9,058 |
ehistory.osu.edu | 8,805 |
history.com | 6,749 |
spartacus-educational.com | 3,904 |
historynet.com | 3,811 |
historyhit.com | 3,173 |
... | ... |
TOTAL | 721,701 |
CODEC uses KILT’s Wikipedia KB for the entity ranking task, which is based on the 2019/08/01 Wikipedia snapshot. KILT contains 5.9M preprocessed articles which is freely available to use: link.
CODEC uses a 2-stage assessment approach to balance adequate coverage of current systems while allowing annotators to explore topics using iterative search system. This creates 6,186 document judgments (147.3 per topic) and 11,323 entity judgments (269.6 per topic):
These raw judgements are released: link.
Judgment | Document Ranking | Entity Ranking |
---|---|---|
0 | 2,353 | 7,053 |
1 | 2,210 | 2,241 |
2 | 1,207 | 1,252 |
3 | 416 | 777 |
TOTAL | 6,186 | 11,323 |
During assessment process, researchers use a live search system to explore the complex topic. We release the full 387 queries and mapped relevance judgment: link
An example of these manual query reformulations:
CODEC provides aligned document and entity judgments that allows for new entity-centric search models to be developed.
We provide TREC-style query-relevance files (entity rankings: link) and (document ranking: link).
The official measures for both tasks include MAP, NDCG@10, and Recall@1000.
Systems:
Sparse retrieval BM25 and BM25+RM3 runs use Pyserini with Porter stemming and stopwords removed. We cross validate and release tuned paramters here.
ANCE is a dense retrieval model. We use an MS Marco fined-tune ANCE model and Pyserini’s wrapper for easy indexing. ANCE+FirstP takes the first 512 BERT tokens of each document to represent that document. While ANCE+MaxP shards the document into a maximum of four 512-token shards, with the maximum score representing the document.
T5 is state-of-the-art LM re-ranker that casts text re-ranking into a sequence-to-sequence. Pygaggle’s MonoT5 model, which is fine-tuned using MS Marco. We employ a max-passage approach similar to Nogueira et al. (2020) to re-rank all initial retrieval runs.
Document ranking:
MAP | NDCG@10 | Recall@1000 | |
---|---|---|---|
BM25 | 0.213 | 0.322 | 0.762 |
BM25+RM3 | 0.233 | 0.327 | 0.800 |
ANCE-MaxP | 0.186 | 0.363 | 0.689 |
BM25+T5 | 0.340 | 0.468 | 0.762 |
BM25+RM3+T5 | 0.346 | 0.472 | 0.800 |
ANCE-MaxP+T5 | 0.316 | 0.481 | 0.689 |
Entity Ranking:
MAP | NDCG@10 | Recall@1000 | |
---|---|---|---|
BM25 | 0.181 | 0.397 | 0.615 |
BM25+RM3 | 0.209 | 0.412 | 0.685 |
ANCE-FirstP | 0.076 | 0.269 | 0.340 |
BM25+T5 | 0.172 | 0.361 | 0.615 |
BM25+RM3+T5 | 0.179 | 0.362 | 0.685 |
ANCE-FirstP+T5 | 0.136 | 0.407 | 0.340 |
We envision CODEC to be an evolving collection, with additional judgments and tasks added in the future, i.e. knowledge grounded generation, passage ranking, and entity linking. The topics could also be further enhanced with facet annotations and semantic annotations to support tail and non-KG entities research
Please suggest any future extensions or bug fixes on github or email ([email protected]).