Collecting statistics from AM dataset (#100)

* added statistics: Relation_argument_outer_token_distance.md * minor fix * micro fix * micro fix * micro fix * added commands and collapsed all histograms * minor edits * added text_length_tokens.md and minor edit * micro fix * add span_lengths_tokens.md * edited texts in markdowns * micro fix * added relation_argument_outer_token_distance_per_label.md * micro fix * deleted ~~relation_argument_outer_token_distance.md~~ * moved collected statistics content to pie/abstrct/readme.md * minor fix * edited text in abstrct/readme * minor edit abstrct/readme * removed abstrct histograms from statistics folder * moved content and files to aae2.readme, and removed from statistics folder * minor fix * minor edit * moved content and files to argmicro/readme.md * moved content and files to cdcp/readme.md * minor fix * moved content and files to scidtb_argmin/readme.md * minor fix * moved content and files to sciarg/readme.md * deleted AM_statistics folder * minor edit * fixed typos * minor adjustments * minor adjustments * improve usage examples * add usage example to argmicro * fix argmicro usage example * add cdcp usage example * add usage example to scidtb_argmin minor and fix * add details to document converter * improve usage example for sciarg * move usage example for sciarg --------- Co-authored-by: Arne Binder <[email protected]>
ArneBinder · Nov 7, 2024 · e68f60d · e68f60d
1 parent 65ed8a1
commit e68f60d
Show file tree

Hide file tree

Showing 42 changed files with 1,041 additions and 47 deletions.
diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md
@@ -4,6 +4,29 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t
 
 Therefore, the `aae2` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat).
 
+### Usage
+
+```python
+from pie_datasets import load_dataset
+from pie_datasets.builders.brat import BratDocumentWithMergedSpans
+from pytorch_ie.documents import TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions
+
+# load default version
+dataset = load_dataset("pie/aae2")
+assert isinstance(dataset["train"][0], BratDocumentWithMergedSpans)
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
+assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
+
+# get first relation in the first document
+doc = dataset_converted["train"][0]
+print(doc.binary_relations[0])
+# BinaryRelation(head=LabeledSpan(start=716, end=851, label='Premise', score=1.0), tail=LabeledSpan(start=591, end=714, label='Claim', score=1.0), label='supports', score=1.0)
+print(doc.binary_relations[0].resolve())
+# ('supports', (('Premise', 'What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others'), ('Claim', 'through cooperation, children can learn about interpersonal skills which are significant in the future life of all students')))
+```
+
 ### Dataset Summary
 
 Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642)
@@ -28,17 +51,6 @@ The `aae2` dataset comes in a single version (`default`) with `BratDocumentWithM
 
 See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema).
 
-### Usage
-
-```python
-from pie_datasets import load_dataset, builders
-
-# load default version
-datasets = load_dataset("pie/aae2")
-doc = datasets["train"][0]
-assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
-```
-
 ### Data Splits
 
 | Statistics                                                       |                      Train |                     Test |
@@ -109,7 +121,7 @@ The dataset provides document converters for the following target document types
 See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
 definitions.
 
-#### Label Statistics after Document Conversion
+#### Relation Label Statistics after Document Conversion
 
 When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`,
 we apply a relation-conversion method (see above) that changes the label counts for the relations, as follows:
@@ -129,6 +141,154 @@ we apply a relation-conversion method (see above) that changes the label counts
 | support: `supports` |  5958 |     89.3 % |
 | attack: `attacks`   |   715 |     10.7 % |
 
+### Collected Statistics after Document Conversion
+
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
+
+```commandline
+python src/evaluate_documents.py dataset=aae2_base metric=METRIC
+```
+
+where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).
+
+This also requires to have the following dataset config in `configs/dataset/aae2_base.yaml` of this dataset within the repo directory:
+
+```commandline
+_target_: src.utils.execute_pipeline
+input:
+  _target_: pie_datasets.DatasetDict.load_dataset
+  path: pie/aae2
+  revision: 1015ee38bd8a36549b344008f7a49af72956a7fe
+```
+
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
+
+For relation-label statistics, we collect those from the default relation conversion method, i.e., `connect_first`, resulting in three distinct relation labels.
+
+#### Relation argument (outer) token distance per label
+
+The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
+We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
+
+<details>
+<summary>Command</summary>
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=relation_argument_token_distances
+```
+
+</details>
+
+##### train (322 documents)
+
+|                   |  len | max |    mean | min |     std |
+| :---------------- | ---: | --: | ------: | --: | ------: |
+| ALL               | 9002 | 514 | 102.582 |   9 |   93.76 |
+| attacks           |  810 | 442 | 127.622 |  10 | 109.283 |
+| semantically_same |  552 | 514 | 301.638 |  25 |  73.756 |
+| supports          | 7640 | 493 |  85.545 |   9 |  74.023 |
+
+<details>
+  <summary>Histogram (split: train, 322 documents)</summary>
+
+![rtd-label_aae2_train.png](img%2Frtd-label_aae2_train.png)
+
+</details>
+
+##### test (80 documents)
+
+|                   |  len | max |    mean | min |    std |
+| :---------------- | ---: | --: | ------: | --: | -----: |
+| ALL               | 2372 | 442 | 100.711 |  10 | 92.698 |
+| attacks           |  184 | 402 | 115.891 |  12 | 98.751 |
+| semantically_same |  146 | 442 | 299.671 |  34 | 72.921 |
+| supports          | 2042 | 437 |  85.118 |  10 | 75.023 |
+
+<details>
+  <summary>Histogram (split: test, 80 documents)</summary>
+
+![rtd-label_aae2_test.png](img%2Frtd-label_aae2_test.png)
+
+</details>
+
+#### Span lengths (tokens)
+
+The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
+We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.
+
+<details>
+<summary>Command</summary>
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=span_lengths_tokens
+```
+
+</details>
+
+| statistics |  train |   test |
+| :--------- | -----: | -----: |
+| no. doc    |    322 |     80 |
+| len        |   4823 |   1266 |
+| mean       | 17.157 | 16.317 |
+| std        |  8.079 |  7.953 |
+| min        |      3 |      3 |
+| max        |     75 |     50 |
+
+<details>
+  <summary>Histogram (split: train, 332 documents)</summary>
+
+![slt_aae2_train.png](img%2Fslt_aae2_train.png)
+
+</details>
+  <details>
+  <summary>Histogram (split: test, 80 documents)</summary>
+
+![slt_aae2_test.png](img%2Fslt_aae2_test.png)
+
+</details>
+
+#### Token length (tokens)
+
+The token length is measured from the first token of the document to the last one.
+
+We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
+We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.
+
+<details>
+<summary>Command</summary>
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=count_text_tokens
+```
+
+</details>
+
+| statistics |   train |   test |
+| :--------- | ------: | -----: |
+| no. doc    |     322 |     80 |
+| mean       | 377.686 |  378.4 |
+| std        |  64.534 | 66.054 |
+| min        |     236 |    269 |
+| max        |     580 |    532 |
+
+<details>
+  <summary>Histogram (split: train, 332 documents)</summary>
+
+![tl_aae2_train.png](img%2Ftl_aae2_train.png)
+
+</details>
+  <details>
+  <summary>Histogram (split: test, 80 documents)</summary>
+
+![tl_aae2_test.png](img%2Ftl_aae2_test.png)
+
+</details>
+
 ## Dataset Creation
 
 ### Curation Rationale

diff --git a/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png b/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png
diff --git a/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png b/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png
diff --git a/dataset_builders/pie/aae2/img/slt_aae2_test.png b/dataset_builders/pie/aae2/img/slt_aae2_test.png
diff --git a/dataset_builders/pie/aae2/img/slt_aae2_train.png b/dataset_builders/pie/aae2/img/slt_aae2_train.png
diff --git a/dataset_builders/pie/aae2/img/tl_aae2_test.png b/dataset_builders/pie/aae2/img/tl_aae2_test.png
diff --git a/dataset_builders/pie/aae2/img/tl_aae2_train.png b/dataset_builders/pie/aae2/img/tl_aae2_train.png