Skip to content

Commit

Permalink
Collecting statistics from AM dataset (#100)
Browse files Browse the repository at this point in the history
* added statistics: Relation_argument_outer_token_distance.md

* minor fix

* micro fix

* micro fix

* micro fix

* added commands and collapsed all histograms

* minor edits

* added text_length_tokens.md and
minor edit

* micro fix

* add span_lengths_tokens.md

* edited texts in markdowns

* micro fix

* added relation_argument_outer_token_distance_per_label.md

* micro fix

* deleted ~~relation_argument_outer_token_distance.md~~

* moved collected statistics content to pie/abstrct/readme.md

* minor fix

* edited text in abstrct/readme

* minor edit abstrct/readme

* removed abstrct histograms from statistics folder

* moved content and files to aae2.readme, and removed from statistics folder

* minor fix

* minor edit

* moved content and files to argmicro/readme.md

* moved content and files to cdcp/readme.md

* minor fix

* moved content and files to scidtb_argmin/readme.md

* minor fix

* moved content and files to sciarg/readme.md

* deleted AM_statistics folder

* minor edit

* fixed typos

* minor adjustments

* minor adjustments

* improve usage examples

* add usage example to argmicro

* fix argmicro usage example

* add cdcp usage example

* add usage example to scidtb_argmin minor and fix

* add details to document converter

* improve usage example for sciarg

* move usage example for sciarg

---------

Co-authored-by: Arne Binder <[email protected]>
  • Loading branch information
idalr and ArneBinder authored Nov 7, 2024
1 parent 65ed8a1 commit e68f60d
Show file tree
Hide file tree
Showing 42 changed files with 1,041 additions and 47 deletions.
184 changes: 172 additions & 12 deletions dataset_builders/pie/aae2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,29 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t

Therefore, the `aae2` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat).

### Usage

```python
from pie_datasets import load_dataset
from pie_datasets.builders.brat import BratDocumentWithMergedSpans
from pytorch_ie.documents import TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions

# load default version
dataset = load_dataset("pie/aae2")
assert isinstance(dataset["train"][0], BratDocumentWithMergedSpans)

# if required, normalize the document type (see section Document Converters below)
dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)

# get first relation in the first document
doc = dataset_converted["train"][0]
print(doc.binary_relations[0])
# BinaryRelation(head=LabeledSpan(start=716, end=851, label='Premise', score=1.0), tail=LabeledSpan(start=591, end=714, label='Claim', score=1.0), label='supports', score=1.0)
print(doc.binary_relations[0].resolve())
# ('supports', (('Premise', 'What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others'), ('Claim', 'through cooperation, children can learn about interpersonal skills which are significant in the future life of all students')))
```

### Dataset Summary

Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642)
Expand All @@ -28,17 +51,6 @@ The `aae2` dataset comes in a single version (`default`) with `BratDocumentWithM

See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema).

### Usage

```python
from pie_datasets import load_dataset, builders

# load default version
datasets = load_dataset("pie/aae2")
doc = datasets["train"][0]
assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
```

### Data Splits

| Statistics | Train | Test |
Expand Down Expand Up @@ -109,7 +121,7 @@ The dataset provides document converters for the following target document types
See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.

#### Label Statistics after Document Conversion
#### Relation Label Statistics after Document Conversion

When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`,
we apply a relation-conversion method (see above) that changes the label counts for the relations, as follows:
Expand All @@ -129,6 +141,154 @@ we apply a relation-conversion method (see above) that changes the label counts
| support: `supports` | 5958 | 89.3 % |
| attack: `attacks` | 715 | 10.7 % |

### Collected Statistics after Document Conversion

We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
After checking out that code, the statistics and plots can be generated by the command:

```commandline
python src/evaluate_documents.py dataset=aae2_base metric=METRIC
```

where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).

This also requires to have the following dataset config in `configs/dataset/aae2_base.yaml` of this dataset within the repo directory:

```commandline
_target_: src.utils.execute_pipeline
input:
_target_: pie_datasets.DatasetDict.load_dataset
path: pie/aae2
revision: 1015ee38bd8a36549b344008f7a49af72956a7fe
```

For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).

For relation-label statistics, we collect those from the default relation conversion method, i.e., `connect_first`, resulting in three distinct relation labels.

#### Relation argument (outer) token distance per label

The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.

We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.

<details>
<summary>Command</summary>

```
python src/evaluate_documents.py dataset=aae2_base metric=relation_argument_token_distances
```

</details>

##### train (322 documents)

| | len | max | mean | min | std |
| :---------------- | ---: | --: | ------: | --: | ------: |
| ALL | 9002 | 514 | 102.582 | 9 | 93.76 |
| attacks | 810 | 442 | 127.622 | 10 | 109.283 |
| semantically_same | 552 | 514 | 301.638 | 25 | 73.756 |
| supports | 7640 | 493 | 85.545 | 9 | 74.023 |

<details>
<summary>Histogram (split: train, 322 documents)</summary>

![rtd-label_aae2_train.png](img%2Frtd-label_aae2_train.png)

</details>

##### test (80 documents)

| | len | max | mean | min | std |
| :---------------- | ---: | --: | ------: | --: | -----: |
| ALL | 2372 | 442 | 100.711 | 10 | 92.698 |
| attacks | 184 | 402 | 115.891 | 12 | 98.751 |
| semantically_same | 146 | 442 | 299.671 | 34 | 72.921 |
| supports | 2042 | 437 | 85.118 | 10 | 75.023 |

<details>
<summary>Histogram (split: test, 80 documents)</summary>

![rtd-label_aae2_test.png](img%2Frtd-label_aae2_test.png)

</details>

#### Span lengths (tokens)

The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.

We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.

<details>
<summary>Command</summary>

```
python src/evaluate_documents.py dataset=aae2_base metric=span_lengths_tokens
```

</details>

| statistics | train | test |
| :--------- | -----: | -----: |
| no. doc | 322 | 80 |
| len | 4823 | 1266 |
| mean | 17.157 | 16.317 |
| std | 8.079 | 7.953 |
| min | 3 | 3 |
| max | 75 | 50 |

<details>
<summary>Histogram (split: train, 332 documents)</summary>

![slt_aae2_train.png](img%2Fslt_aae2_train.png)

</details>
<details>
<summary>Histogram (split: test, 80 documents)</summary>

![slt_aae2_test.png](img%2Fslt_aae2_test.png)

</details>

#### Token length (tokens)

The token length is measured from the first token of the document to the last one.

We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.

<details>
<summary>Command</summary>

```
python src/evaluate_documents.py dataset=aae2_base metric=count_text_tokens
```

</details>

| statistics | train | test |
| :--------- | ------: | -----: |
| no. doc | 322 | 80 |
| mean | 377.686 | 378.4 |
| std | 64.534 | 66.054 |
| min | 236 | 269 |
| max | 580 | 532 |

<details>
<summary>Histogram (split: train, 332 documents)</summary>

![tl_aae2_train.png](img%2Ftl_aae2_train.png)

</details>
<details>
<summary>Histogram (split: test, 80 documents)</summary>

![tl_aae2_test.png](img%2Ftl_aae2_test.png)

</details>

## Dataset Creation

### Curation Rationale
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/aae2/img/slt_aae2_test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/aae2/img/slt_aae2_train.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/aae2/img/tl_aae2_test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/aae2/img/tl_aae2_train.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit e68f60d

Please sign in to comment.