Skip to content

Commit

Permalink
Add data: AMI DialSum Corpus
Browse files Browse the repository at this point in the history
  • Loading branch information
gcunhase committed Dec 4, 2019
1 parent af79b5f commit 4c9feb8
Show file tree
Hide file tree
Showing 7,832 changed files with 70,424 additions and 7 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.idea/*
*.pyc
.DS_Store
data/ami_*/*
data/ami-*/*
data/*/.DS_Store
venv*/*
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
* Transforms into CNN-DailyMail News dataset (`.story` files with article and highlight in it)

### Contents
[Requirements](#requirements)[About AMI Meeting Corpus](#ami-corpus)[How to Use](#how-to-use)[How to Cite](#acknowledgement)
[Requirements](#requirements)[About AMI Meeting Corpus](#ami-corpus)[AMI DialSum Corpus](#ami-dialsum-meeting-corpus)[How to Use](#how-to-use)[How to Cite](#acknowledgement)

## Requirements
Tested on Python 3.6+, Ubuntu 16.04, Mac OS
Expand Down Expand Up @@ -86,16 +86,17 @@ python main_obtain_meeting2summary_data.py --summary_type abstractive
* Return all the collected words as a paragraph
* Output: `data/ami-summary/extractive/`

## AMI DialSum Meeting Corpus
* [DialSum](https://github.com/MiuLab/DialSum): modified version of the AMI Meeting Dataset
* Use script `ami_dialsum_meeting_story.py`:
* This script takes 2 text files (`in` and `sum`) and formats it into a series of `.story` files compatible with the CNN/DM format
* Each line in file `in` corresponds to a meeting transcript with summary present in the same line in file `sum`/.

## Notes
* XML reader in Python:
* Minidom vs Element Tree: [Reading XML files in Python](http://stackabuse.com/reading-and-writing-xml-files-in-python/)
* Minidom: XML parser for Python

* Script `ami_dialsum_meeting_story.py`:
* This script takes 2 text files (`in` and `sum`) and formats it into a series of `.story` files compatible with the CNN/DM format
* Each line in file `in` corresponds to a meeting transcript with summary present in the same line in file `sum`/.
* Implemented to deal with a modified version of the AMI Meeting Dataset called [DialSum](https://github.com/MiuLab/DialSum).

* TODO
* Overlapping meeting transcript
* Decision abstract
Expand Down
400 changes: 400 additions & 0 deletions data/ami_dialsum_corpus/test/in

Large diffs are not rendered by default.

Loading

0 comments on commit 4c9feb8

Please sign in to comment.