GitHub - singhalprerana/SST_data_extraction: To extract labeled data from Stanford Sentiment treebank

singhalprerana / SST_data_extraction Public

Notifications You must be signed in to change notification settings
Fork 7
Star 16

To extract labeled data from Stanford Sentiment treebank

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.DS_Store		.DS_Store
README.txt		README.txt
SOStr.txt		SOStr.txt
STree.txt		STree.txt
datasetSentences.txt		datasetSentences.txt
datasetSplit.txt		datasetSplit.txt
dictionary.txt		dictionary.txt
original_rt_snippets.txt		original_rt_snippets.txt
sentiment_labels.txt		sentiment_labels.txt
sst5_dev.csv		sst5_dev.csv
sst5_test.csv		sst5_test.csv
sst5_train_phrases.csv		sst5_train_phrases.csv
sst5_train_sentences.csv		sst5_train_sentences.csv
sst_dev.csv		sst_dev.csv
sst_test.csv		sst_test.csv
sst_train_phrases.csv		sst_train_phrases.csv
sst_train_sentences.csv		sst_train_sentences.csv
xtract_sst.py		xtract_sst.py

Repository files navigation

Stanford Sentiment Treebank V1.0

This is the dataset of the paper:

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts
Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)

If you use this dataset in your research, please cite the above paper.

@incollection{SocherEtAl2013:RNTN,
title = {{Parsing With Compositional Vector Grammars}},
author = {Richard Socher and Alex Perelygin and Jean Wu and Jason Chuang and Christopher Manning and Andrew Ng and Christopher Potts},
booktitle = {{EMNLP}},
year = {2013}
}

This file includes:
1. original_rt_snippets.txt contains 10,605 processed snippets from the original pool of Rotten Tomatoes HTML files. Please note that some snippet may contain multiple sentences.

2. dictionary.txt contains all phrases and their IDs, separated by a vertical line |

3. sentiment_labels.txt contains all phrase ids and the corresponding sentiment labels, separated by a vertical line.
Note that you can recover the 5 classes by mapping the positivity probability using the following cut-offs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
for very negative, negative, neutral, positive, very positive, respectively.
Please note that phrase ids and sentence ids are not the same.

4. SOStr.txt and STree.txt encode the structure of the parse trees. 
STree encodes the trees in a parent pointer format. Each line corresponds to each sentence in the datasetSentences.txt file. The Matlab code of this paper will show you how to read this format if you are not familiar with it.

5. datasetSentences.txt contains the sentence index, followed by the sentence string separated by a tab. These are the sentences of the train/dev/test sets.

6. datasetSplit.txt contains the sentence index (corresponding to the index in datasetSentences.txt file) followed by the set label separated by a comma:
	1 = train
	2 = test
	3 = dev

Please note that the datasetSentences.txt file has more sentences/lines than the original_rt_snippet.txt. 
Each row in the latter represents a snippet as shown on RT, whereas the former is each sub sentence as determined by the Stanford parser.

For comparing research and training models, please use the provided train/dev/test splits.