Mention-anomaly-based Event Detection and Tracking in Twitter
Author: Adrien GUILLE
Details of this program are described in the following paper:
Adrien Guille and Cécile Favre (2014)
Mention-Anomaly-Based Event Detection and Tracking in Twitter.
In Proceedings of the 2014 IEEE/ACM International Conference on
Advances in Social Network Mining and Analysis (ASONAM 2014),
pp. 375-382, DOI: 10.1109/ASONAM.2014.6921613
Please cite this paper when using the program.
- input/: input files that describe the corpus in which we want to detect events
- MABED.jar: Java program that does the event detection
- README.txt: this file
- parameters.txt: Java properties file in which parameters are set
- stopwords.txt: a list of common stopwords to remove when generating the vocabulary
- lib/: program dependencies
If the program is called with the argument "-split", it expects the file in the "dataset/" directory:
- <name_file.csv>.text: content all tweets, one line per tweet, each line should be formatted according to this format: "timestamp","tweet message"; and the timestamp should be formatted according to the format shown below.
If the program is called with the argument "-run", it expects two sets of files in the "input/" directory:
- <time_slice>.text: content of the messages, one line per message;
- <time_slice>.time: timestamp of the messages, each line maps to the message that has the same line number in <time_slice>.text. Timestamps should be formatted according to this format: YYYY-MM-DD HH:mm:ss.S (e.g. 2009-11-01 00:01:24.0)
Time-slices are expected to be numbered starting from 0 and files are expected to be named with 8 digits (e.g. 00000000.text, 00000000.time, 00000001.text, 00000001.time)
All the parameters are set in the parameters.txt file:
- prepareCorpus (boolean): if you are running MABED for the first time, or if the content of the input directory has been modified, this parameter should be set to 'true', otherwise 'false'.
- timeSliceLength (int): length of each time-slice, expressed in minutes (e.g. 30);
- numberOfThreads (int): the number of threads used by MABED (if > 1, then the parallelized implementation of MABED is executed)
- k (int): desired number of events (e.g. 40);
- p (int): maximum number of related words describing each event (e.g. 10);
- theta (double): minimum weight of each related word (e.g. 0.7);
- sigma (double): merging threshold (e.g. 0.5);
- stopwords (String): name of the file that lists the stopwords, one word per line (e.g. stopwords.txt);
- minSupport (double): minimum support of words in the vocabulary (e.g. 0)
- maxSupport (double): maximum support of words in the vocabulary (e.g. 1)
-
Requirements: JAVA (7+)
-
Execute the program MABED.jar with the following command: "java -jar MABED.jar -run". It should process the input and save the output in the "ouput/" directory.
-
To generate input files from a '.csv' file containing all tweets (timestamps and messages), execute the program MABED.jar with the following command: "java -jar MABED.jar -split timeSliceLength name_file.csv". It should process the file containing all tweets, and save the split files in the "input/" directory.