Nota bene: This Java implementation is not maintained anymore, for an up-to-date implementation, see: https://github.com/AdrienGuille/pyMABED
Mention-anomaly-based Event Detection and Tracking in Twitter
Author: Adrien GUILLE
Details of this program are described in the following papers:
Adrien Guille and Cécile Favre (2015)
Event detection, tracking, and visualization in Twitter: a mention-anomaly-based approach.
Springer Social Network Analysis and Mining,
vol. 5, iss. 1, art. 18, DOI: 10.1007/s13278-015-0258-0
Adrien Guille and Cécile Favre (2014)
Mention-Anomaly-Based Event Detection and Tracking in Twitter.
In Proceedings of the 2014 IEEE/ACM International Conference on
Advances in Social Network Mining and Analysis (ASONAM 2014),
pp. 375-382, DOI: 10.1109/ASONAM.2014.6921613
Please cite one of these papers when using the program.
- input/: input files that describe the corpus in which we want to detect events
- MABED.jar: Java program that does the event detection
- README.txt: this file
- parameters.txt: Java properties file in which parameters are set
- stopwords.txt: a list of common stopwords to remove when generating the vocabulary
- lib/: program dependencies
The program expects two sets of files in the "input/" directory:
- <time_slice>.text: content of the messages, one line per message;
- <time_slice>.time: timestamp of the messages, each line maps to the message that has the same line number in <time_slice>.text. Timestamps should be formatted according to this format: YYYY-MM-DD HH:mm:ss (e.g. 2009-11-01 00:01:24)
Time-slices are expected to be numbered starting from 0 and files are expected to be named with 8 digits (e.g. 00000000.text, 00000000.time, 00000001.text, 00000001.time)
All the parameters are set in the parameters.txt file:
- prepareCorpus (boolean): if you are running MABED for the first time, or if the content of the input directory has been modified, this parameter should be set to 'true', otherwise 'false'.
- timeSliceLength (int): length of each time-slice, expressed in minutes (e.g. 30);
- numberOfThreads (int): the number of threads used by MABED (if > 1, then the parallelized implementation of MABED is executed)
- k (int): desired number of events (e.g. 40);
- p (int): maximum number of related words describing each event (e.g. 10);
- theta (double): minimum weight of each related word (e.g. 0.7);
- sigma (double): merging threshold (e.g. 0.5);
- stopwords (String): name of the file that lists the stopwords, one word per line (e.g. stopwords.txt);
- minSupport (double): minimum support of words in the vocabulary (e.g. 0)
- maxSupport (double): maximum support of words in the vocabulary (e.g. 1)
- Requirements: JAVA (7+)
- Execute the program MABED.jar with the following command: "java -jar MABED.jar -run". It should process the input and save the output in the "ouput/" directory.