Releases: angelosalatino/cso-classifier
CSO Classifier v3.3
This release extends version 3.2 with a new feature that lets you refine the classification process by focusing on specific areas within the Computer Science Ontology. Specifically, providing one or more topics within the parameter filter_by (type list), the classifier will extract the sub-branches of such CSO topics, and when classifying will narrow down the output to the only sub-topics available in those areas. This is especially helpful when you are interested in exploring specific branches of the CSO, such as identifying only the concepts related to artificial intelligence and semantic web within a given paper, and can be achieved by setting filter_by = ["artificial intelligence", "semantic web"] (see Parameters). If this parameter is set, the classifier will return the standard classification results, with four extra sets of results (syntactic_filtered, semantic_filtered, union_filtered, enhanced_filtered) containing only the filtered topics. This gives users the full picture and a focused view within the chosen areas.
CSO Classifier v3.2
This release extends version 3.1 by supporting users in exporting the weights associated to the identified topics. If enabled, within the result of the classification, the classifier include two new keys syntactic_weights and semantic_weights which respectively contain the identified syntactic and semantic topics as keys, and their weights as values. This component is disabled by default and can be enabled by setting get_weights = True when calling the CSO Classifier.
CSO Classifier v3.1
This release brings in two main changes. The first change is related to the library (and the code) to perform the Levenshtein similarity. Before we relied on python-Levenshtein
which required python3-devel
. This new version uses rapidfuzz
which as fast as the previous library and it is much easier to install on the various systems.
The second change is related to an updated list of dependencies. We updated some libraries including igraph
.
CSO Classifier v3.0
This release welcomes some improvements under the hood. In particular:
- we refactored the code, reorganising scripts into more elegant classes
- we added functionalities to automatically setup and update the classifier to the latest version of CSO
- we added the explanation feature, which returns chunks of text that allowed the classifier to infer a given topic
- the syntactic module takes now advantage of Spacy POS tagger (as previously done only by semantic module)
- the grammar for the chunk parser is now more robust:
{<JJ.*>*<HYPH>*<JJ.*>*<HYPH>*<NN.*>*<HYPH>*<NN.*>+}
In addition, in the post-processing module, we added the outlier detection component. This component improves the accuracy of the result set, by removing erroneous topics that were conceptually distant from the others. This component is enabled by default and can be disabled by setting delete_outliers = False
when calling the CSO Classifier (see Parameters).
Please, be aware that having substantially restructured the code into classes, the way of running the classifier has changed too. Thus, if you are using a previous version of the classifier, we encourage you to update it (pip install -U cso-classifier
) and modify your calls to the classifier, accordingly. Read our usage examples.
We would like to thank James Dunham @jamesdunham from CSET (Georgetown University) for suggesting to us how to improve the code.
CSO Classifier v2.3.2
Version alignement with Pypi. Similar to version 2.3.1.
CSO Classifier v2.3.1
Bug Fix. Added some exception handles
CSO Classifier v2.3
This new release, contains a bug fix and the latest version of the CSO ontology.
Bug Fix: When running in batch mode, the classifier was treating the keyword field as an array instead of string. In this way, instead of processing keywords (separated by comma), it was processing each single letters, hence inferring wrong topics. This now has been fixed. In addition, if the keyword field is actually an array, the classifier will first 'stringify' it and then process it.
We also downloaded and packed the latest version of the CSO ontology.
CSO Classifier v2.2
In this version (release v2.2), we (i) updated the requirements needed to run the classifier, (ii) removed all unnecessary warnings, and (iii) enabled multiprocessing. In particular, we removed all useless requirements that were installed in development mode, by cleaning the requirements.txt file.
When computing certain research papers, the classifier display warnings raised by the kneed library. Since the classifier can automatically adapt to such warnings, we decided to hide them and prevent users from being concerned about such outcome.
This version of the classifier provides improved scalablibility through multiprocessing. Once the number of workers is set (i.e. num_workers >= 1), each worker will be given a copy of the CSO Classifier with a chunk of the corpus to process. Then, the results will be aggregated once all processes are completed. Please be aware that this function is only available in batch mode.
CSO Classifier v2.1
The CSO Classifier is an application that takes as input the text from abstract, title, and keywords of a research paper and outputs a list of relevant concepts from CSO. This new release (version v2.1) aims at improving its scalability.
Compared to its previous version (v2.0), the classifier relies on a cached word2vec model which connects the words within the model vocabulary directly with the CSO topics. Thanks to this cache, the classifier is able to quickly retrieve all CSO topics that could be inferred by given tokens, speeding up the processing time. In addition, this cache is lighter (~64M) compared to the actual word2vec model (~366MB), which allows to save additional time at loading time.
Thanks to this improvement the CSO Classifier is around 24x faster and can be easily run on large corpus of scholarly data.
CSO Classifier v2.0
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this repository, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of research areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.