Preprocess Text: add option to extract specific (keyword) N-grams #1011
Labels
enhancement
meal
This will take a day or two
text expert
Requires knowledge of Text add-on.
wishlist
Is your feature request related to a problem? Please describe.
Certain types of documents, such as scientific publications, are often often accompanied by a list of keywords that typically contain N-grams, such as "generative neural networks", "fertility rates" or "consumer preferences". If metadata of the publications can be downloaded, these often appear in a separate column, separated by commas, semicolons or some other separator. Using Preprocess Text, these N-grams can easily be extracted by tokenization using the regexp [^;]+ (for ";" as separator).
When analyzing the full texts or abstracts, it would be very useful if these same N-grams are also recognized as belonging together and not as separate words - including keyword N-grams from other documents that appear in a document's main text (but not in its keywords). Of course, N-grams can be extracted defining an N-grams range in Preprocess Text, but this will produce also many meaningless or less meaningful N-grams, especially if in-between stopwords have already been removed.
Describe the solution you'd like
Ideally I would like to be able to connect two corpora as input to Preprocess Text, one with the main texts or abstracts from all documents and one with all the keywords, 1-grams and N-grams from all documents, presumably tokenized with Preprocess Text already. The second input is only used for a "keyword N-gram construction" step after tokenization of the first input (not necessarily at the end, like regular N-gram construction).
Another option would be to allow for specifying an "N-gram keyword lexicon" file in the Filtering step, but that would require a two-step approach where the list of keywords has to be re-created and reloaded each time when documents are being added
Describe alternatives you've considered
As said, use the regular N-gram construction option, which produces a lot of noise
The text was updated successfully, but these errors were encountered: