This repository contains all files used in the sequential pattern mining applied at 72.019 sentences with entity associations from PubMed abstracts classified as positive in Text Classification Step. Below, there is information about the files:
- sequential-pattern-mining-pubmed-abstract-sentences-gh.R: R script for sequential pattern mining in PubMed abstract sentences on polyphenols anticancer activity.
- anotated_sentences.tsv: tsv file with a list of 72.019 sentences annotated with entities about polyphenols, cancers and genes, for sequential pattern mining. Save this file in the same folder of sequential-pattern-mining-pubmed-abstract-sentences-gh.R script, because it is needed to execute the script.
For more information about this and other steps of the Kaphta Architecture, see sections of the Kaptha Web Tool available in https://portal.ifsuldeminas.edu.br/kaphtawebtool/.
Below, there is information about the files with the patterns mined, used in the creation of rules for information extraction about anticancer activity in PubMed abstracts:
- Patterns-unique-trasaction.tsv: list of patterns mined with a unique term.
- Patterns-polifenol-cancer-associations.tsv: list of patterns mined with polyphenol-cancer entities associations.
- Patterns-polifenol-gene-associations.tsv: list of patterns mined with polyphenol-gene entities associations.
- Patterns-gene-cancer-associations.tsv: list of patterns mined with gene-cancer entities associations.
The sequential pattern mining contributes for creation of a dictionary with 25 rules for the Information Extraction Step. Click to see more information about the Rules Dictionary Implementation.