-
Notifications
You must be signed in to change notification settings - Fork 2
Find features
We use str_extract_all
in the stringr
package to extract locus tags from the searchPMC
results using the prefix, number of digits and optional suffixes as the pattern string.
y <- searchPMC(txt, "BPS[SL][0-9]{4}")
str_extract_all(y$mention, "BPS[SL][0-9]{4}[abc]?")
[[1]]
[1] "BPSS1279" "BPSL1771" "BPSS0842"
[[2]]
[1] "BPSL2311" "BPSL2312"
[[3]]
[1] "BPSS0404" "BPSL2945"
[[4]]
[1] "BPSL2787" "BPSL2810" "BPSS0417" "BPSS0429" "BPSS1825" "BPSS1834"
In addition, many locus tags are arranged as pairs marking the start and end
of a region such as a genomic island or operon. We also extract these pairs
and expand the range using seqIds
and the ordered list of locus tags from
the GFF3 file.
unlist( str_extract_all(y$mention, "BPS[SL][0-9]{4}-BPS[SL][0-9]{4}") )
[1] "BPSL2787-BPSL2810" "BPSS0417-BPSS0429" "BPSS1825-BPSS1834" "BPSS1493-BPSS1511"
seqIds("BPSS0417-BPSS0429", tags= bplocus)
[1] "BPSS0417" "BPSS0418" "BPSS0419" "BPSS0420" "BPSS0421" "BPSS0422" "BPSS0423" "BPSS0424" "BPSS0425" "BPSS0426" "BPSS0427" "BPSS0428" "BPSS0429"
The findTags
function extracts tags and expands ranges using the pmcText
or pmcTable output. The resulting data.frame
includes the PMC id, section, locus tag, flag indicating if tags were indirectly
cited within a range, and the citation (sentence or collapsed row).
x <- findTags(txt, bplocus, prefix = "BPS[SL]" , suffix= "[abc]")
[1] "9 matches"
[1] "Expanded 2 matches to 48, 19 tags"
[1] "79 locus tags cited (78 unique)"
x[1:10,]
id source locus range mention
1 PMC3418162 Results BPSS1279 FALSE Anaerobic metabolism pathway genes such as BPSS1279 (threonine dehydratase), BPSL1771 (cobalamin biosynthesis protein CbiG) and BPSS0842 (benzoylformate decarboxylase) were up-regulated throughout the infection period.
2 PMC3418162 Results BPSL1771 FALSE Anaerobic metabolism pathway genes such as BPSS1279 (threonine dehydratase), BPSL1771 (cobalamin biosynthesis protein CbiG) and BPSS0842 (benzoylformate decarboxylase) were up-regulated throughout the infection period.
3 PMC3418162 Results BPSS0842 FALSE Anaerobic metabolism pathway genes such as BPSS1279 (threonine dehydratase), BPSL1771 (cobalamin biosynthesis protein CbiG) and BPSS0842 (benzoylformate decarboxylase) were up-regulated throughout the infection period.
4 PMC3418162 Results BPSL2311 FALSE Nevertheless, none of the components of the anaerobic respiratory chain showed significant changes in expression except for BPSL2311 (putative respiratory nitrate reductase delta chain) and BPSL2312 (putative respiratory nitrate reductase gamma chain) that were induced at the early stage of infection.
5 PMC3418162 Results BPSL2312 FALSE Nevertheless, none of the components of the anaerobic respiratory chain showed significant changes in expression except for BPSL2311 (putative respiratory nitrate reductase delta chain) and BPSL2312 (putative respiratory nitrate reductase gamma chain) that were induced at the early stage of infection.
The pmcXML
package includes a few other functions to find species and genes
and we are working on functions to find accessions,
sequences and coordinates within the full-text, tables and supplements.
In most articles, there are many gene names that are not included in the
RefSeq GFF3 file and more work is needed to track down the source of these
genes (most are from B. pseduomallei, but many gene names cited in the
methods may be from other species).
table2(findSpecies(doc))
[1] "Found 96 species mentions"
Total
Burkholderia pseudomallei 91
Burkholderia cenocepacia 2
Bordetella pertussis 1
Burkholderia mallei 1
Caenorhabditis elegans 1
x<- findGenes(txt)
[1] "Found 45 gene mentions (32 unique)"
[1] " possible operons: ftsABH, rpoABCZ"
table(x$gene)
ahpC bimA bopE bprC bspR catAC cbiG cydB dnaB dnaE dspA fhaB
1 2 1 1 1 1 1 2 1 1 1 2
fhaC ftsABH groEL groES hcp1 katG minD minE oxyR parA parB parC
4 1 1 1 3 1 1 1 2 1 2 1
rpoABCZ rpoS tssD tssH tssM virA virAG virG
1 3 2 1 1 1 1 1
sort(unique(x$gene[!x$gene %in% bpgenes]))
[1] "bimA" "bprC" "bspR" "catAC" "cbiG" "dspA" "fhaB" "fhaC" "ftsABH"
[10] "hcp1" "parA" "rpoABCZ" "tssD" "tssH" "tssM" "virA" "virAG" "virG"
Finally, we created a loop that uses the list of references from ncbiPMC
and downloads each XML file and parses the full-text and tables and extracts
all matching locus tags. In this case, the 2990 locus tag citations are saved to a
file. Currently, the supplements are not included in
the loop and these are downloaded separately since some additional code is
still needed to reformat tables before extracting tags.
pmcLoop(bp, tags= bpgff, prefix = "BPS[SL]" , suffix= "[abc]", file="bp.tab")