Skip to content
This repository has been archived by the owner on Mar 21, 2019. It is now read-only.

Find features

cstubben edited this page Sep 8, 2014 · 3 revisions

We use str_extract_all in the stringr package to extract locus tags from the searchPMC results using the prefix, number of digits and optional suffixes as the pattern string.

y <- searchPMC(txt, "BPS[SL][0-9]{4}")
	
str_extract_all(y$mention, "BPS[SL][0-9]{4}[abc]?")
[[1]]
[1] "BPSS1279" "BPSL1771" "BPSS0842"
[[2]]
[1] "BPSL2311" "BPSL2312"
[[3]]
[1] "BPSS0404" "BPSL2945"
[[4]]
[1] "BPSL2787" "BPSL2810" "BPSS0417" "BPSS0429" "BPSS1825" "BPSS1834"

In addition, many locus tags are arranged as pairs marking the start and end of a region such as a genomic island or operon. We also extract these pairs and expand the range using seqIds and the ordered list of locus tags from the GFF3 file.

unlist( str_extract_all(y$mention, "BPS[SL][0-9]{4}-BPS[SL][0-9]{4}") )
[1] "BPSL2787-BPSL2810" "BPSS0417-BPSS0429" "BPSS1825-BPSS1834" "BPSS1493-BPSS1511"

seqIds("BPSS0417-BPSS0429", tags= bplocus)
[1] "BPSS0417" "BPSS0418" "BPSS0419" "BPSS0420" "BPSS0421" "BPSS0422" "BPSS0423" "BPSS0424" "BPSS0425" "BPSS0426" "BPSS0427" "BPSS0428" "BPSS0429"

The findTags function extracts tags and expands ranges using the pmcText or pmcTable output. The resulting data.frame includes the PMC id, section, locus tag, flag indicating if tags were indirectly cited within a range, and the citation (sentence or collapsed row).

x <- findTags(txt, bplocus, prefix = "BPS[SL]" , suffix= "[abc]")
[1] "9 matches"
[1] "Expanded 2 matches to 48, 19 tags"
[1] "79 locus tags cited (78 unique)"

x[1:10,]
   id         source  locus    range mention                                                                                                                                                                                                                                                                                                                         
1  PMC3418162 Results BPSS1279 FALSE Anaerobic metabolism pathway genes such as BPSS1279 (threonine dehydratase), BPSL1771 (cobalamin biosynthesis protein CbiG) and BPSS0842 (benzoylformate decarboxylase) were up-regulated throughout the infection period.                                                                                                       
2  PMC3418162 Results BPSL1771 FALSE Anaerobic metabolism pathway genes such as BPSS1279 (threonine dehydratase), BPSL1771 (cobalamin biosynthesis protein CbiG) and BPSS0842 (benzoylformate decarboxylase) were up-regulated throughout the infection period.                                                                                                       
3  PMC3418162 Results BPSS0842 FALSE Anaerobic metabolism pathway genes such as BPSS1279 (threonine dehydratase), BPSL1771 (cobalamin biosynthesis protein CbiG) and BPSS0842 (benzoylformate decarboxylase) were up-regulated throughout the infection period.                                                                                                       
4  PMC3418162 Results BPSL2311 FALSE Nevertheless, none of the components of the anaerobic respiratory chain showed significant changes in expression except for BPSL2311 (putative respiratory nitrate reductase delta chain) and BPSL2312 (putative respiratory nitrate reductase gamma chain) that were induced at the early stage of infection.                   
5  PMC3418162 Results BPSL2312 FALSE Nevertheless, none of the components of the anaerobic respiratory chain showed significant changes in expression except for BPSL2311 (putative respiratory nitrate reductase delta chain) and BPSL2312 (putative respiratory nitrate reductase gamma chain) that were induced at the early stage of infection.                   

The pmcXML package includes a few other functions to find species and genes and we are working on functions to find accessions, sequences and coordinates within the full-text, tables and supplements. In most articles, there are many gene names that are not included in the RefSeq GFF3 file and more work is needed to track down the source of these genes (most are from B. pseduomallei, but many gene names cited in the methods may be from other species).

table2(findSpecies(doc))
[1] "Found 96 species mentions"
                          Total
Burkholderia pseudomallei    91
Burkholderia cenocepacia      2
Bordetella pertussis          1
Burkholderia mallei           1
Caenorhabditis elegans        1

x<- findGenes(txt)
[1] "Found 45 gene mentions (32 unique)"
[1] "  possible operons: ftsABH, rpoABCZ"

table(x$gene)
   ahpC    bimA    bopE    bprC    bspR   catAC    cbiG    cydB    dnaB    dnaE    dspA    fhaB 
      1       2       1       1       1       1       1       2       1       1       1       2 
   fhaC  ftsABH   groEL   groES    hcp1    katG    minD    minE    oxyR    parA    parB    parC 
      4       1       1       1       3       1       1       1       2       1       2       1 
rpoABCZ    rpoS    tssD    tssH    tssM    virA   virAG    virG 
      1       3       2       1       1       1       1       1

sort(unique(x$gene[!x$gene %in% bpgenes]))
 [1] "bimA"    "bprC"    "bspR"    "catAC"   "cbiG"    "dspA"    "fhaB"    "fhaC"    "ftsABH" 
[10] "hcp1"    "parA"    "rpoABCZ" "tssD"    "tssH"    "tssM"    "virA"    "virAG"   "virG" 

Finally, we created a loop that uses the list of references from ncbiPMC and downloads each XML file and parses the full-text and tables and extracts all matching locus tags. In this case, the 2990 locus tag citations are saved to a file. Currently, the supplements are not included in the loop and these are downloaded separately since some additional code is still needed to reformat tables before extracting tags.

pmcLoop(bp, tags= bpgff, prefix = "BPS[SL]" , suffix= "[abc]",  file="bp.tab")
Clone this wiki locally