Skip to content
This repository has been archived by the owner on Mar 21, 2019. It is now read-only.

Using XPath

cstubben edited this page Dec 18, 2014 · 14 revisions

This section describes some internal XML parsing details using the XML package. The pmcOAI function uses xmlParse to generate the XML tree within the R session, so objects are stored as an XMLInternalDocument class and can be queried using XPath expressions below.

id <- "PMC3418162"
doc <- pmcOAI(id)

This query list all 87 tags and counts the number of occurrences (only first 16 displayed).

summary(doc)   # or
table( xpathSApply(doc, "//*", xmlName)) 
           abstract                    aff                article     article-categories 
                  1                      2                      1                      1 
         article-id           article-meta          article-title                   back 
                  4                      1                     62                      1 
               body                   bold                caption                    col 
                  1                     40                     13                     13 
           colgroup                contrib          contrib-group       copyright-holder 
                  3                      3                      1                      1 

You can search down the XML tree and find the three main nodes of a PMC XML file. These include the front with abstract, body with main text, and back with references.

xpathSApply(doc, "//article/*", xmlName)
[1] "front" "body"  "back"

In some cases, tag names are not specific, so searching up the tree may help find a specific type of tag. For example, article-titles are included within the references or main title and therefore both tags are needed to return the main title.

table( xpathSApply(doc, "//article-title/..", xmlName) )
mixed-citation    title-group 
            61              1 
xpathSApply(doc, "//title-group/article-title", xmlValue)
[1] "Burkholderia pseudomallei transcriptional adaptation in macrophages"

Captions may also be associated with figures, tables and supplements, so listing only table captions requires adding the table-wrap node before the caption.

table( xpathSApply(doc, "//caption/parent::node()", xmlName) )
  fig                  media supplementary-material             table-wrap 
    8                      1                      1                      3 

xpathSApply(doc, "//table-wrap/caption", xmlValue)
[1] "Twenty-five common up-regulated genes of B. pseudomallei during intracellular growth in host macrophages relative to in vitro growth"         
[2] "Gene function enrichment analysis of B. pseudomallei common up-regulated and down-regulated genes throughout growth within host macrophages"
[3] "List of oligonucleotides used in real-time qPCR experiments" 

The first function below will list all 23 section titles and the second functions lists only the 8 main sections in the document (not subsections).

sec <- xpathSApply(doc, "//body//sec/title", xmlValue)
sec
 [1] "Background"                                                        
 [2] "Results"                                                           
 [3] "Infection model and bacterial RNA isolation"                       
 [4] "The global gene expression profile"                                
 [5] "Intracellular metabolism and ion transport"                        
 [6] "Expression of virulence and virulence-associated factors"          
 [7] "Stress responses genes"                                            
 [8] "DNA topology and growth arrest within macrophages"                 
 [9] "Discussion"                                                                             
   
xpathSApply(doc, "//body/sec/title", xmlValue)
[1] "Background"             "Results"                "Discussion"             "Conclusions"            "Methods"               
[6] "Competing interests"    "Authors’ contributions" "Supplementary Material"

In many cases, we would like to find features associated with subsections like "Expression of virulence factors" above. Our current solution is to use xmlAncestors to count the number of section title parents to get the level in the tree/hierarchy. We use the path.string function to write the entire path string, so subsections at any depth will contain a delimited list of all parent titles.

n <- xpathSApply(doc, "//body//sec/title", function(y) length(xmlAncestors(y) ))
path <- path.string(sec, n)
path
 [1] "Background"                                                                 
 [2] "Results"                                                                    
 [3] "Results; Infection model and bacterial RNA isolation"                       
 [4] "Results; The global gene expression profile"                                
 [5] "Results; Intracellular metabolism and ion transport"                        
 [6] "Results; Expression of virulence and virulence-associated factors" 
 ...
Clone this wiki locally