-
Notifications
You must be signed in to change notification settings - Fork 2
Using XPath
This section describes some internal XML parsing details using the XML package. The pmcOAI
function uses xmlParse
to generate the XML tree within the R session, so objects are stored as an XMLInternalDocument
class and can be queried using XPath expressions below.
id <- "PMC3418162"
doc <- pmcOAI(id)
This query list all 87 tags and counts the number of occurrences (only first 16 displayed).
summary(doc) # or
table( xpathSApply(doc, "//*", xmlName))
abstract aff article article-categories
1 2 1 1
article-id article-meta article-title back
4 1 62 1
body bold caption col
1 40 13 13
colgroup contrib contrib-group copyright-holder
3 3 1 1
You can search down the XML tree and find the three main nodes of a PMC XML file. These include the front with abstract, body with main text, and back with references.
xpathSApply(doc, "//article/*", xmlName)
[1] "front" "body" "back"
In some cases, tag names are not specific, so searching up the tree may help find a specific type of tag. For example, article-titles are included within the references or main title and therefore both tags are needed to return the main title.
table( xpathSApply(doc, "//article-title/..", xmlName) )
mixed-citation title-group
61 1
xpathSApply(doc, "//title-group/article-title", xmlValue)
[1] "Burkholderia pseudomallei transcriptional adaptation in macrophages"
Captions may also be associated with figures, tables and supplements, so listing only table captions requires adding the table-wrap node before the caption.
table( xpathSApply(doc, "//caption/parent::node()", xmlName) )
fig media supplementary-material table-wrap
8 1 1 3
xpathSApply(doc, "//table-wrap/caption", xmlValue)
[1] "Twenty-five common up-regulated genes of B. pseudomallei during intracellular growth in host macrophages relative to in vitro growth"
[2] "Gene function enrichment analysis of B. pseudomallei common up-regulated and down-regulated genes throughout growth within host macrophages"
[3] "List of oligonucleotides used in real-time qPCR experiments"
The first function below will list all 23 section titles and the second functions lists only the 8 main sections in the document (not subsections).
sec <- xpathSApply(doc, "//body//sec/title", xmlValue)
sec
[1] "Background"
[2] "Results"
[3] "Infection model and bacterial RNA isolation"
[4] "The global gene expression profile"
[5] "Intracellular metabolism and ion transport"
[6] "Expression of virulence and virulence-associated factors"
[7] "Stress responses genes"
[8] "DNA topology and growth arrest within macrophages"
[9] "Discussion"
xpathSApply(doc, "//body/sec/title", xmlValue)
[1] "Background" "Results" "Discussion" "Conclusions" "Methods"
[6] "Competing interests" "Authors’ contributions" "Supplementary Material"
In many cases, we would like to find features associated with subsections like "Expression of virulence factors" above. Our current solution is to use xmlAncestors
to count the number of section title parents to get the level in the tree/hierarchy. We use the path.string
function to write the entire path string, so subsections at any depth will contain a delimited list of all parent titles.
n <- xpathSApply(doc, "//body//sec/title", function(y) length(xmlAncestors(y) ))
path <- path.string(sec, n)
path
[1] "Background"
[2] "Results"
[3] "Results; Infection model and bacterial RNA isolation"
[4] "Results; The global gene expression profile"
[5] "Results; Intracellular metabolism and ion transport"
[6] "Results; Expression of virulence and virulence-associated factors"
...