-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Neven Jovanović edited this page Jun 11, 2020
·
3 revisions
Notes and explanations for various sub-projects.
Petar Soldo as a LiLa Erasmus intern at the Università Cattolica del Sacro Cuore, CIRCSE, Milan, Italy, Summer semester 2019/2020.
The subset as an XQuery variable:
declare variable $docs := ("aa-vv-supetarski.xml", "sisgor-g-prosopopeya.xml", "modr-n-navic.xml",
"marulus-m-carmina008.xml", "sisgor-g-odae.xml", "bunic-j-de-r.xml", "tubero-comm-rhac.xml",
"andreis-f-epist-nadasd.xml", "benesa-d_epigr03_croala5095251.croala-lat1.xml",
"gradic-s-oratio.xml", "boskovic-r-ecl.xml", "kunic-r-hymnus-cererem.xml", "milasin-f-viator.xml");
- Define a subset of CroALa files, copy it to another directory. Create a directory first. Then use the BaseX and XQuery script create-subset-from-selected-files.xq.
- Alternatively, clone the
croatiae-auctores-latini-textus
repository, which already contains the subset - Create a database from the subset: createCroALaDBfromsubset.xq
- Create a list of words in the subset: wordlist-from-subset-db.xq
- Inside the
TEI/text
node of the document, tokenize all text nodes, wrap words inw
tag and punctuation inpc
- Skip all elements with
@ana="editorial"
attribute and attribute value - Replace the original
TEI/text
node with the updated node - Export the files into the subset-tokenized directory
The tasks 1-3 are performed by the XQuery script subset-tokenize-w-pc.xq. Task 4 is done by the script subset-export-files.xq
The algorithm outlined above uses a recursive function to distinguish between text()
nodes and others:
declare function local:copy-nodes-filter-text($element) {
if ($element[@ana="editorial" or name()="g"]) then $element
else element { node-name($element) }
{ $element/@*,
for $child in $element/node()
return if (not($child/self::text()))
then local:copy-nodes-filter-text($child)
else for $c in tokenize($child, "\s+") return local:tokenize-words-pc($c)
}
};
The actual tokenization is done with the following function:
declare function local:tokenize-words-pc($token){
for $part in analyze-string($token, '\w+')/*
return if ($part/name()="fn:match") then element w { $part/string()}
else element pc { $part/string()}
};
The analyze-string XQuery function is very important and useful.