Skip to content
Neven Jovanović edited this page Jun 11, 2020 · 3 revisions

Croatiae auctores Latini (CroALa) – exploration and documentation

Notes and explanations for various sub-projects.

Linguistic analysis of a subset of CroALa texts

Petar Soldo as a LiLa Erasmus intern at the Università Cattolica del Sacro Cuore, CIRCSE, Milan, Italy, Summer semester 2019/2020.

The subset as an XQuery variable:

declare variable $docs := ("aa-vv-supetarski.xml", "sisgor-g-prosopopeya.xml", "modr-n-navic.xml", 
"marulus-m-carmina008.xml", "sisgor-g-odae.xml", "bunic-j-de-r.xml", "tubero-comm-rhac.xml", 
"andreis-f-epist-nadasd.xml", "benesa-d_epigr03_croala5095251.croala-lat1.xml", 
"gradic-s-oratio.xml", "boskovic-r-ecl.xml", "kunic-r-hymnus-cererem.xml", "milasin-f-viator.xml");
  1. Define a subset of CroALa files, copy it to another directory. Create a directory first. Then use the BaseX and XQuery script create-subset-from-selected-files.xq.
  2. Alternatively, clone the croatiae-auctores-latini-textus repository, which already contains the subset
  3. Create a database from the subset: createCroALaDBfromsubset.xq
  4. Create a list of words in the subset: wordlist-from-subset-db.xq

Tokenize words and punctuation in original text (not metadata) of documents

  1. Inside the TEI/text node of the document, tokenize all text nodes, wrap words in w tag and punctuation in pc
  2. Skip all elements with @ana="editorial" attribute and attribute value
  3. Replace the original TEI/text node with the updated node
  4. Export the files into the subset-tokenized directory

The tasks 1-3 are performed by the XQuery script subset-tokenize-w-pc.xq. Task 4 is done by the script subset-export-files.xq

The algorithm outlined above uses a recursive function to distinguish between text() nodes and others:

declare function local:copy-nodes-filter-text($element) {
  if ($element[@ana="editorial" or name()="g"]) then $element
  else element { node-name($element) }
             { $element/@*,
               for $child in $element/node()
                  return if (not($child/self::text()))
                    then local:copy-nodes-filter-text($child)
                    else for $c in tokenize($child, "\s+") return local:tokenize-words-pc($c)

The actual tokenization is done with the following function:

declare function local:tokenize-words-pc($token){
  for $part in analyze-string($token, '\w+')/*
   return  if ($part/name()="fn:match") then element w { $part/string()}
      else element pc { $part/string()}

The analyze-string XQuery function is very important and useful.

Clone this wiki locally