Annotating PDFs to generate RDF? #4535

tpluscode · 2024-02-21T14:19:53Z

tpluscode
Feb 21, 2024

I have a personal project where I publish a ton of various scanned PDF documents with various information from the public transportation industry (bus technical details, manufacturers, etc)

What I want to do is to publish all that information as RDF Linked Data. Short of manually copying and pasting to Turtle, or something, I was looking for a PDF annotation tool which would let me label entities, their relations and properties in a form close enough that I can more easily produce RDF in my desired ontology/structure.

Inception comes very close but I failed to find the right functionality for exporting the annotations in a way useful for me. What are my options?

reckart · 2024-02-21T18:29:44Z

reckart
Feb 21, 2024
Maintainer

The current approach would be:

define or import a knowledge base that contains the concepts you want to use as labels
define a span layer Entity with a single concept property iri
- if you want entities to be limited to properties from the knowledge base, set the allowed type to classes only (or possibly to instances only)
define a relation layer Relation with a single concept property iri
- if you want relations to be limited to properties from the knowledge base, set the allowed type to property
annotate a few entities and link them using the relations
export the document as UIMA CAS JSON 0.4.0
use DKPro Cassis to load the exported JSON file

from cassis import *

with open('cas.json', 'rb') as f:
   cas = load_cas_from_json(f)

select the relations and write them out as triples, effectively creating a file in the ntriples format

for rel in cas.select('webanno.custom.Relation'):
  print(f'<{rel.get("Governor.iri")}> <{rel.get("iri")}> <{rel.get("Dependent.iri")}> .')

This would create statements from all relations where the source of the relation is the subject, the relation is the predicate and the target of the relation is the object.

If you use a local knowledge base that you build in INCEpTION, the IRIs of the concepts in are auto-generated with random identifiers. If you import an externally built knowledge base, you usually have more speaking IRIs.

Example local knowledge base

Example text annotated with entities and relations

In either case, you could merge the ntriple statements mentioned above with your knowledge base to enrich it with information about the subjects.

Example concept enriched based on the ntriples generated from the annotations using the script snipped above after importing the results back into the INCEpTION knowledge base

Here is the toy project from which I generated the screenshots above.
annotating-with-knowledge-base-example.zip

I hope in future versions of INCEpTION, we can have a project template and associated data exporter that may avoid having to use an external script -- at least for simple cases.

Feedback and suggestions are welcome.

4 replies

tpluscode Feb 22, 2024
Author

Thank you very much for an inspiring example! I am looking for a more sophisticated approach where I would not only map to existing entities from a KG but also extract additional knowledge. Based on your example, I think I am getting closer to my desired result

Here's my exported project: demo-31652343198146808040265.zip

I created an Entity layer as you proposed. I do map it to an existing SPARQL KB
I created a Spec layer with:
1 Link-type property which allows me to attach specific values as details of an Entity
2. required property property (now string for simplicity) which defines the relation semantically
3. unit property, needed for numeric values (will also map to external KB in the real thing)

I love the way links are displayed on the right hand side by the way 🤩

The desired output would be something like this, for a single property:

<>
  ex:describes <https://new.wikibus.org/vehicle/autosan/h10> ; # Entity.iri
  schema:length [             # Spec link
    schema:value "11 200" ;   # Spec text
    schema:unit unit:MilliM ; # Spec uni
  ] ;
.

As a general problem with the UIMA CAS JSON format, I found that it will not be simple to retrace those Entity-Spec links to create the triples. Is that something which the python library makes easy to navigate?
Additionally, I'm missing the actual annotated value. I saw that I would have to find the value myself by following @sofa/begin/end to substring the sofaString. That is not a big deal and maybe also easily accessible from DKPro-cassis?

Side quest: I'm mostly writing JS but doesn't look like there is any JS lib which will be useful to transform the exported formats to triples?

reckart Feb 24, 2024
Maintainer

Side quest: I'm mostly writing JS but doesn't look like there is any JS lib which will be useful to transform the exported formats to triples?

There is currently no JS library I know of that supports the UIMA CAS JSON or UIMA CAS XMI formats. It should not be too difficult to implement one based on the UIMA JSON specs. There is also a set of test files against which such an implementation can be tested.

reckart Feb 24, 2024
Maintainer

As a general problem with the UIMA CAS JSON format, I found that it will not be simple to retrace those Entity-Spec links to create the triples. Is that something which the python library makes easy to navigate?

You use the select and get methods. The endpoints of a relation are accessible from the relation annotations as Dependent and Governor using the get method. In order to find which relations a span annotation connects to, you would have to search for relations that have an endpoint matching that span annotation.

Additionally, I'm missing the actual annotated value. I saw that I would have to find the value myself by following @sofa/begin/end to substring the sofaString. That is not a big deal and maybe also easily accessible from DKPro-cassis?

Once you have an annotation you use the get_covered_text() method.

tpluscode Feb 26, 2024
Author

It is very far from my sphere of interest to actually work with UIMA CAS as a model. All I really need is to rebuild the relations from annotations and be able to transform it further with familiar tooling.

As mentioned in #4549, I was able to do a quick and dirty modification of CAS JSON so that it can become a valid JSON-LD:

Rewrite numeric identifiers into blank node ids
Replace @foo properties because JSON-LD apparently does not allow then
Not strictly necessary, but I also renamed %ID and %TYPE to @id and @type respectively
Applied a JSON-LD @context with mappings of the CAS term + dedicated terms to match the data model
To reduce the output, I filter to keep only features matching /^webanno.custom./, /^uima.cas.Sofa$/ and /^uima.cas.FSArray$/

I put full example on gist: https://gist.github.com/tpluscode/010da6cd2af6a8018bb6abf608cb5606

The annotations are not perfect, so the output is a little wonky but it proves a point.

reckart · 2024-02-26T21:52:45Z

reckart
Feb 26, 2024
Maintainer

In #4549 you wrote:

A simply RDF binding to the CAS format itself would totally work too. Did you have in mind embedding it in the NIF export? In other words, exporting only the annotations from CAS as RDF. Not the tokens and all?

Some time back, I had implemented a CAS-to-RDF binding. The RDF data produced by this looks somewhat like this;

<file:/Obama.txt#6429>
        rdf:type                 rdfcas:FeatureStructure , <uima:webanno.custom.Entity> ;
        rdfcas:indexedIn         <file:/Obama.txt#1> ;
        cas:AnnotationBase-sofa  <file:/Obama.txt#1> ;
        tcas:Annotation-begin    "159"^^xsd:int ;
        tcas:Annotation-end      "175"^^xsd:int ;
        <uima:webanno.custom.Entity-iri>
                "http://www.ukp.informatik.tu-darmstadt.de/inception/1.0#5557c69bcb2645ac80764c7a898ab448306" .

<file:/Obama.txt#6434>
        rdf:type                 rdfcas:FeatureStructure , <uima:webanno.custom.Entity> ;
        rdfcas:indexedIn         <file:/Obama.txt#1> ;
        cas:AnnotationBase-sofa  <file:/Obama.txt#1> ;
        tcas:Annotation-begin    "179"^^xsd:int ;
        tcas:Annotation-end      "184"^^xsd:int ;
        <uima:webanno.custom.Entity-iri>
                "http://www.ukp.informatik.tu-darmstadt.de/inception/1.0#5557c69bcb2645ac80764c7a898ab448281" .

<file:/Obama.txt#6439>
        rdf:type                 rdfcas:FeatureStructure , <uima:webanno.custom.Relation> ;
        rdfcas:indexedIn         <file:/Obama.txt#1> ;
        cas:AnnotationBase-sofa  <file:/Obama.txt#1> ;
        tcas:Annotation-begin    "159"^^xsd:int ;
        tcas:Annotation-end      "175"^^xsd:int ;
        <uima:webanno.custom.Relation-Dependent>
                <file:/Obama.txt#6429> ;
        <uima:webanno.custom.Relation-Governor>
                <file:/Obama.txt#6434> ;
        <uima:webanno.custom.Relation-iri>
                "http://www.ukp.informatik.tu-darmstadt.de/inception/1.0#5557c69bcb2645ac80764c7a898ab448284" .

I think it would be fairly straightforward to add this RDF binding to INCEpTION if it helps anybody (i.e. in this case you). The format is a complete CAS <-> RDF conversion. From the perspective of the CAS, tokens and sentences are also just annotations, so those would be included in the RDF format as well. The benefit over NIF would be that this binding should be able to represent all of the annotation data from INCEpTION, not just the handful of layers supported by NIF.

That said, this RDF binding probably gives you a similar kind of access to the CAS data as your JSON-LD approach - except that things in fact do contain proper IRIs in the right places instead of plain numbers.

No let's take a step back and consider the idea again:

So the idea is that we want to essentially create triples by annotating a subject and object and linking them with a relation that serves as predicate - and then to export these annotations in a RDF format - right? The problem here is that in order to do this kind of annotation, you have to create a custom span layer and a custom relation layer - each with features allowing to link them against the KB. If instead INCEpTION would come with a set of predefined layers for this purpose, then it would probably be rather easy to implement a format that is aware of exactly these predefined layers and would be able to write them out in a simple RDF format - excluding tokens/sentences. Does that make sense?

So I imagine we could set up a project template for say "statement annotation" that comes with two layers, e.g. "Resource" and "Property". Both would have a feature "iri" pointing to a knowledge base. Maybe the "Resource" layer would additionally have a feature "literal" in case it should not link to a KB but rather represent a value (e.g. for a year, monetary value, measurement etc.).
Based on that, a format could be defined that exports all "resource-property-resource" triples in RDF syntax.

That would basically be what I did in the Python script - but without the script - because with predefined layers, INCEpTION would know the semantics of the layers and could directly provide a suitable export format.

What do you think is more interesting:

a general CAS <-> RDF binding,
the JSON-LD approach,
or the idea of a project template with pre-defined layers for statement annotation and an associated RDF-based export (only) format?

9 replies

tpluscode Feb 27, 2024
Author

The example uima:webanno.custom.Entity-iri property would be marked as IRI, obviously, so that the output is...

Good point :)

And on second thought, maybe this setting won't be necessary, or implicit, if the property is a KB property, right? I guess that all KB values are IRIs

reckart Feb 27, 2024
Maintainer

From the perspective of the RDF format, it seems more sensible that values that are IRIs are properly identifiable as such and not wrapped opaquely in a string. INCEpTION can provide the necessary information to the write such that it will output IRIs with pointy brackets instead of quotes:

<admin---new-project-1/Obama.txt#6434>
        rdf:type                 rdfcas:FeatureStructure , <uima:webanno.custom.Entity> ;
        rdfcas:indexedIn         <admin---new-project-1/Obama.txt#1> ;
        cas:AnnotationBase-sofa  <admin---new-project-1/Obama.txt#1> ;
        tcas:Annotation-begin    "179"^^xsd:int ;
        tcas:Annotation-end      "184"^^xsd:int ;
        <uima:webanno.custom.Entity-iri>
                <http://www.ukp.informatik.tu-darmstadt.de/inception/1.0#5557c69bcb2645ac80764c7a898ab448281> .

reckart Feb 27, 2024
Maintainer

Ok, we have a first "draft" here: #4568

reckart Feb 27, 2024
Maintainer

I guess the current implementation can explode in many ways, e.g. if the filename/project name has a format that the Jena RDF library would choke on...

tpluscode Feb 27, 2024
Author

I guess the current implementation can explode in many ways, e.g. if the filename/project name has a format that the Jena RDF library would choke on...

You mean <admin---new-project-1/Obama.txt#1>? I expect Jena to nicely encode characters as needed when creating named nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotating PDFs to generate RDF? #4535

{{title}}

Replies: 2 comments 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Annotating PDFs to generate RDF? #4535

tpluscode Feb 21, 2024

Replies: 2 comments · 13 replies

reckart Feb 21, 2024 Maintainer

tpluscode Feb 22, 2024 Author

reckart Feb 24, 2024 Maintainer

reckart Feb 24, 2024 Maintainer

tpluscode Feb 26, 2024 Author

reckart Feb 26, 2024 Maintainer

tpluscode Feb 27, 2024 Author

reckart Feb 27, 2024 Maintainer

reckart Feb 27, 2024 Maintainer

reckart Feb 27, 2024 Maintainer

tpluscode Feb 27, 2024 Author

tpluscode
Feb 21, 2024

Replies: 2 comments 13 replies

reckart
Feb 21, 2024
Maintainer

tpluscode Feb 22, 2024
Author

reckart Feb 24, 2024
Maintainer

reckart Feb 24, 2024
Maintainer

tpluscode Feb 26, 2024
Author

reckart
Feb 26, 2024
Maintainer

tpluscode Feb 27, 2024
Author

reckart Feb 27, 2024
Maintainer

reckart Feb 27, 2024
Maintainer

reckart Feb 27, 2024
Maintainer

tpluscode Feb 27, 2024
Author