Clean up code in jena.ext.xerces #2828

Ostrzyciel · 2024-11-06T23:19:06Z

Change

This is a continuation of #2797 – trimming out the bloat in datatyped literal validations.

That issue tackled the worst offender – allocating two hashmaps per each validated literal. This is however not the whole story. For every literal validation we still allocate a new ValidationState:

jena/jena-core/src/main/java/org/apache/jena/datatypes/xsd/XSDDatatype.java

Lines 267 to 271 in c7eaf83

    
           public Object parse(String lexicalForm) throws DatatypeFormatException { 
        
               try { 
        
                   ValidationContext context = new ValidationState(); 
        
                   ValidatedInfo resultInfo = new ValidatedInfo(); 
        
                   typeDeclaration.validate(lexicalForm, context, resultInfo);

This makes no sense, because that object is never really mutated (or, to be precise, does not need to be mutable). At the same time, it has quite a few fields. Although the JVM can probably figure out how to handle this efficiently with escape analysis, this is still unnecessary bloat.

The entire org.apache.jena.ext.xerces package contains a lot of unused code carried over from xerces. A lot of the infrastructure is not needed in Jena, because the more complex XML features make no sense in the context of RDF literals. For example, a large part of the original job of ValidationState was to check if ID and IDREF attributes are correct with respect to one another. This, along with a few other quirky XML thingies "SHOULD NOT" be used in RDF according to the spec, so I think we can safely remove this.

The plan

Remove code for special handling of datatypes: QName, ENTITY, ID, IDREF, NOTATION, IDREFS, ENTITIES, NMTOKENS – none of which are valid in RDF.
- Note that although ID/IDREF validation is implemented in Jena, it currently does not work, because the ValidationState is allocated once per literal, not once per document. And what would a "document" even mean in RDF?
Instantiate only one instance of ValidationState per Jena instance, or something to a similar effect.
Remove all kinds of unused code from the xerces package to help maintainability, make the JARs a bit smaller etc.

How do I do this?

What is the process for making breaking changes to Jena APIs? I assume that even if a public class is not used in the Jena codebase, it can be removed only in a MAJOR release. So, should I do something like:

Make a PR deprecating all the different public methods and classes and marking them for removal in Jena 6
Before Jena 6.0 release (but after last Jena 5.x release) make a PR actually removing the code

?

Notes on unused code

According to my IntelliJ, these methods of ValidationState are not used anywhere in the Jena codebase:
- All set* methods, except setEntityState, which is used only in ValidationManager, which is in turn an unused class.
- resetIDTables, reset, useNamespaces
- Of course, this needs to be double-checked.
Other unused code:
- ConfigurableValidationState class
- DatatypeValidator interface
- EntityState interface – it is only used in unused methods of ValidationState and ValidationManager
- SchemaDVFactory methods: createTypeRestriction, createTypeList, createTypeUnion – this refers to some advanced XSD features... I think? Anyway, this doesn't seem to be used in RDF.
  - Implementations of these methods in child classes (BaseSchemaDVFactory and BaseDVFactory) are, by extension, also unused.
- XSDDatatype: there is a huge comment block section that probably should be removed. The inner static class XSDGenericType is also unused.
- ValidatedInfo methods: stringValue, isComparable, copyFrom
- NamespaceContext fields: XML_URI, XMLNS_URI and methods pushContext, popContext, declarePrefix, getDeclaredPrefixCount, getDeclaredPrefixAt
  - Actually, the whole interface is effectively unused... there are no classes implementing it. There is some code passing around instances of NamespaceContext, but this must be always null.

Are you interested in contributing a pull request for this task?

Yes

The text was updated successfully, but these errors were encountered:

afs · 2024-11-10T10:54:02Z

Great analysis! ext.xerces exists to provide XSD datatypes and nothing more.

Your plan looks good and doesn't need to be coupled to Jena6, nor deprecation classes.

public within ext.xerces doesn't contribute to the Jena API; it's only the Java need for one package to access classes in another. So I don't think this is a Jena6 matter and deleting and slimming down the code can happen as "internal" improvements.

It has been valuable to have all the details of datatypes implemented in an efficient manner. Jena forked off the Xerces codebase when it switched to using the JDK parser, or user choice of XML parser, and then Jena stopped shipping a Xerces jar and ICU4J (which is huge). ext.xerces is not connected to any XML processing including RDF/XML parsing. Presumably why NamespaceContext isn't used.

There is a use of ext.xerces.impl.xpath.regex in ARQ. It gives exact implementation of REGEX in SPARQL (XPath/XQuery Functions and Operators regular expressions).
The JDK regex subsystem is very close but not exact (there is a regex flag not provided) - JDK regex is used by default.

I've opened the task #2833 to extract the regex engine for SPARQL/ARQ's use.

The only use in the datatypes area of xpath.regex I can find is in XSSimpleTypeDecl to support pattern facets. Such custom XSD datatypes are (probably!) not needed. IMO It can stay "just in case" or go.

I experimented by passing null from XSDDatatype.parse and deleting ConfigurableValidationState, ValidationManager, ValidationState.

 typeDeclaration.validate(lexicalForm, null, resultInfo);

The build passed just fine.

Ostrzyciel · 2024-11-10T11:04:20Z

@afs Thanks! That clears a lot of things up. I will then start making some PRs for this in my spare time.

afs · 2024-11-10T11:06:34Z

This is very much a case of "less is more". No rush, doesn't need to be all done at once.

Issue: apache#2828 Remove the code for validating datatypes that according to the RDF 1.1 spec SHOULD NOT be used: QName, ENTITY, ID, IDREF, NOTATION, ENTITIES, NMTOKENS, IDREFS. Also remove code for validating XSD lists and unions.

Issue: #2828 Remove the code for validating datatypes that according to the RDF 1.1 spec SHOULD NOT be used: QName, ENTITY, ID, IDREF, NOTATION, ENTITIES, NMTOKENS, IDREFS. Also remove code for validating XSD lists and unions.

Ostrzyciel · 2024-11-20T16:39:04Z

@afs could you reopen this? GitHub closed this automatically, but it will need 1–2 more PRs to complete. I can't reopen this myself.

Ostrzyciel added the task Task label Nov 6, 2024

afs mentioned this issue Nov 10, 2024

Extract the Xerces implementation of regular expressions for SPARQL use. #2833

Closed

Ostrzyciel mentioned this issue Nov 17, 2024

GH-2828: Remove validation for XML-only datatypes #2846

Merged

4 tasks

afs closed this as completed in #2846 Nov 20, 2024

afs reopened this Nov 20, 2024

Ostrzyciel added a commit to Ostrzyciel/jena that referenced this issue Dec 23, 2024

apacheGH-2828: Remove unused code in ext.xerces

4d6efa5

Ostrzyciel added a commit to Ostrzyciel/jena that referenced this issue Dec 23, 2024

apacheGH-2828: Remove unused code in ext.xerces

648803c

Ostrzyciel linked a pull request Dec 23, 2024 that will close this issue

GH-2828: Remove unused code in ext.xerces #2906

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up code in jena.ext.xerces #2828

Clean up code in jena.ext.xerces #2828

Ostrzyciel commented Nov 6, 2024

afs commented Nov 10, 2024 •

edited

Loading

Ostrzyciel commented Nov 10, 2024

afs commented Nov 10, 2024

Ostrzyciel commented Nov 20, 2024

Clean up code in jena.ext.xerces #2828

Clean up code in jena.ext.xerces #2828

Comments

Ostrzyciel commented Nov 6, 2024

Change

The plan

How do I do this?

Notes on unused code

Are you interested in contributing a pull request for this task?

afs commented Nov 10, 2024 • edited Loading

Ostrzyciel commented Nov 10, 2024

afs commented Nov 10, 2024

Ostrzyciel commented Nov 20, 2024

afs commented Nov 10, 2024 •

edited

Loading