How to map a single dataset containing multiple sources to itself? #255

KonradHoeffner · 2022-01-14T14:01:44Z

Is it possible to use LIMES with more than two sources which are all included in the same file?
The sources should be mapped to each other but of course I don't want to map a source to itself and I also don't want to have duplicate pairs (A,B) and (B,A).
To clarify with an example, lets say I have a class :Country with many instances and each country has a population of individuals.
All of this data is in the same file countries.ttl.
Now I want to find out, which individuals live in more than one country.

:Germany a :Country;
 rdfs:label "Germany".

:Azerbaijan a :Country;
 rdfs:label "Azerbaijan".

:person123 a :Person;
 rdfs:label "Alex Müller";
 :country :Germany.

:person 456 a :Person;
 rdfs:label "Alex Mueller";
 :country :Azerbaijan.

This can be done in the following manner, declaring source and target alike:

    <SOURCE>
        <ID>c1</ID>
        <ENDPOINT>countries.ttl</ENDPOINT>
        <VAR>?c1</VAR>
        <PAGESIZE>-1</PAGESIZE>
        <RESTRICTION>?c1 a :Person; :country ?x.</RESTRICTION>
        <PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
        <TYPE>TURTLE</TYPE>
    </SOURCE>

    <TARGET>
        <ID>c2</ID>
        <ENDPOINT>countries.ttl</ENDPOINT>
        <VAR>?c2</VAR>
        <PAGESIZE>-1</PAGESIZE>
        <RESTRICTION>?c2 a :Person; :country ?y.</RESTRICTION>
        <PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
        <TYPE>TURTLE</TYPE>
    </TARGET>

   <METRIC>trigrams(c1.label,c2.label)</METRIC>

However this will generate a false match for every person to itself, and also it will also match each pair twice in both directions.
I would like to add a restriction like "STR(?x) < STR(?y)" but it seems like one cannot reference variables from the source in the restriction of the target.
A workaround is to throw away all matches with score exactly 1.0 but this is wasteful on resources and also discards correct matches that happen to be exactly equal.
Also, this will map people in a country to others in the same country which is not intended.

    <ACCEPTANCE>
        <THRESHOLD>1</THRESHOLD>
        <FILE>exact.ttl</FILE>
        <RELATION>owl:sameAs</RELATION>
    </ACCEPTANCE>
    
    <REVIEW>
        <THRESHOLD>0.8</THRESHOLD>
        <FILE>close.ttl</FILE>
        <RELATION>owl:sameAs</RELATION>
    </REVIEW>

Another way is to perform postprocessing to remove all duplicate and self matches but that seems to be inefficient in both developer and execution time.

Lastly, I could write a script which would enumerate all n*(n-1)/2 unique non self-matching pairs and generate as many limes configuration files but that has its own problems.

Is there any way to solve this task efficiently using LIMES or do I need to use one of the mentioned imperfect options?

The text was updated successfully, but these errors were encountered:

KonradHoeffner · 2022-01-18T15:18:16Z

Thanks to @MSherif I had partial success with MINUS(TRIGRAMS(c1.label,c2.label)0.5,EXACTMATCH(c1.x,c2.y)|1) however that still contains duplicates and it seems like those cannot be removed with limes as there is no "less than" operator.

MSherif · 2022-03-09T10:56:55Z

We added the new lessThan String measure. Please test it and close the issue if it is OK.

KonradHoeffner · 2022-03-09T14:59:09Z

Unfortunately it doesn't seem to work for me. Did I make a mistake with the combined metric? I don't really understand the documentation on what exactly MINUS, MAX and LESS_THAN output.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE LIMES SYSTEM "limes.dtd">
<LIMES>
	<PREFIX>
		<NAMESPACE>http://hitontology.eu/ontology/</NAMESPACE>
		<LABEL>hito</LABEL>
	</PREFIX>
	<PREFIX>
		<NAMESPACE>http://www.w3.org/1999/02/22-rdf-syntax-ns#</NAMESPACE>
		<LABEL>rdf</LABEL>
	</PREFIX>
	<PREFIX>
		<NAMESPACE>http://www.w3.org/2000/01/rdf-schema#</NAMESPACE>
		<LABEL>rdfs</LABEL>
	</PREFIX>
	<PREFIX>
		<NAMESPACE>http://www.w3.org/2002/07/owl#</NAMESPACE>
		<LABEL>owl</LABEL>
	</PREFIX>
	<PREFIX>
		<NAMESPACE>http://www.w3.org/2004/02/skos/core#</NAMESPACE>
		<LABEL>skos</LABEL>
	</PREFIX>
	
	<SOURCE>
		<ID>c1</ID>
		<ENDPOINT>https://hitontology.eu/sparql</ENDPOINT>
		<VAR>?c1</VAR>
		<PAGESIZE>-1</PAGESIZE>
		<RESTRICTION>?c1 a hito:FeatureClassified</RESTRICTION>
		<PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
		<PROPERTY>hito:featureCatalogue RENAME cat</PROPERTY>
		<OPTIONAL_PROPERTY>rdfs:comment AS nolang->lowercase->regularalphabet RENAME comment</OPTIONAL_PROPERTY>
		<TYPE>SPARQL</TYPE>
	</SOURCE>

	<TARGET>
		<ID></ID>
		<ENDPOINT>https://hitontology.eu/sparql</ENDPOINT>
		<VAR>?c2</VAR>
		<PAGESIZE>-1</PAGESIZE>
		<RESTRICTION>?c2 a hito:FeatureClassified</RESTRICTION>
		<PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
		<PROPERTY>hito:featureCatalogue RENAME cat</PROPERTY>
		<OPTIONAL_PROPERTY>rdfs:comment AS nolang->lowercase->regularalphabet RENAME comment</OPTIONAL_PROPERTY>
		<TYPE>SPARQL</TYPE>
	</TARGET>

<METRIC>MINUS(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC>

	<ACCEPTANCE>
		<THRESHOLD>1</THRESHOLD>
		<FILE>catalogue-exact.ttl</FILE>
		<RELATION>skos:closeMatch</RELATION>
	</ACCEPTANCE>
	
	<REVIEW>
		<THRESHOLD>0.5</THRESHOLD>
		<FILE>catalogue-close.ttl</FILE>
		<RELATION>skos:closeMatch</RELATION>
	</REVIEW>

	<EXECUTION>
		<REWRITER>default</REWRITER>
		<PLANNER>default</PLANNER>
		<ENGINE>default</ENGINE>
	</EXECUTION>

	<OUTPUT>CSV</OUTPUT>
</LIMES>

Despite saying that c1.cat should be less than c2.cat, the resulting catalogue-close.ttl still contains symmetric pairs:

<http://hitontology.eu/ontology/WhoDhiSelfMonitoringOfHealthOrDiagnosticDataByClient>   <http://hitontology.eu/ontology/WhoDhiRemoteMonitoringOfClientHealthOrDiagnosticDataByProvider> 0.618421052631579
<http://hitontology.eu/ontology/WhoDhiNonRoutineDataCollectionAndManagement>    <http://hitontology.eu/ontology/WhoDhiRoutineHealthIndicatorDataCollectionAndManagement>    0.6129032258064516
<http://hitontology.eu/ontology/WhoDhiManageCertificationregistrationOfHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiMapLocationOfHealthcareProviders> 0.5245901639344263
<http://hitontology.eu/ontology/WhoDhiMapLocationOfHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiManageCertificationregistrationOfHealthcareProviders> 0.5245901639344263
<http://hitontology.eu/ontology/WhoDhiRemoteMonitoringOfClientHealthOrDiagnosticDataByProvider> <http://hitontology.eu/ontology/WhoDhiSelfMonitoringOfHealthOrDiagnosticDataByClient>   0.618421052631579
<http://hitontology.eu/ontology/WhoDhiTransmitNonroutineHealthEventAlertsToHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiTransmitRoutinePayrollPaymentToHealthcareProviders>   0.5540540540540541
<http://hitontology.eu/ontology/WhoDhiTransmitRoutinePayrollPaymentToHealthcareProviders>   <http://hitontology.eu/ontology/WhoDhiTransmitNonroutineHealthEventAlertsToHealthcareProviders> 0.5540540540540541
<http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToHealthcareProviders>  <http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToClientsForHealthServices> 0.52
<http://hitontology.eu/ontology/WhoDhiRoutineHealthIndicatorDataCollectionAndManagement>    <http://hitontology.eu/ontology/WhoDhiNonRoutineDataCollectionAndManagement>    0.6129032258064516
<http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToClientsForHealthServices> <http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToHealthcareProviders>  0.52

MSherif · 2022-03-09T16:09:15Z

Min(m1, m2) Computes the intersection of the two mappings m1 and m2. In case an entry (i.e., link) exists in both mappings the minimal similarity is taken.

Max(m1, m2) Computes the union of the two mappings m1 and m2. In case an entry (i.e., link) exists in both mappings the maximal similarity is taken.

MINUS(m1, m2) Computes the difference of two mappings. i.e. the set difference m1 - m2

MSherif · 2022-03-09T16:28:11Z

Plz try <METRIC>MIN(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC>

KonradHoeffner · 2022-03-10T08:09:51Z

Thank you for the detailed explanation, this is extremely helpful! Could you add this to the official documentation at http://dice-group.github.io/LIMES/#/user_manual/configuration_file/defining_link_specifications?id=boolean-operations? I know what minimum, maximum and set difference are but the interaction with the thresholds was not clear to me. However what I still don't know is: What is the similarity score output of the MINUS operator? The ones from the first parameter? And what if something is below the threshold?

KonradHoeffner · 2022-03-10T08:18:01Z

Unfortunately, <METRIC>MIN(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC> does not do the trick. If I replace this in the full specification given above (you can run it yourself to verify if you want), it gives a bunch of identical results:

<http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness>   <http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness>   1.0
<http://hitontology.eu/ontology/EhrSfmSupportForResearchProtocolsRelativeToIndividualPatientCare>   <http://hitontology.eu/ontology/EhrSfmSupportForResearchProtocolsRelativeToIndividualPatientCare>   1.0
<http://hitontology.eu/ontology/BbDisplayVitalParametersFromMonitoringDevices>  <http://hitontology.eu/ontology/BbDisplayVitalParametersFromMonitoringDevices>  1.0
<http://hitontology.eu/ontology/WhoDhiTargetedClientCommunication>  <http://hitontology.eu/ontology/WhoDhiTargetedClientCommunication>  1.0
[...]

However this should not be possible, because for example http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness only has one catalogue, and this cannot be smaller than itself, as specified in LESS_THAN(c1.cat,c2.cat).

Output of LIMES

$ limes test-sparql.xml                   
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
09:13:15.813 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:125 - Checking for file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.821 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:128 - Found cached data. Loading data from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.859 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:134 - Cached data loaded successfully from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.860 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:135 - Size = 618
09:13:15.860 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:125 - Checking for file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.860 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:128 - Found cached data. Loading data from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.873 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:134 - Cached data loaded successfully from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.874 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:135 - Size = 618
09:13:16.205 [main] [] WARN  org.apache.sis.system:228 - The “SIS_DATA” environment variable is not set.
09:13:17.171 [main] [] INFO  org.aksw.limes.core.controller.Controller:237 - Mapping task finished in 1218 ms
09:13:17.175 [main] [] INFO  org.aksw.limes.core.controller.Controller:241 - Mapping size: 620 (accepted) + 1520 (need verification) = 2140 (total)
09:13:17.176 [main] [] INFO  org.aksw.limes.core.controller.Controller:108 - Writing result files...
09:13:17.176 [main] [] INFO  org.aksw.limes.core.io.serializer.SerializerFactory:32 - Getting serializer with name CSV
09:13:17.199 [main] [] INFO  org.aksw.limes.core.controller.Controller:111 - Writing statistics file...

MSherif · 2022-03-10T09:32:22Z

Thank you for the detailed explanation, this is extremely helpful! Could you add this to the official documentation at http://dice-group.github.io/LIMES/#/user_manual/configuration_file/defining_link_specifications?id=boolean-operations? I know what minimum, maximum, and set differences are but the interaction with the thresholds was not clear to me. However what I still don't know is: What is the similarity score output of the MINUS operator? The ones from the first parameter? And what if something is below the threshold?

Actually, the MIN(m1, m2) is the entries (i.e., links) with minimum similarities in both m1 and m2, where nonexisting entries in both m1 and m2 are assumed to have a similarity of 0. Therefore, if one link l only exists in one m1 for instance, then we conceder that m2 contains the same link l with a similarity of 0. Therefore, we do not return l as it would have the minimum similarity of 0. The MAX(m1, m2) has the same semantics.
MINUS(m1,m2) will only return links from m1 with their respective similarities, only in case such links do not exist in m2.

MSherif · 2022-03-10T11:27:32Z

Done updating the LIMES docs

KonradHoeffner changed the title ~~How to map a single dataset to itself?~~ How to map a single dataset containing multiple sources to itself? Jan 14, 2022

KonradHoeffner mentioned this issue Jan 14, 2022

Katalogeinträge verlinken limes hitontology/ontology#94

Closed

KonradHoeffner mentioned this issue Jan 18, 2022

Please add a "less than" < string operator #256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to map a single dataset containing multiple sources to itself? #255

How to map a single dataset containing multiple sources to itself? #255

KonradHoeffner commented Jan 14, 2022 •

edited

Loading

KonradHoeffner commented Jan 18, 2022 •

edited

Loading

MSherif commented Mar 9, 2022

KonradHoeffner commented Mar 9, 2022 •

edited

Loading

MSherif commented Mar 9, 2022 •

edited

Loading

MSherif commented Mar 9, 2022

KonradHoeffner commented Mar 10, 2022 •

edited

Loading

KonradHoeffner commented Mar 10, 2022

MSherif commented Mar 10, 2022

MSherif commented Mar 10, 2022

How to map a single dataset containing multiple sources to itself? #255

How to map a single dataset containing multiple sources to itself? #255

Comments

KonradHoeffner commented Jan 14, 2022 • edited Loading

KonradHoeffner commented Jan 18, 2022 • edited Loading

MSherif commented Mar 9, 2022

KonradHoeffner commented Mar 9, 2022 • edited Loading

MSherif commented Mar 9, 2022 • edited Loading

MSherif commented Mar 9, 2022

KonradHoeffner commented Mar 10, 2022 • edited Loading

KonradHoeffner commented Mar 10, 2022

Output of LIMES

MSherif commented Mar 10, 2022

MSherif commented Mar 10, 2022

KonradHoeffner commented Jan 14, 2022 •

edited

Loading

KonradHoeffner commented Jan 18, 2022 •

edited

Loading

KonradHoeffner commented Mar 9, 2022 •

edited

Loading

MSherif commented Mar 9, 2022 •

edited

Loading

KonradHoeffner commented Mar 10, 2022 •

edited

Loading