Skip to content

Guidelines or suggestions for data reconciliation (updated from time to time; collecting advice from everyone)

Junjun Cao edited this page Feb 21, 2025 · 35 revisions

0. Regarding the latent "Schema" behind the data (to be finished)

Currently, data reconciliation is performed on the CSV files, so we cannot have a clear look on the general schema. We had better pay extra attention in 2 aspects:

0.1 Detect the "attributes for relationship instead of any individual entity" in the CSV files (I am not sure if it's suitable to put this here)

There are some attributes which are not associated with any single entity but with more than one entities. To put it in another way, they are attributes for a relationship. For example:

<style> </style>
recording_id track number tune
https://thesession.org/recordings/626 1 1 Beidh Aonach Amarach
https://thesession.org/recordings/626 1 2 Bean An Ti Ar Lar
https://thesession.org/recordings/4 1 1 Johnny Boyle's
The values for the track and the number are in context of the value both of tune and of recording_id which may make the reconciliation a bit...

0.2 Detect the redundancy among the CSV files

Sometimes, the CSV files provided by the database webpage have duplicated properties. Just be cautious of this.

1. Q and P in WikiData

1.0 Basic attention

  • clarification of name space:

wd:http://www.wikidata.org/entity/ not wd:https://www.wikidata.org/wiki/
wdt:http://www.wikidata.org/prop/direct/ not wdt:https://www.wikidata.org/wiki/Property:

  • Don't mix them in using.
  • Be cautious of ambiguity of some term:

For example, the "recording" entity of TheSession is not a "recorded music"(Q49017950) but indeed an "album"(Q482994); but in MusicBrainz, "recorded music" and "album" coexist and are different.

1.1 Q

1.1.1 Find the class/type for the entities/instances

It's worth manual reconciliation for classes

1.1.1.1 For the first column

Such as <entity> rdf:type <entity>. or <entity> wdt:P31 <entity>.

In the future, we may add semantics like rdf:type owl:equivalentProperty wdt:P31..

1.1.1.2 For columns other than the first column

Such as type for entities of MusicBrainz There is type on the spreadsheet header of certain entity files, which type can be sub-type of the entity.

1.1.2 Reconcile each instances with those of Wikidata(see below)

Issue to be discussed: Should we reconcile the type for the property values?

At present, seemingly we don't have to. But If we do that, there will be a more comprehensive ontology including rdfs:domain and rdfs:range for every property, which may be helpful for natural language question to SPARQL.

1.2 P

(1) Reconciliation for Properties can only be done manually.

It's recommended that in addition to referencing the literal description of a property, the reconciliation for properties should also consider comparing the rdfs:domain & rdfs:range between both the local property and the corresponding property of Wikidata: e.g., if the domain/range of the local property fall within that domain/range of the candidate property of Wikidata, such reconciliation is generally more reliable.

(2) Reconciliation for Property values can be mostly done automatically on OpenRefine.

1.2.0 About rdfs:label

Some properties[such as name(wdt:P2561), title(wdt:title)] are basically similar to rdfs:label, which is preferably recommended for the convenience of LLM2SPARQL.

1.2.1 Note the type of property values

Since the coming out of OWL, property can be substantially divided into 2 types:

(1)object property:the data type is another item which has URI, for example, see day of week (P2894)
(2)data property:the data type is not URI but rdfs:Literal...

Perhaps, consistency of properties type is recommended. Till now, it's also because a clear distinction between object property and data property will contribute to the accuracy of LLM2SPARQL. For example:

If you ask a question to ChatGPT, it usually render any property as either an object property or a data property. To clarify, you probably have to use isIRI(?x). For example:

See a specific question "Find in TheSession performers who are Canadians. And find the recordings they performed in TheSession".
The expected SPARQL can be:

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>

SELECT distinct ?recording ?performer 
WHERE {
  GRAPH <http://sample/thesession/reconciled> {
    ?recording a wd:Q482994 ;
               wdt:P175 ?performer .
    FILTER isIRI(?performer)# Without the FILTER, it will report "Virtuoso S1TAT Error Query did not complete due to ANYTIME timeout."
  }
  SERVICE <https://query.wikidata.org/sparql> { 
    ?performer wdt:P27 wd:Q16
  }
}
  • Caution: Sometimes, the original database intentionally treats certain attributes as DataProperties rather than structured entities(ObjectProperty). For example, the wdt:P175(performer)'s value for recording of TheSession is more like DataProperty, meaning the values are not well-structured. This can lead to ambiguity, such as the same name referring to different individuals, causing more difficulties for reconciliation.
1.2.2 Wikidata:WikiProject_Music

We can also refer to this: https://www.wikidata.org/wiki/Wikidata:WikiProject_Music to get a lot of recommended properties for LinkedMusic.

1.2.3 Whether it's worth reconciliation for any property values?

It involves a consideration of avoiding "Super Nodes", referring to https://github.com/DDMAL/linkedmusic-datalake/discussions/205

2. Ask ChatGPT to recommend properties or types for entities

3. Check the context where the property is used in Wikidata

3.1 Especially the context of subject->property->object
3.2 Checking the "subclass of" or "instance of" property is also useful

This raises another notable issue: A word has "seemingly the same but actually different" meanings depending on the context. For example, in terms of reconciliation of the sub-type "Club" of the entity "Place" in MusicBrainz, there are 2 potential matches from Wikidata: "club (Q988108), which is subclass of organization" and "nightclub (Q622425), which is subclass of music venue", we would rather choose Q622425 which more closely aligns with the type of "Place".

4. In addition to Wikidata, we can reconcile with other metadata schema/ontology such as schema.org

The recommended list of schema/ontology sorted in descending order based on priority is: ...

5. Reconciliation for entities:

All the entities' format should abide by the ones denoted by namespace prefixes wd and wdt

@preifx wd:http://www.wikidata.org/entity/
@preifx wdt:http://www.wikidata.org/prop/direct/

Be very careful that it's "http" instead of "https"; for wd, it's /entity/ instead of /wiki/...
Or the reconciled URI won't be recognized by Wikidata SPARQL Endpoint.

5.1 Suggestion for preferred entities

When faced with similar candidates and it's difficult to decide which one to be mapped to preferentially, it's suggested to choose the entity with comparatively Smaller Q ID number or P ID number. Generally, a smaller ID number indicates that the entity is more widely used.

5.2 Keep records for reconciliation

Especially for those that you have to reconciled manually on OpenRefine, you had better have a spreadsheet to record the mapped entities from Wikidata.

Some records can be regarded as "controlled vocabularies". For example, in MusicBrainz, an type of entity usually has several sub-types, such as the sub-type of label, which contains Bootleg Production,Original Production,Production,Imprint...

Refer to https://github.com/DDMAL/linkedmusic-datalake/wiki/Wikidata:-Things-we-should-add

6. For those not easy to be mapped to an exact property or type, we prepare two methods as substitute:

1. Add to Wikidata

2. Use the URL of the webpage of the corresponding database

2.1 If necessary, we use hash #(document fragment delimiter). Such as https://musicbrainz.org/doc/Event#Cancelled in MusicBrainz

2.2 If necessary, we can use a fake URL.

7. Special situation:

7.1 Sometimes we can model with respect to some special situation. Such as

image Please refer to https://github.com/DDMAL/linkedmusic-datalake/issues/107

7.2 About wdt:P2888:exact match

  • We had better also use owl:sameAs, because it can supplement latent data via an activated reasoning function (in Virtuoso), see:
INSERT {
GRAPH <urn:reason.example> {
     <http://InstanceA_local> <http://property> <http://InstanceB_local>.
     <http://InstanceA_local> owl:sameAs <http://InstanceA_wiki>.
     <http://InstanceB_local> owl:sameAs <http://InstanceB_wiki>.# This reasoning condition doesn't take effect, to be investigated in the future.
  }
}

After insertion of data as above, if you check what property http://InstanceA_wiki will have, you may query while activating the reasoning function:

DEFINE input:same-as "yes"
SELECT distinct ?p ?o 
FROM <urn:reason.example> 
WHERE {
    <http://InstanceA_wiki> ?p ?o .  
}

The result can be:

p o
http://www.w3.org/2002/07/owl#sameAs http://InstanceA_wiki
http://property http://InstanceB_local

p o http://www.w3.org/2002/07/owl#sameAs http://instancea_wiki/ http://property/ http://instanceb_local/

7.3 About a property nested in another property

Here is a case from RISM, with the original data including blank nodes:

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ns1: <https://rism.online/api/v1#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<https://rism.online/sources/1000000001> a ns1:Source ;
ns1:hasRelationship [ dcterms:relation <https://rism.online/people/40005939> ;
                      ns1:hasRole <http://id.loc.gov/vocabulary/relators/arr> ] .

The ns1:hasRole serves as an adverbial to describe the relation. I am not sure: is it worthwhile for reconciling ns1:hasRole?

Clone this wiki locally