-
Notifications
You must be signed in to change notification settings - Fork 4
Guidelines or suggestions for data reconciliation (updated from time to time; collecting advice from everyone)
Currently, data reconciliation is performed on the CSV files, so we cannot have a clear look on the general schema. We had better pay extra attention in 2 aspects:
0.1 Detect the "attributes for relationship instead of any individual entity" in the CSV files (I am not sure if it's suitable to put this here)
There are some attributes which are not associated with any single entity but with more than one entities. To put it in another way, they are attributes for a relationship. For example:
<style> </style>recording_id | track | number | tune |
---|---|---|---|
https://thesession.org/recordings/626 | 1 | 1 | Beidh Aonach Amarach |
https://thesession.org/recordings/626 | 1 | 2 | Bean An Ti Ar Lar |
https://thesession.org/recordings/4 | 1 | 1 | Johnny Boyle's |
Sometimes, the CSV files provided by the database webpage have duplicated properties. Just be cautious of this.
- clarification of name space:
wd:http://www.wikidata.org/entity/
notwd:https://www.wikidata.org/wiki/
wdt:http://www.wikidata.org/prop/direct/
notwdt:https://www.wikidata.org/wiki/Property:
- Don't mix them in using.
- Be cautious of ambiguity of some term:
For example, the "recording" entity of TheSession is not a "recorded music"(Q49017950) but indeed an "album"(Q482994); but in MusicBrainz, "recorded music" and "album" coexist and are different.
It's worth manual reconciliation for classes
Such as
<entity> rdf:type <entity>.
or<entity> wdt:P31 <entity>.
In the future, we may add semantics like
rdf:type owl:equivalentProperty wdt:P31.
.
Such as type for entities of MusicBrainz There is type on the spreadsheet header of certain entity files, which type can be sub-type of the entity.
- Note: The type among the spreadsheet headers of series does not refer to the actual type of series (see: https://musicbrainz.org/doc/Series). To avoid confusion and contradiction (in the mapping relationship), we should rename it to something more descriptive, such as "typeOfContainedEntity", mapping to https://musicbrainz.org/doc/Series#Type.
Issue to be discussed: Should we reconcile the type for the property values?
At present, seemingly we don't have to. But If we do that, there will be a more comprehensive ontology including
rdfs:domain
andrdfs:range
for every property, which may be helpful for natural language question to SPARQL.
(1) Reconciliation for Properties can only be done manually.
It's recommended that in addition to referencing the literal description of a property, the reconciliation for properties should also consider comparing the
rdfs:domain
&rdfs:range
between both the local property and the corresponding property of Wikidata: e.g., if the domain/range of the local property fall within that domain/range of the candidate property of Wikidata, such reconciliation is generally more reliable.
(2) Reconciliation for Property values can be mostly done automatically on OpenRefine.
Some properties[such as name(wdt:P2561), title(wdt:title)] are basically similar to rdfs:label
, which is preferably recommended for the convenience of LLM2SPARQL.
Since the coming out of OWL, property can be substantially divided into 2 types:
(1)object property:the data type is another item which has URI, for example, see day of week (P2894)
(2)data property:the data type is not URI but rdfs:Literal...
Perhaps, consistency of properties type is recommended. Till now, it's also because a clear distinction between object property and data property will contribute to the accuracy of LLM2SPARQL. For example:
If you ask a question to ChatGPT, it usually render any property as either an object property or a data property. To clarify, you probably have to use
isIRI(?x)
. For example:See a specific question "Find in TheSession performers who are Canadians. And find the recordings they performed in TheSession".
The expected SPARQL can be:
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT distinct ?recording ?performer
WHERE {
GRAPH <http://sample/thesession/reconciled> {
?recording a wd:Q482994 ;
wdt:P175 ?performer .
FILTER isIRI(?performer)# Without the FILTER, it will report "Virtuoso S1TAT Error Query did not complete due to ANYTIME timeout."
}
SERVICE <https://query.wikidata.org/sparql> {
?performer wdt:P27 wd:Q16
}
}
- Caution: Sometimes, the original database intentionally treats certain attributes as DataProperties rather than structured entities(ObjectProperty). For example, the wdt:P175(performer)'s value for recording of TheSession is more like DataProperty, meaning the values are not well-structured. This can lead to ambiguity, such as the same name referring to different individuals, causing more difficulties for reconciliation.
We can also refer to this: https://www.wikidata.org/wiki/Wikidata:WikiProject_Music to get a lot of recommended properties for LinkedMusic.
It involves a consideration of avoiding "Super Nodes", referring to https://github.com/DDMAL/linkedmusic-datalake/discussions/205
3.1 Especially the context of subject->property->object
3.2 Checking the "subclass of" or "instance of" property is also usefulThis raises another notable issue: A word has "seemingly the same but actually different" meanings depending on the context. For example, in terms of reconciliation of the sub-type "Club" of the entity "Place" in MusicBrainz, there are 2 potential matches from Wikidata: "club (Q988108), which is subclass of organization" and "nightclub (Q622425), which is subclass of music venue", we would rather choose Q622425 which more closely aligns with the type of "Place".
The recommended list of schema/ontology sorted in descending order based on priority is: ...
All the entities' format should abide by the ones denoted by namespace prefixes wd
and wdt
@preifx wd:http://www.wikidata.org/entity/
@preifx wdt:http://www.wikidata.org/prop/direct/
Be very careful that it's "http" instead of "https"; for wd
, it's /entity/
instead of /wiki/
...
Or the reconciled URI won't be recognized by Wikidata SPARQL Endpoint.
When faced with similar candidates and it's difficult to decide which one to be mapped to preferentially, it's suggested to choose the entity with comparatively Smaller Q ID number or P ID number. Generally, a smaller ID number indicates that the entity is more widely used.
Especially for those that you have to reconciled manually on OpenRefine, you had better have a spreadsheet to record the mapped entities from Wikidata.
Some records can be regarded as "controlled vocabularies". For example, in MusicBrainz, an type of entity usually has several sub-types, such as the sub-type of label, which contains
Bootleg Production
,Original Production
,Production
,Imprint
...
Refer to https://github.com/DDMAL/linkedmusic-datalake/wiki/Wikidata:-Things-we-should-add
6. For those not easy to be mapped to an exact property or type, we prepare two methods as substitute:
2.1 If necessary, we use hash #(document fragment delimiter). Such as https://musicbrainz.org/doc/Event#Cancelled in MusicBrainz
2.2 If necessary, we can use a fake URL.
Please refer to https://github.com/DDMAL/linkedmusic-datalake/issues/107
- We had better also use owl:sameAs, because it can supplement latent data via an activated reasoning function (in Virtuoso), see:
INSERT {
GRAPH <urn:reason.example> {
<http://InstanceA_local> <http://property> <http://InstanceB_local>.
<http://InstanceA_local> owl:sameAs <http://InstanceA_wiki>.
<http://InstanceB_local> owl:sameAs <http://InstanceB_wiki>.# This reasoning condition doesn't take effect, to be investigated in the future.
}
}
After insertion of data as above, if you check what property http://InstanceA_wiki will have, you may query while activating the reasoning function:
DEFINE input:same-as "yes"
SELECT distinct ?p ?o
FROM <urn:reason.example>
WHERE {
<http://InstanceA_wiki> ?p ?o .
}
The result can be:
p | o |
---|---|
http://www.w3.org/2002/07/owl#sameAs | http://InstanceA_wiki |
http://property | http://InstanceB_local |
p o http://www.w3.org/2002/07/owl#sameAs http://instancea_wiki/ http://property/ http://instanceb_local/
Here is a case from RISM, with the original data including blank nodes:
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ns1: <https://rism.online/api/v1#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<https://rism.online/sources/1000000001> a ns1:Source ;
ns1:hasRelationship [ dcterms:relation <https://rism.online/people/40005939> ;
ns1:hasRole <http://id.loc.gov/vocabulary/relators/arr> ] .
The ns1:hasRole
serves as an adverbial to describe the relation. I am not sure: is it worthwhile for reconciling ns1:hasRole
?