Skip to content

Latest commit

 

History

History
30 lines (22 loc) · 4.44 KB

DATA_CONTRACTS.md

File metadata and controls

30 lines (22 loc) · 4.44 KB

Note

Those new to Wikidata should consider first reading the Wikidata and Scribe Guide.

Scribe uses community curated data from Wikidata in all applications via the Scribe-Data CLI. To make sure that Scribe-Data is easy to maintain, we've made the decision that data output labels reflect the data that comprises them. As the data in Wikidata can change, that also means that labels can change, meaning that return values for Scribe applications cannot be hard coded. If they were hard coded, then we'd need to update end applications each time that there's new data or the data changes meaning that we wouldn't have parallel update processes for applications and data.

Data labels are generated from underlying data via the Scribe-Data metadata files that then are used to check outputs. An example of labels for data coming from the data itself is:

  • For a verbs query we check Wikidata and see that a verb has the following entities for an indicative, present tense, first person, singular conjugation form
    • This verb form for to go in French would be je vais, in German ich gehe, etc
  • The Wikidata QIDs for the above entities are Q682111, Q192613, Q21714344, and Q110786
  • The meta data files determine the translation of QIDs to unique labels and further their order
  • The above conjugation would thus be indicativePresentFirstPersonSingular
  • With similar translations we get all conjugations for a tense:
    • indicativePresentFirstPersonSingular
    • indicativePresentSecondPersonSingular
    • indicativePresentThirdPersonSingular
    • indicativePresentFirstPersonPlural
    • indicativePresentSecondPersonPlural
    • indicativePresentThirdPersonPlural
  • The above comprises what the user would see in a conjugation view in a Scribe application

This process has countless benefits for Scribe-Data as it allows us to check that queries are correct, to get consistent data from Wikidata via dumps or the Wikidata Query Service and also to generate queries from Wikidata dumps.

But what if the data changed? Data on Wikidata is modeled by a diverse community, meaning not all data is modeled the same, and sometimes it does change. An example of this is nouns in Spanish, where for many languages on Wikidata different genders of a word have their own unique LID identifier - brother and sister would have their own identifiers. At time of writing brother and sister are on the same identifier in Spanish. We thus need special processes for dealing with Spanish data, but then if we hard coded them and the data changes, then the end applications would break.

A similar situation could arise for conjugations like those above. Say the data was modeled without indicative at first, and then the community decides that verbs for a certain language need to include their grammatical mood (indicative is a mood). Our applications would be coded to accept presentFirstPersonSingular, presentSecondPersonSingular, etc, but then all of this would then be broken as the data labels would change.

The solution to this is the data contracts. Scribe-Data generates both data and the corresponding contracts, with both then being transmitted from Scribe-Server to end applications. The contracts dictate which columns in which tables should be used to access features for end applications, or combinations of data can be codified so that end applications always know which queries they need to run on the SQLite databases. We run checks to make sure that the current contracts are valid given new data, and if the data changes then we just update the contract values.