Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relational database schemas for subgraphs #1020

Closed
Jannis opened this issue Jun 27, 2019 · 8 comments
Closed

Relational database schemas for subgraphs #1020

Jannis opened this issue Jun 27, 2019 · 8 comments

Comments

@Jannis
Copy link
Contributor

Jannis commented Jun 27, 2019

This is required for #297.

@lutter
Copy link
Collaborator

lutter commented Aug 5, 2019

Goals

We want to switch the storage of entities from JSONB to a relational schema, meaning that each entity type is stored in its own table. The relational schema for a subgraph is generated from the GraphQL schema.

Storage in a relational schema will be added in such a way that already existing subgraphs that use JSONB storage will continue to function. Only newly deployed subgraphs will be stored in a relational schema.

Migrating existing subgraphs to relational storage is not currently planned.

Mapping GraphQL to a relational schema

The mapping from GraphQL to SQL is generated from the GraphQL schema we already store, and needs to be passed into Store.create_subgraph_deployment. The store will need to be able to load the mapping from the database by itself so that it can perform queries against a subgraph. The generated mapping should be cached in the store.

We will continue to store all entities for a subgraph in a subgraph-specific database schema/namespace.

Each type in the GraphQL schema is mapped to a database table, and each attribute of the type is stored in a column of that table. GraphQL types are mapped to SQL types in the following manner:

GraphQL type SQL type
Boolean boolean
BigInt numeric
Bytes bytea
BigDecimal numeric
ID varchar or bytea
Int integer
String varchar
List array

Note that for ID, we generally use varchar, but the block explorer uses bytea for the ID property.

We will only support lists of scalars, i.e., not lists of lists.

Not-null constraints in GraphQL will be mapped to not null constraints in SQL.

Properties that reference other entities, either a single entity, or an array of entities, will be stored as the ID type or as an array of ID.

For example, the GraphQL entity type

type Musician @entity {
    id: ID!
    name: String!
    mainBand: Band
    bands: [Band!]!
    writtenSongs: [Song]! @derivedFrom(field: "writtenBy")
}

would be stored in a SQL table

create table musicians (
        id                   varchar primary key,
        name                 varchar not null,
        main_band            varchar,
        bands                varchar[] not null
     -- derived fields (not stored in this table)
     -- written_songs        varchar[] not null
);

Interfaces

The biggest headache in generating a relational schema for a GraphQL schema are interfaces. The fact that a certain entity implements an interface has no bearing on the relational schema, but GraphQL attributes that reference interfaces need to be treated specially.

Assume we have the GraphQL schema

interface Pet {
    name: String!
}

type Cat implements Pet @entity {
    id: ID!,
    name: String!
}

type Dog implements Pet @entity {
    id: ID!,
    name: String!
}

type Owner @entity {
    id: ID!,
    firstPet: Pet!,
    pets: [Pet!]!
}

The tables for Cat and Dog are straightforward, but the table for Owner needs to deal with a polymorphic reference. This blog post has a nice overview of how to store such references. Since we do not enforce referential integrity in our database schema, we can use the Polymorphic Joins approach from the post, so that the table for Owner would look like

create table owners (
    id             varchar primary key,
    first_pet_type varchar,
    first_pet_id   varchar,
    pets_type      varchar[],
    pets_id        varchar[]
    -- add check constraints to ensure that 
    -- first_pet_type and pets_type only contain
    -- 'Cat' and 'Dog'
)

Serialization

Currently, we serialize entities into JSON objects before storing them in the database. For relational schemas, we will need to convert each attribute of an entity to the corresponding SQL type, and use attribute values directly in queries.

Writing entities

Apart from serialization, writing entities is fairly straightforward, and a Musician entity could be added to the store with a query like

insert into musicians(id, name, main_band, bands)
  values('lennon', 'John Lennon', 'Beatles', '{beatles,quarrymen}')

The above will of course use bind variables in real code. Updating and deleting entities will be handled similarly.

Querying entities

Currently, we only ever query one entity type in the same query; queries against the relational schema therefore only ever reference a single table. The biggest change for querying will be in how we turn an EntityFilter into conditions for a SQL query's where clause, specifically how attributes are accessed in store/postgres/src/filter.rs. With JSONB, the name of a musician is accessed as data->'name'->>'data'. With relational storage, this becomes just name.

Time-travel queries

We will not implement the current scheme of storing entity history in a separate history table; instead, we will follow the approach outlined in this issue

All tables that are generated will have a block_range column added that indicates for which range of blocks an entity version is valid. The id column will no longer be the primary key, instead, we use the exclude constraint shown in the above issue to ensure there is only ever one valid version for an entity at any given block.

The time-travel support will also be used to perform block reversions.

@leoyvens
Copy link
Collaborator

leoyvens commented Aug 5, 2019

Plan looks good. I have a comment about interfaces. Currently we do not require mappings to tell us the concrete type of an interface relationship, and instead require all types implementing an interface to not have conflicting ids. So we wouldn't know what to put in first_pet_type, and instead know only first_pet_id and have to join with all implementors of Pet when traversing the relationship.

@lutter
Copy link
Collaborator

lutter commented Aug 5, 2019

We can create schema that way - it'll cause more work when querying, since we have to look at all Pet entities, but is doable.

@Jannis
Copy link
Contributor Author

Jannis commented Aug 14, 2019

@leoyvens One thing we could do there is to perform lookups by ID/IDs to identify the type before writing the references to the db. This requires the referenced entities to already exist before referencing them — which feels like a good constraint.

@lutter A few comments:

  • The blog post linked for polymorphic joins only demonstrates 1->1 or n->1 relationships but not 1->n relationships like in your example. How does it work in your example?

  • One thing we're only partially supporting right now is GraphQL's union types, e.g. union Pet = Cat | Dog. Assuming we require Cat and Dog to be concrete entity types, it looks like we can implement them similarly to interfaces: dedicated entity tables for the concrete Cat and Dog types, plus type hints when referencing 1 or more pets. Thinking about it, it shouldn't be hard to support unions involving other unions and interfaces as well, because we can resolve them to all candidate types when we need to.

  • What I'm missing is an implementation plan (with rough estimates). Could you break the work down a bit, thinking about the sequencing of it?

Other than that, this looks good.

@lutter
Copy link
Collaborator

lutter commented Aug 15, 2019

@leoyvens One thing we could do there is to perform lookups by ID/IDs to identify the type before writing the references to the db. This requires the referenced entities to already exist before referencing them — which feels like a good constraint.

We made a decision to not enforce referential integrity when writing data at all. According to @yanivtal, we want to make it possible to write an entity that references a Pet before we write the actual Pet entity; and with that, we can't know the type of Pet we are dealing with. References to interfaces would therefore be stored like any other reference, as the ID, and resolution whether it is a Cat or Dog would happen when we run a query.

* The blog post linked for polymorphic joins only demonstrates 1->1 or n->1 relationships but not 1->n relationships like in your example. How does it work in your example?

The relational schema follows the GraphQL schema pretty closely: for 1->n relationships, we'd store an array of ID.

* One thing we're only partially supporting right now is GraphQL's `union` types, e.g. `union Pet = Cat | Dog`. Assuming we require `Cat` and `Dog` to be concrete entity types, it looks like we can implement them similarly to interfaces: dedicated entity tables for the concrete `Cat` and `Dog` types, plus type hints when referencing 1 or more pets. Thinking about it, it shouldn't be hard to support unions involving other unions and interfaces as well, because we can resolve them to all candidate types when we need to.

Yes, I think on the storage side, unions and interfaces are pretty much the same, both for storing references to them, as well as when we query them.

@lutter
Copy link
Collaborator

lutter commented Sep 11, 2019

Huge chunk of this was merged with PR #1159.

Remaining work:

@Jannis
Copy link
Contributor Author

Jannis commented Oct 28, 2019

@lutter I suppose we can close this, right?

@lutter
Copy link
Collaborator

lutter commented Oct 28, 2019

Yep!

@lutter lutter closed this as completed Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants