Skip to content

Commit

Permalink
add: graphql factory documentation (#2887)
Browse files Browse the repository at this point in the history
* fix: link to `new` graphql + rest home docs

* chore: move `rest` api docs to new location

* add: `graphql` factory documentation

* fix: correct `broken` links
  • Loading branch information
Jabolol authored Jan 28, 2025
1 parent 7da5eeb commit f11dc89
Show file tree
Hide file tree
Showing 10 changed files with 242 additions and 17 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
202 changes: 202 additions & 0 deletions apps/docs/docs/contribute-data/api-crawling/graphql-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
---
title: GraphQL API Crawler
sidebar_position: 3
---

## GraphQL Resource Factory

Many of our data ingestion workflows rely on GraphQL, but constructing queries
and introspecting types can involve repetitive steps. This **GraphQL Resource
Factory** eliminates boilerplate by handling introspection, query building, and
parameter management for you.

This guide will explain how to use the factory to bootstrap a
[`graphql_factory`](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/factories/graphql.py)
asset in Dagster. We'll walk through the process of defining a configuration,
building the factory, and customizing the asset to suit your needs.

---

## Step by Step: Defining Your GraphQL Resource

This example will demonstrate how to create a GraphQL asset that fetches
transactions from the Open
[Collective API](https://docs.opencollective.com/help/contributing/development/api).
The API has a `transactions` query that returns a list of transactions.

Currently, it has hundreds of nested fields, making it **cumbersome** to write
queries manually (which we have done in the past and it's not fun). The GraphQL
Resource Factory will generate the query for us, extract the relevant data, and
return a clean, usable asset, all with minimal effort.

### 1. Create the Configuration

The first step is to define a configuration object that describes your GraphQL
resource. For the Open Collective transactions example, we set the endpoint URL,
define query parameters, and specify a transformation function to extract the
data we need.

We also set a `max_depth` parameter to limit the depth of the introspection
query. This means the generated query will only explore fields up to a certain
depth, preventing it from becoming too large, but capturing all the necessary
data for our asset.

```python
from ..factories.graphql import GraphQLResourceConfig

config = GraphQLResourceConfig(
name="transactions",
endpoint="https://api.opencollective.com/graphql/v2",
max_depth=3, # Limit the introspection depth
parameters={
"limit": {
"type": "Int!",
"value": 10,
},
"type": {
"type": "TransactionType!",
"value": "CREDIT",
},
"dateFrom": {
"type": "DateTime!",
"value": "2024-01-01T00:00:00Z",
},
"dateTo": {
"type": "DateTime!",
"value": "2024-12-31T23:59:59Z",
},
},
transform_fn=lambda result: result["transactions"]["nodes"], # Optional transformation function
target_query="transactions", # The query to execute
target_type="TransactionCollection", # The type containing the data
)
```

:::tip
For the full `GraphQLResourceConfig` spec, see the [`source`](https://github.com/opensource-observer/oso/blob/05fe8b9192a08f6446225a89f4455c6b3723c5de/warehouse/oso_dagster/factories/graphql.py#L99)
:::

In this configuration, we define the following fields:

- **name**: A unique identifier for the dagster asset.
- **endpoint**: The URL of the GraphQL API.
- **max_depth**: The maximum depth of the introspection query. This will
generate a query that explores all fields recursively up to this depth.
- **parameters**: A dictionary of query parameters. The keys are the parameter
names, and the values are dictionaries with the parameter type and value.
- **transform_fn**: A function that processes the raw GraphQL response and
returns the desired data.
- **target_query**: The name of the query to execute.
- **target_type**: The name of the GraphQL type that contains the data of
interest.

The factory will create the following query automatically, recursively
introspecting all the fields up to the specified depth:

```graphql
query (
$limit: Int!
$type: TransactionType!
$dateFrom: DateTime!
$dateTo: DateTime!
) {
transactions(
limit: $limit
type: $type
dateFrom: $dateFrom
dateTo: $dateTo
) {
offset
limit
totalCount
nodes {
id
legacyId
uuid
group
type
kind
amount {
value
currency
valueInCents
}
oppositeTransaction {
id
legacyId
uuid
# ... other generated fields ...
merchantId
invoiceTemplate
}
merchantId
balanceInHostCurrency {
value
currency
valueInCents
}
invoiceTemplate
}
kinds
}
}
```

### 2. Build the Factory

:::tip
The GraphQL factory function takes a mandatory `config`
argument. The other arguments are directly passed to the underlying
`dlt_factory` function, allowing you to customize the behavior of the asset.

For the full reference of the allowed arguments, check out the Dagster
[`asset`](https://docs.dagster.io/api/python-api/assets) documentation.
:::

The `graphql_factory` function is the used to convert your configuration into a
callable Dagster asset. It takes the configuration object and returns a factory
function that our infrastructure will use to automatically create the asset.

```python
from ..factories.graphql import graphql_factory

# ... config definition ...

open_collective_transactions = graphql_factory(
config,
key_prefix="open_collective",
)
```

---

## How to Run and View Results

:::tip
If you have not setup your local Dagster environment yet, please follow
our [quickstart guide](../../guides/dagster/index.md).
:::

After having your Dagster instance running, follow the
[Dagster Asset Guide](../../guides/dagster/index.md) to materialize the assets.
Our example assets are located under `assets/open_collective/transactions`.

![Dagster Open Collective Asset List](crawl-api-graphql-pipeline.png)

Running the pipeline will fetch the `10` transactions from the Open Collective
API and store them in BigQuery:

![Dagster Open Collective Result](crawl-api-example-opencollective.png)

---

## Conclusion

The GraphQL Resource Factory is a powerful tool for creating reusable assets
that fetch data from GraphQL APIs. By defining a configuration object and
building a factory function, you can quickly create assets that crawl complex
APIs effortlessly.

This allows you to focus on the data you need and the transformations you want
to apply, rather than the mechanics of constructing queries and managing API
interactions.
18 changes: 18 additions & 0 deletions apps/docs/docs/contribute-data/api-crawling/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
title: Crawl an API
sidebar_position: 4
---

Currently, we offer factories for two types of APIs: REST and GraphQL. These
factories allow you to ingest data from APIs with minimal effort. Here are the
guides to help you get started:

- [REST API Crawler](./rest-api.md): This guide will walk you through the
process of creating a REST API crawler using our `rest` factory. It will show
you how to define the endpoints you want to fetch, create a configuration
object, and build the asset.
- [GraphQL API Crawler](./graphql-api.md): This guide will explain how to use
the GraphQL Resource Factory to bootstrap a `graphql_factory` asset in Dagster
that fetches all fields from a GraphQL endpoint automatically.

Reach out to us on [Discord](https://www.opensource.observer/discord) for help.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Crawl an API
sidebar_position: 4
title: Rest API Crawler
sidebar_position: 2
---

## What Are We Trying to Achieve?
Expand Down Expand Up @@ -100,10 +100,10 @@ We have a handy factory function called
that takes your configuration and returns a callable **factory** that wires all
assets up with the specified configuration.

For a minimal configuration, we just need to supply a `key_prefix` to the factory
function. This will be used to create the asset keys in the Dagster environment. It
accepts a list of strings as input. Each element will be represented as a level in
the key hierarchy.
For a minimal configuration, we just need to supply a `key_prefix` to the
factory function. This will be used to create the asset keys in the Dagster
environment. It accepts a list of strings as input. Each element will be
represented as a level in the key hierarchy.

:::tip
Under the hood, this will create a set of Dagster assets, managing all of
Expand Down Expand Up @@ -132,29 +132,33 @@ pipeline, the data will be ingested into your OSO warehouse.

:::tip
If you have not setup your local Dagster environment yet, please follow
our [quickstart guide](../guides/dagster/index.md).
our [quickstart guide](../../guides/dagster/index.md).
:::

After having your Dagster instance running, follow the
[Dagster Asset Guide](../guides/dagster/index.md) to materialize the assets. Our
example assets are located under `assets/defillama/tvl`.
[Dagster Asset Guide](../../guides/dagster/index.md) to materialize the assets.
Our example assets are located under `assets/defillama/tvl`.

![Dagster DefiLlama Asset List](crawl-api-example-defillama.png)

---

---

## Expanding Your Crawler

In practice, you may do more than just retrieve data:

- **Pagination**: `dlt` supports adding a paginator if you have large result sets.
- **Pagination**: `dlt` supports adding a paginator if you have large result
sets.
- **Transformations**: You can add transformations before loading, such as
cleaning up invalid fields or renaming columns.

Our tooling is flexible enough to let you customize these details without losing
the simplicity of the factory approach.

Here's a more advanced example showing automatic pagination and specific field selection using the Pokémon API:
Here's a more advanced example showing automatic pagination and specific field
selection using the Pokémon API:

```py
from dlt.sources.rest_api.typing import RESTAPIConfig
Expand Down Expand Up @@ -197,7 +201,8 @@ dlt_assets = create_rest_factory_asset(config=config)
pokemon_assets = dlt_assets(key_prefix=["pokemon", "moves"])
```

After running the pipeline, you'll find the Pokémon moves assets in your data warehouse:
After running the pipeline, you'll find the Pokémon moves assets in your data
warehouse:

![BigQuery Pokemon Moves Table Data](crawl-api-advanced.png)

Expand All @@ -223,5 +228,5 @@ removes repetitive tasks and helps you maintain a consistent approach to
ingestion. Whenever you need to add or remove endpoints, you simply update your
configuration object.

Does this factory not fit your needs? You can always create your own custom
asset following [this guide](./dagster.md).
Does this factory not fit your needs? Check the
[GraphQL API Crawler](./graphql-api.md)!
2 changes: 1 addition & 1 deletion apps/docs/docs/contribute-data/dagster.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Before writing a fully custom Dagster asset,
we recommend you first see if the previous guides on
[BigQuery datasets](./bigquery.md),
[database replication](./database.md),
[API crawling](./api.md)
[API crawling](./api-crawling/index.md)
may be a better fit.
This guide should only be used in the rare cases where you cannot
use the other methods.
Expand Down
2 changes: 1 addition & 1 deletion apps/docs/docs/contribute-data/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ We're always looking for new data sources to integrate with OSO. Here are the cu

- 🗂️ [BigQuery Public Datasets](./bigquery.md) - Preferred and easiest route for maintaining a dataset
- 🗄️ [Database Replication](./database.md) - Provide access to your database for replication as an OSO dataset
- 🌐 [API Crawling](./api.md) - Crawl an API by writing a plugin
- 🌐 [API Crawling](./api-crawling/index.md) - Crawl REST and GraphQL APIs easily by writing a plugin
- 📁 [Files into Google Cloud Storage (GCS)](./gcs.md) - Drop Parquet/CSV files in our GCS bucket for loading into BigQuery
- ⚙️ [Custom Dagster Assets](./dagster.md) - Write a custom Dagster asset for unique data sources
- 📜 Static Files - Coordinate hand-off for high-quality data via static files. This path is
Expand Down
2 changes: 1 addition & 1 deletion apps/docs/docs/guides/dagster/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,4 @@ Head over to [http://localhost:3000](http://localhost:3000) to access Dagster's
UI. _Et voilà_! You have successfully set up Dagster locally.

This is just the beginning. Check out how to create a
[DLT Dagster Asset](../../contribute-data/api.md) next and start building!
[DLT Dagster Asset](../../contribute-data/api-crawling/index.md) next and start building!

0 comments on commit f11dc89

Please sign in to comment.