add: graphql factory documentation (#2887)

* fix: link to `new` graphql + rest home docs * chore: move `rest` api docs to new location * add: `graphql` factory documentation * fix: correct `broken` links
opensource-observer · Jan 28, 2025 · f11dc89 · f11dc89
1 parent 7da5eeb
commit f11dc89
Show file tree

Hide file tree

Showing 10 changed files with 242 additions and 17 deletions.
diff --git a/...cs/contribute-data/crawl-api-advanced.png → ...-data/api-crawling/crawl-api-advanced.png b/...cs/contribute-data/crawl-api-advanced.png → ...-data/api-crawling/crawl-api-advanced.png
diff --git a/...bute-data/crawl-api-example-defillama.png → ...-crawling/crawl-api-example-defillama.png b/...bute-data/crawl-api-example-defillama.png → ...-crawling/crawl-api-example-defillama.png
diff --git a/apps/docs/docs/contribute-data/api-crawling/crawl-api-example-opencollective.png b/apps/docs/docs/contribute-data/api-crawling/crawl-api-example-opencollective.png
diff --git a/apps/docs/docs/contribute-data/api-crawling/crawl-api-graphql-pipeline.png b/apps/docs/docs/contribute-data/api-crawling/crawl-api-graphql-pipeline.png
diff --git a/apps/docs/docs/contribute-data/api-crawling/graphql-api.md b/apps/docs/docs/contribute-data/api-crawling/graphql-api.md
@@ -0,0 +1,202 @@
+---
+title: GraphQL API Crawler
+sidebar_position: 3
+---
+
+## GraphQL Resource Factory
+
+Many of our data ingestion workflows rely on GraphQL, but constructing queries
+and introspecting types can involve repetitive steps. This **GraphQL Resource
+Factory** eliminates boilerplate by handling introspection, query building, and
+parameter management for you.
+
+This guide will explain how to use the factory to bootstrap a
+[`graphql_factory`](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/factories/graphql.py)
+asset in Dagster. We'll walk through the process of defining a configuration,
+building the factory, and customizing the asset to suit your needs.
+
+---
+
+## Step by Step: Defining Your GraphQL Resource
+
+This example will demonstrate how to create a GraphQL asset that fetches
+transactions from the Open
+[Collective API](https://docs.opencollective.com/help/contributing/development/api).
+The API has a `transactions` query that returns a list of transactions.
+
+Currently, it has hundreds of nested fields, making it **cumbersome** to write
+queries manually (which we have done in the past and it's not fun). The GraphQL
+Resource Factory will generate the query for us, extract the relevant data, and
+return a clean, usable asset, all with minimal effort.
+
+### 1. Create the Configuration
+
+The first step is to define a configuration object that describes your GraphQL
+resource. For the Open Collective transactions example, we set the endpoint URL,
+define query parameters, and specify a transformation function to extract the
+data we need.
+
+We also set a `max_depth` parameter to limit the depth of the introspection
+query. This means the generated query will only explore fields up to a certain
+depth, preventing it from becoming too large, but capturing all the necessary
+data for our asset.
+
+```python
+from ..factories.graphql import GraphQLResourceConfig
+
+config = GraphQLResourceConfig(
+    name="transactions",
+    endpoint="https://api.opencollective.com/graphql/v2",
+    max_depth=3, # Limit the introspection depth
+    parameters={
+        "limit": {
+            "type": "Int!",
+            "value": 10,
+        },
+        "type": {
+            "type": "TransactionType!",
+            "value": "CREDIT",
+        },
+        "dateFrom": {
+            "type": "DateTime!",
+            "value": "2024-01-01T00:00:00Z",
+        },
+        "dateTo": {
+            "type": "DateTime!",
+            "value": "2024-12-31T23:59:59Z",
+        },
+    },
+    transform_fn=lambda result: result["transactions"]["nodes"], # Optional transformation function
+    target_query="transactions", # The query to execute
+    target_type="TransactionCollection", # The type containing the data
+)
+```
+
+:::tip
+For the full `GraphQLResourceConfig` spec, see the [`source`](https://github.com/opensource-observer/oso/blob/05fe8b9192a08f6446225a89f4455c6b3723c5de/warehouse/oso_dagster/factories/graphql.py#L99)
+:::
+
+In this configuration, we define the following fields:
+
+- **name**: A unique identifier for the dagster asset.
+- **endpoint**: The URL of the GraphQL API.
+- **max_depth**: The maximum depth of the introspection query. This will
+  generate a query that explores all fields recursively up to this depth.
+- **parameters**: A dictionary of query parameters. The keys are the parameter
+  names, and the values are dictionaries with the parameter type and value.
+- **transform_fn**: A function that processes the raw GraphQL response and
+  returns the desired data.
+- **target_query**: The name of the query to execute.
+- **target_type**: The name of the GraphQL type that contains the data of
+  interest.
+
+The factory will create the following query automatically, recursively
+introspecting all the fields up to the specified depth:
+
+```graphql
+query (
+  $limit: Int!
+  $type: TransactionType!
+  $dateFrom: DateTime!
+  $dateTo: DateTime!
+) {
+  transactions(
+    limit: $limit
+    type: $type
+    dateFrom: $dateFrom
+    dateTo: $dateTo
+  ) {
+    offset
+    limit
+    totalCount
+    nodes {
+      id
+      legacyId
+      uuid
+      group
+      type
+      kind
+      amount {
+        value
+        currency
+        valueInCents
+      }
+      oppositeTransaction {
+        id
+        legacyId
+        uuid
+        # ... other generated fields ...
+        merchantId
+        invoiceTemplate
+      }
+      merchantId
+      balanceInHostCurrency {
+        value
+        currency
+        valueInCents
+      }
+      invoiceTemplate
+    }
+    kinds
+  }
+}
+```
+
+### 2. Build the Factory
+
+:::tip
+The GraphQL factory function takes a mandatory `config`
+argument. The other arguments are directly passed to the underlying
+`dlt_factory` function, allowing you to customize the behavior of the asset.
+
+For the full reference of the allowed arguments, check out the Dagster
+[`asset`](https://docs.dagster.io/api/python-api/assets) documentation.
+:::
+
+The `graphql_factory` function is the used to convert your configuration into a
+callable Dagster asset. It takes the configuration object and returns a factory
+function that our infrastructure will use to automatically create the asset.
+
+```python
+from ..factories.graphql import graphql_factory
+
+# ... config definition ...
+
+open_collective_transactions = graphql_factory(
+    config,
+    key_prefix="open_collective",
+)
+```
+
+---
+
+## How to Run and View Results
+
+:::tip
+If you have not setup your local Dagster environment yet, please follow
+our [quickstart guide](../../guides/dagster/index.md).
+:::
+
+After having your Dagster instance running, follow the
+[Dagster Asset Guide](../../guides/dagster/index.md) to materialize the assets.
+Our example assets are located under `assets/open_collective/transactions`.
+
+![Dagster Open Collective Asset List](crawl-api-graphql-pipeline.png)
+
+Running the pipeline will fetch the `10` transactions from the Open Collective
+API and store them in BigQuery:
+
+![Dagster Open Collective Result](crawl-api-example-opencollective.png)
+
+---
+
+## Conclusion
+
+The GraphQL Resource Factory is a powerful tool for creating reusable assets
+that fetch data from GraphQL APIs. By defining a configuration object and
+building a factory function, you can quickly create assets that crawl complex
+APIs effortlessly.
+
+This allows you to focus on the data you need and the transformations you want
+to apply, rather than the mechanics of constructing queries and managing API
+interactions.
diff --git a/apps/docs/docs/contribute-data/api-crawling/index.md b/apps/docs/docs/contribute-data/api-crawling/index.md
@@ -0,0 +1,18 @@
+---
+title: Crawl an API
+sidebar_position: 4
+---
+
+Currently, we offer factories for two types of APIs: REST and GraphQL. These
+factories allow you to ingest data from APIs with minimal effort. Here are the
+guides to help you get started:
+
+- [REST API Crawler](./rest-api.md): This guide will walk you through the
+  process of creating a REST API crawler using our `rest` factory. It will show
+  you how to define the endpoints you want to fetch, create a configuration
+  object, and build the asset.
+- [GraphQL API Crawler](./graphql-api.md): This guide will explain how to use
+  the GraphQL Resource Factory to bootstrap a `graphql_factory` asset in Dagster
+  that fetches all fields from a GraphQL endpoint automatically.
+
+Reach out to us on [Discord](https://www.opensource.observer/discord) for help.
diff --git a/apps/docs/docs/contribute-data/api.md → .../contribute-data/api-crawling/rest-api.md b/apps/docs/docs/contribute-data/api.md → .../contribute-data/api-crawling/rest-api.md
@@ -1,6 +1,6 @@
 ---
-title: Crawl an API
-sidebar_position: 4
+title: Rest API Crawler
+sidebar_position: 2
 ---
 
 ## What Are We Trying to Achieve?
@@ -100,10 +100,10 @@ We have a handy factory function called
 that takes your configuration and returns a callable **factory** that wires all
 assets up with the specified configuration.
 
-For a minimal configuration, we just need to supply a `key_prefix` to the factory
-function. This will be used to create the asset keys in the Dagster environment. It
-accepts a list of strings as input. Each element will be represented as a level in
-the key hierarchy.
+For a minimal configuration, we just need to supply a `key_prefix` to the
+factory function. This will be used to create the asset keys in the Dagster
+environment. It accepts a list of strings as input. Each element will be
+represented as a level in the key hierarchy.
 
 :::tip
 Under the hood, this will create a set of Dagster assets, managing all of
@@ -132,29 +132,33 @@ pipeline, the data will be ingested into your OSO warehouse.
 
 :::tip
 If you have not setup your local Dagster environment yet, please follow
-our [quickstart guide](../guides/dagster/index.md).
+our [quickstart guide](../../guides/dagster/index.md).
 :::
 
 After having your Dagster instance running, follow the
-[Dagster Asset Guide](../guides/dagster/index.md) to materialize the assets. Our
-example assets are located under `assets/defillama/tvl`.
+[Dagster Asset Guide](../../guides/dagster/index.md) to materialize the assets.
+Our example assets are located under `assets/defillama/tvl`.
 
 ![Dagster DefiLlama Asset List](crawl-api-example-defillama.png)
 
 ---
 
+---
+
 ## Expanding Your Crawler
 
 In practice, you may do more than just retrieve data:
 
-- **Pagination**: `dlt` supports adding a paginator if you have large result sets.
+- **Pagination**: `dlt` supports adding a paginator if you have large result
+  sets.
 - **Transformations**: You can add transformations before loading, such as
   cleaning up invalid fields or renaming columns.
 
 Our tooling is flexible enough to let you customize these details without losing
 the simplicity of the factory approach.
 
-Here's a more advanced example showing automatic pagination and specific field selection using the Pokémon API:
+Here's a more advanced example showing automatic pagination and specific field
+selection using the Pokémon API:
 
 ```py
 from dlt.sources.rest_api.typing import RESTAPIConfig
@@ -197,7 +201,8 @@ dlt_assets = create_rest_factory_asset(config=config)
 pokemon_assets = dlt_assets(key_prefix=["pokemon", "moves"])
 ```
 
-After running the pipeline, you'll find the Pokémon moves assets in your data warehouse:
+After running the pipeline, you'll find the Pokémon moves assets in your data
+warehouse:
 
 ![BigQuery Pokemon Moves Table Data](crawl-api-advanced.png)
 
@@ -223,5 +228,5 @@ removes repetitive tasks and helps you maintain a consistent approach to
 ingestion. Whenever you need to add or remove endpoints, you simply update your
 configuration object.
 
-Does this factory not fit your needs? You can always create your own custom
-asset following [this guide](./dagster.md).
+Does this factory not fit your needs? Check the
+[GraphQL API Crawler](./graphql-api.md)!
diff --git a/apps/docs/docs/contribute-data/dagster.md b/apps/docs/docs/contribute-data/dagster.md
@@ -7,7 +7,7 @@ Before writing a fully custom Dagster asset,
 we recommend you first see if the previous guides on
 [BigQuery datasets](./bigquery.md),
 [database replication](./database.md),
-[API crawling](./api.md)
+[API crawling](./api-crawling/index.md)
 may be a better fit.
 This guide should only be used in the rare cases where you cannot
 use the other methods.

diff --git a/apps/docs/docs/contribute-data/index.md b/apps/docs/docs/contribute-data/index.md
@@ -9,7 +9,7 @@ We're always looking for new data sources to integrate with OSO. Here are the cu
 
 - 🗂️ [BigQuery Public Datasets](./bigquery.md) - Preferred and easiest route for maintaining a dataset
 - 🗄️ [Database Replication](./database.md) - Provide access to your database for replication as an OSO dataset
-- 🌐 [API Crawling](./api.md) - Crawl an API by writing a plugin
+- 🌐 [API Crawling](./api-crawling/index.md) - Crawl REST and GraphQL APIs easily by writing a plugin
 - 📁 [Files into Google Cloud Storage (GCS)](./gcs.md) - Drop Parquet/CSV files in our GCS bucket for loading into BigQuery
 - ⚙️ [Custom Dagster Assets](./dagster.md) - Write a custom Dagster asset for unique data sources
 - 📜 Static Files - Coordinate hand-off for high-quality data via static files. This path is

diff --git a/apps/docs/docs/guides/dagster/index.md b/apps/docs/docs/guides/dagster/index.md
@@ -91,4 +91,4 @@ Head over to [http://localhost:3000](http://localhost:3000) to access Dagster's
 UI. _Et voilà_! You have successfully set up Dagster locally.
 
 This is just the beginning. Check out how to create a
-[DLT Dagster Asset](../../contribute-data/api.md) next and start building!
+[DLT Dagster Asset](../../contribute-data/api-crawling/index.md) next and start building!