Skip to content

Commit

Permalink
Add high level overview to normalization doc. (#6445)
Browse files Browse the repository at this point in the history
* Add high level overview to normalization

* Address review comments

Co-authored-by: Abhi Vaidyanatha <[email protected]>
  • Loading branch information
avaidyanatha and Abhi Vaidyanatha authored Sep 28, 2021
1 parent 911998b commit 6b19bf4
Showing 1 changed file with 26 additions and 13 deletions.
39 changes: 26 additions & 13 deletions docs/understanding-airbyte/basic-normalization.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,17 @@
# Basic Normalization

At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".
## High-Level Overview

However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.

So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization".

Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.

To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:
{% hint style="info" %}
The high-level overview contains all the information you need to use Basic Normalization when pulling from APIs. Information past that can be read for advanced or educational purposes.
{% endhint %}

![](../.gitbook/assets/connecting-EL-with-T-4.png)
When you run your first Airbyte sync without the basic normalization, you'll notice that your data gets written to your destination as one data column with a JSON blob that contains all of your data. This is the `_airbyte_raw_` table that you may have seen before. Why do we create this table? A core tenet of ELT philosophy is that data should be untouched as it moves through the E and L stages so that the raw data is always accessible. If an unmodified version of the
data exists in the destination, it can be retransformed without needing to sync data again.

In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:
- Airbyte base-normalization python package to generate dbt SQL models files
- dbt to compile and executes the models on top of the data in the destinations that supports it.
If you have Basic Normalization enabled, Airbyte automatically uses this JSON blob to create a schema and tables with your data in mind, converting it to the format of your destination. This runs after your sync and may take a long time if you have a large amount of data synced. If you don't enable Basic Normalization, you'll have to transform the JSON data from that column yourself.

## Overview
## Example

Basic Normalization uses a fixed set of rules to map a json object from a source to the types and format that are native to the destination. For example if a source emits data that looks like this:

Expand Down Expand Up @@ -50,6 +45,24 @@ The [normalization rules](basic-normalization.md#Rules) are _not_ configurable.

Airbyte places the json blob version of your data in a table called `_airbyte_raw_<stream name>`. If basic normalization is turned on, it will place a separate copy of the data in a table called `<stream name>`. Under the hood, Airbyte is using dbt, which means that the data only ingresses into the data store one time. The normalization happens as a query within the datastore. This implementation avoids extra network time and costs.

## Why does Airbyte have Basic Normalization?

At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".

However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.

So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization".

Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.

To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:

![](../.gitbook/assets/connecting-EL-with-T-4.png)

In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:
- Airbyte base-normalization python package to generate dbt SQL models files
- dbt to compile and executes the models on top of the data in the destinations that supports it.

## Destinations that Support Basic Normalization

* [BigQuery](../integrations/destinations/bigquery.md)
Expand Down

0 comments on commit 6b19bf4

Please sign in to comment.