Add high level overview to normalization doc. (#6445)

* Add high level overview to normalization * Address review comments Co-authored-by: Abhi Vaidyanatha <[email protected]>
airbytehq · Sep 28, 2021 · 6b19bf4 · 6b19bf4
1 parent 911998b
commit 6b19bf4
Showing 1 changed file with 26 additions and 13 deletions.
diff --git a/docs/understanding-airbyte/basic-normalization.md b/docs/understanding-airbyte/basic-normalization.md
@@ -1,22 +1,17 @@
 # Basic Normalization
 
-At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".
+## High-Level Overview
 
-However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.
-
-So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization".
-
-Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.
-
-To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:
+{% hint style="info" %}
+The high-level overview contains all the information you need to use Basic Normalization when pulling from APIs. Information past that can be read for advanced or educational purposes.
+{% endhint %}
 
-![](../.gitbook/assets/connecting-EL-with-T-4.png)
+When you run your first Airbyte sync without the basic normalization, you'll notice that your data gets written to your destination as one data column with a JSON blob that contains all of your data. This is the `_airbyte_raw_` table that you may have seen before. Why do we create this table? A core tenet of ELT philosophy is that data should be untouched as it moves through the E and L stages so that the raw data is always accessible. If an unmodified version of the
+data exists in the destination, it can be retransformed without needing to sync data again.
 
-In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:
-- Airbyte base-normalization python package to generate dbt SQL models files
-- dbt to compile and executes the models on top of the data in the destinations that supports it.
+If you have Basic Normalization enabled, Airbyte automatically uses this JSON blob to create a schema and tables with your data in mind, converting it to the format of your destination. This runs after your sync and may take a long time if you have a large amount of data synced. If you don't enable Basic Normalization, you'll have to transform the JSON data from that column yourself.
 
-## Overview
+## Example
 
 Basic Normalization uses a fixed set of rules to map a json object from a source to the types and format that are native to the destination. For example if a source emits data that looks like this:
 
@@ -50,6 +45,24 @@ The [normalization rules](basic-normalization.md#Rules) are _not_ configurable.
 
 Airbyte places the json blob version of your data in a table called `_airbyte_raw_<stream name>`. If basic normalization is turned on, it will place a separate copy of the data in a table called `<stream name>`. Under the hood, Airbyte is using dbt, which means that the data only ingresses into the data store one time. The normalization happens as a query within the datastore. This implementation avoids extra network time and costs.
 
+## Why does Airbyte have Basic Normalization?
+
+At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination". 
+
+However, this is actually producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this json blob normalized so that each field is its own column.
+
+So, after EL, comes the T \(transformation\) and the first T step that Airbyte actually applies on top of the extracted data is called "Normalization".
+
+Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.
+
+To summarize, we can represent the ELT process in the diagram below. These are steps that happens between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:
+
+![](../.gitbook/assets/connecting-EL-with-T-4.png)
+
+In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:
+- Airbyte base-normalization python package to generate dbt SQL models files
+- dbt to compile and executes the models on top of the data in the destinations that supports it.
+
 ## Destinations that Support Basic Normalization
 
 * [BigQuery](../integrations/destinations/bigquery.md)