Basic Normalization

Basic Normalization High-Level Overview

INFO

The high-level overview contains all the information you need to use Basic Normalization when pulling from APIs. Information past that can be read for advanced or educational purposes.

When you run your first Airbyte sync without the basic normalization, you'll notice that your data gets written to your destination as one data column with a JSON blob that contains all of your data. This is the _airbyte_raw_ table that you may have seen before. Why do we create this table? A core tenet of ELT philosophy is that data should be untouched as it moves through the E and L stages so that the raw data is always accessible. If an unmodified version of the data exists in the destination, it can be retransformed without needing to sync data again.

If you have Basic Normalization enabled, Airbyte automatically uses this JSON blob to create a schema and tables with your data in mind, converting it to the format of your destination. This runs after your sync and may take a long time if you have a large amount of data synced. If you don't enable Basic Normalization, you'll have to transform the JSON data from that column yourself.

Example

Basic Normalization uses a fixed set of rules to map a JSON object from a source to the types and formats that are native to the destination. For example, if a source emits data that looks like this:

{

 "make": "alfa romeo",

 "model": "4C coupe",

 "horsepower": "247"

}

The destination connectors produce the following raw table in the destination database:

CREATE TABLE "_airbyte_raw_cars" (

   -- metadata added by airbyte

   "_airbyte_ab_id" VARCHAR, -- uuid value assigned by connectors to each row of the data written in the destination.

   "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE, -- time at which the record was emitted.

   "_airbyte_data" JSONB -- data stored as a Json Blob.

);

Then, basic normalization would create the following table:

CREATE TABLE "cars" (

   "_airbyte_ab_id" VARCHAR,

   "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,

   "_airbyte_cars_hashid" VARCHAR,

   "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,




   -- data from source

   "make" VARCHAR,

   "model" VARCHAR,

   "horsepower" INTEGER

);
Normalization metadata columns

You'll notice that some metadata are added to keep track of important information about each record.

  • Some are introduced at the destination connector level: These are propagated by the normalization process from the raw table to the final table
    • _airbyte_ab_id: uid value assigned by connectors to each row of the data written in the destination.
    • _airbyte_emitted_at: time at which the record was emitted and recorded by the destination connector.
  • While other metadata columns are created at the normalization step.
    • _airbyte_<table_name>_hashid: hash value assigned by airbyte normalization derived from a hash function of the record data.
    • _airbyte_normalized_at: time at which the record was last normalized (useful to track when incremental transformations are performed)

Additional metadata columns can be added on some tables depending on the usage:

  • On the Slowly Changing Dimension (SCD) tables:
    • _airbyte_start_at: equivalent to the cursor column defined on the table, denotes when the row was first seen
    • _airbyte_end_at: denotes until when the row was seen with these particular values. If this column is not NULL, then the record has been updated and is no longer the most up-to-date one. If NULL, then the row is the latest version for the record.
    • _airbyte_active_row: denotes if the row for the record is the latest version or not.
    • _airbyte_unique_key_scd: hash of primary keys + cursors used to de-duplicate the scd table.
    • On de-duplicated (and SCD) tables:
    • _airbyte_unique_key: hash of primary keys used to de-duplicate the final table.

The normalization rules are not configurable. They are designed to pick a reasonable set of defaults to hit the 80/20 rule of data normalization. We respect that normalization is a detail-oriented problem and that with a fixed set of rules, we cannot normalize your data in such a way that covers all use cases. If this feature does not meet your normalization needs, we always put the full JSON blob in the destination as well, so that you can parse that object however best meets your use case. We will be adding more advanced normalization functionality shortly. Airbyte is focused on the EL of ELT. If you need a featureful tool for the transformations then, we suggest trying out dbt.

Airbyte places the JSON blob version of your data in a table called _airbyte_raw_<stream name>. If basic normalization is turned on, it will place a separate copy of the data in a table called <stream name>. Under the hood, Airbyte is using dbt, which means that the data only ingresses into the data store one time. The normalization happens as a query within the data store. This implementation avoids extra network time and costs.

Why does Airbyte have Basic Normalization?

At its core, Airbyte is geared to handle the EL (Extract Load) steps of an ELT process. These steps can also be referred to in Airbyte's dialect as "Source" and "Destination".

However, this is producing a table in the destination with a JSON blob column... For the typical analytics use case, you probably want this JSON blob normalized so that each field is its own column.

So, after EL, comes the T (transformation), and the first T step that Airbyte applies on top of the extracted data is called "Normalization".

Airbyte runs this step before handing the final data over to other tools that will manage further transformation down the line.

To summarize, we can represent the ELT process in the diagram below. These are steps that happen between your "Source Database or API" and the final "Replicated Tables" with examples of implementation underneath:

In Airbyte, the current normalization option is implemented using a dbt Transformer composed of:

  • Airbyte base-normalization python package to generate dbt SQL model files
  • dbt to compile and execute the models on top of the data in the destinations that support it.
Destinations that Support Basic Normalization
  • BigQuery
  • MS Server SQL
  • MySQL
    • The server must support the WITH keyword.
    • Require MySQL >= 8.0, or MariaDB >= 10.2.1.
  • Postgres
  • Redshift
  • Snowflake

Basic Normalization can be configured when you're creating the connection between your Connection Setup and after in the Transformation Tab. Select the option: Normalized tabular data.

Rules

Typing

Airbyte tracks types using JsonSchema's primitive types. Here is how these types will map onto standard SQL types. Note: The names of the types may differ slightly across different destinations.

Airbyte uses the types described in the catalog to determine the correct type for each column. It does not try to use the values themselves to infer the type.

Nesting

Basic Normalization attempts to expand any nested arrays or objects it receives into separate tables to allow more ergonomic querying of your data.

Arrays

Basic Normalization expands arrays into separate tables. For example, if the source provides the following data:

{

 "make": "alfa romeo",

 "model": "4C coupe",

 "limited_editions": [

   { "name": "4C spider", "release_year": 2013 },

   { "name" : "4C spider italia" , "release_year":  2018 }

 ]

}

The resulting normalized schema would be:

CREATE TABLE "cars" (

   "_airbyte_cars_hashid" VARCHAR,

   "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,

   "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,




   "make" VARCHAR,

   "model" VARCHAR

);




CREATE TABLE "limited_editions" (

   "_airbyte_limited_editions_hashid" VARCHAR,

   "_airbyte_cars_foreign_hashid" VARCHAR,

   "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,

   "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,




   "name" VARCHAR,

   "release_year" VARCHAR

);

If the nested items in the array are not objects then they are expanded into a string field of comma-separated values e.g.:

{

 "make": "alfa romeo",

 "model": "4C coupe",

 "limited_editions": [ "4C spider", "4C spider italia"]

}

The resulting normalized schema would be:

CREATE TABLE "cars" (

   "_airbyte_cars_hashid" VARCHAR,

   "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,

   "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,




   "make" VARCHAR,

   "model" VARCHAR

);




CREATE TABLE "limited_editions" (

   "_airbyte_limited_editions_hashid" VARCHAR,

   "_airbyte_cars_foreign_hashid" VARCHAR,

   "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,

   "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,




   "data" VARCHAR

);

Objects

In the case of a nested object e.g.:

{

 "make": "alfa romeo",

 "model": "4C coupe",

 "powertrain_specs": { "horsepower": 247, "transmission": "6-speed" }

}

The normalized schema would be:

CREATE TABLE "cars" (

   "_airbyte_cars_hashid" VARCHAR,

   "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,

   "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,




   "make" VARCHAR,

   "model" VARCHAR

);




CREATE TABLE "powertrain_specs" (

   "_airbyte_powertrain_hashid" VARCHAR,

   "_airbyte_cars_foreign_hashid" VARCHAR,

   "_airbyte_emitted_at" TIMESTAMP_WITH_TIMEZONE,

   "_airbyte_normalized_at" TIMESTAMP_WITH_TIMEZONE,




   "horsepower" INTEGER,

   "transmission" VARCHAR

);

Naming Collisions for un-nested objects

When extracting nested objects or arrays, the Basic Normalization process needs to figure out new names for the expanded tables.

For example, if we had a cars table with a nested column cars containing an object whose schema is identical to the parent table.

{

 "make": "alfa romeo",

 "model": "4C coupe",

 "cars": [

   { "make": "audi", "model": "A7" },

   { "make" : "lotus" , "model":  "elise" }

   { "make" : "chevrolet" , "model":  "mustang" }

 ]

}

The expanded table would have a conflict in terms of naming since both are named cars. To avoid name collisions and ensure a more consistent naming scheme, Basic Normalization chooses the expanded name as follows:

  • cars for the original parent table
  • cars_da3_cars for the expanded nested columns following this naming scheme in 3 parts: <Json path>_<Hash>_<nested column name>
  • JSON path: The entire JSON path string with '_' characters used as delimiters to reach the table that contains the nested column name.
  • Hash: Hash of the entire JSON path to reach the nested column reduced to 3 characters. This is to make sure we have a unique name (in case part of the name gets truncated, see below)
  • Nested column name: name of the column being expanded into its own table.

By following this strategy, nested columns should "never" collide with other table names. If it does, an exception will probably be thrown either by the normalization process or by dbt that runs afterward.

CREATE TABLE "cars" (