Airbyte Protocol

Airbyte Protocol

Goals

The Airbyte Protocol describes a series of standard components and all the interactions between them in order to declare an ELT pipeline. All message passing across components is done via serialized JSON messages for inter-process communication.

This document describes the protocol as it exists in its CURRENT form. Stay tuned for an RFC on how the protocol will evolve.

This document is intended to contain ALL the rules of the Airbyte Protocol in one place. Anything not contained in this document is NOT part of the Protocol. At the time of writing, there is one known exception, which is the Supported Data Types, which contains rules on data types that are part of the Protocol. That said, there are additional articles, e.g. A Beginner's Guide to the Airbyte Catalog that repackage the information in this document for different audiences.

Key Concepts

There are 2 major components in the Airbyte Protocol: Source and Destination. These components are referred to as Actors. A source is an application that is described by a series of standard interfaces. This application extracts data from an underlying data store. A data store in this context refers to the tool where the data is actually stored. A data store includes: databases, APIs, anything that produces data, etc. For example, the Postgres Source is a Source that pulls from Postgres (which is a data store). A Destination is an application that is described by a series of standard interfaces that loads data into a data store.

The key primitives that the Protocol uses to describe data are Catalog, Configured Catalog, Stream, Configured Stream, and Field:

  • Stream - A Stream describes the schema of a resource and various metadata about how a user can interact with that resource. A resource in this context might refer to a database table, a resource in a REST API, or a data stream.
  • Field - A Field refers to a "column" in a Stream. In a database this would be a column; in a JSON object it is a field.
  • Catalog - A Catalog is a list of Streams that describes the data in the data store that a Source represents.

An Actor can advertise information about itself with an Actor Specification. One of the main pieces of information the specification shares is what information is needed to configure an Actor.

Each of these concepts is described in greater depth in their respective section.

Actor Interface

This section describes important details about the interface over actors. It reviews parts of the interface that are the same across all actors. It also describes some invariants for all methods in actor interfaces.

Common Interface Methods

The following part of the interface is identical across all actors:

spec() -> ConnectorSpecification

check(Config) -> AirbyteConnectionStatus

These methods are described in their respective sections (spec, check).

Interface Invariants

The output of each method in actor interface is wrapped in an AirbyteMessage. This struct is an envelope for the return value of any message in the described interface. See the section the AirbyteMessage section below for more details. For the sake of brevity, interface diagrams will elide these AirbyteMessages.

Additionally, all methods described in the protocol can emit AirbyteLogMessages and AirbyteTraceMessages (for more details see Logging). These messages allow an actor to emit logs and other informational metadata. All subsequent method signatures will assume that any number of messages of these types (wrapped in the AirbyteMessage) may be emitted.

Each method in the protocol has 3 parts:

  1. Input: these are the arguments passed to the method.
  2. Data Channel Egress (Output): all outputs from a method are via STDOUT. While some method signatures declare a single return value, in practice, any number of AirbyteLogMessages and AirbyteTraceMessages may be emitted. An actor is responsible for closing STDOUT to declare that it is done.
  3. Data Channel Ingress: after a method begins running, data can be passed to it via STDIN. For example, records are passed to a Destination on STDIN so that it can load them into a data warehouse.

Sources are a special case and do not have a Data Channel Ingress.

Additional Invariants

  • All arguments passed to an Actor and all messages emitted from an Actor are serialized JSON.
  • All messages emitted from Actors must be wrapped in an AirbyteMessage(ref) envelope.
  • Messages not wrapped in the AirbyteMessage must be dropped (e.g. not be passed from Source to Destination). However certain implementations of the Airbyte Protocol may choose to store and log unknown messages for debugging purposes.
  • Each message must be on its own line. Multiple messages cannot be sent on the same line. The JSON objects cannot be serialized across multiple lines.
  • STDERR should only be used for log messages (for errors). All other Data Channel Data moves on STDIN and STDOUT.

Common Interface

Spec

spec() -> ConnectorSpecification

The spec command allows an actor to broadcast information about itself and how it can be configured.

Input:

  1. none.

Output:

  1. spec - a ConnectorSpecification wrapped in an AirbyteMessage of type spec. See the Actor Specification for more details on the information in the spec.

Check

check(Config) -> AirbyteConnectionStatus

The check command validates that, given a configuration, that the Actor is able to connect and access all resources that it needs in order to operate. e.g. Given some Postgres credentials, it determines whether it can connect to the Postgres database. If it can, it will return a success response. If it fails (perhaps the password is incorrect), it will return a failed response and (when possible) a helpful error message. If an actor's check command succeeds, it is expected that all subsequent methods in the sync will also succeed.

Input:

  1. config - A configuration JSON object that has been validated using ConnectorSpecification#connectionSpecification (see ActorSpecification for information on connectionSpecification).

Output:

  1. connectionStatus - an AirbyteConnectionStatus Message wrapped in an AirbyteMessage of type connection_status.

Source

A Source is an application that extracts data from an underlying data store. A Source implements the following interface:

spec() -> ConnectorSpecification

check(Config) -> AirbyteConnectionStatus

discover(Config) -> AirbyteCatalog

read(Config, ConfiguredAirbyteCatalog, State) -> Stream<AirbyteRecordMessage | AirbyteStateMessage>

spec and check are the same as the commands described in the Common Commands section.

Discover

The discover method detects and describes the structure of the data in the data store and which Airbyte configurations can be applied to that data. For example, given a Postges source and valid Config, discover would return a list of available tables as streams.

Input:

  1. config - A configuration JSON object that has been validated using ConnectorSpecification#connectionSpecification (see ActorSpecification for information on connectionSpecification).

Output:

  1. catalog - an AirbyteCatalog wrapped in an AirbyteMessage of type catalog. See the Catalog Section for details.

Read

read extracts data from the underlying data store and emits it as AirbyteRecordMessages. It also emits AirbyteStateMessages to allow checkpointing replication.

Input:

  1. config - A configuration JSON object that has been validated using ConnectorSpecification#connectionSpecification (see ActorSpecification for information on connectionSpecification).
  2. configured catalog - A ConfiguredAirbyteCatalog is built on top of the catalog returned by discover. The ConfiguredAirbyteCatalog specifies HOW the data in the catalog should be replicated. The catalog is documented in the Catalog Section.
  3. state - An JSON object that represents a checkpoint in the replication. This object is only ever written or read by the source, so it is a JSON blob with whatever information is necessary to keep track of how much of the data source has already been read (learn more in the State & Checkpointing Section).

Output:

  1. message stream - An iterator of AirbyteRecordMessages and AirbyteStateMessages piped to the Data Channel Egress i.e: stdout.
    • A source outputs AirbyteStateMessages in order to allow checkpointing data replication. State is described in more detail below in the State & Checkpointing section.
    • Only AirbyteRecordMessages that contain streams that are in the catalog will be processed. Those that do not will be ignored. See Schema Mismatches for more details.
    • AirbyteRecordMessages from multiple streams can be multiplexed/mixed together, and do not need to be emitted serially as a group.

Destination

A destination receives data on the Data Channel Ingress and loads it into an underlying data store (e.g. data warehouse or database).

It implements the following interface.

spec() -> ConnectorSpecification

check(Config) -> AirbyteConnectionStatus

write(Config, AirbyteCatalog, Stream<AirbyteMessage>(stdin)) -> Stream<AirbyteStateMessage>

For the sake of brevity, we will not re-describe spec and check. They are exactly the same as those commands described for the Source.

Write

Input:

  1. config - A configuration JSON object that has been validated using the ConnectorSpecification.
  2. catalog - An AirbyteCatalog. This catalog should be a subset of the catalog returned by the discover command. Any AirbyteRecordMessagess that the destination receives that do not match the structure described in the catalog will fail.
  3. message stream - (this stream is consumed on stdin--it is not passed as an arg). It will receive a stream of JSON-serialized AirbyteMesssage.

Output:

  1. message stream - A stream of AirbyteStateMessages piped to stdout. The destination connector should only output state messages if they were previously received as input on stdin. Outputting a state message indicates that all records which came before it have been successfully written to the destination. Implementations of this spec will likely want to move messages filtering and validation upstream of the destination itself
  • The destination should read in the AirbyteMessages and write any that are of type AirbyteRecordMessage to the underlying data store.
  • The destination should ignore fields or streams that are out of sync with the catalog. The destination should always make its best effort to load what data is there that does match that catalog. e.g. if the User Stream has the fields first_name and last_name in the catalog, but the record has first_name and eye_color, the destination should persist first_name, even though last_name is missing. It should ignore eye_color as extraneous.

This concludes the overview of the Actor Interface. The remaining content will dive deeper into each concept covered so far.

Actor Specification

The specification allows the Actor to share information about itself.

The connectionSpecification is JSONSchema that describes what information needs to the actor for it operate. e.g. If using a Postgres Source, the ConnectorSpecification would specify that a hostname, port, and password are required in order for the connector to function. This JSONSchema can be used to validate that the provided inputs are valid. e.g. If port is one of the fields and the JsonSchema in the connectionSpecification specifies that this field should be a number, if a user inputs "airbyte", they will receive an error. For connection specification, Airbyte adheres to JsonSchema validation rules. The Airbyte implementation of the Protocol is able to render this JSONSchema to produce a form for users to fill in the information for an Actor.

The specification also contains information about what features the Actor supports.

  • protocol_version describes which version of the protocol the Connector supports. The default value is "0.2.0".
  • supported_destination_sync_modes - describes which sync modes a destination is able to support. See Sync Modes.

documentationUrl and changelogUrl are optional fields that link to additional information about the connector.

The following are fields that still exist in the specification but are slated to be removed as they leak choices about how Airbyte implements the protocol as opposed to being strictly necessary part of the protocol.

  • supportsIncremental is deprecated and can be ignored. It is vestigial from when full refresh / incremental was specified at the Actor level.
  • supportsNormalization determines whether the Destination supports Basic Normalization
  • supportsDBT - determines whether the Destination supports Basic Normalization
  • authSpecification and advanced_auth will be removed from the protocol and as such are not documented. Information on their use can be found here.
ConnectorSpecification:

 d