diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 5a2ce45c98c21..d6a603c07efa6 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -1,13 +1,11 @@ # Table of contents * [Overview](../README.md) -* [Getting Started](getting-started-tutorial.md) +* [Tutorials](tutorials/README.md) + * [Getting Started](tutorials/getting-started.md) + * [Config & Persistence](tutorials/airbyte-config-persistence.md) * [Changelog](changelog.md) * [Roadmap](roadmap.md) -* [Deploying Airbyte](deploying-airbyte/README.md) - * [On Your Workstation](deploying-airbyte/on-your-workstation.md) - * [On AWS \(EC2\)](deploying-airbyte/on-aws-ec2.md) - * [On GCP \(Compute Engine\)](deploying-airbyte/on-gcp-compute-engine.md) * [Connectors](integrations/README.md) * [Sources](integrations/sources/README.md) * [Braintree](integrations/sources/braintree.md) @@ -48,6 +46,10 @@ * [Snowflake](integrations/destinations/snowflake.md) * [Custom or New Connector](integrations/custom-connectors.md) * [Connector Changelog](integrations/integrations-changelog.md) +* [Deploying Airbyte](deploying-airbyte/README.md) + * [On Your Workstation](deploying-airbyte/on-your-workstation.md) + * [On AWS \(EC2\)](deploying-airbyte/on-aws-ec2.md) + * [On GCP \(Compute Engine\)](deploying-airbyte/on-gcp-compute-engine.md) * [Architecture](architecture/README.md) * [High-level View](architecture/high-level-view.md) * [Airbyte Specification](architecture/airbyte-specification.md) diff --git a/docs/tutorial/airbyte_config_persistence.ipynb b/docs/tutorial/airbyte_config_persistence.ipynb deleted file mode 100644 index 9ac1d4040f865..0000000000000 --- a/docs/tutorial/airbyte_config_persistence.ipynb +++ /dev/null @@ -1,911 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "cd ~/Workspace/airbyte" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "# Airbyte Configuration Persistence Tutorial\n", - "\n", - "Once you manage to spin up a local instance of Airbyte, following steps in the [Getting started Tutorial](../getting-started-tutorial.md), you may want to gain a better understanding of what configuration files are available in Airbyte and how to work with it.\n", - "\n", - "As we continue to improve the User experience around those aspects to make it simpler in the UI, this tutorial will go over how to manually import and export Airbyte configurations of connectors.\n", - "\n", - "This may be useful if you need for example to version control, make a backup, share with your team or if you just want to debug and learn more about Airbyte internals.\n", - "\n", - "Here are the goals for this tutorial:\n", - "1. Access to replication logs files \n", - "2. Export & Import Airbyte Configuration data files\n", - "3. Export normalization models to use in your own DBT project\n", - "\n", - "## Setting up a local Postgres Destination\n", - "\n", - "For this tutorial, we are going to use 2 types of destinations to run our demo where data will be written:\n", - "- Local File Destination\n", - "- Local Postgres Database\n", - "\n", - "The local files will be written by default to the directory `/tmp/airbyte_local`.\n", - "\n", - "The postgres database that we are going to spin up below will be running locally with the following configuration where data will be written:\n", - "\n", - " - Host: localhost\n", - " - Port: 3000\n", - " - User: postgres\n", - " - Password: password\n", - " - DB Name: postgres\n", - "\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "File Content in the local destination (may not exist yet):\n", - "find: /tmp/airbyte_local: No such file or directory\n", - "\n", - "Start a Postgres container named local-airbyte-postgres-destination\n", - "846846105d7566fce316a5516a51bda6f4b5b0c8da29adb25a043175c1c0f27b\n", - "\n", - "Docker Containers currently running:\n", - "CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n", - "846846105d75 postgres \"docker-entrypoint.s…\" 1 second ago Up Less than a second 0.0.0.0:3000->5432/tcp local-airbyte-postgres-destination\n" - ] - } - ], - "source": [ - "#!/usr/bin/env bash\n", - "\n", - "echo \"File Content in the local destination (may not exist yet):\"\n", - "find /tmp/airbyte_local\n", - "\n", - "echo \"\"\n", - "\n", - "docker ps | grep -q local-airbyte-postgres-destination\n", - "if [ $? -eq 0 ]; then\n", - " echo \"Postgres Database local-airbyte-postgres-destination is already up\"\n", - "else \n", - " echo \"Start a Postgres container named local-airbyte-postgres-destination\"\n", - " docker run --rm --name local-airbyte-postgres-destination -e POSTGRES_PASSWORD=password -p 3000:5432 -d postgres\n", - "fi\n", - "\n", - "echo \"\"\n", - "\n", - "echo \"Docker Containers currently running:\"\n", - "\n", - "docker ps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Example Output:\n", - "\n", - " File Content in the local destination (may not exist yet):\n", - " find: /tmp/airbyte_local: No such file or directory\n", - " \n", - " Start a Postgres container named local-airbyte-postgres-destination\n", - " 8e24a9682a1ec2e7539c7ada5d993120d3337cff07a54603fcdb8d44f4013aab\n", - " \n", - " Docker Containers currently running:\n", - " CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n", - " 8e24a9682a1e postgres \"docker-entrypoint.s…\" 1 second ago Up Less than a second 0.0.0.0:3000->5432/tcp local-airbyte-postgres-destination\n", - "\n", - "\n", - "## Starting Airbyte Server\n", - "\n", - "As we've seen in the previous tutorial, we can spin up Airbyte instance after installing it:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[33mWARNING\u001b[0m: The API_URL variable is not set. Defaulting to a blank string.\n", - "Creating network \"airbyte_default\" with the default driver\n", - "Creating volume \"airbyte_workspace\" with default driver\n", - "Creating volume \"airbyte_data\" with default driver\n", - "Creating volume \"airbyte_db\" with default driver\n", - "Creating init ... \n", - "\u001b[1BCreating airbyte-data-seed ... \n", - "Creating airbyte-db ... \n", - "\u001b[1BCreating airbyte-server ... mdone\u001b[0m\n", - "\u001b[3BCreating airbyte-scheduler ... mdone\u001b[0m\n", - "\u001b[1BCreating airbyte-webapp ... mdone\u001b[0m\n", - "\u001b[1Bting airbyte-webapp ... \u001b[32mdone\u001b[0m\n", - "\n", - "Docker Containers currently running:\n", - "CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n", - "ee669992c01a airbyte/webapp:0.7.1-alpha \"/docker-entrypoint.…\" 1 second ago Up Less than a second 0.0.0.0:8000->80/tcp airbyte-webapp\n", - "e04136aeb41b airbyte/scheduler:0.7.1-alpha \"/bin/bash -c './wai…\" 1 second ago Up Less than a second airbyte-scheduler\n", - "6cbe878300ff airbyte/server:0.7.1-alpha \"/bin/bash -c './wai…\" 1 second ago Up Less than a second 8000/tcp, 0.0.0.0:8001->8001/tcp airbyte-server\n", - "1814d4f26db8 airbyte/db:0.7.1-alpha \"docker-entrypoint.s…\" 2 seconds ago Up 1 second 5432/tcp airbyte-db\n", - "846846105d75 postgres \"docker-entrypoint.s…\" 10 seconds ago Up 9 seconds 0.0.0.0:3000->5432/tcp local-airbyte-postgres-destination\n" - ] - } - ], - "source": [ - "#!/usr/bin/env bash\n", - "\n", - "docker-compose up -d\n", - "\n", - "echo -e \"\\n\"\n", - "\n", - "echo \"Docker Containers currently running:\"\n", - "docker ps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Example Output:\n", - "\n", - " WARNING: The API_URL variable is not set. Defaulting to a blank string.\n", - " Creating network \"airbyte_default\" with the default driver\n", - " Creating volume \"airbyte_workspace\" with default driver\n", - " Creating volume \"airbyte_data\" with default driver\n", - " Creating volume \"airbyte_db\" with default driver\n", - " Creating init ... \n", - " Creating airbyte-data-seed ... \n", - " Creating airbyte-db ... \n", - " Creating airbyte-server ... mdone\n", - " Creating airbyte-scheduler ... \n", - " Creating airbyte-webapp ... mdone\n", - " ting airbyte-webapp ... done\n", - " \n", - " Docker Containers currently running:\n", - " CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n", - " 39cec3eb5953 airbyte/webapp:0.7.1-alpha \"/docker-entrypoint.…\" 1 second ago Up Less than a second 0.0.0.0:8000->80/tcp airbyte-webapp\n", - " f0ff3f8f2b2b airbyte/scheduler:0.7.1-alpha \"/bin/bash -c './wai…\" 1 second ago Up 1 second airbyte-scheduler\n", - " 50448db21673 airbyte/server:0.7.1-alpha \"/bin/bash -c './wai…\" 1 second ago Up Less than a second 8000/tcp, 0.0.0.0:8001->8001/tcp airbyte-server\n", - " 2aa496838b99 airbyte/db:0.7.1-alpha \"docker-entrypoint.s…\" 2 seconds ago Up 1 second 5432/tcp airbyte-db\n", - " 8e24a9682a1e postgres \"docker-entrypoint.s…\" 8 seconds ago Up 7 seconds 0.0.0.0:3000->5432/tcp local-airbyte-postgres-destination\n", - "\n", - "Note that if you already went through the previous tutorial or already used Airbyte in the past, you may not need to complete the Onboarding process this time.\n", - "\n", - "Otherwise, please complete the different steps until you reach the Airbyte Dashboard page.\n", - "\n", - "After a few seconds, the UI should be ready to go at http://localhost:8000/\n", - "\n", - "## Notes about running this tutorial on Mac OS vs Linux\n", - "\n", - "Note that Docker for Mac is not a real Docker host, now it actually runs a virtual machine behind the scenes and hides it from you to make things simpler. \n", - "Here's the simpler version, unless you want to dig deeper... just like our current use case where we want to inspect the content of internal Docker volumes...\n", - "\n", - "Here are some related links as references on accessing Docker Volumes:\n", - "- on Mac OS [Using Docker containers in 2019](https://stackoverflow.com/a/55648186)\n", - "- official doc [Use Volume](https://docs.docker.com/storage/volumes/#backup-restore-or-migrate-data-volumes)\n", - "\n", - "From these discussions, we will be using on Mac OS either:\n", - "1. any docker container/image to browse the virtual filesystem by mounting the volume in order to access them, for example with [busybox](https://hub.docker.com/_/busybox)\n", - "2. or extract files from the volume by copying them onto the host with [Docker cp](https://docs.docker.com/engine/reference/commandline/cp/)\n", - "\n", - "However as a side remark on Linux, accessing to named Docker Volume can be easier since you simply need to:\n", - " \n", - " docker volume inspect \n", - " \n", - "Then look at the \"Mountpoint\" value, this is where the volume is actually stored in the host filesystem and you can directly retrieve files directly from that folder.\n", - "\n", - "Back to this tutorial, commands shown below should work for both Mac OS and Linux !\n", - "\n", - "## Export Initial Setup\n", - "\n", - "Now let's first make a backup of the configuration state of your Airbyte instance by running the following commands.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "TUTORIAL_DIR=$(pwd)/build/persitence-tutorial\n", - "mkdir -p $TUTORIAL_DIR/my-setup\n", - "\n", - "docker cp airbyte-server:/data $TUTORIAL_DIR/my-setup" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Configure Exchange Rate source and File destination\n", - "\n", - "Head back to http://localhost:8000/ and add more connectors. \n", - "Here is an example of configuration from an API source:\n", - "\n", - "![airbyte_config_persistence_api_source](./airbyte_config_persistence_1.png)\n", - "\n", - "and a local file destination:\n", - "\n", - "![airbyte_config_persistence_local_file](./airbyte_config_persistence_2.png)\n", - "\n", - "## Run a Sync job\n", - "\n", - "- once the source and destination are created\n", - "- the catalog and frequency can be configured\n", - "- then run the \"Sync Now\" button\n", - "- finally inspect logs in the UI\n", - "\n", - "![airbyte_config_persistence_ui_logs](./airbyte_config_persistence_3.png)\n", - "\n", - "## Exploring Logs folders\n", - "\n", - "We can read from the lines reported in the logs the working directory that is being used to run the synchronization process from.\n", - "\n", - "As an example in the previous run, it is being ran in `/tmp/workspace/5/0` and we notice the different docker commands being used internally are starting with:\n", - "\n", - " docker run --rm -i -v airbyte_workspace:/data -v /tmp/airbyte_local:/local -w /data/5/0 --network host ...\n", - "\n", - "From there, we can observe that Airbyte is using a docker named volume called `airbyte_workspace` that is mounted in the container at the location `/data`.\n", - "\n", - "Following [Docker Volume documentation](https://docs.docker.com/storage/volumes/), we can inspect and manipulate persisted configuration data in these volumes.\n", - "For example, we can run any docker container/image to browse the content of this named volume by mounting it in a similar way, let's use the [busybox](https://hub.docker.com/_/busybox) image.\n", - "\n", - " docker run -it --rm --volume airbyte_workspace:/data busybox\n", - "\n", - "This will drop you into an `sh` shell to allow you to do what you want inside a BusyBox system from which we can browse the filesystem and accessing to logs files:\n", - "\n", - " ls /data/5/0/\n", - "\n", - "Example Output:\n", - "\n", - " catalog.json normalize tap_config.json\n", - " logs.log singer_rendered_catalog.json target_config.json\n", - "\n", - "Or you can simply run:\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[0;0mcatalog.json\u001b[m \u001b[0;0msinger_rendered_catalog.json\u001b[m\n", - "\u001b[0;0mlogs.log\u001b[m \u001b[0;0mtap_config.json\u001b[m\n", - "\u001b[1;34mnormalize\u001b[m \u001b[0;0mtarget_config.json\u001b[m\n" - ] - } - ], - "source": [ - "docker run -it --rm --volume airbyte_workspace:/data busybox ls /data/5/0" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "lines_to_next_cell": 0 - }, - "source": [ - "Example Output:\n", - "\n", - " catalog.json singer_rendered_catalog.json\n", - " logs.log tap_config.json\n", - " normalize target_config.json\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{\"streams\":[{\"stream\":{\"name\":\"exchange_rate\",\"json_schema\":{\"type\":\"object\",\"properties\":{\"CHF\":{\"type\":\"number\"},\"HRK\":{\"type\":\"number\"},\"date\":{\"type\":\"string\"},\"MXN\":{\"type\":\"number\"},\"ZAR\":{\"type\":\"number\"},\"INR\":{\"type\":\"number\"},\"CNY\":{\"type\":\"number\"},\"THB\":{\"type\":\"number\"},\"AUD\":{\"type\":\"number\"},\"ILS\":{\"type\":\"number\"},\"KRW\":{\"type\":\"number\"},\"JPY\":{\"type\":\"number\"},\"PLN\":{\"type\":\"number\"},\"GBP\":{\"type\":\"number\"},\"IDR\":{\"type\":\"number\"},\"HUF\":{\"type\":\"number\"},\"PHP\":{\"type\":\"number\"},\"TRY\":{\"type\":\"number\"},\"RUB\":{\"type\":\"number\"},\"HKD\":{\"type\":\"number\"},\"ISK\":{\"type\":\"number\"},\"EUR\":{\"type\":\"number\"},\"DKK\":{\"type\":\"number\"},\"CAD\":{\"type\":\"number\"},\"MYR\":{\"type\":\"number\"},\"USD\":{\"type\":\"number\"},\"BGN\":{\"type\":\"number\"},\"NOK\":{\"type\":\"number\"},\"RON\":{\"type\":\"number\"},\"SGD\":{\"type\":\"number\"},\"CZK\":{\"type\":\"number\"},\"SEK\":{\"type\":\"number\"},\"NZD\":{\"type\":\"number\"},\"BRL\":{\"type\":\"number\"}}},\"supported_sync_modes\":[\"full_refresh\"],\"default_cursor_field\":[]},\"sync_mode\":\"full_refresh\",\"cursor_field\":[]}]}" - ] - } - ], - "source": [ - "docker run -it --rm --volume airbyte_workspace:/data busybox cat /data/5/0/catalog.json " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Example Output:\n", - "\n", - " {\"streams\":[{\"stream\":{\"name\":\"exchange_rate\",\"json_schema\":{\"type\":\"object\",\"properties\":{\"CHF\":{\"type\":\"number\"},\"HRK\":{\"type\":\"number\"},\"date\":{\"type\":\"string\"},\"MXN\":{\"type\":\"number\"},\"ZAR\":{\"type\":\"number\"},\"INR\":{\"type\":\"number\"},\"CNY\":{\"type\":\"number\"},\"THB\":{\"type\":\"number\"},\"AUD\":{\"type\":\"number\"},\"ILS\":{\"type\":\"number\"},\"KRW\":{\"type\":\"number\"},\"JPY\":{\"type\":\"number\"},\"PLN\":{\"type\":\"number\"},\"GBP\":{\"type\":\"number\"},\"IDR\":{\"type\":\"number\"},\"HUF\":{\"type\":\"number\"},\"PHP\":{\"type\":\"number\"},\"TRY\":{\"type\":\"number\"},\"RUB\":{\"type\":\"number\"},\"HKD\":{\"type\":\"number\"},\"ISK\":{\"type\":\"number\"},\"EUR\":{\"type\":\"number\"},\"DKK\":{\"type\":\"number\"},\"CAD\":{\"type\":\"number\"},\"MYR\":{\"type\":\"number\"},\"USD\":{\"type\":\"number\"},\"BGN\":{\"type\":\"number\"},\"NOK\":{\"type\":\"number\"},\"RON\":{\"type\":\"number\"},\"SGD\":{\"type\":\"number\"},\"CZK\":{\"type\":\"number\"},\"SEK\":{\"type\":\"number\"},\"NZD\":{\"type\":\"number\"},\"BRL\":{\"type\":\"number\"}}},\"supported_sync_modes\":[\"full_refresh\"],\"default_cursor_field\":[]},\"sync_mode\":\"full_refresh\",\"cursor_field\":[]}]}\n", - "\n", - "## Check local data folder\n", - "\n", - "Since the job completed successfully, a new file should be available in the special `/local/` directory in the container which is mounted from `/tmp/airbyte_local` on the host machine.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "In the container:\n", - "/local\n", - "/local/data\n", - "/local/data/exchange_rate_raw.csv\n", - "\n", - "On the host:\n", - "/tmp/airbyte_local\n", - "/tmp/airbyte_local/data\n", - "/tmp/airbyte_local/data/exchange_rate_raw.csv\n" - ] - } - ], - "source": [ - "#!/usr/bin/env bash\n", - "\n", - "echo \"In the container:\"\n", - "\n", - "docker run -it --rm -v /tmp/airbyte_local:/local busybox find /local\n", - "\n", - "echo \"\"\n", - "echo \"On the host:\"\n", - "\n", - "find /tmp/airbyte_local" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Example Output:\n", - "\n", - " In the container:\n", - " /local\n", - " /local/data\n", - " /local/data/exchange_rate_raw.csv\n", - " \n", - " On the host:\n", - " /tmp/airbyte_local\n", - " /tmp/airbyte_local/data\n", - " /tmp/airbyte_local/data/exchange_rate_raw.csv\n", - "\n", - "\n", - "## Backup Exchange Rate Source and Destination configurations\n", - "\n", - "In the following steps, we will play with persistence of configurations so let's make a backup of our newly added connectors for now:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "mkdir -p $TUTORIAL_DIR/exchange-rate-setup\n", - "\n", - "docker cp airbyte-server:data $TUTORIAL_DIR/exchange-rate-setup" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Shutting down Airbyte server and clear previous configurations\n", - "\n", - "Whenever you want to stop the Airbyte server, you can run: `docker-compose down`\n", - "\n", - "From [docker documentation](https://docs.docker.com/compose/reference/down/)\n", - "```\n", - "This command stops containers and removes containers, networks, volumes, and images created by up.\n", - "\n", - "By default, the only things removed are:\n", - "\n", - "- Containers for services defined in the Compose file\n", - "- Networks defined in the networks section of the Compose file\n", - "- The default network, if one is used\n", - "\n", - "Networks and volumes defined as external are never removed.\n", - "\n", - "Anonymous volumes are not removed by default. However, as they don’t have a stable name, they will not be automatically mounted by a subsequent up. For data that needs to persist between updates, use host or named volumes.\n", - "```\n", - "\n", - "So since Airbyte is using named volumes to store the configurations, if you run \n", - "`docker-compose up` again, your connectors configurations from earlier steps will still be available.\n", - "\n", - "Let's wipe our configurations on purpose and use the following option:\n", - "\n", - "```\n", - "-v, --volumes Remove named volumes declared in the `volumes`\n", - " section of the Compose file and anonymous volumes\n", - " attached to containers.\n", - "```\n", - "\n", - "Note that the `/tmp/airbyte_local:/local` that we saw earlier is a [bind mount](https://docs.docker.com/storage/bind-mounts/) so data that was replicated locally won't be affected by the next command.\n", - "\n", - "However it will get rid of the named volume workspace so all logs and generated files by Airbyte will be lost.\n", - "\n", - "We can then run:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[33mWARNING\u001b[0m: The API_URL variable is not set. Defaulting to a blank string.\n", - "Stopping airbyte-webapp ... \n", - "Stopping airbyte-scheduler ... \n", - "Stopping airbyte-server ... \n", - "Stopping airbyte-db ... \n", - "\u001b[1BRemoving airbyte-webapp ... mdone\u001b[0m\n", - "Removing airbyte-scheduler ... \n", - "Removing airbyte-server ... \n", - "Removing airbyte-db ... \n", - "Removing airbyte-data-seed ... \n", - "Removing init ... \n", - "\u001b[2BRemoving network airbyte_defaultdone\u001b[0m\n", - "Removing volume airbyte_workspace\n", - "Removing volume airbyte_data\n", - "Removing volume airbyte_db\n", - "\u001b[33mWARNING\u001b[0m: The API_URL variable is not set. Defaulting to a blank string.\n", - "Creating network \"airbyte_default\" with the default driver\n", - "Creating volume \"airbyte_workspace\" with default driver\n", - "Creating volume \"airbyte_data\" with default driver\n", - "Creating volume \"airbyte_db\" with default driver\n", - "Creating init ... \n", - "\u001b[1BCreating airbyte-data-seed ... \n", - "Creating airbyte-db ... \n", - "\u001b[1BCreating airbyte-scheduler ... mdone\u001b[0m\n", - "Creating airbyte-server ... \n", - "\u001b[1BCreating airbyte-webapp ... mdone\u001b[0m\n", - "\u001b[1Bting airbyte-webapp ... \u001b[32mdone\u001b[0m" - ] - } - ], - "source": [ - "docker-compose down -v\n", - "docker-compose up -d" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Example Output:\n", - "\n", - " WARNING: The API_URL variable is not set. Defaulting to a blank string.\n", - " Stopping airbyte-webapp ... \n", - " Stopping airbyte-scheduler ... \n", - " Stopping airbyte-server ... \n", - " Stopping airbyte-db ... \n", - " Removing airbyte-webapp ... mdone\n", - " Removing airbyte-scheduler ... \n", - " Removing airbyte-server ... \n", - " Removing airbyte-data-seed ... \n", - " Removing airbyte-db ... \n", - " Removing init ... \n", - " Removing network airbyte_defaultdone\n", - " Removing volume airbyte_workspace\n", - " Removing volume airbyte_data\n", - " Removing volume airbyte_db\n", - " WARNING: The API_URL variable is not set. Defaulting to a blank string.\n", - " Creating network \"airbyte_default\" with the default driver\n", - " Creating volume \"airbyte_workspace\" with default driver\n", - " Creating volume \"airbyte_data\" with default driver\n", - " Creating volume \"airbyte_db\" with default driver\n", - " Creating init ... \n", - " Creating airbyte-data-seed ... \n", - " Creating airbyte-db ... \n", - " Creating airbyte-server ... mdone\n", - " Creating airbyte-scheduler ... mdone\n", - " Creating airbyte-webapp ... mdone\n", - " ting airbyte-webapp ... done\n", - "\n", - "Wait a moment for the webserver to start and go refresh the page http://localhost:8000/.\n", - "\n", - "We are prompted with the onboarding process again...\n", - "\n", - "Let's ignore that step, close the page and go back to the notebook to import configurations from our initial setup instead.\n", - "\n", - "## Restore our initial setup\n", - "\n", - "We can play and restore files in the named docker volume `data` and thus retrieve files that were created from earlier:\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "docker cp $TUTORIAL_DIR/my-setup/data/config airbyte-server:data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now refresh back the page http://localhost:8000/ again, wait a little bit for the server to pick up the freshly imported configurations...\n", - "Tada! We don't need to complete the onboarding process anymore! \n", - "and we have the list of connectors that were created previously available again. Thus you can use this ability of export/import files from named volumes to share with others the configuration of your connectors.\n", - "\n", - "Warning: and it will include credentials, so be careful too!\n", - "\n", - "## Configure some Covid (data) source and Postgres destinations\n", - "\n", - "Let's re-iterate the source and destination creation, this time, with a file accessible from a public API:\n", - "\n", - " Here are some example of public API CSV:\n", - " https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv\n", - "\n", - "![airbyte_config_persistence_ui_logs](./airbyte_config_persistence_4.png)\n", - "\n", - "And a local Postgres Database:\n", - "\n", - "![airbyte_config_persistence_ui_logs](./airbyte_config_persistence_5.png)\n", - "\n", - "\n", - "After setting up the connectors, we can trigger the sync and study the logs:\n", - "\n", - "![airbyte_config_persistence_ui_logs](./airbyte_config_persistence_6.png)\n", - "\n", - "Since we wiped the workspace volume and restarted the Airbyte Server, notice that thi process ran in the `/tmp/workspace/5/0` as well.\n", - "\n", - "## Export and customize Normalization step with DBT\n", - "\n", - "In the previous connector configuration, selected a Postgres Database destination and chose to enable the \"Basic Normalization\" option.\n", - "\n", - "In Airbyte, data is written in destination in a JSON blob format in tables with suffix \"_raw\" as it is taking care of the `E` and `L` in `ELT`. \n", - "\n", - "The normalization option adds a last `T` transformation step that takes care of converting such JSON tables into flat tables. \n", - "To do so, Airbyte is currently using [DBT](https://docs.getdbt.com/) to handle such tasks which can be manually triggered in the normalization container like this:\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Running with dbt=0.18.1\n", - "dbt version: 0.18.1\n", - "python version: 3.7.9\n", - "python path: /usr/local/bin/python\n", - "os info: Linux-4.19.121-linuxkit-x86_64-with-debian-10.6\n", - "Using profiles.yml file at ./profiles.yml\n", - "Using dbt_project.yml file at /data/5/0/normalize/dbt_project.yml\n", - "\n", - "Configuration:\n", - " profiles.yml file [OK found and valid]\n", - " dbt_project.yml file [OK found and valid]\n", - "\n", - "Required dependencies:\n", - " - git [OK found]\n", - "\n", - "Connection:\n", - " host: localhost\n", - " port: 3000\n", - " user: postgres\n", - " database: postgres\n", - " schema: quarantine\n", - " search_path: None\n", - " keepalives_idle: 0\n", - " sslmode: None\n", - " Connection test: OK connection ok\n", - "\n", - "Running with dbt=0.18.1\n", - "Found 1 model, 0 tests, 0 snapshots, 0 analyses, 302 macros, 0 operations, 0 seed files, 1 source\n", - "\n", - "15:00:57 | Concurrency: 32 threads (target='prod')\n", - "15:00:57 | \n", - "15:00:57 | 1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN]\n", - "15:00:58 | 1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [\u001b[32mSELECT 17911\u001b[0m in 0.34s]\n", - "15:00:58 | \n", - "15:00:58 | Finished running 1 table model in 0.52s.\n", - "\n", - "\u001b[32mCompleted successfully\u001b[0m\n", - "\n", - "Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1\n" - ] - } - ], - "source": [ - "#!/usr/bin/env bash\n", - "\n", - "# find latest workspace where normalization was run\n", - "NORMALIZE_WORKSPACE=`docker run --rm -i -v airbyte_workspace:/data busybox find /data -path \"*normalize/models*\" | sed -E \"s;/data/([0-9]+/[0-9]+/)normalize/.*;\\1;g\" | sort | \n", - "uniq | tail -n 1`\n", - "\n", - "docker run --rm -i -v airbyte_workspace:/data -w /data/$NORMALIZE_WORKSPACE/normalize --network host --entrypoint /usr/local/bin/dbt airbyte/normalization debug --profiles-dir=. --project-dir=.\n", - "docker run --rm -i -v airbyte_workspace:/data -w /data/$NORMALIZE_WORKSPACE/normalize --network host --entrypoint /usr/local/bin/dbt airbyte/normalization run --profiles-dir=. --project-dir=." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Example Output:\n", - "\n", - " Running with dbt=0.18.1\n", - " dbt version: 0.18.1\n", - " python version: 3.7.9\n", - " python path: /usr/local/bin/python\n", - " os info: Linux-4.19.121-linuxkit-x86_64-with-debian-10.6\n", - " Using profiles.yml file at ./profiles.yml\n", - " Using dbt_project.yml file at /data/5/0/normalize/dbt_project.yml\n", - " \n", - " Configuration:\n", - " profiles.yml file [OK found and valid]\n", - " dbt_project.yml file [OK found and valid]\n", - " \n", - " Required dependencies:\n", - " - git [OK found]\n", - " \n", - " Connection:\n", - " host: localhost\n", - " port: 3000\n", - " user: postgres\n", - " database: postgres\n", - " schema: quarantine\n", - " search_path: None\n", - " keepalives_idle: 0\n", - " sslmode: None\n", - " Connection test: OK connection ok\n", - " \n", - " Running with dbt=0.18.1\n", - " Found 1 model, 0 tests, 0 snapshots, 0 analyses, 302 macros, 0 operations, 0 seed files, 1 source\n", - " \n", - " 14:37:10 | Concurrency: 32 threads (target='prod')\n", - " 14:37:10 | \n", - " 14:37:10 | 1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN]\n", - " 14:37:11 | 1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [SELECT 17911 in 0.33s]\n", - " 14:37:11 | \n", - " 14:37:11 | Finished running 1 table model in 0.50s.\n", - " \n", - " Completed successfully\n", - " \n", - " Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1\n", - "\n", - "As seen earlier, it is possible to browse the workspace folders and examine further logs if an error occurs.\n", - "\n", - "In particular, we can also take a look at the DBT models generated by Airbyte and export them to the local host filesystem:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "with \n", - "covid_epidemiology_node as (\n", - " select \n", - " emitted_at,\n", - " {{ dbt_utils.current_timestamp_in_utc() }} as normalized_at,\n", - " cast({{ json_extract_scalar('data', ['date']) }} as {{ dbt_utils.type_string() }}) as date,\n", - " cast({{ json_extract_scalar('data', ['new_recovered']) }} as {{ dbt_utils.type_float() }}) as new_recovered,\n", - " cast({{ json_extract_scalar('data', ['new_tested']) }} as {{ dbt_utils.type_float() }}) as new_tested,\n", - " cast({{ json_extract_scalar('data', ['total_deceased']) }} as {{ dbt_utils.type_float() }}) as total_deceased,\n", - " cast({{ json_extract_scalar('data', ['new_deceased']) }} as {{ dbt_utils.type_float() }}) as new_deceased,\n", - " cast({{ json_extract_scalar('data', ['new_confirmed']) }} as {{ dbt_utils.type_float() }}) as new_confirmed,\n", - " cast({{ json_extract_scalar('data', ['total_confirmed']) }} as {{ dbt_utils.type_float() }}) as total_confirmed,\n", - " cast({{ json_extract_scalar('data', ['total_tested']) }} as {{ dbt_utils.type_float() }}) as total_tested,\n", - " cast({{ json_extract_scalar('data', ['total_recovered']) }} as {{ dbt_utils.type_float() }}) as total_recovered,\n", - " cast({{ json_extract_scalar('data', ['key']) }} as {{ dbt_utils.type_string() }}) as key\n", - " from {{ source('quarantine', 'covid_epidemiology_raw') }}\n", - "),\n", - "covid_epidemiology_with_id as (\n", - " select\n", - " *,\n", - " {{ dbt_utils.surrogate_key([\n", - " 'date',\n", - " 'new_recovered',\n", - " 'new_tested',\n", - " 'total_deceased',\n", - " 'new_deceased',\n", - " 'new_confirmed',\n", - " 'total_confirmed',\n", - " 'total_tested',\n", - " 'total_recovered',\n", - " 'key'\n", - " ]) }} as _covid_epidemiology_hashid\n", - " from covid_epidemiology_node\n", - ")\n", - "select * from covid_epidemiology_with_id" - ] - } - ], - "source": [ - "#!/usr/bin/env bash\n", - "\n", - "rm -rf $TUTORIAL_DIR/normalization-files\n", - "mkdir -p $TUTORIAL_DIR/normalization-files\n", - "\n", - "docker cp airbyte-server:/tmp/workspace/$NORMALIZE_WORKSPACE/normalize/ $TUTORIAL_DIR/normalization-files\n", - "\n", - "NORMALIZE_DIR=$TUTORIAL_DIR/normalization-files/normalize\n", - "cd $NORMALIZE_DIR\n", - "cat $NORMALIZE_DIR/models/generated/*.sql" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Example Output:\n", - "\n", - " with \n", - " covid_epidemiology_node as (\n", - " select \n", - " emitted_at,\n", - " {{ dbt_utils.current_timestamp_in_utc() }} as normalized_at,\n", - " cast({{ json_extract_scalar('data', ['date']) }} as {{ dbt_utils.type_string() }}) as date,\n", - " cast({{ json_extract_scalar('data', ['new_recovered']) }} as {{ dbt_utils.type_float() }}) as new_recovered,\n", - " cast({{ json_extract_scalar('data', ['new_tested']) }} as {{ dbt_utils.type_float() }}) as new_tested,\n", - " cast({{ json_extract_scalar('data', ['total_deceased']) }} as {{ dbt_utils.type_float() }}) as total_deceased,\n", - " cast({{ json_extract_scalar('data', ['new_deceased']) }} as {{ dbt_utils.type_float() }}) as new_deceased,\n", - " cast({{ json_extract_scalar('data', ['new_confirmed']) }} as {{ dbt_utils.type_float() }}) as new_confirmed,\n", - " cast({{ json_extract_scalar('data', ['total_confirmed']) }} as {{ dbt_utils.type_float() }}) as total_confirmed,\n", - " cast({{ json_extract_scalar('data', ['total_tested']) }} as {{ dbt_utils.type_float() }}) as total_tested,\n", - " cast({{ json_extract_scalar('data', ['total_recovered']) }} as {{ dbt_utils.type_float() }}) as total_recovered,\n", - " cast({{ json_extract_scalar('data', ['key']) }} as {{ dbt_utils.type_string() }}) as key\n", - " from {{ source('quarantine', 'covid_epidemiology_raw') }}\n", - " ),\n", - " covid_epidemiology_with_id as (\n", - " select\n", - " *,\n", - " {{ dbt_utils.surrogate_key([\n", - " 'date',\n", - " 'new_recovered',\n", - " 'new_tested',\n", - " 'total_deceased',\n", - " 'new_deceased',\n", - " 'new_confirmed',\n", - " 'total_confirmed',\n", - " 'total_tested',\n", - " 'total_recovered',\n", - " 'key'\n", - " ]) }} as _covid_epidemiology_hashid\n", - " from covid_epidemiology_node\n", - " )\n", - " select * from covid_epidemiology_with_id\n", - "\n", - "If you have [dbt cli](https://docs.getdbt.com/dbt-cli/cli-overview/) installed on your machine, you can then view, edit, customize and run the dbt models in your project if you want to bypass the normalization steps generated by Airbyte!\n" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "lines_to_next_cell": 0 - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Running with dbt=0.18.1\n", - "Installing https://github.com/fishtown-analytics/dbt-utils.git@0.6.2\n", - " Installed from revision 0.6.2\n", - "\n", - "\u001b[0mRunning with dbt=0.18.1\n", - "Found 1 model, 0 tests, 0 snapshots, 0 analyses, 302 macros, 0 operations, 0 seed files, 1 source\n", - "\n", - "16:01:26 | Concurrency: 32 threads (target='prod')\n", - "16:01:26 | \n", - "16:01:26 | 1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN]\n", - "16:01:26 | 1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [\u001b[32mSELECT 17911\u001b[0m in 0.26s]\n", - "16:01:26 | \n", - "16:01:26 | Finished running 1 table model in 0.41s.\n", - "\n", - "\u001b[32mCompleted successfully\u001b[0m\n", - "\n", - "Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1\n", - "\u001b[0m" - ] - } - ], - "source": [ - "#!/usr/bin/env bash \n", - "\n", - "dbt deps --profiles-dir=$NORMALIZE_DIR --project-dir=$NORMALIZE_DIR\n", - "dbt run --profiles-dir=$NORMALIZE_DIR --project-dir=$NORMALIZE_DIR --full-refresh" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Example Output:\n", - "\n", - " Running with dbt=0.18.1\n", - " Installing https://github.com/fishtown-analytics/dbt-utils.git@0.6.2\n", - " Installed from revision 0.6.2\n", - " \n", - " Running with dbt=0.18.1\n", - " Found 1 model, 0 tests, 0 snapshots, 0 analyses, 302 macros, 0 operations, 0 seed files, 1 source\n", - " \n", - " 15:37:54 | Concurrency: 32 threads (target='prod')\n", - " 15:37:54 | \n", - " 15:37:55 | 1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN]\n", - " 15:37:55 | 1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [SELECT 17911 in 0.30s]\n", - " 15:37:55 | \n", - " 15:37:55 | Finished running 1 table model in 0.51s.\n", - " \n", - " Completed successfully\n", - " \n", - " Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1\n", - " " - ] - } - ], - "metadata": { - "jupytext": { - "cell_metadata_filter": "-all", - "notebook_metadata_filter": "-all", - "text_representation": { - "extension": ".md", - "format_name": "markdown" - } - }, - "kernelspec": { - "display_name": "Bash", - "language": "bash", - "name": "bash" - }, - "language_info": { - "codemirror_mode": "shell", - "file_extension": ".sh", - "mimetype": "text/x-sh", - "name": "bash" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/tutorial/airbyte_config_persistence.md b/docs/tutorial/airbyte_config_persistence.md deleted file mode 100644 index da24f82c19680..0000000000000 --- a/docs/tutorial/airbyte_config_persistence.md +++ /dev/null @@ -1,517 +0,0 @@ -# Airbyte Configuration Persistence Tutorial - -Once you manage to spin up a local instance of Airbyte, following steps in the [Getting started Tutorial](../getting-started-tutorial.md), you may want to gain a better understanding of what configuration files are available in Airbyte and how to work with it. - -As we continue to improve the User experience around those aspects to make it simpler in the UI, this tutorial will go over how to manually import and export Airbyte configurations of connectors. - -This may be useful if you need for example to version control, make a backup, share with your team or if you just want to debug and learn more about Airbyte internals. - -Here are the goals for this tutorial: -1. Access to replication logs files -2. Export & Import Airbyte Configuration data files -3. Export normalization models to use in your own DBT project - -## Setting up a local Postgres Destination - -For this tutorial, we are going to use 2 types of destinations to run our demo where data will be written: -- Local File Destination -- Local Postgres Database - -The local files will be written by default to the directory `/tmp/airbyte_local`. - -The postgres database that we are going to spin up below will be running locally with the following configuration where data will be written: - - - Host: localhost - - Port: 3000 - - User: postgres - - Password: password - - DB Name: postgres - - - - - -```bash -#!/usr/bin/env bash - -echo "File Content in the local destination (may not exist yet):" -find /tmp/airbyte_local - -echo "" - -docker ps | grep -q local-airbyte-postgres-destination -if [ $? -eq 0 ]; then - echo "Postgres Database local-airbyte-postgres-destination is already up" -else - echo "Start a Postgres container named local-airbyte-postgres-destination" - docker run --rm --name local-airbyte-postgres-destination -e POSTGRES_PASSWORD=password -p 3000:5432 -d postgres -fi - -echo "" - -echo "Docker Containers currently running:" - -docker ps -``` - -Example Output: - - File Content in the local destination (may not exist yet): - find: /tmp/airbyte_local: No such file or directory - - Start a Postgres container named local-airbyte-postgres-destination - 8e24a9682a1ec2e7539c7ada5d993120d3337cff07a54603fcdb8d44f4013aab - - Docker Containers currently running: - CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES - 8e24a9682a1e postgres "docker-entrypoint.s…" 1 second ago Up Less than a second 0.0.0.0:3000->5432/tcp local-airbyte-postgres-destination - - -## Starting Airbyte Server - -As we've seen in the previous tutorial, we can spin up Airbyte instance after installing it: - - -```bash -#!/usr/bin/env bash - -docker-compose up -d - -echo -e "\n" - -echo "Docker Containers currently running:" -docker ps -``` - -Example Output: - - WARNING: The API_URL variable is not set. Defaulting to a blank string. - Creating network "airbyte_default" with the default driver - Creating volume "airbyte_workspace" with default driver - Creating volume "airbyte_data" with default driver - Creating volume "airbyte_db" with default driver - Creating init ... - Creating airbyte-data-seed ... - Creating airbyte-db ... - Creating airbyte-server ... mdone - Creating airbyte-scheduler ... - Creating airbyte-webapp ... mdone - ting airbyte-webapp ... done - - Docker Containers currently running: - CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES - 39cec3eb5953 airbyte/webapp:0.7.1-alpha "/docker-entrypoint.…" 1 second ago Up Less than a second 0.0.0.0:8000->80/tcp airbyte-webapp - f0ff3f8f2b2b airbyte/scheduler:0.7.1-alpha "/bin/bash -c './wai…" 1 second ago Up 1 second airbyte-scheduler - 50448db21673 airbyte/server:0.7.1-alpha "/bin/bash -c './wai…" 1 second ago Up Less than a second 8000/tcp, 0.0.0.0:8001->8001/tcp airbyte-server - 2aa496838b99 airbyte/db:0.7.1-alpha "docker-entrypoint.s…" 2 seconds ago Up 1 second 5432/tcp airbyte-db - 8e24a9682a1e postgres "docker-entrypoint.s…" 8 seconds ago Up 7 seconds 0.0.0.0:3000->5432/tcp local-airbyte-postgres-destination - -Note that if you already went through the previous tutorial or already used Airbyte in the past, you may not need to complete the Onboarding process this time. - -Otherwise, please complete the different steps until you reach the Airbyte Dashboard page. - -After a few seconds, the UI should be ready to go at http://localhost:8000/ - -## Notes about running this tutorial on Mac OS vs Linux - -Note that Docker for Mac is not a real Docker host, now it actually runs a virtual machine behind the scenes and hides it from you to make things simpler. -Here's the simpler version, unless you want to dig deeper... just like our current use case where we want to inspect the content of internal Docker volumes... - -Here are some related links as references on accessing Docker Volumes: -- on Mac OS [Using Docker containers in 2019](https://stackoverflow.com/a/55648186) -- official doc [Use Volume](https://docs.docker.com/storage/volumes/#backup-restore-or-migrate-data-volumes) - -From these discussions, we will be using on Mac OS either: -1. any docker container/image to browse the virtual filesystem by mounting the volume in order to access them, for example with [busybox](https://hub.docker.com/_/busybox) -2. or extract files from the volume by copying them onto the host with [Docker cp](https://docs.docker.com/engine/reference/commandline/cp/) - -However as a side remark on Linux, accessing to named Docker Volume can be easier since you simply need to: - - docker volume inspect - -Then look at the "Mountpoint" value, this is where the volume is actually stored in the host filesystem and you can directly retrieve files directly from that folder. - -Back to this tutorial, commands shown below should work for both Mac OS and Linux ! - -## Export Initial Setup - -Now let's first make a backup of the configuration state of your Airbyte instance by running the following commands. - - -```bash -TUTORIAL_DIR=$(pwd)/build/persitence-tutorial -mkdir -p $TUTORIAL_DIR/my-setup - -docker cp airbyte-server:/data $TUTORIAL_DIR/my-setup -``` - -## Configure Exchange Rate source and File destination - -Head back to http://localhost:8000/ and add more connectors. -Here is an example of configuration from an API source: - -![airbyte_config_persistence_api_source](./airbyte_config_persistence_1.png) - -and a local file destination: - -![airbyte_config_persistence_local_file](./airbyte_config_persistence_2.png) - -## Run a Sync job - -- once the source and destination are created -- the catalog and frequency can be configured -- then run the "Sync Now" button -- finally inspect logs in the UI - -![airbyte_config_persistence_ui_logs](./airbyte_config_persistence_3.png) - -## Exploring Logs folders - -We can read from the lines reported in the logs the working directory that is being used to run the synchronization process from. - -As an example in the previous run, it is being ran in `/tmp/workspace/5/0` and we notice the different docker commands being used internally are starting with: - - docker run --rm -i -v airbyte_workspace:/data -v /tmp/airbyte_local:/local -w /data/5/0 --network host ... - -From there, we can observe that Airbyte is using a docker named volume called `airbyte_workspace` that is mounted in the container at the location `/data`. - -Following [Docker Volume documentation](https://docs.docker.com/storage/volumes/), we can inspect and manipulate persisted configuration data in these volumes. -For example, we can run any docker container/image to browse the content of this named volume by mounting it in a similar way, let's use the [busybox](https://hub.docker.com/_/busybox) image. - - docker run -it --rm --volume airbyte_workspace:/data busybox - -This will drop you into an `sh` shell to allow you to do what you want inside a BusyBox system from which we can browse the filesystem and accessing to logs files: - - ls /data/5/0/ - -Example Output: - - catalog.json normalize tap_config.json - logs.log singer_rendered_catalog.json target_config.json - -Or you can simply run: - - - -```bash -docker run -it --rm --volume airbyte_workspace:/data busybox ls /data/4/0 -``` - -Example Output: - - catalog.json singer_rendered_catalog.json - logs.log tap_config.json - normalize target_config.json - - - -```bash -docker run -it --rm --volume airbyte_workspace:/data busybox cat /data/5/0/catalog.json -``` - -Example Output: - - {"streams":[{"stream":{"name":"exchange_rate","json_schema":{"type":"object","properties":{"CHF":{"type":"number"},"HRK":{"type":"number"},"date":{"type":"string"},"MXN":{"type":"number"},"ZAR":{"type":"number"},"INR":{"type":"number"},"CNY":{"type":"number"},"THB":{"type":"number"},"AUD":{"type":"number"},"ILS":{"type":"number"},"KRW":{"type":"number"},"JPY":{"type":"number"},"PLN":{"type":"number"},"GBP":{"type":"number"},"IDR":{"type":"number"},"HUF":{"type":"number"},"PHP":{"type":"number"},"TRY":{"type":"number"},"RUB":{"type":"number"},"HKD":{"type":"number"},"ISK":{"type":"number"},"EUR":{"type":"number"},"DKK":{"type":"number"},"CAD":{"type":"number"},"MYR":{"type":"number"},"USD":{"type":"number"},"BGN":{"type":"number"},"NOK":{"type":"number"},"RON":{"type":"number"},"SGD":{"type":"number"},"CZK":{"type":"number"},"SEK":{"type":"number"},"NZD":{"type":"number"},"BRL":{"type":"number"}}},"supported_sync_modes":["full_refresh"],"default_cursor_field":[]},"sync_mode":"full_refresh","cursor_field":[]}]} - -## Check local data folder - -Since the job completed successfully, a new file should be available in the special `/local/` directory in the container which is mounted from `/tmp/airbyte_local` on the host machine. - - -```bash -#!/usr/bin/env bash - -echo "In the container:" - -docker run -it --rm -v /tmp/airbyte_local:/local busybox find /local - -echo "" -echo "On the host:" - -find /tmp/airbyte_local -``` - -Example Output: - - In the container: - /local - /local/data - /local/data/exchange_rate_raw.csv - - On the host: - /tmp/airbyte_local - /tmp/airbyte_local/data - /tmp/airbyte_local/data/exchange_rate_raw.csv - - -## Backup Exchange Rate Source and Destination configurations - -In the following steps, we will play with persistence of configurations so let's make a backup of our newly added connectors for now: - - -```bash -mkdir -p $TUTORIAL_DIR/exchange-rate-setup - -docker cp airbyte-server:data $TUTORIAL_DIR/exchange-rate-setup -``` - -## Shutting down Airbyte server and clear previous configurations - -Whenever you want to stop the Airbyte server, you can run: `docker-compose down` - -From [docker documentation](https://docs.docker.com/compose/reference/down/) -``` -This command stops containers and removes containers, networks, volumes, and images created by up. - -By default, the only things removed are: - -- Containers for services defined in the Compose file -- Networks defined in the networks section of the Compose file -- The default network, if one is used - -Networks and volumes defined as external are never removed. - -Anonymous volumes are not removed by default. However, as they don’t have a stable name, they will not be automatically mounted by a subsequent up. For data that needs to persist between updates, use host or named volumes. -``` - -So since Airbyte is using named volumes to store the configurations, if you run -`docker-compose up` again, your connectors configurations from earlier steps will still be available. - -Let's wipe our configurations on purpose and use the following option: - -``` --v, --volumes Remove named volumes declared in the `volumes` - section of the Compose file and anonymous volumes - attached to containers. -``` - -Note that the `/tmp/airbyte_local:/local` that we saw earlier is a [bind mount](https://docs.docker.com/storage/bind-mounts/) so data that was replicated locally won't be affected by the next command. - -However it will get rid of the named volume workspace so all logs and generated files by Airbyte will be lost. - -We can then run: - -```bash -docker-compose down -v -docker-compose up -d -``` -Example Output: - - WARNING: The API_URL variable is not set. Defaulting to a blank string. - Stopping airbyte-webapp ... - Stopping airbyte-scheduler ... - Stopping airbyte-server ... - Stopping airbyte-db ... - Removing airbyte-webapp ... mdone - Removing airbyte-scheduler ... - Removing airbyte-server ... - Removing airbyte-data-seed ... - Removing airbyte-db ... - Removing init ... - Removing network airbyte_defaultdone - Removing volume airbyte_workspace - Removing volume airbyte_data - Removing volume airbyte_db - WARNING: The API_URL variable is not set. Defaulting to a blank string. - Creating network "airbyte_default" with the default driver - Creating volume "airbyte_workspace" with default driver - Creating volume "airbyte_data" with default driver - Creating volume "airbyte_db" with default driver - Creating init ... - Creating airbyte-data-seed ... - Creating airbyte-db ... - Creating airbyte-server ... mdone - Creating airbyte-scheduler ... mdone - Creating airbyte-webapp ... mdone - ting airbyte-webapp ... done - -Wait a moment for the webserver to start and go refresh the page http://localhost:8000/. - -We are prompted with the onboarding process again... - -Let's ignore that step, close the page and go back to the notebook to import configurations from our initial setup instead. - -## Restore our initial setup - -We can play and restore files in the named docker volume `data` and thus retrieve files that were created from earlier: - - - - -```bash -docker cp $TUTORIAL_DIR/my-setup/data/config airbyte-server:data -``` - -Now refresh back the page http://localhost:8000/ again, wait a little bit for the server to pick up the freshly imported configurations... -Tada! We don't need to complete the onboarding process anymore! -and we have the list of connectors that were created previously available again. Thus you can use this ability of export/import files from named volumes to share with others the configuration of your connectors. - -Warning: and it will include credentials, so be careful too! - -## Configure some Covid (data) source and Postgres destinations - -Let's re-iterate the source and destination creation, this time, with a file accessible from a public API: - - Here are some examples of public API CSV: - https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv - -![airbyte_config_persistence_ui_logs](./airbyte_config_persistence_4.png) - -And a local Postgres Database: - -![airbyte_config_persistence_ui_logs](./airbyte_config_persistence_5.png) - -After setting up the connectors, we can trigger the sync and study the logs: - -![airbyte_config_persistence_ui_logs](./airbyte_config_persistence_6.png) - -Since we wiped the workspace volume and restarted the Airbyte Server, notice that the process ran in the `/tmp/workspace/5/0` as well but the logs for ExchangeRate are gone... - -## Export and customize Normalization step with DBT - -In the previous connector configuration, selected a Postgres Database destination and chose to enable the "Basic Normalization" option. - -In Airbyte, data is written in destination in a JSON blob format in tables with suffix "_raw" as it is taking care of the `E` and `L` in `ELT`. - -The normalization option adds a last `T` transformation step that takes care of converting such JSON tables into flat tables. -To do so, Airbyte is currently using [DBT](https://docs.getdbt.com/) to handle such tasks which can be manually triggered in the normalization container like this: - - - - -```bash -#!/usr/bin/env bash - -# find latest workspace where normalization was run -NORMALIZE_WORKSPACE=`docker run --rm -i -v airbyte_workspace:/data busybox find /data -path "*normalize/models*" | sed -E "s;/data/([0-9]+/[0-9]+/)normalize/.*;\1;g" | sort | -uniq | tail -n 1` - -docker run --rm -i -v airbyte_workspace:/data -w /data/$NORMALIZE_WORKSPACE/normalize --network host --entrypoint /usr/local/bin/dbt airbyte/normalization debug --profiles-dir=. --project-dir=. -docker run --rm -i -v airbyte_workspace:/data -w /data/$NORMALIZE_WORKSPACE/normalize --network host --entrypoint /usr/local/bin/dbt airbyte/normalization run --profiles-dir=. --project-dir=. -``` -Example Output: - - Running with dbt=0.18.1 - dbt version: 0.18.1 - python version: 3.7.9 - python path: /usr/local/bin/python - os info: Linux-4.19.121-linuxkit-x86_64-with-debian-10.6 - Using profiles.yml file at ./profiles.yml - Using dbt_project.yml file at /data/5/0/normalize/dbt_project.yml - - Configuration: - profiles.yml file [OK found and valid] - dbt_project.yml file [OK found and valid] - - Required dependencies: - - git [OK found] - - Connection: - host: localhost - port: 3000 - user: postgres - database: postgres - schema: quarantine - search_path: None - keepalives_idle: 0 - sslmode: None - Connection test: OK connection ok - - Running with dbt=0.18.1 - Found 1 model, 0 tests, 0 snapshots, 0 analyses, 302 macros, 0 operations, 0 seed files, 1 source - - 14:37:10 | Concurrency: 32 threads (target='prod') - 14:37:10 | - 14:37:10 | 1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN] - 14:37:11 | 1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [SELECT 17911 in 0.33s] - 14:37:11 | - 14:37:11 | Finished running 1 table model in 0.50s. - - Completed successfully - - Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 - -As seen earlier, it is possible to browse the workspace folders and examine further logs if an error occurs. - -In particular, we can also take a look at the DBT models generated by Airbyte and export them to the local host filesystem: - - -```bash -#!/usr/bin/env bash - -rm -rf $TUTORIAL_DIR/normalization-files -mkdir -p $TUTORIAL_DIR/normalization-files - -docker cp airbyte-server:/tmp/workspace/$NORMALIZE_WORKSPACE/normalize/ $TUTORIAL_DIR/normalization-files - -NORMALIZE_DIR=$TUTORIAL_DIR/normalization-files/normalize -cd $NORMALIZE_DIR -cat $NORMALIZE_DIR/models/generated/*.sql -``` -Example Output: - - with - covid_epidemiology_node as ( - select - emitted_at, - {{ dbt_utils.current_timestamp_in_utc() }} as normalized_at, - cast({{ json_extract_scalar('data', ['date']) }} as {{ dbt_utils.type_string() }}) as date, - cast({{ json_extract_scalar('data', ['new_recovered']) }} as {{ dbt_utils.type_float() }}) as new_recovered, - cast({{ json_extract_scalar('data', ['new_tested']) }} as {{ dbt_utils.type_float() }}) as new_tested, - cast({{ json_extract_scalar('data', ['total_deceased']) }} as {{ dbt_utils.type_float() }}) as total_deceased, - cast({{ json_extract_scalar('data', ['new_deceased']) }} as {{ dbt_utils.type_float() }}) as new_deceased, - cast({{ json_extract_scalar('data', ['new_confirmed']) }} as {{ dbt_utils.type_float() }}) as new_confirmed, - cast({{ json_extract_scalar('data', ['total_confirmed']) }} as {{ dbt_utils.type_float() }}) as total_confirmed, - cast({{ json_extract_scalar('data', ['total_tested']) }} as {{ dbt_utils.type_float() }}) as total_tested, - cast({{ json_extract_scalar('data', ['total_recovered']) }} as {{ dbt_utils.type_float() }}) as total_recovered, - cast({{ json_extract_scalar('data', ['key']) }} as {{ dbt_utils.type_string() }}) as key - from {{ source('quarantine', 'covid_epidemiology_raw') }} - ), - covid_epidemiology_with_id as ( - select - *, - {{ dbt_utils.surrogate_key([ - 'date', - 'new_recovered', - 'new_tested', - 'total_deceased', - 'new_deceased', - 'new_confirmed', - 'total_confirmed', - 'total_tested', - 'total_recovered', - 'key' - ]) }} as _covid_epidemiology_hashid - from covid_epidemiology_node - ) - select * from covid_epidemiology_with_id - -If you have [dbt cli](https://docs.getdbt.com/dbt-cli/cli-overview/) installed on your machine, you can then view, edit, customize and run the dbt models in your project if you want to bypass the normalization steps generated by Airbyte! - - -```bash -#!/usr/bin/env bash - -dbt deps --profiles-dir=$NORMALIZE_DIR --project-dir=$NORMALIZE_DIR -dbt run --profiles-dir=$NORMALIZE_DIR --project-dir=$NORMALIZE_DIR --full-refresh -``` -Example Output: - - Running with dbt=0.18.1 - Installing https://github.com/fishtown-analytics/dbt-utils.git@0.6.2 - Installed from revision 0.6.2 - - Running with dbt=0.18.1 - Found 1 model, 0 tests, 0 snapshots, 0 analyses, 302 macros, 0 operations, 0 seed files, 1 source - - 15:37:54 | Concurrency: 32 threads (target='prod') - 15:37:54 | - 15:37:55 | 1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN] - 15:37:55 | 1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [SELECT 17911 in 0.30s] - 15:37:55 | - 15:37:55 | Finished running 1 table model in 0.51s. - - Completed successfully - - Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 - \ No newline at end of file diff --git a/docs/tutorial/airbyte_config_persistence_1.png b/docs/tutorial/airbyte_config_persistence_1.png deleted file mode 100644 index 426e215cb63c6..0000000000000 Binary files a/docs/tutorial/airbyte_config_persistence_1.png and /dev/null differ diff --git a/docs/tutorial/airbyte_config_persistence_2.png b/docs/tutorial/airbyte_config_persistence_2.png deleted file mode 100644 index e409378182202..0000000000000 Binary files a/docs/tutorial/airbyte_config_persistence_2.png and /dev/null differ diff --git a/docs/tutorial/airbyte_config_persistence_3.png b/docs/tutorial/airbyte_config_persistence_3.png deleted file mode 100644 index 33f7a53cb98be..0000000000000 Binary files a/docs/tutorial/airbyte_config_persistence_3.png and /dev/null differ diff --git a/docs/tutorial/airbyte_config_persistence_4.png b/docs/tutorial/airbyte_config_persistence_4.png deleted file mode 100644 index de685e7d25d81..0000000000000 Binary files a/docs/tutorial/airbyte_config_persistence_4.png and /dev/null differ diff --git a/docs/tutorial/airbyte_config_persistence_5.png b/docs/tutorial/airbyte_config_persistence_5.png deleted file mode 100644 index 8563fd3feed3d..0000000000000 Binary files a/docs/tutorial/airbyte_config_persistence_5.png and /dev/null differ diff --git a/docs/tutorial/airbyte_config_persistence_6.png b/docs/tutorial/airbyte_config_persistence_6.png deleted file mode 100644 index 3a53ed86aadcc..0000000000000 Binary files a/docs/tutorial/airbyte_config_persistence_6.png and /dev/null differ diff --git a/docs/tutorials/README.md b/docs/tutorials/README.md new file mode 100644 index 0000000000000..84ce15b788618 --- /dev/null +++ b/docs/tutorials/README.md @@ -0,0 +1,2 @@ +# Tutorials + diff --git a/docs/tutorials/airbyte-config-persistence.md b/docs/tutorials/airbyte-config-persistence.md new file mode 100644 index 0000000000000..2bc29d437451f --- /dev/null +++ b/docs/tutorials/airbyte-config-persistence.md @@ -0,0 +1,527 @@ +# Config & Persistence + +Once you manage to spin up a local instance of Airbyte, following steps in the [Getting started Tutorial](getting-started.md), you may want to gain a better understanding of what configuration files are available in Airbyte and how to work with it. + +As we continue to improve the User experience around those aspects to make it simpler in the UI, this tutorial will go over how to manually import and export Airbyte configurations of connectors. + +This may be useful if you need, for example, to version control, make a backup, share with your team, or if you just want to debug and learn more about Airbyte internals. + +Here are the goals for this tutorial: + +1. Access replication logs files +2. Export & Import Airbyte Configurations +3. Export normalization models to use in your own DBT project + +## Setting up a local Postgres Destination + +For this tutorial, we are going to use 2 types of destinations to run our demo where data will be written: + +* Local File Destination +* Local Postgres Database + +The local files will be written by default to the directory `/tmp/airbyte_local`. + +The `postgres` database that we are going to spin up below will be running locally with the following configuration where data will be written: + +* Host: localhost +* Port: 3000 +* User: postgres +* Password: password +* DB Name: postgres + +```bash +#!/usr/bin/env bash + +echo "File Content in the local destination (may not exist yet):" +find /tmp/airbyte_local + +echo "" + +docker ps | grep -q local-airbyte-postgres-destination +if [ $? -eq 0 ]; then + echo "Postgres Database local-airbyte-postgres-destination is already up" +else + echo "Start a Postgres container named local-airbyte-postgres-destination" + docker run --rm --name local-airbyte-postgres-destination -e POSTGRES_PASSWORD=password -p 3000:5432 -d postgres +fi + +echo "" + +echo "Docker Containers currently running:" + +docker ps +``` + +Example Output: + +```text +File Content in the local destination (may not exist yet): +find: /tmp/airbyte_local: No such file or directory + +Start a Postgres container named local-airbyte-postgres-destination +8e24a9682a1ec2e7539c7ada5d993120d3337cff07a54603fcdb8d44f4013aab + +Docker Containers currently running: +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +8e24a9682a1e postgres "docker-entrypoint.s…" 1 second ago Up Less than a second 0.0.0.0:3000->5432/tcp local-airbyte-postgres-destination +``` + +## Starting Airbyte Server + +As we've seen in the previous tutorial, we can spin up Airbyte instance after installing it: + +```bash +#!/usr/bin/env bash + +docker-compose up -d + +echo -e "\n" + +echo "Docker Containers currently running:" +docker ps +``` + +Example Output: + +```text +WARNING: The API_URL variable is not set. Defaulting to a blank string. +Creating network "airbyte_default" with the default driver +Creating volume "airbyte_workspace" with default driver +Creating volume "airbyte_data" with default driver +Creating volume "airbyte_db" with default driver +Creating init ... +Creating airbyte-data-seed ... +Creating airbyte-db ... +Creating airbyte-server ... mdone +Creating airbyte-scheduler ... +Creating airbyte-webapp ... mdone +ting airbyte-webapp ... done + +Docker Containers currently running: +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +39cec3eb5953 airbyte/webapp:0.7.1-alpha "/docker-entrypoint.…" 1 second ago Up Less than a second 0.0.0.0:8000->80/tcp airbyte-webapp +f0ff3f8f2b2b airbyte/scheduler:0.7.1-alpha "/bin/bash -c './wai…" 1 second ago Up 1 second airbyte-scheduler +50448db21673 airbyte/server:0.7.1-alpha "/bin/bash -c './wai…" 1 second ago Up Less than a second 8000/tcp, 0.0.0.0:8001->8001/tcp airbyte-server +2aa496838b99 airbyte/db:0.7.1-alpha "docker-entrypoint.s…" 2 seconds ago Up 1 second 5432/tcp airbyte-db +8e24a9682a1e postgres "docker-entrypoint.s…" 8 seconds ago Up 7 seconds 0.0.0.0:3000->5432/tcp local-airbyte-postgres-destination +``` + +Note that if you already went through the previous tutorial or already used Airbyte in the past, you may not need to complete the Onboarding process this time. + +Otherwise, please complete the different steps until you reach the Airbyte Dashboard page. + +After a few seconds, the UI should be ready to go at [http://localhost:8000/](http://localhost:8000/) + +## Notes about running this tutorial on macOS vs Linux + +Note that Docker for Mac is not a real Docker host, now it actually runs a virtual machine behind the scenes and hides it from you to make things simpler. Here's the simpler version, unless you want to dig deeper... just like our current use case where we want to inspect the content of internal Docker volumes... + +Here are some related links as references on accessing Docker Volumes: + +* on macOS [Using Docker containers in 2019](https://stackoverflow.com/a/55648186) +* official doc [Use Volume](https://docs.docker.com/storage/volumes/#backup-restore-or-migrate-data-volumes) + +From these discussions, we will be using on macOS either: + +1. any docker container/image to browse the virtual filesystem by mounting the volume in order to access them, for example with [busybox](https://hub.docker.com/_/busybox) +2. or extract files from the volume by copying them onto the host with [Docker cp](https://docs.docker.com/engine/reference/commandline/cp/) + +However as a side remark on Linux, accessing to named Docker Volume can be easier since you simply need to: + +```text +docker volume inspect +``` + +Then look at the `Mountpoint` value, this is where the volume is actually stored in the host filesystem and you can directly retrieve files directly from that folder. + +Back to this tutorial, commands shown below should work for both macOS and Linux ! + +## Export Initial Setup + +Now let's first make a backup of the configuration state of your Airbyte instance by running the following commands. + +```bash +TUTORIAL_DIR=$(pwd)/build/persitence-tutorial +mkdir -p $TUTORIAL_DIR/my-setup + +docker cp airbyte-server:/data $TUTORIAL_DIR/my-setup +``` + +## Configure Exchange Rate source and File destination + +Head back to [http://localhost:8000/](http://localhost:8000/) and add more connectors. Here is an example of configuration from an API source: + +![airbyte\_config\_persistence\_api\_source](../.gitbook/assets/airbyte_config_persistence_1.png) + +and a local file destination: + +![airbyte\_config\_persistence\_local\_file](../.gitbook/assets/airbyte_config_persistence_2.png) + +## Run a Sync job + +* once the source and destination are created +* the catalog and frequency can be configured +* then run the "Sync Now" button +* finally inspect logs in the UI + +![airbyte\_config\_persistence\_ui\_logs](../.gitbook/assets/airbyte_config_persistence_3.png) + +## Exploring Logs folders + +We can read from the lines reported in the logs the working directory that is being used to run the synchronization process from. + +As an example in the previous run, it is being ran in `/tmp/workspace/5/0` and we notice the different docker commands being used internally are starting with: + +```text +docker run --rm -i -v airbyte_workspace:/data -v /tmp/airbyte_local:/local -w /data/5/0 --network host ... +``` + +From there, we can observe that Airbyte is using a docker named volume called `airbyte_workspace` that is mounted in the container at the location `/data`. + +Following [Docker Volume documentation](https://docs.docker.com/storage/volumes/), we can inspect and manipulate persisted configuration data in these volumes. For example, we can run any docker container/image to browse the content of this named volume by mounting it in a similar way, let's use the [busybox](https://hub.docker.com/_/busybox) image. + +```text +docker run -it --rm --volume airbyte_workspace:/data busybox +``` + +This will drop you into an `sh` shell to allow you to do what you want inside a BusyBox system from which we can browse the filesystem and accessing to logs files: + +```text +ls /data/5/0/ +``` + +Example Output: + +```text +catalog.json normalize tap_config.json +logs.log singer_rendered_catalog.json target_config.json +``` + +Or you can simply run: + +```bash +docker run -it --rm --volume airbyte_workspace:/data busybox ls /data/4/0 +``` + +Example Output: + +```text +catalog.json singer_rendered_catalog.json +logs.log tap_config.json +normalize target_config.json +``` + +```bash +docker run -it --rm --volume airbyte_workspace:/data busybox cat /data/5/0/catalog.json +``` + +Example Output: + +```text +{"streams":[{"stream":{"name":"exchange_rate","json_schema":{"type":"object","properties":{"CHF":{"type":"number"},"HRK":{"type":"number"},"date":{"type":"string"},"MXN":{"type":"number"},"ZAR":{"type":"number"},"INR":{"type":"number"},"CNY":{"type":"number"},"THB":{"type":"number"},"AUD":{"type":"number"},"ILS":{"type":"number"},"KRW":{"type":"number"},"JPY":{"type":"number"},"PLN":{"type":"number"},"GBP":{"type":"number"},"IDR":{"type":"number"},"HUF":{"type":"number"},"PHP":{"type":"number"},"TRY":{"type":"number"},"RUB":{"type":"number"},"HKD":{"type":"number"},"ISK":{"type":"number"},"EUR":{"type":"number"},"DKK":{"type":"number"},"CAD":{"type":"number"},"MYR":{"type":"number"},"USD":{"type":"number"},"BGN":{"type":"number"},"NOK":{"type":"number"},"RON":{"type":"number"},"SGD":{"type":"number"},"CZK":{"type":"number"},"SEK":{"type":"number"},"NZD":{"type":"number"},"BRL":{"type":"number"}}},"supported_sync_modes":["full_refresh"],"default_cursor_field":[]},"sync_mode":"full_refresh","cursor_field":[]}]} +``` + +## Check local data folder + +Since the job completed successfully, a new file should be available in the special `/local/` directory in the container which is mounted from `/tmp/airbyte_local` on the host machine. + +```bash +#!/usr/bin/env bash + +echo "In the container:" + +docker run -it --rm -v /tmp/airbyte_local:/local busybox find /local + +echo "" +echo "On the host:" + +find /tmp/airbyte_local +``` + +Example Output: + +```text +In the container: +/local +/local/data +/local/data/exchange_rate_raw.csv + +On the host: +/tmp/airbyte_local +/tmp/airbyte_local/data +/tmp/airbyte_local/data/exchange_rate_raw.csv +``` + +## Backup Exchange Rate Source and Destination configurations + +In the following steps, we will play with persistence of configurations so let's make a backup of our newly added connectors for now: + +```bash +mkdir -p $TUTORIAL_DIR/exchange-rate-setup + +docker cp airbyte-server:data $TUTORIAL_DIR/exchange-rate-setup +``` + +## Shutting down Airbyte server and clear previous configurations + +Whenever you want to stop the Airbyte server, you can run: `docker-compose down` + +From [docker documentation](https://docs.docker.com/compose/reference/down/) + +```text +This command stops containers and removes containers, networks, volumes, and images created by up. + +By default, the only things removed are: + +- Containers for services defined in the Compose file +- Networks defined in the networks section of the Compose file +- The default network, if one is used + +Networks and volumes defined as external are never removed. + +Anonymous volumes are not removed by default. However, as they don’t have a stable name, they will not be automatically mounted by a subsequent up. For data that needs to persist between updates, use host or named volumes. +``` + +So since Airbyte is using named volumes to store the configurations, if you run `docker-compose up` again, your connectors configurations from earlier steps will still be available. + +Let's wipe our configurations on purpose and use the following option: + +```text +-v, --volumes Remove named volumes declared in the `volumes` + section of the Compose file and anonymous volumes + attached to containers. +``` + +Note that the `/tmp/airbyte_local:/local` that we saw earlier is a [bind mount](https://docs.docker.com/storage/bind-mounts/) so data that was replicated locally won't be affected by the next command. + +However it will get rid of the named volume workspace so all logs and generated files by Airbyte will be lost. + +We can then run: + +```bash +docker-compose down -v +docker-compose up -d +``` + +Example Output: + +```text +WARNING: The API_URL variable is not set. Defaulting to a blank string. +Stopping airbyte-webapp ... +Stopping airbyte-scheduler ... +Stopping airbyte-server ... +Stopping airbyte-db ... +Removing airbyte-webapp ... mdone +Removing airbyte-scheduler ... +Removing airbyte-server ... +Removing airbyte-data-seed ... +Removing airbyte-db ... +Removing init ... +Removing network airbyte_defaultdone +Removing volume airbyte_workspace +Removing volume airbyte_data +Removing volume airbyte_db +WARNING: The API_URL variable is not set. Defaulting to a blank string. +Creating network "airbyte_default" with the default driver +Creating volume "airbyte_workspace" with default driver +Creating volume "airbyte_data" with default driver +Creating volume "airbyte_db" with default driver +Creating init ... +Creating airbyte-data-seed ... +Creating airbyte-db ... +Creating airbyte-server ... mdone +Creating airbyte-scheduler ... mdone +Creating airbyte-webapp ... mdone +ting airbyte-webapp ... done +``` + +Wait a moment for the webserver to start and go refresh the page [http://localhost:8000/](http://localhost:8000/). + +We are prompted with the onboarding process again... + +Let's ignore that step, close the page and go back to the notebook to import configurations from our initial setup instead. + +## Restore our initial setup + +We can play and restore files in the named docker volume `data` and thus retrieve files that were created from earlier: + +```bash +docker cp $TUTORIAL_DIR/my-setup/data/config airbyte-server:data +``` + +Now refresh back the page [http://localhost:8000/](http://localhost:8000/) again, wait a little bit for the server to pick up the freshly imported configurations... Tada! We don't need to complete the onboarding process anymore! and we have the list of connectors that were created previously available again. Thus you can use this ability of export/import files from named volumes to share with others the configuration of your connectors. + +Warning: and it will include credentials, so be careful too! + +## Configure some Covid \(data\) source and Postgres destinations + +Let's re-iterate the source and destination creation, this time, with a file accessible from a public API: + +```text +Here are some examples of public API CSV: +https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv +``` + +![airbyte\_config\_persistence\_ui\_logs](../.gitbook/assets/airbyte_config_persistence_4.png) + +And a local Postgres Database: + +![airbyte\_config\_persistence\_ui\_logs](../.gitbook/assets/airbyte_config_persistence_5.png) + +After setting up the connectors, we can trigger the sync and study the logs: + +![airbyte\_config\_persistence\_ui\_logs](../.gitbook/assets/airbyte_config_persistence_.png) + +Since we wiped the workspace volume and restarted the Airbyte Server, notice that the process ran in the `/tmp/workspace/5/0` as well but the logs for ExchangeRate are gone... + +## Export and customize Normalization step with DBT + +In the previous connector configuration, selected a Postgres Database destination and chose to enable the "Basic Normalization" option. + +In Airbyte, data is written in destination in a JSON blob format in tables with suffix "\_raw" as it is taking care of the `E` and `L` in `ELT`. + +The normalization option adds a last `T` transformation step that takes care of converting such JSON tables into flat tables. To do so, Airbyte is currently using [DBT](https://docs.getdbt.com/) to handle such tasks which can be manually triggered in the normalization container like this: + +```bash +#!/usr/bin/env bash + +# find latest workspace where normalization was run +NORMALIZE_WORKSPACE=`docker run --rm -i -v airbyte_workspace:/data busybox find /data -path "*normalize/models*" | sed -E "s;/data/([0-9]+/[0-9]+/)normalize/.*;\1;g" | sort | +uniq | tail -n 1` + +docker run --rm -i -v airbyte_workspace:/data -w /data/$NORMALIZE_WORKSPACE/normalize --network host --entrypoint /usr/local/bin/dbt airbyte/normalization debug --profiles-dir=. --project-dir=. +docker run --rm -i -v airbyte_workspace:/data -w /data/$NORMALIZE_WORKSPACE/normalize --network host --entrypoint /usr/local/bin/dbt airbyte/normalization run --profiles-dir=. --project-dir=. +``` + +Example Output: + +```text +Running with dbt=0.18.1 +dbt version: 0.18.1 +python version: 3.7.9 +python path: /usr/local/bin/python +os info: Linux-4.19.121-linuxkit-x86_64-with-debian-10.6 +Using profiles.yml file at ./profiles.yml +Using dbt_project.yml file at /data/5/0/normalize/dbt_project.yml + +Configuration: + profiles.yml file [OK found and valid] + dbt_project.yml file [OK found and valid] + +Required dependencies: + - git [OK found] + +Connection: + host: localhost + port: 3000 + user: postgres + database: postgres + schema: quarantine + search_path: None + keepalives_idle: 0 + sslmode: None + Connection test: OK connection ok + +Running with dbt=0.18.1 +Found 1 model, 0 tests, 0 snapshots, 0 analyses, 302 macros, 0 operations, 0 seed files, 1 source + +14:37:10 | Concurrency: 32 threads (target='prod') +14:37:10 | +14:37:10 | 1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN] +14:37:11 | 1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [SELECT 17911 in 0.33s] +14:37:11 | +14:37:11 | Finished running 1 table model in 0.50s. + +Completed successfully + +Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 +``` + +As seen earlier, it is possible to browse the workspace folders and examine further logs if an error occurs. + +In particular, we can also take a look at the DBT models generated by Airbyte and export them to the local host filesystem: + +```bash +#!/usr/bin/env bash + +rm -rf $TUTORIAL_DIR/normalization-files +mkdir -p $TUTORIAL_DIR/normalization-files + +docker cp airbyte-server:/tmp/workspace/$NORMALIZE_WORKSPACE/normalize/ $TUTORIAL_DIR/normalization-files + +NORMALIZE_DIR=$TUTORIAL_DIR/normalization-files/normalize +cd $NORMALIZE_DIR +cat $NORMALIZE_DIR/models/generated/*.sql +``` + +Example Output: + +```text +with +covid_epidemiology_node as ( + select + emitted_at, + {{ dbt_utils.current_timestamp_in_utc() }} as normalized_at, + cast({{ json_extract_scalar('data', ['date']) }} as {{ dbt_utils.type_string() }}) as date, + cast({{ json_extract_scalar('data', ['new_recovered']) }} as {{ dbt_utils.type_float() }}) as new_recovered, + cast({{ json_extract_scalar('data', ['new_tested']) }} as {{ dbt_utils.type_float() }}) as new_tested, + cast({{ json_extract_scalar('data', ['total_deceased']) }} as {{ dbt_utils.type_float() }}) as total_deceased, + cast({{ json_extract_scalar('data', ['new_deceased']) }} as {{ dbt_utils.type_float() }}) as new_deceased, + cast({{ json_extract_scalar('data', ['new_confirmed']) }} as {{ dbt_utils.type_float() }}) as new_confirmed, + cast({{ json_extract_scalar('data', ['total_confirmed']) }} as {{ dbt_utils.type_float() }}) as total_confirmed, + cast({{ json_extract_scalar('data', ['total_tested']) }} as {{ dbt_utils.type_float() }}) as total_tested, + cast({{ json_extract_scalar('data', ['total_recovered']) }} as {{ dbt_utils.type_float() }}) as total_recovered, + cast({{ json_extract_scalar('data', ['key']) }} as {{ dbt_utils.type_string() }}) as key + from {{ source('quarantine', 'covid_epidemiology_raw') }} +), +covid_epidemiology_with_id as ( + select + *, + {{ dbt_utils.surrogate_key([ + 'date', + 'new_recovered', + 'new_tested', + 'total_deceased', + 'new_deceased', + 'new_confirmed', + 'total_confirmed', + 'total_tested', + 'total_recovered', + 'key' + ]) }} as _covid_epidemiology_hashid + from covid_epidemiology_node +) +select * from covid_epidemiology_with_id +``` + +If you have [dbt cli](https://docs.getdbt.com/dbt-cli/cli-overview/) installed on your machine, you can then view, edit, customize and run the dbt models in your project if you want to bypass the normalization steps generated by Airbyte! + +```bash +#!/usr/bin/env bash + +dbt deps --profiles-dir=$NORMALIZE_DIR --project-dir=$NORMALIZE_DIR +dbt run --profiles-dir=$NORMALIZE_DIR --project-dir=$NORMALIZE_DIR --full-refresh +``` + +Example Output: + +```text +Running with dbt=0.18.1 +Installing https://github.com/fishtown-analytics/dbt-utils.git@0.6.2 + Installed from revision 0.6.2 + +Running with dbt=0.18.1 +Found 1 model, 0 tests, 0 snapshots, 0 analyses, 302 macros, 0 operations, 0 seed files, 1 source + +15:37:54 | Concurrency: 32 threads (target='prod') +15:37:54 | +15:37:55 | 1 of 1 START table model quarantine.covid_epidemiology....................................................... [RUN] +15:37:55 | 1 of 1 OK created table model quarantine.covid_epidemiology.................................................. [SELECT 17911 in 0.30s] +15:37:55 | +15:37:55 | Finished running 1 table model in 0.51s. + +Completed successfully + +Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 +``` + diff --git a/docs/getting-started-tutorial.md b/docs/tutorials/getting-started.md similarity index 94% rename from docs/getting-started-tutorial.md rename to docs/tutorials/getting-started.md index a60f38cf277ff..b76c8641256ca 100644 --- a/docs/getting-started-tutorial.md +++ b/docs/tutorials/getting-started.md @@ -24,13 +24,13 @@ Once you see an Airbyte banner, the UI is ready to go at [http://localhost:8000/ You should see an onboarding page. Enter your email if you want updates about Airbyte and continue. -![](.gitbook/assets/airbyte_get-started%20%289%29%20%282%29.png) +![](../.gitbook/assets/airbyte_get-started%20%289%29%20%282%29.png) ## 2. Set up your first connection Now you will see a wizard that allows you choose the data you want to send through Airbyte. -![](.gitbook/assets/02_set-up-sources%20%289%29.png) +![](../.gitbook/assets/02_set-up-sources%20%289%29.png) As of our alpha launch, we have one database source \(Postgres\) and two API sources \(an exchange rate API and the Stripe API\). We're currently building an integration framework that makes it easy to create sources and destinations, so you should expect many more soon. Please reach out to us if you need a specific connector or would like to help build one. @@ -76,7 +76,7 @@ DB Name: postgres After adding the destination, you can choose what tables and columns you want to sync. -![](.gitbook/assets/03_set-up-connection%20%283%29%20%283%29.png) +![](../.gitbook/assets/03_set-up-connection%20%283%29%20%283%29.png) For this demo, we recommend leaving the defaults and selecting "Every 5 Minutes" as the frequency. Click `Set Up Connection` to finish setting up the sync. @@ -84,7 +84,7 @@ For this demo, we recommend leaving the defaults and selecting "Every 5 Minutes" You should now see a list of sources with the source you just added. Click on it to find more information about your connection. This is the page where you can update any settings about this source and how it syncs. There should be a `Completed` job under the history section. If you click on that run, it will show logs from that run. -![](.gitbook/assets/04_source-details%20%289%29%20%282%29.png) +![](../.gitbook/assets/04_source-details%20%289%29%20%282%29.png) One of biggest problems we've seen in tools like Fivetran is the lack of visibility when debugging. In Airbyte, allowing full log access and the ability to debug and fix connector problems is one of our highest priorities. We'll be working hard to make these logs accessible and understandable.