Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 Staging for Snowflake Destinations not working correctly #3935

Closed
tylerdelange opened this issue Jun 7, 2021 · 7 comments · Fixed by #3960
Closed

AWS S3 Staging for Snowflake Destinations not working correctly #3935

tylerdelange opened this issue Jun 7, 2021 · 7 comments · Fixed by #3960

Comments

@tylerdelange
Copy link

Expected Behavior

I expect that the connector would first drop the data ingestion to an S3 location then copy to the snowflake destination when complete.

Current Behavior

When running data ingestion (from mySQL) data is working as if it is in "Standard Inserts" configuration. Also selecting the "AWS S3 Staging" configuration doesn't seem to be saving.

More Information

AWS S3 Staging for the Snowflake Destination connector is not currently working. The proper information is added, all connection tests pass, but when running ingestion instead of the files being staged into S3 first THEN copied into snowflake, the regular "Standard Inserts" method is used instead. It seems like when selecting the "AWS S3 Staging" it just doesn't stick. If I select it, test connections (pass), go out then back into the Snowflake destination settings, the "Standard Inserts" is back as the default again.

Steps to Reproduce

  1. Set Up Snowflake Destination
  2. Select AWS S3 Staging
  3. Save configuration and exit
  4. Go back into Snowflake Destination Settings
  5. Notice that "AWS S# Staging has reverted BACK to "Standard Inserts"

Severity of the bug for you

Medium

Airbyte Version

Airbyte version: 0.24.4-alpha
airbyte/destination-snowflake:0.3.7
airbyte/source-mysql:0.3.5

@tylerdelange tylerdelange added the type/bug Something isn't working label Jun 7, 2021
@sherifnada sherifnada added the area/connectors Connector related issues label Jun 7, 2021
@marcosmarxm
Copy link
Member

@sherifnada this is not a problem with the connector. I set up a postgres source with CDC and the error happen there too.
@tylerdelange source/destination is created correctly but UI cannot load the component correctly.
@jamakase fyi

@marcosmarxm
Copy link
Member

Setting new source with CDC

image

Visiting the setting page after setting the source

image

curl -X POST "http://localhost:8001/api/v1/sources/get" \
 -H "Accept: application/json" \
 -H "Content-Type: application/json" \
 -d '{"sourceId":"353715d2-f303-402f-b572-9b2e47e7bca7"}' 
{
  "sourceDefinitionId": "decd338e-5647-4c0b-adf4-da0e75f5a750",
  "sourceId": "353715d2-f303-402f-b572-9b2e47e7bca7",
  "workspaceId": "5ae6b09b-fdec-41af-aaf7-7d94cfc33ef6",
  "connectionConfiguration": {
    "replication_method": {
      "replication_slot": "slot1",
      "publication": "pub1"
    },
    "ssl": false,
    "password": "**********",
    "username": "postgres",
    "database": "postgres",
    "port": 2000,
    "host": "localhost"
  },
  "name": "pg-source-with-cdc",
  "sourceName": "Postgres"
}

@marcosmarxm marcosmarxm added area/frontend and removed area/connectors Connector related issues labels Jun 8, 2021
@marcosmarxm
Copy link
Member

marcosmarxm commented Jun 8, 2021

Based on the File connector I added one property common to all oneOf fields called method. This solves the problem, but I don't know if this is the best way to solve it. @sherifnada wdyt?
image
image

@sherifnada
Copy link
Contributor

@marcosmarxm I don't follow the fix you found -- are you saying you changed something in the spec.json that made this work?

@marcosmarxm
Copy link
Member

marcosmarxm commented Jun 8, 2021

yes, I added the required field method to each one.

{
...
      "loading_method": {
        "type": "object",
        "title": "Loading Method",
        "description": "Loading method used to send data to Snowflake.",
        "order": 7,
        "oneOf": [
          {
            "title": "Standard Inserts",
            "additionalProperties": false,
            "description": "Uses <pre>INSERT</pre> statements to send batches of records to Snowflake. Easiest (no setup) but not recommended for large production workloads due to slow speed.",
            "required": ["method"],
            "properties": {
              "method": {
                "type": "string",
                "enum": ["Standard"],
                "default": "Standard"
              }
            }
          },
          {
            "title": "AWS S3 Staging",
            "additionalProperties": false,
            "description": "Writes large batches of records to a file, uploads the file to S3, then uses <pre>COPY INTO table</pre> to upload the file. Recommended for large production workloads for better speed and scalability.",
            "required": [
              "method",
              "s3_bucket_name",
              "access_key_id",
              "secret_access_key"
            ],
            "properties": {
              "method": {
                "type": "string",
                "enum": ["S3"],
                "default": "S3",
                "order": 0
              },
              "s3_bucket_name": { "required": ["method"],
            "properties": {
              "method": {
                "type": "string",
                "enum": ["Standard"],
                "default": "Standard"
              }
                "title": "S3 Bucket Name",
                "type": "string",
                "description": "The name of the staging S3 bucket. Airbyte will write files to this bucket and read them via <pre>COPY</pre> statements on Snowflake.",
                "examples": ["airbyte.staging"],
                "order": 1
              },
              "s3_bucket_region": {
                "title": "S3 Bucket Region",
                "type": "string",
                "default": "",
                "description": "The region of the S3 staging bucket to use if utilising a copy strategy.",
                "enum": [
                  "",
                  "us-east-1",
                  "us-east-2",
                  "us-west-1",
                  "us-west-2",
                  "af-south-1",
                  "ap-east-1",
                  "ap-south-1",
                  "ap-northeast-1",
                  "ap-northeast-2",
                  "ap-northeast-3",
                  "ap-southeast-1",
                  "ap-southeast-2",
                  "ca-central-1",
                  "cn-north-1",
                  "cn-northwest-1",
                  "eu-west-1",
                  "eu-west-2",
                  "eu-west-3",
                  "eu-south-1",
                  "eu-north-1",
                  "sa-east-1",
                  "me-south-1"
                ],
                "order": 2
              },
              "access_key_id": {
                "type": "string",
                "description": "The Access Key Id granting allow one to access the above S3 staging bucket. Airbyte requires Read and Write permissions to the given bucket.",
                "title": "S3 Key Id",
                "airbyte_secret": true,
                "order": 3
              },
              "secret_access_key": {
                "type": "string",
                "description": "The corresponding secret to the above access key id.",
                "title": "S3 Access Key",
                "airbyte_secret": true,
                "order": 4
              }
            }
          },
          {
            "title": "GCS Staging",
            "additionalProperties": false,
            "description": "Writes large batches of records to a file, uploads the file to GCS, then uses <pre>COPY INTO table</pre> to upload the file. Recommended for large production workloads for better speed and scalability.",
            "required": ["method", "project_id", "bucket_name", "credentials_json"],
            "properties": { "required": ["method"],
            "properties": {
              "method": {
                "type": "string",
                "enum": ["Standard"],
                "default": "Standard"
              }
              "method": {
                "type": "string",
                "enum": ["GCS"],
                "default": "GCS",
                "order": 0
              },
              "project_id": {
                "title": "GCP Project ID",
                "type": "string",
                "description": "The name of the GCP project ID for your credentials.",
                "examples": ["my-project"],
                "order": 1
              },
              "bucket_name": {
                "title": "GCS Bucket Name",
                "type": "string",
                "description": "The name of the staging GCS bucket. Airbyte will write files to this bucket and read them via <pre>COPY</pre> statements on Snowflake.",
                "examples": ["airbyte-staging"],
                "order": 2
              },
              "credentials_json": {
                "title": "Google Application Credentials",
                "type": "string",
                "description": "The contents of the JSON key file that has read/write permissions to the staging GCS bucket. You will separately need to grant bucket access to your Snowflake GCP service account. See the <a href=\"https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating_service_account_keys\">GCP docs</a> for more information on how to generate a JSON key for your service account.",
                "airbyte_secret": true,
                "multiline": true,
                "order": 3
              }
            }
          }
        ]
      },
      "basic_normalization": {
        "type": "boolean",
        "default": true,
        "description": "Whether or not to normalize the data in the destination. See <a href=\"https://docs.airbyte.io/architecture/basic-normalization\">basic normalization</a> for more details.",
        "title": "Basic Normalization",
        "examples": [true, false],
        "order": 8
      }
    }
  }
}

@jamakase
Copy link
Contributor

jamakase commented Jun 8, 2021

@marcosmarxm this is an intended behaviour from UI point of view due to the way we handle oneOf. As the backend doesn't return which of the options we selected, we have to guess on frontend the selected path. We start traversing all the json_schemas of oneOf until we find the one matching.

              case "formGroup":
                const selectedValues = get(formValues, subConditionItems.path);

                const subPathSchema = buildYupFormForJsonSchema({
                  type: "object",
                  ...subConditionItems.jsonSchema,
                });

                if (subPathSchema.isValidSync(selectedValues)) {
                  return key;
                }

As path with Standart replication is actually just an empty object with no properties - it matches yup validator. Probably we should add noUnknownFields check for validation, so it won't match empty object.

You can also find more info in this discussion #962

@cgardens
Copy link
Contributor

cgardens commented Jun 8, 2021

@marcosmarxm @jamakase I'm going to try to write down my understanding of what's going on here based on reading these comments.

  1. If a user configures snowflake to use S3 upload and saves that configuration, it works successfully.
  2. If they then reload the page, the UI will make it look like that connector is configured for Standard (not S3 upload). The UI in this case is just not reflecting the truth. The connector is configured as expected, but the UI makes it look like it is not.

Fixes

Option 1

Add a method field to snowflake like we in other places (example, example, example) so that the UI can tell these apart.

Option 2

Set additionalProperties: false for the standard section

Option 2

On the frontend, the yum check should not match if there are any unknownfields. @jamakase is this easy to do?

Is that all true? Or am I missing something? Assuming that this is correct, @marcosmarxm can you do option 1 and @jamakase can you hit option 2? Seems like both things make sense to knock out.

As a follow up, it seems related to #3385. I will bump the priority of resolving this issue would should get at the root cause. In the meanwhile we should figure out how to fix this instance (and identify any others that might have this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment