Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery external table requires replacement even though no changes have been made #10919

Closed
jamiet-msm opened this issue Jan 17, 2022 · 14 comments
Assignees
Labels
bug forward/review In review; remove label to forward service/bigquery

Comments

@jamiet-msm
Copy link

jamiet-msm commented Jan 17, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v0.14.9

Affected Resource(s)

  • google_bigquery_table

Terraform Configuration Files

variable "project" {}
variable "region" {}
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "4.5.0"
    }
  }
}

provider "google" {
  region = var.region
}

resource "google_storage_bucket" "bucket" {
  name     = "${var.project}-bucket"
  location = "eu"
  project  = var.project
}

resource "google_storage_bucket_object" "fake_message" {
  /*
    The hive partitoned table defined in google_bigquery_table.hive_table
    requires that the partitions locations exist when the table is created, in order to do so
    we have to create this fake message, without it the deployment of the table fails with:

     > Cannot query hive partitioned data for table messages without any associated files

    */
  name    = "publish/dt=2000-01-01/hr=00/min=00/fake_message.json"
  content = "{\"column1\": \"XXX\"}"
  bucket  = google_storage_bucket.bucket.name
}

resource "google_bigquery_dataset" "hive_store" {
  dataset_id = "hive_store"
  location   = "EU"
  project    = var.project
}

resource "google_bigquery_table" "hive_table" {
  /*
    to query:
    bq query --nouse_legacy_sql "select * from hive_store.messages"
    */
  dataset_id = google_bigquery_dataset.hive_store.dataset_id
  project    = var.project
  table_id   = "messages"
  external_data_configuration {
    autodetect    = false
    source_uris   = ["gs://${google_storage_bucket.bucket.name}/publish/*"]
    source_format = "NEWLINE_DELIMITED_JSON"
    hive_partitioning_options {
      mode                     = "CUSTOM"
      source_uri_prefix        = "gs://${google_storage_bucket.bucket.name}/publish/{dt:STRING}/{hr:STRING}/{min:STRING}"
      require_partition_filter = false
    }
  }
  schema = <<EOF
  [
    {
        "name": "column1",
        "type": "STRING",
        "mode": "NULLABLE"
    }
]
  EOF
  depends_on = [
    google_storage_bucket_object.fake_message
  ]
}

Debug Output

https://github.com/jamiet-msm/hive_partitioned_table_issue/blob/master/debug_output

Expected Behavior

Apply this configuration multiple times without terraform attempting to destroy anything or make changes.
In other words, if I apply and then make zero changes to the code then terraform should attempt to make zero changes to the deployed infrastructure.

Actual Behavior

On subsequent applies terraform destroys the columns that have been defined as hive partitioning columns:

2022/01/17 15:34:25 [DEBUG] command: asking for input: "Do you want to perform these actions?"
  # google_bigquery_table.hive_table must be replaced
-/+ resource "google_bigquery_table" "hive_table" {
      ~ creation_time       = 1642433648197 -> (known after apply)
      ~ etag                = "6ysTmjOM2o+a+c1Kx1uEsg==" -> (known after apply)
      ~ expiration_time     = 0 -> (known after apply)
      ~ id                  = "projects/msm-groupdata-datalake-dev/datasets/hive_store/tables/messages" -> (known after apply)
      - labels              = {} -> null
      ~ last_modified_time  = 1642433648334 -> (known after apply)
      ~ location            = "EU" -> (known after apply)
      ~ num_bytes           = 0 -> (known after apply)
      ~ num_long_term_bytes = 0 -> (known after apply)
      ~ num_rows            = 0 -> (known after apply)
      ~ schema              = jsonencode(
          ~ [ # forces replacement
                {
                    mode = "NULLABLE"
                    name = "column1"
                    type = "STRING"
                },
              - {
                  - mode = "NULLABLE"
                  - name = "dt"
                  - type = "STRING"
                },
              - {
                  - mode = "NULLABLE"
                  - name = "hr"
                  - type = "STRING"
                },
              - {
                  - mode = "NULLABLE"
                  - name = "min"
                  - type = "STRING"
                },
            ]
        )
      ~ self_link           = "https://bigquery.googleapis.com/bigquery/v2/projects/msm-groupdata-datalake-dev/datasets/hive_store/tables/messages" -> (known after apply)
      ~ type                = "EXTERNAL" -> (known after apply)
        # (4 unchanged attributes hidden)

      ~ external_data_configuration {
          - ignore_unknown_values = false -> null
          - max_bad_records       = 0 -> null
          + schema                = (known after apply)
            # (4 unchanged attributes hidden)

            # (1 unchanged block hidden)
        }
    }

Plan: 1 to add, 0 to change, 1 to destroy.

image

Steps to Reproduce

  1. run terraform apply -var project=projectname -var region=europe-west2, everything gets deployed
  2. Run bq query --nouse_legacy_sql "select * from projectname.hive_store.messages"
    image
  3. run terraform apply -var project=projectname -var region=europe-west2 again, terraform determines some columns need to be removed from the table. The debug output from this operation is linked to above.

Important Factoids

I am attempting to create a BigQuery external table that follows a hive partitioning layout, see Querying externally partitioned data for more information.

As far as I know I have defined the table in my terraform configuration correctly (I admit tis possible that I have not however I have tried different combinations and can't get the desired behaviour).

The problem here is that the partitioning columns, which are not originally defined in the table schema, seemingly get added to the table schema and then when I come to apply again terraform observes that those columns are not part of the schema as defined in the terraform configuration and attempts to remove them.

This is a problem because we have an automated deployment pipeline that automatically deploys changes to our environments but it only does so if terraform doesn't do anything destructive. If terraform reports that it WILL do anything destructive then we intentionally halt the deployment because we want a human being to say whether the destruction is valid or not. With the behaviour I've described herein every single apply will cause a destruction and hence we've lost our ability to automatically deploy stuff. This is a major major problem for us, automation is the bedrock of our team's successes, if we have to manually deploy everything then we lose our ability to deploy many many times a day which is what we have been doing up to now.

I acknowledge that it is BigQuery's API that is reporting that these columns are now part of the schema however we need the terraform provider to recognise that the added columns are the partitioning columns and therefore not attempt to make any changes.

I did try to circumvent the problem by defining those columns as part of the schema, thus my resource changed to:

resource "google_bigquery_table" "hive_table" {
  /*
    to query:
    bq query --nouse_legacy_sql "select * from hive_store.messages"
    */
  dataset_id          = google_bigquery_dataset.hive_store.dataset_id
  project             = var.project
  table_id            = "messages"
  deletion_protection = false
  external_data_configuration {
    autodetect    = false
    source_uris   = ["gs://${google_storage_bucket.bucket.name}/publish/*"]
    source_format = "NEWLINE_DELIMITED_JSON"
    hive_partitioning_options {
      mode                     = "CUSTOM"
      source_uri_prefix        = "gs://${google_storage_bucket.bucket.name}/publish/{dt:STRING}/{hr:STRING}/{min:STRING}"
      require_partition_filter = false
    }
  }
  schema = <<EOF
  [
    {
        "name": "column1",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "mode": "NULLABLE",
        "name": "dt",
        "type": "STRING"
    },
    {
        "mode": "NULLABLE",
        "name": "hr",
        "type": "STRING"
    },
    {
        "mode": "NULLABLE",
        "name": "min",
        "type": "STRING"
    }
]
  EOF
  depends_on = [
    google_storage_bucket_object.fake_message
  ]
}

however attempting to apply the configuration with those changes caused error:

Error: googleapi: Error 400: Error while reading table: messages, error message: Failed to add partition key dt (type: TYPE_STRING) to schema, because another column with the same name was already present. This is not allowed. Full partition schema: [dt:TYPE_STRING, hr:TYPE_STRING, min:TYPE_STRING]., invalid

which makes sense of course.

Quite simply, I need to be able to apply this table without causing changes on subsequent applies

@jamiet-msm jamiet-msm added the bug label Jan 17, 2022
@megan07
Copy link
Contributor

megan07 commented Jan 21, 2022

Hi @jamiet-msm ! I'm sorry you're running into this issue. Would you mind letting me know what version of the provider you are using? Running terraform version should give you that information. Thanks!

@jamiet-msm
Copy link
Author

Hi @jamiet-msm ! I'm sorry you're running into this issue. Would you mind letting me know what version of the provider you are using? Running terraform version should give you that information. Thanks!

Hi @megan07

Sorry, should have included that shouldn't I. We're on 4.5.0

terraform {
  backend "gcs" {}
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "4.5.0"
    }
    google-beta = {
      source  = "hashicorp/google"
      version = "4.5.0"
    }
  }
}

@megan07
Copy link
Contributor

megan07 commented Jan 21, 2022

Oh! No worries! I see now that you did, I wasn't looking in your configuration - sorry about that! Thanks!

@megan07
Copy link
Contributor

megan07 commented Jan 21, 2022

@jamiet-msm Very odd, I was able to repro this issue, however, I made the exact same changes (added the extra columns in the config) and I didn't have any problem. Can you tell me the order of operations when you tried that? When you did a terraform plan before adding the 3 columns into the config did it show a diff before you applied it?

@jamiet-msm
Copy link
Author

@megan07 did you issue terraform destroy before adding the 3 columns in the config? if you add those 3 columns after the table has already been deployed then i think it works however deploying to a vanilla project (i.e. without the table already existing) then that's when you get the error.

(I think that's correct. Give it a try.)

@jamiet-msm
Copy link
Author

I've updated a few typos in the original comment on this issue

@megan07 megan07 self-assigned this Jan 21, 2022
@flovouin
Copy link

@jamiet-msm Hope I'm not adding too much noise to the original issue here, but I was successful in creating the table and Terraform only detecting changes when the actual schema changes using the following configuration:

resource "google_bigquery_table" "events" {
  # Your regular configuration here.

  # Terraform will detect changes to this property made by BigQuery, but we'll ignore them using the `lifecycle` block.
  schema = var.schema

  external_data_configuration {
    autodetect    = false
    source_format = "NEWLINE_DELIMITED_JSON"
    source_uris   = ["..."]

    # Use the exact same schema here. This one won't be changed by BigQuery, however Terraform will still detect the changes you make on purpose to this field.
    schema = var.schema

    hive_partitioning_options {
      mode              = "CUSTOM"
      source_uri_prefix = "..."
    }
  }

  # If the schema does change on purpose, you will need to be able to delete the table.
  # In this case, as the table is only a view on Google Storage data, it should be safe to delete it.
  deletion_protection = false

  lifecycle {
    ignore_changes = [
      # BigQuery will return the effective schema, which contains differences (e.g. the partition column(s) is added to
      # it). Recreation of the table should only be based on `external_data_configuration.schema`, which is only stored
      # in the Terraform state, not BigQuery. This field contains exactly the input schema and can be used for diffs.
      schema,
    ]
  }
}

However I'm curious about your google_storage_bucket_object.fake_message resource. I didn't run into this problem at first, but upon recreation of a table (due to a schema update) I hit the same issue ("Cannot query hive partitioned data for table messages without any associated files"). I haven't investigated much, but I wonder if there is something that can be done on the Terraform/provider side to prevent this? 🤔 For example, couldn't the schema be updated without having to recreate the table? It looks like the GCP console provides this option.

I know that I'm a bit off-topic here, but as the solution could also imply improving schema management in Terraform, I thought I'd chime in.

@ScottSuarez
Copy link
Collaborator

To unblock your environment you can ignore_changes on this field if you are not expecting any abnormal divergence while a fix is worked on.

https://www.terraform.io/language/meta-arguments/lifecycle#ignore_changes

@jamiet-msm
Copy link
Author

Hi @ScottSuarez @flovouin ,
Thanks both, your suggestion (use lifecycle) worked like a charm!

@megan07
Copy link
Contributor

megan07 commented Jan 28, 2022

Hi @jamiet-msm , could you do one quick thing for me and completely move the schema definition inside the external_data_configuration block and let me know if that helps you? So the top-level schema would be left undefined in the configuration. Does that make sense?

@EladDolev
Copy link

Hi @megan07
Your proposed solution indeed solves the issue 💪
Maybe worth adding this to the docs ?

@jamiet-msm
Copy link
Author

Hi @jamiet-msm , could you do one quick thing for me and completely move the schema definition inside the external_data_configuration block and let me know if that helps you? So the top-level schema would be left undefined in the configuration. Does that make sense?

@megan07 apologies, I missed this somehow, so I didn't attempt it. Agree with @EladDolev that this should be in the docs.

jamiet-msm added a commit to jamiet-msm/hive_partitioned_table_issue that referenced this issue Mar 4, 2022
Why:

* I was recommended to do this by @megan07 at
hashicorp/terraform-provider-google#10919 (comment)
and it works!

This change addresses the need by:

* Moving the schema definition into external_data_configuration
did the trick. After doing so I was able to apply and then a
subsequent apply did not cause any changes.

```
terraform apply -var project=project-name -var region=europe-west2

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_bigquery_dataset.hive_store will be created
  + resource "google_bigquery_dataset" "hive_store" {
      + creation_time              = (known after apply)
      + dataset_id                 = "hive_store"
      + delete_contents_on_destroy = false
      + etag                       = (known after apply)
      + id                         = (known after apply)
      + last_modified_time         = (known after apply)
      + location                   = "EU"
      + project                    = "project-name"
      + self_link                  = (known after apply)

      + access {
          + domain         = (known after apply)
          + group_by_email = (known after apply)
          + role           = (known after apply)
          + special_group  = (known after apply)
          + user_by_email  = (known after apply)

          + view {
              + dataset_id = (known after apply)
              + project_id = (known after apply)
              + table_id   = (known after apply)
            }
        }
    }

  # google_bigquery_table.hive_table will be created
  + resource "google_bigquery_table" "hive_table" {
      + creation_time       = (known after apply)
      + dataset_id          = "hive_store"
      + deletion_protection = false
      + etag                = (known after apply)
      + expiration_time     = (known after apply)
      + id                  = (known after apply)
      + last_modified_time  = (known after apply)
      + location            = (known after apply)
      + num_bytes           = (known after apply)
      + num_long_term_bytes = (known after apply)
      + num_rows            = (known after apply)
      + project             = "project-name"
      + schema              = (known after apply)
      + self_link           = (known after apply)
      + table_id            = "messages"
      + type                = (known after apply)

      + external_data_configuration {
          + autodetect    = false
          + compression   = "NONE"
          + schema        = jsonencode(
                [
                  + {
                      + mode = "NULLABLE"
                      + name = "column1"
                      + type = "STRING"
                    },
                ]
            )
          + source_format = "NEWLINE_DELIMITED_JSON"
          + source_uris   = [
              + "gs://project-name-bucket/publish/*",
            ]

          + hive_partitioning_options {
              + mode                     = "CUSTOM"
              + require_partition_filter = false
              + source_uri_prefix        = "gs://project-name-bucket/publish/{dt:STRING}/{hr:STRING}/{min:STRING}"
            }
        }
    }

  # google_storage_bucket.bucket will be created
  + resource "google_storage_bucket" "bucket" {
      + force_destroy               = false
      + id                          = (known after apply)
      + location                    = "EU"
      + name                        = "project-name-bucket"
      + project                     = "project-name"
      + self_link                   = (known after apply)
      + storage_class               = "STANDARD"
      + uniform_bucket_level_access = (known after apply)
      + url                         = (known after apply)
    }

  # google_storage_bucket_object.fake_message will be created
  + resource "google_storage_bucket_object" "fake_message" {
      + bucket         = "project-name-bucket"
      + content        = (sensitive value)
      + content_type   = (known after apply)
      + crc32c         = (known after apply)
      + detect_md5hash = "different hash"
      + id             = (known after apply)
      + kms_key_name   = (known after apply)
      + md5hash        = (known after apply)
      + media_link     = (known after apply)
      + name           = "publish/dt=2000-01-01/hr=00/min=00/fake_message.json"
      + output_name    = (known after apply)
      + self_link      = (known after apply)
      + storage_class  = (known after apply)
    }

Plan: 4 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

google_bigquery_dataset.hive_store: Creating...
google_storage_bucket.bucket: Creating...
google_bigquery_dataset.hive_store: Creation complete after 2s [id=projects/project-name/datasets/hive_store]
google_storage_bucket.bucket: Creation complete after 2s [id=project-name-bucket]
google_storage_bucket_object.fake_message: Creating...
google_storage_bucket_object.fake_message: Creation complete after 0s [id=project-name-bucket-publish/dt=2000-01-01/hr=00/min=00/fake_message.json]
google_bigquery_table.hive_table: Creating...
google_bigquery_table.hive_table: Creation complete after 1s [id=projects/project-name/datasets/hive_store/tables/messages]

Apply complete! Resources: 4 added, 0 changed, 0 destroyed.
➜  hive_partitioned_table_issue git:(master) ✗ terraform apply -var project=project-name -var region=europe-west2
google_storage_bucket.bucket: Refreshing state... [id=project-name-bucket]
google_bigquery_dataset.hive_store: Refreshing state... [id=projects/project-name/datasets/hive_store]
google_storage_bucket_object.fake_message: Refreshing state... [id=project-name-bucket-publish/dt=2000-01-01/hr=00/min=00/fake_message.json]
google_bigquery_table.hive_table: Refreshing state... [id=projects/project-name/datasets/hive_store/tables/messages]

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
```
@jamiet-msm
Copy link
Author

@megan07 happy to report that your fix worked.

Here is the commit with the fix: jamiet-msm/hive_partitioned_table_issue@f7f8e76

and here is the proof, note 0 changes on the subsequent apply:

terraform apply -var project=project-name -var region=europe-west2

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # google_bigquery_dataset.hive_store will be created
  + resource "google_bigquery_dataset" "hive_store" {
      + creation_time              = (known after apply)
      + dataset_id                 = "hive_store"
      + delete_contents_on_destroy = false
      + etag                       = (known after apply)
      + id                         = (known after apply)
      + last_modified_time         = (known after apply)
      + location                   = "EU"
      + project                    = "project-name"
      + self_link                  = (known after apply)

      + access {
          + domain         = (known after apply)
          + group_by_email = (known after apply)
          + role           = (known after apply)
          + special_group  = (known after apply)
          + user_by_email  = (known after apply)

          + view {
              + dataset_id = (known after apply)
              + project_id = (known after apply)
              + table_id   = (known after apply)
            }
        }
    }

  # google_bigquery_table.hive_table will be created
  + resource "google_bigquery_table" "hive_table" {
      + creation_time       = (known after apply)
      + dataset_id          = "hive_store"
      + deletion_protection = false
      + etag                = (known after apply)
      + expiration_time     = (known after apply)
      + id                  = (known after apply)
      + last_modified_time  = (known after apply)
      + location            = (known after apply)
      + num_bytes           = (known after apply)
      + num_long_term_bytes = (known after apply)
      + num_rows            = (known after apply)
      + project             = "project-name"
      + schema              = (known after apply)
      + self_link           = (known after apply)
      + table_id            = "messages"
      + type                = (known after apply)

      + external_data_configuration {
          + autodetect    = false
          + compression   = "NONE"
          + schema        = jsonencode(
                [
                  + {
                      + mode = "NULLABLE"
                      + name = "column1"
                      + type = "STRING"
                    },
                ]
            )
          + source_format = "NEWLINE_DELIMITED_JSON"
          + source_uris   = [
              + "gs://project-name-bucket/publish/*",
            ]

          + hive_partitioning_options {
              + mode                     = "CUSTOM"
              + require_partition_filter = false
              + source_uri_prefix        = "gs://project-name-bucket/publish/{dt:STRING}/{hr:STRING}/{min:STRING}"
            }
        }
    }

  # google_storage_bucket.bucket will be created
  + resource "google_storage_bucket" "bucket" {
      + force_destroy               = false
      + id                          = (known after apply)
      + location                    = "EU"
      + name                        = "project-name-bucket"
      + project                     = "project-name"
      + self_link                   = (known after apply)
      + storage_class               = "STANDARD"
      + uniform_bucket_level_access = (known after apply)
      + url                         = (known after apply)
    }

  # google_storage_bucket_object.fake_message will be created
  + resource "google_storage_bucket_object" "fake_message" {
      + bucket         = "project-name-bucket"
      + content        = (sensitive value)
      + content_type   = (known after apply)
      + crc32c         = (known after apply)
      + detect_md5hash = "different hash"
      + id             = (known after apply)
      + kms_key_name   = (known after apply)
      + md5hash        = (known after apply)
      + media_link     = (known after apply)
      + name           = "publish/dt=2000-01-01/hr=00/min=00/fake_message.json"
      + output_name    = (known after apply)
      + self_link      = (known after apply)
      + storage_class  = (known after apply)
    }

Plan: 4 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

google_bigquery_dataset.hive_store: Creating...
google_storage_bucket.bucket: Creating...
google_bigquery_dataset.hive_store: Creation complete after 2s [id=projects/project-name/datasets/hive_store]
google_storage_bucket.bucket: Creation complete after 2s [id=project-name-bucket]
google_storage_bucket_object.fake_message: Creating...
google_storage_bucket_object.fake_message: Creation complete after 0s [id=project-name-bucket-publish/dt=2000-01-01/hr=00/min=00/fake_message.json]
google_bigquery_table.hive_table: Creating...
google_bigquery_table.hive_table: Creation complete after 1s [id=projects/project-name/datasets/hive_store/tables/messages]

Apply complete! Resources: 4 added, 0 changed, 0 destroyed.
➜  hive_partitioned_table_issue git:(master) ✗ terraform apply -var project=project-name -var region=europe-west2
google_storage_bucket.bucket: Refreshing state... [id=project-name-bucket]
google_bigquery_dataset.hive_store: Refreshing state... [id=projects/project-name/datasets/hive_store]
google_storage_bucket_object.fake_message: Refreshing state... [id=project-name-bucket-publish/dt=2000-01-01/hr=00/min=00/fake_message.json]
google_bigquery_table.hive_table: Refreshing state... [id=projects/project-name/datasets/hive_store/tables/messages]

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

@github-actions
Copy link

github-actions bot commented Apr 4, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 4, 2022
@github-actions github-actions bot added service/bigquery forward/review In review; remove label to forward labels Jan 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug forward/review In review; remove label to forward service/bigquery
Projects
None yet
Development

No branches or pull requests

5 participants