Bigquery model-level unique tests on concat() columns #35

Dimi727 · 2021-10-04T20:15:24Z

Hi everyone.

I did not find this issue anywhere yet which seems odd to me and I dont know if it is on my side.

Using Bigquery and dbt 0.20.2.

Having defined a test for a model like this:

  - name: base_test_model
    tests:
      - unique:
          column_name: "concat(id, field, changed_at)"

Which was working before dbt 0.20.0. Now I get an error:

Database Error in test unique_base_test_id_field_changed_at (models\base\schema.yml)
SELECT list expression references column id which is neither grouped nor aggregated at [12:5] compiled SQL at

I checked in projects where I use dbt 0.19.x and the test above compiled to this:

select count(*) as validation_errors
from (

    select
        concat(id, field, changed_at)

    from base_test
    where concat(id, field, changed_at) is not null
    group by concat(id, field, changed_at)
    having count(*) > 1

) validation_errors

Now it compiles to this:

select
      count(*) as failures
    from (
      
    
select
    concat(id, field, changed_at) as unique_field,
    count(*) as n_records

from base_test
where concat(id, field, changed_at) is not null
group by concat(id, field, changed_at)
having count(*) > 1

    ) dbt_internal_test

which Bigquery seems not to like. Using the query above and removing the alias unique_field works for Bigquery.

With Postgres and MySQL both of the above queries run. Did not test Snowflake and Redshift.

The text was updated successfully, but these errors were encountered:

Dimi727 · 2021-10-05T10:27:53Z

Seems to be a Bigquery thing:
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#having_clause

If a query contains aliases in the SELECT clause, those aliases override names in a FROM clause.

Therefore the compiled test query for Bigquery adapters should look like this to work:

select
      count(*) as failures
    from (
      
    
select
    concat(id, field, changed_at) as unique_field,
    count(*) as n_records

from base_test
where concat(id, field, changed_at) is not null
group by unique_field
having count(*) > 1

    ) dbt_internal_test

DigUpTheHatchet · 2021-10-05T23:54:04Z

I've also experienced a number of tests start failing with the same message:
SELECT list expression references {column} which is neither grouped nor aggregated

After upgrading to version 20.0.2, these errors started appearing. There was no issue when we were running on 20.0.1. We're also running on BigQuery.

It seems to be affecting generic ~~and schema~~ tests where a concat() function is used.

I'm going to quickly check the compiled tests code for both versions and see what has changed.

jtcohen6 · 2021-10-07T12:32:35Z

@Dimi727 @DigUpTheHatchet Thanks for opening and commenting. This one is on me: dbt-labs/dbt-core#3812. I added an alias to the column produced by the unique test query for the sake of users who wanted to use it in conjunction with the --store-failures feature, on databases that require tables be created with named/aliased columns.

I have a few comments:

We really encourage you to test uniqueness on an actual column, rather than passing a column expression into the unique test. (The ability to do this is indeed documented; I wouldn't say it's highly encouraged.) Is there any reason why you wouldn't want concat(id, field, changed_at) as a field in your model, to serve as a unique identifier for the row?
There is a dedicated generic test for testing uniqueness on a combination of columns: dbt_utils.unique_combination_of_columns. If you know you want to test a combination of columns, without storing a surrogate key column in your model, this is the way I'd recommend.
Having said both of the above, you can get this working in the meantime by overriding the unique test in your root project, following the suggestions I'll leave below.

Per @artamoshin's detailed research in dbt-labs/dbt-core#3905 (comment), the vast majority of databases want us to group by {{ column_name }} rather than group by <alias>. It just so happens that BigQuery is happy with group by {{ column_name }} ... until there's an alias. Ugh!

One way to fix this for sure for sure would be to reimplement the unique test query in the dbt-bigquery adapter plugin so that it includes a CTE:

{% macro bigquery__test_unique(model, column_name) %}

with dbt_test__target as (
  
  select {{ column_name }} as unique_field
  from {{ model }}
  where {{ column_name }} is not null
  
)

select
    unique_field,
    count(*) as n_records

from dbt_test__target
group by unique_field
having count(*) > 1

{% endmacro %}

That CTE-based approach would also have the nice side-benefit of fixing another BigQuery-specific issue (https://github.com/dbt-labs/dbt/issues/3489), when the column name and table name match.

Personally, I'd love to make the CTE-based query the default, but I know that some databases don't support CTEs nested inside subqueries—and dbt now wraps generic test queries in subqueries when executing them. We can't win them all, unfortunately. Given that other databases by and large support the existing syntax, I don't think it's unreasonable to make this a BigQuery-specific change.

Would either of you be interested in contributing:

A fix for the above (by adding bigquery__test_unique to the dbt-bigquery plugin)
An integration test case that passes an expression into each built-in generic test, at least for BigQuery for starters. We should run this on all of our supported adapters. If it's not tested functionality, it really shouldn't be documented! There's nothing stopping us from accidentally breaking it (again) in the future.

DigUpTheHatchet · 2021-10-07T22:22:47Z

@jtcohen6 I've been wanting to make a first contribution so I'll give this a go over the weekend.

I can't promise how successful I will be but I will try!

Dimi727 · 2021-10-11T15:15:08Z

@jtcohen6

there any reason why you wouldn't want concat(id, field, changed_at) as a field in your model, to serve as a unique identifier for the row?

Well :-) I think its kind of redundant if there is no use for this column except an dbt-internal unique test or incremental-model wihch needs an unique field (which is not always used).

jtcohen6 · 2021-10-19T08:40:03Z

Resolved by #10

jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 12, 2021

jtcohen6 added type:bug Something isn't working good_first_issue Good for newcomers labels Oct 12, 2021

jtcohen6 closed this as completed Oct 19, 2021

jtcohen6 mentioned this issue Nov 3, 2021

dbt-0.21.0 test issue: Alias getting added in test column which is causing test failure #52

Closed

jtcohen6 mentioned this issue Nov 12, 2021

[Bug] Unique test errors when testing an expression on BigQuery dbt-labs/dbt-core#4264

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigquery model-level unique tests on concat() columns #35

Bigquery model-level unique tests on concat() columns #35

Dimi727 commented Oct 4, 2021

Dimi727 commented Oct 5, 2021 •

edited

Loading

DigUpTheHatchet commented Oct 5, 2021 •

edited

Loading

jtcohen6 commented Oct 7, 2021 •

edited

Loading

DigUpTheHatchet commented Oct 7, 2021

Dimi727 commented Oct 11, 2021

jtcohen6 commented Oct 19, 2021

Bigquery model-level unique tests on concat() columns #35

Bigquery model-level unique tests on concat() columns #35

Comments

Dimi727 commented Oct 4, 2021

Dimi727 commented Oct 5, 2021 • edited Loading

DigUpTheHatchet commented Oct 5, 2021 • edited Loading

jtcohen6 commented Oct 7, 2021 • edited Loading

DigUpTheHatchet commented Oct 7, 2021

Dimi727 commented Oct 11, 2021

jtcohen6 commented Oct 19, 2021

Dimi727 commented Oct 5, 2021 •

edited

Loading

DigUpTheHatchet commented Oct 5, 2021 •

edited

Loading

jtcohen6 commented Oct 7, 2021 •

edited

Loading