Stream / Field Name Constraints Break Many Sources #1048

cgardens · 2020-11-22T03:30:04Z

Current Behavior

We discussed how to handle special characters in stream and field names. We agreed to go with the approach implemented here: stream/field name validation/filtering #1004 to solve these issues: Sync worker should filter out streams and fields with illegal names #972 and Discovery should fail if the source contains illegal stream or field names #954
This approach has broken several sources including:
- shopify - which includes +s some field names
- hubspot - which uses -
- facebook - which uses - and .
- googlesheets - which is a free for all.
- databases - which use spaces and potentially other special characters
- source file which includes .
- I am sure there are others that we just haven't discovered yet in our testing.
As a hack to get around this we've started slipping special characters back into our allowed characters.
I think we made a bad choice here and broken alot of functionality. We should course correct now.

Solutions

Sources not allowed to emit special characters - This is the course we're on now roughly. We force sources to create a mapping of names with special characters removed to names with special characters. Sources now need to translate back and forth between their names and the names allowed by airbyte.
1. my opinion - this is a bad approach. it will make many sources very hard and potentially impossible to write. examples.
  1. singer - these may be impossible to reconcile. if we use an singer tap that outputs a special character then the wrapper needs to keep track of how to translate from the name with the special characters removed back to the name with special characters. in cases where the schema is relatively fixed, this is merely annoying. in cases where fields are added dynamically (i.e. salesforce), then the tap needs to refetch the singer catalog on each sync and then try to reconcile it against the ConfiguredAirbyteCatalog. This may be possible, though you can certainly imagine colliding cases where the wrong fields / streams are synced.
  2. databases - pretty much the same issue as singer. if a table name has a space in it, it's best guess when trying to select streams and fields.
Sync worker transform stream and field names - This is the original approach we discussed when we were trying to figure this out. There were reasonable concerns about how this would change the contract that the sync worker has.
1. my opinion - if we only have options 1 and 2, then option 2 is better and we should change the contract of the sync worker.
  1. personally, I am not convinced by the narrative that sources and destinations should be able to communicate directly with each other without mediation from airbyte. This is already not true. e.g. sync worker filters out logs and other noise--we do not force destinations to handle that. If this is something we care about then we need to do a better job codifying it (and documenting that destinations need to be able to filter logs).
Destinations accept any names and transform them as needed themselves - I don't think we considered this one too deeply. It is pretty straight forward. The downside is it makes writing destinations harder, because they need to handle these transformations. That being said, you could make a strong argument that they need to do this anyway. There is no guarantee that the names that we, as airbyte, say are valid are going to be valid in every destination we support. e.g. it is idiomatic in snowflake to uppercase table names and it prevents needing to do quoted queries. even if we clean names in the sync worker, the snowflake destination likely will want to transform these into upper case anyway (yes, it is a more in-depth transformation if they need to handle any string then one that is airbyte compliant, but in practice i don't think this is that much harder). Given that destinations may need to do some transformation on names anyway, it makes sense to push this responsibility to them for now).

Based on the paths I can see, 2 and 3 are the only options that seem viable. I think we should go with option 3 because:

destinations are likely going to need to do some of their own name cleaning anyway, having them do all of it, isn't that much more work.
it lets us punt on changing the sync worker contract. if we do decide to change that contract in the future, we haven't closed any doors by picking option 3 now--literally no change will be needed in the destinations if we change our mind.
while slightly more work than 2, it's pretty straight forward.

That being said, I think this needs to be a high priority to fix this week, because we've inadvertently just rendered a ton of sources unusable.

The text was updated successfully, but these errors were encountered:

cgardens · 2020-11-22T03:37:45Z

Opting to skip a release this week because I don't think the version that we would push is likely to work well for most integrations.

michel-tricot · 2020-11-22T18:59:24Z

Agreed about option3. Only the destination knows what its limitations are.

cgardens · 2020-11-23T04:03:01Z

Let's assume we are going with option 3, I think we should do the following. Someone should go through each integration, determine what characters it can reasonably support. Potentially some of the name normalization code that we can write will be reusable across destinations 🤞 .

Once that is done, we can strip out the validation and filter in the discover and sync workers respectively.

cgardens added the type/bug Something isn't working label Nov 22, 2020

cgardens mentioned this issue Nov 23, 2020

Debug issues syncing MSSQL to Snowflake #1056

Closed

cgardens added this to the v0.7.0 milestone Nov 23, 2020

This was referenced Nov 23, 2020

Destination-postgres should write data in a table from the specified schema #1059

Merged

Naming conventions managed in destinations #1060

Merged

cgardens assigned ChristopheDuong Nov 23, 2020

cgardens mentioned this issue Nov 25, 2020

revert stream / field filtering in sources #1095

Merged

cgardens closed this as completed in #1095 Nov 26, 2020

snyk-bot mentioned this issue Mar 11, 2022

[Snyk] Security upgrade yup from 0.28.5 to 0.30.0 yeedster/airbyte#12

Open

snyk-bot mentioned this issue Sep 26, 2022

[Snyk] Fix for 5 vulnerabilities yeedster/airbyte#83

Open

yeedster mentioned this issue Sep 30, 2022

[Snyk] Fix for 5 vulnerabilities yeedster/airbyte#84

Open

snyk-bot mentioned this issue Oct 4, 2022

[Snyk] Fix for 5 vulnerabilities yeedster/airbyte#85

Open

This was referenced Oct 5, 2022

[Snyk] Fix for 5 vulnerabilities yeedster/airbyte#86

Open

[Snyk] Fix for 5 vulnerabilities yeedster/airbyte#87

Open

[Snyk] Fix for 5 vulnerabilities yeedster/airbyte#88

Open

[Snyk] Fix for 5 vulnerabilities yeedster/airbyte#89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream / Field Name Constraints Break Many Sources #1048

Stream / Field Name Constraints Break Many Sources #1048

cgardens commented Nov 22, 2020

cgardens commented Nov 22, 2020

michel-tricot commented Nov 22, 2020

cgardens commented Nov 23, 2020 •

edited

Loading

Stream / Field Name Constraints Break Many Sources #1048

Stream / Field Name Constraints Break Many Sources #1048

Comments

cgardens commented Nov 22, 2020

Current Behavior

Solutions

cgardens commented Nov 22, 2020

michel-tricot commented Nov 22, 2020

cgardens commented Nov 23, 2020 • edited Loading

cgardens commented Nov 23, 2020 •

edited

Loading