Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream / Field Name Constraints Break Many Sources #1048

Closed
cgardens opened this issue Nov 22, 2020 · 3 comments · Fixed by #1095
Closed

Stream / Field Name Constraints Break Many Sources #1048

cgardens opened this issue Nov 22, 2020 · 3 comments · Fixed by #1095
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@cgardens
Copy link
Contributor

Current Behavior

Solutions

  1. Sources not allowed to emit special characters - This is the course we're on now roughly. We force sources to create a mapping of names with special characters removed to names with special characters. Sources now need to translate back and forth between their names and the names allowed by airbyte.
    1. my opinion - this is a bad approach. it will make many sources very hard and potentially impossible to write. examples.
      1. singer - these may be impossible to reconcile. if we use an singer tap that outputs a special character then the wrapper needs to keep track of how to translate from the name with the special characters removed back to the name with special characters. in cases where the schema is relatively fixed, this is merely annoying. in cases where fields are added dynamically (i.e. salesforce), then the tap needs to refetch the singer catalog on each sync and then try to reconcile it against the ConfiguredAirbyteCatalog. This may be possible, though you can certainly imagine colliding cases where the wrong fields / streams are synced.
      2. databases - pretty much the same issue as singer. if a table name has a space in it, it's best guess when trying to select streams and fields.
  2. Sync worker transform stream and field names - This is the original approach we discussed when we were trying to figure this out. There were reasonable concerns about how this would change the contract that the sync worker has.
    1. my opinion - if we only have options 1 and 2, then option 2 is better and we should change the contract of the sync worker.
      1. personally, I am not convinced by the narrative that sources and destinations should be able to communicate directly with each other without mediation from airbyte. This is already not true. e.g. sync worker filters out logs and other noise--we do not force destinations to handle that. If this is something we care about then we need to do a better job codifying it (and documenting that destinations need to be able to filter logs).
  3. Destinations accept any names and transform them as needed themselves - I don't think we considered this one too deeply. It is pretty straight forward. The downside is it makes writing destinations harder, because they need to handle these transformations. That being said, you could make a strong argument that they need to do this anyway. There is no guarantee that the names that we, as airbyte, say are valid are going to be valid in every destination we support. e.g. it is idiomatic in snowflake to uppercase table names and it prevents needing to do quoted queries. even if we clean names in the sync worker, the snowflake destination likely will want to transform these into upper case anyway (yes, it is a more in-depth transformation if they need to handle any string then one that is airbyte compliant, but in practice i don't think this is that much harder). Given that destinations may need to do some transformation on names anyway, it makes sense to push this responsibility to them for now).

Based on the paths I can see, 2 and 3 are the only options that seem viable. I think we should go with option 3 because:

  1. destinations are likely going to need to do some of their own name cleaning anyway, having them do all of it, isn't that much more work.
  2. it lets us punt on changing the sync worker contract. if we do decide to change that contract in the future, we haven't closed any doors by picking option 3 now--literally no change will be needed in the destinations if we change our mind.
  3. while slightly more work than 2, it's pretty straight forward.

That being said, I think this needs to be a high priority to fix this week, because we've inadvertently just rendered a ton of sources unusable.

@cgardens cgardens added the type/bug Something isn't working label Nov 22, 2020
@cgardens
Copy link
Contributor Author

Opting to skip a release this week because I don't think the version that we would push is likely to work well for most integrations.

@michel-tricot
Copy link
Contributor

Agreed about option3. Only the destination knows what its limitations are.

@cgardens
Copy link
Contributor Author

cgardens commented Nov 23, 2020

Let's assume we are going with option 3, I think we should do the following. Someone should go through each integration, determine what characters it can reasonably support. Potentially some of the name normalization code that we can write will be reusable across destinations 🤞 .

Once that is done, we can strip out the validation and filter in the discover and sync workers respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants