-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination Azure Blob Storage: Support writing timestamps #6063
Comments
Hello, @dsdorazio and @sherifnada. I would like to share with you my vision of this task. Now, I have opened a pull request, in which S3 and GCS can be used as staging. The data of each synchronization is saved in a separate file on the staging, and in Azure Blob Storage there is always one blob with the airbyte stream name with actual data. Is this solution right for you? If the solution that I described is not suitable, please describe in more detail the solution that you propose. |
Let's move our discussion from #9336 to here.
I don't think this is the right solution. The purpose of the Azure blob storage destination is to store objects directly on Azure. Adding the S3 or GCS as staging area unnecessarily copy the data first to S3 or GCS, and then to Azure. What's the current filename outputted by Azure destination? It seems to me that if the Azure output filename follows a similar pattern as the S3 or GCS, a timestamp will be included in the filename. Here is what the S3 destination filename looks like: https://docs.airbyte.com/integrations/destinations/s3#configuration
Same for GCS: |
@tuliren, The current filename outputted by Azure destination - it is stream name. Also, all blobs are stored in one container, which is specified by the user when creating the destination.
Are you suggesting changing this to -
Or is the next option better?
|
Got it. I think keeping the same pattern as S3 should be good:
|
@andriikorotkov, was a previous comment deleted? Although folders are not supported in Azure, the object path can have |
Tell us about the problem you're trying to solve
I'd like to be able to use Azure Blob Storage (or S3 / GCS) as a durable data lake while also facilitating quick loads into a DW, like Snowflake and BigQuery.
Describe the solution you’d like
The option to add append the current timestamp (_airbyte_emitted_at) to the resulting filename in Cloud Storage. This would allow incremental reads to create individual files that can be loaded, queried, managed efficiently.
Describe the alternative you’ve considered or used
An alternative would be to manage a larger workflow outside of Airbyte that loads the file, copies to a durable location, and then removes the original.
Another possible alternative may be to enhance DW Destinations that leverage Cloud Storage by allowing the user to retain the staged data, as opposed to removing it automatically. I could see value in both enhancements.
Additional context
Similar to #4610
Are you willing to submit a PR?
Perhaps. :)
The text was updated successfully, but these errors were encountered: