Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vdk-datasources: data sources POC #2805
vdk-datasources: data sources POC #2805
Changes from all commits
3989adf
bee768d
9a32f82
72f1e08
d048f8c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the stream never ends, e.g. if it's a Kafka topic with constant influx of data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could happen depends on how the Kafka Data source is implemented
You can implement it where it would fetch all data until start_timestamp . In this case it should end.
But one can reuse https://pypi.org/project/pipelinewise-tap-kafka/ with vdk-singer .
And the way they have handled it is to use max_runtime_ms (The maximum time for the tap to collect new messages from Kafka topic) to end the ingestion batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a toml file? I could obviously google it - we configure data jobs with INI files, and now TOML files for something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been considering replacing ini with TOML because
It would not have been feasible/easy to use ini format the above data flow structure. So I decided to use TOML now as an experiment to see if it's going to work for users.
We need to move away from ini due to above reasons and we've had users who have requested to support more "modern" format.
The other alternative is Yaml. But yaml is pretty ugly for highly nested configurations and would make migration from ini to yaml more involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree with everything you said, however I would be great to have a more structured approach, e.g. use TOML everywhere, rather than use it here and use something else in a different place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A data stream, which generates data about data streams... Seems a bit tautological... what about a more common/well known/relatable use case like Employees or shapes ... animals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah I probably could have come with some more interesting example. I will leave it for now though since there are already lots of tests that expect this data. But I will change it later.