Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdk-data-source-git: data source for git POC #2859

Merged
merged 1 commit into from
Nov 16, 2023
Merged

Conversation

antoniivanov
Copy link
Collaborator

@antoniivanov antoniivanov commented Nov 1, 2023

Extracts content from Git repositories along with associated file metadata. See README for more details

This is needed because most other data sources (basically all vdk-singer data sources) are really relational data sources (json / dictionary) I needed a data source that is blobs of data so that wecan test those scenarios and to find out the limitations of our ingestion. Git data from internal git sytems is natural data source for fine tuning certain ML models as well.



@data_source(name="git", config_class=GitDataSourceConfiguration)
class GitDataSource(IDataSource):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This datasource will always only have a single stream?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this iteration of the data source implementation. All singer data sources are really relational data sources (json / dictionary) I needed a data source that is blobs of data so that's why I developed that. To find out the limitations of our ingestion . Eventually it might make sense to allow users to configure streams - maybe branches, or directories or something else. But not at this first iteration.

It's development status is pre-alpha currently or maybe alphase as per https://martin-thoma.com/software-development-stages/

PS : The main limitation I found is that we want the payloads to be json serializable so we don't really accept "bytes" in the payload.

@antoniivanov antoniivanov force-pushed the person/aivanov/git branch 2 times, most recently from b557d5a to bfd97e9 Compare November 3, 2023 12:59
Extracts content from Git repositories along with associated file
metadata. See README for more details
@antoniivanov antoniivanov merged commit 99eed5f into main Nov 16, 2023
@antoniivanov antoniivanov deleted the person/aivanov/git branch November 16, 2023 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants