-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vdk-data-source-git: data source for git POC #2859
Merged
+301
−0
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
image: "python:3.7" | ||
|
||
.build-vdk-data-source-git: | ||
variables: | ||
PLUGIN_NAME: vdk-data-source-git | ||
extends: .build-plugin | ||
|
||
build-py37-vdk-data-source-git: | ||
extends: .build-vdk-data-source-git | ||
image: "python:3.7" | ||
|
||
build-py311-vdk-data-source-git: | ||
extends: .build-vdk-data-source-git | ||
image: "python:3.11" | ||
|
||
release-vdk-data-source-git: | ||
variables: | ||
PLUGIN_NAME: vdk-data-source-git | ||
extends: .release-plugin |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# data-source-git | ||
|
||
Extracts content from Git repositories along with associated file metadata. | ||
|
||
## Usage | ||
|
||
``` | ||
pip install vdk-data-source-git | ||
``` | ||
|
||
### Extracted Data Schema | ||
|
||
The extracted data is returned in a `DataSourcePayload` object with two main components: `content` and `metadata`. | ||
|
||
#### `content` | ||
|
||
The `content` field contains the actual content of the file as a string. | ||
|
||
#### `metadata` | ||
|
||
The `metadata` field contains a dictionary with the following schema: | ||
|
||
| Key | Description | Data Type | Example | | ||
|------------------------|---------------------------------------------------|-----------|---------------| | ||
| `size` | The size of the file in bytes | Integer | 12345 | | ||
| `path` | The path of the file in the repository | String | "src/main.py" | | ||
| `num_lines` | The number of lines in the file | Integer | 678 | | ||
| `file_extension` | The file extension | String | ".py" | | ||
| `programming_language` | The detected programming language of the file | String | "Python" | | ||
| `is_likely_test_file` | Flag indicating if the file is likely a test file | Boolean | false | | ||
|
||
### Configuration | ||
|
||
(`vdk config-help` is useful command to browse all config options of your installation of vdk) | ||
|
||
| Name | Description | (example) Value | | ||
|---------|------------------------------------------|--------------------------------| | ||
| git_url | URL of the Git repository to be cloned. | "https://github.com/user/repo" | | ||
|
||
|
||
### Build and testing | ||
|
||
``` | ||
pip install -r requirements.txt | ||
pip install -e . | ||
pytest | ||
``` | ||
|
||
In VDK repo [../build-plugin.sh](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/build-plugin.sh) script can be used also. | ||
|
||
|
||
#### Note about the CICD: | ||
|
||
.plugin-ci.yaml is needed only for plugins part of [Versatile Data Kit Plugin repo](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins). | ||
|
||
The CI/CD is separated in two stages, a build stage and a release stage. | ||
The build stage is made up of a few jobs, all which inherit from the same | ||
job configuration and only differ in the Python version they use (3.7, 3.8, 3.9 and 3.10). | ||
They run according to rules, which are ordered in a way such that changes to a | ||
plugin's directory trigger the plugin CI, but changes to a different plugin does not. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# this file is used to provide testing requirements | ||
# for requirements (dependencies) needed during and after installation of the plugin see (and update) setup.py install_requires section | ||
|
||
|
||
pytest | ||
vdk-core | ||
vdk-test-utils |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
import pathlib | ||
|
||
import setuptools | ||
|
||
""" | ||
Builds a package with the help of setuptools in order for this package to be imported in other projects | ||
""" | ||
|
||
__version__ = "0.1.0" | ||
|
||
setuptools.setup( | ||
name="vdk-data-source-git", | ||
version=__version__, | ||
url="https://github.com/vmware/versatile-data-kit", | ||
description="Read Git repository data source", | ||
long_description=pathlib.Path("README.md").read_text(), | ||
long_description_content_type="text/markdown", | ||
install_requires=["vdk-core", "vdk-data-sources", "pygments", "dulwich"], | ||
package_dir={"": "src"}, | ||
packages=setuptools.find_namespace_packages(where="src"), | ||
# This is the only vdk plugin specifc part | ||
# Define entry point called "vdk.plugin.run" with name of plugin and module to act as entry point. | ||
entry_points={ | ||
"vdk.plugin.run": [ | ||
"vdk-data-source-git = vdk.plugin.data_source_git.plugin_entry" | ||
] | ||
}, | ||
classifiers=[ | ||
"Development Status :: 2 - Pre-Alpha", | ||
"License :: OSI Approved :: Apache Software License", | ||
"Programming Language :: Python :: 3.7", | ||
"Programming Language :: Python :: 3.8", | ||
"Programming Language :: Python :: 3.9", | ||
"Programming Language :: Python :: 3.10", | ||
"Programming Language :: Python :: 3.11", | ||
], | ||
project_urls={ | ||
"Documentation": "https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/vdk-data-source-git", | ||
"Source Code": "https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/vdk-data-source-git", | ||
"Bug Tracker": "https://github.com/vmware/versatile-data-kit/issues/new/choose", | ||
}, | ||
) |
88 changes: 88 additions & 0 deletions
88
projects/vdk-plugins/vdk-data-source-git/src/vdk/plugin/data_source_git/git_source.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
import os | ||
import tempfile | ||
from typing import Iterator | ||
from typing import List | ||
from typing import Optional | ||
|
||
from dulwich import porcelain | ||
from vdk.plugin.data_source_git.utils import detect_language | ||
from vdk.plugin.data_source_git.utils import is_test_file | ||
from vdk.plugin.data_sources.config import config_class | ||
from vdk.plugin.data_sources.config import config_field | ||
from vdk.plugin.data_sources.data_source import DataSourcePayload | ||
from vdk.plugin.data_sources.data_source import IDataSource | ||
from vdk.plugin.data_sources.data_source import ( | ||
IDataSourceConfiguration, | ||
) | ||
from vdk.plugin.data_sources.data_source import IDataSourceStream | ||
from vdk.plugin.data_sources.factory import data_source | ||
from vdk.plugin.data_sources.state import IDataSourceState | ||
|
||
DESCRIPTION = """Git data source. | ||
Extract content from Git repositories and associated file metadata. | ||
""" | ||
|
||
|
||
@config_class(name="git", description=DESCRIPTION) | ||
class GitDataSourceConfiguration(IDataSourceConfiguration): | ||
git_url: str = config_field(description="Git URL that would be cloned. ") | ||
git_ssh_key: Optional[str] = config_field( | ||
description="SSH key to use when cloning the repo." | ||
"Leave empty if no authentication is needed", | ||
default="", | ||
) | ||
|
||
|
||
@data_source(name="git", config_class=GitDataSourceConfiguration) | ||
class GitDataSource(IDataSource): | ||
""" | ||
Data source who is only generating some dummy data for testing purposes. | ||
antoniivanov marked this conversation as resolved.
Show resolved
Hide resolved
antoniivanov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
|
||
def __init__(self): | ||
self._config = None | ||
self._streams = [] | ||
|
||
def configure(self, config: GitDataSourceConfiguration): | ||
self._config = config | ||
|
||
def connect(self, state: IDataSourceState): | ||
if not self._streams: | ||
self._streams = [GitDataSourceStream(self._config.git_url)] | ||
|
||
def disconnect(self): | ||
self._streams = [] | ||
|
||
def streams(self) -> List[IDataSourceStream]: | ||
return self._streams | ||
|
||
|
||
class GitDataSourceStream(IDataSourceStream): | ||
""" """ | ||
|
||
def name(self) -> str: | ||
return self._url | ||
|
||
def __init__(self, url: str): | ||
self._url = url | ||
|
||
def read(self) -> Iterator[DataSourcePayload]: | ||
with tempfile.TemporaryDirectory() as tmp_dir: | ||
repo = porcelain.clone(source=self._url, target=tmp_dir, depth=1) | ||
|
||
for path, entry in repo.open_index().items(): | ||
file_path = path.decode("utf-8") | ||
blob = repo.get_object(entry.sha) | ||
# TODO: VDK send_object/tabular_data for ingestion doesn't support bytes so we convert it for now. | ||
data = blob.data.decode("utf-8") | ||
metadata = { | ||
"size": len(data), | ||
"path": file_path, | ||
"num_lines": data.count("\n") + 1, | ||
"file_extension": os.path.splitext(file_path)[1], | ||
"programming_language": detect_language(file_path), | ||
"is_likely_test_file": is_test_file(file_path), | ||
} | ||
yield DataSourcePayload({"content": data}, metadata=metadata) |
23 changes: 23 additions & 0 deletions
23
projects/vdk-plugins/vdk-data-source-git/src/vdk/plugin/data_source_git/plugin_entry.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
from typing import List | ||
|
||
from vdk.api.plugin.hook_markers import hookimpl | ||
from vdk.api.plugin.plugin_registry import IPluginRegistry | ||
from vdk.plugin.data_source_git.git_source import GitDataSource | ||
from vdk.plugin.data_sources.factory import IDataSourceFactory | ||
|
||
""" | ||
Include the plugins implementation. For example: | ||
""" | ||
|
||
|
||
class DataSourceGitPlugin: | ||
@hookimpl | ||
def vdk_data_sources_register(self, data_source_factory: IDataSourceFactory): | ||
data_source_factory.register_data_source_class(GitDataSource) | ||
|
||
|
||
@hookimpl | ||
def vdk_start(plugin_registry: IPluginRegistry, command_line_args: List): | ||
plugin_registry.load_plugin_with_hooks_impl(DataSourceGitPlugin(), "DataSourceGit") |
16 changes: 16 additions & 0 deletions
16
projects/vdk-plugins/vdk-data-source-git/src/vdk/plugin/data_source_git/utils.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
from pygments.lexers import get_lexer_for_filename | ||
from pygments.util import ClassNotFound | ||
|
||
|
||
def is_test_file(file_path: str) -> bool: | ||
return "test" in file_path | ||
|
||
|
||
def detect_language(file_path: str) -> str: | ||
try: | ||
lexer = get_lexer_for_filename(file_path) | ||
return lexer.name | ||
except ClassNotFound: | ||
return "Unknown" |
Submodule dsc
added at
2c9988
16 changes: 16 additions & 0 deletions
16
projects/vdk-plugins/vdk-data-source-git/tests/jobs/ingest-git-job/step.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
from vdk.api.job_input import IJobInput | ||
from vdk.plugin.data_sources.mapping.data_flow import DataFlowInput | ||
from vdk.plugin.data_sources.mapping.definitions import DataFlowMappingDefinition | ||
from vdk.plugin.data_sources.mapping.definitions import DestinationDefinition | ||
from vdk.plugin.data_sources.mapping.definitions import SourceDefinition | ||
|
||
|
||
def run(job_input: IJobInput): | ||
url = "https://github.com/versatile-data-kit-demo/dsc" | ||
source = SourceDefinition(id=url, name="git", config={"git_url": url}) | ||
destination = DestinationDefinition(id="test", method="memory") | ||
|
||
with DataFlowInput(job_input) as flow_input: | ||
flow_input.start(DataFlowMappingDefinition(source, destination)) |
24 changes: 24 additions & 0 deletions
24
projects/vdk-plugins/vdk-data-source-git/tests/test_run_ingest_git.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
from click.testing import Result | ||
from vdk.plugin.data_source_git import plugin_entry | ||
from vdk.plugin.data_sources import plugin_entry as data_sources_plugin_entry | ||
from vdk.plugin.test_utils.util_funcs import cli_assert_equal | ||
from vdk.plugin.test_utils.util_funcs import CliEntryBasedTestRunner | ||
from vdk.plugin.test_utils.util_funcs import jobs_path_from_caller_directory | ||
from vdk.plugin.test_utils.util_plugins import IngestIntoMemoryPlugin | ||
|
||
|
||
def test_run_ingest_git(): | ||
ingest_plugin = IngestIntoMemoryPlugin() | ||
runner = CliEntryBasedTestRunner( | ||
ingest_plugin, data_sources_plugin_entry, plugin_entry | ||
) | ||
|
||
result: Result = runner.invoke( | ||
["run", jobs_path_from_caller_directory("ingest-git-job")] | ||
) | ||
|
||
cli_assert_equal(0, result) | ||
|
||
assert len(ingest_plugin.payloads) > 0 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This datasource will always only have a single stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this iteration of the data source implementation. All singer data sources are really relational data sources (json / dictionary) I needed a data source that is blobs of data so that's why I developed that. To find out the limitations of our ingestion . Eventually it might make sense to allow users to configure streams - maybe branches, or directories or something else. But not at this first iteration.
It's development status is pre-alpha currently or maybe alphase as per https://martin-thoma.com/software-development-stages/
PS : The main limitation I found is that we want the payloads to be json serializable so we don't really accept "bytes" in the payload.