vdk-datasources: data sources POC #2805

antoniivanov · 2023-10-17T16:49:12Z

This change implements POC for Data Sources API.

It is based on some of teh requirements and reserach in https://github.com/vmware/versatile-data-kit/wiki/Ingest-Source-Research

See concepts page for explanation of data source related concepts

So what's implemented is

Data Source APIs handling sources, streams and state
New Data Source is implemented by implementing IDataSource, IDataSourceConfiguraiton and IDataSourceStream
Data Source connection management partialy
Data Source Ingester that reads from data sources and writes to existing IIngeser
An example data source AutoGeneratedDataSource
An example job in the function test suite

murphp15 · 2023-10-31T08:53:23Z

projects/vdk-plugins/vdk-data-sources/README.md

+A Data Source is a central component responsible for establishing and managing a connection to a specific set of data. It interacts with a given configuration and maintains a stateful relationship with the data it accesses. This stateful relationship can include information such as authentication tokens, data markers, or any other form of metadata that helps manage the data connection. The Data Source exposes various data streams through which data can be read.
+
+#### Data Source Stream
+A Data Source Stream is an abstraction over a subset of data in the Data Source. It can be thought of as a channel through which data flows. Each Data Source Stream has a unique name to identify it and includes methods to read data from the stream. For example for Database based data source , each table could be a separate stream. Streams can be ingested in parallel potentially.


I think a diagram here would be good

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/auto_generated.py

murphp15 · 2023-10-31T08:58:32Z

I understand mostly the difference between data source and data source stream.
However I would really like a few concrete examples.

Can you give an example if your source was a relational database and also an example if it was blob storage?

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/data_source.py

murphp15 · 2023-10-31T09:04:06Z

projects/vdk-plugins/vdk-data-sources/tests/functional/jobs/ingest-sources-job/step.py

+    data_source_ingester.ingest_data_source("auto", auto_generated, method="memory")
+
+    data_source_ingester.terminate_and_wait_to_finish()
+    data_source_ingester.raise_on_error()


can we not make this the default?

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/config.py

murphp15 · 2023-10-31T09:56:12Z

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/data_source.py

+
+
+@dataclass
+class DataSourceError:


why did you not call it an exception?
I would think that error is more serious.

like out of memory error.

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/factory.py

murphp15 · 2023-10-31T10:01:41Z

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/state.py

+    def __init__(self, storage: IDataSourceStateStorage):
+        self._storage = storage
+
+    def get_data_source_state(self, source: str) -> IDataSourceState:


There really needs to be a comment here as users are expected to use this function.

users in the sence of other vdk developers.

But end users are not supposed to use it directly. I will add comment though

antoniivanov · 2023-10-31T11:30:24Z

Can you give an example if your source was a relational database and also an example if it was blob storage?

I will add examples . But would those work ?

In relational database like MySQL, the Data Source would be the database server itself, with which you establish a connection
Here, each table could serve as a Data Source Stream.

In the context of Amazon S3, your Data Source would be an S3 bucket.
Each object (or a group of objects under a common prefix) within that S3 bucket could be considered a Data Source Stream.

In an REST API , the data source is the HTTP base URL (http://xxx.com)
The data stream could be each different endpoint (http://xxx.com/users, http://xxx/admins)

The concepts are not something new. Singer.io and Airbyte are using similar concepts.

This change implements POC for Data Sources API. It is based on some of teh requirements and reserach in https://github.com/vmware/versatile-data-kit/wiki/Ingest-Source-Research See concepts page for explanation of data source related concepts So what's implemented is - Data Source APIs handling sources, streams and state - Data Source connection management partialy - Data Source Ingester that reads from data sources and writes to existing IIngeser - An example data source AutoGeneratedDataSource - An example job in the function test suite Most likely this would be moved to plugin vdk-data-sources. For now it doesn't appear ther's need for this to be in vdk-core.

Add support for passing and keeping state in data sources. The way it works a data source in the begining would be called with method `data_source.connect(previous_state)` to initialize. Then it if Payload.state field is not None , that state will be persisted for that stream. Data Source can keep per-stream state and "others" state for non-stream specfic stateful information if needed. auto

Register a data source and its associated configuration class 1. First decorate the class with @data_source decorator 2. Then impelment vdk_data_sources_register to register the class as below ``` @hookimpl def vdk_data_sources_register(self, data_source_factory: IDataSourceFactory): data_source_factory.register_data_source_class(AutoGeneratedDataSource) ```

Data source can be used in this way: ```python def run(job_input: IJobInput): source = SourceDefinition(id="auto", name="auto-generated-data", config={}) destination = DestinationDefinition(id="auto-dest", method="memory") with DataFlowInput(job_input) as flow_input: flow_input.start(source, destination) ``` or in config.toml file ```toml [sources.auto] name="auto-generated-data" config={} [destinations.auto-dest] method="memory" [[flows]] from="auto" to="auto-dest" ``` ```python def run(job_input: IJobInput): with DataFlowInput(job_input) as flow_input: flow_input.start_flow(toml_parser.load_config("config.toml")) ``` flow flow comments

When defining a data flow mapping we need to be able to map source to target appropraitely (e.g rename columns map steram to tables.

projects/vdk-plugins/vdk-data-sources/README.md

dakodakov · 2023-11-02T08:34:04Z

projects/vdk-plugins/vdk-data-sources/README.md

+#### Data Source Stream
+A Data Source Stream is an abstraction over a subset of data in the Data Source. It can be thought of as a channel through which data flows. Each Data Source Stream has a unique name to identify it and includes methods to read data from the stream. For example for Database based data source , each table could be a separate stream. Streams can be ingested in parallel potentially.
+
+Reading from the stream yields a sequence of Data Source Payloads


what happens if the stream never ends, e.g. if it's a Kafka topic with constant influx of data?

What could happen depends on how the Kafka Data source is implemented
You can implement it where it would fetch all data until start_timestamp . In this case it should end.

But one can reuse https://pypi.org/project/pipelinewise-tap-kafka/ with vdk-singer .

And the way they have handled it is to use max_runtime_ms (The maximum time for the tap to collect new messages from Kafka topic) to end the ingestion batch.

dakodakov · 2023-11-02T08:38:54Z

projects/vdk-plugins/vdk-data-sources/README.md

+        flow_input.start(DataFlowMappingDefinition(source, destination))
+```
+
+or in config.toml file


What is a toml file? I could obviously google it - we configure data jobs with INI files, and now TOML files for something else?

I've been considering replacing ini with TOML because

it supported nested structures of data

it supports arrays

it supports data types !!

it's somewhat similar in ini in syntax - existing ini files can be parsed by TOML parser so a change can be pretty backward compatible.

It would not have been feasible/easy to use ini format the above data flow structure. So I decided to use TOML now as an experiment to see if it's going to work for users.

We need to move away from ini due to above reasons and we've had users who have requested to support more "modern" format.
The other alternative is Yaml. But yaml is pretty ugly for highly nested configurations and would make migration from ini to yaml more involved.

I generally agree with everything you said, however I would be great to have a more structured approach, e.g. use TOML everywhere, rather than use it here and use something else in a different place.

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/auto_generated.py

dakodakov · 2023-11-02T08:45:22Z

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/auto_generated.py

+    def _generate_test_data(self, start_id: int) -> List[DataSourcePayload]:
+        generated_data = []
+        for i in range(self._config.num_records):
+            data = {


A data stream, which generates data about data streams... Seems a bit tautological... what about a more common/well known/relatable use case like Employees or shapes ... animals?

Hmm, yeah I probably could have come with some more interesting example. I will leave it for now though since there are already lots of tests that expect this data. But I will change it later.

dakodakov · 2023-11-02T08:49:13Z

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/data_source.py

+    It's part of a definition of a data source - what configuraiton it requires.
+    You need to implement a class and decoreated with @config_class decorator like this:
+
+    Example::


An example full of "examples" is not very useful. Try to make it more relatable please.

dakodakov · 2023-11-02T09:00:03Z

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/factory.py

+
+
+@dataclass
+class DataSourceRegistryItem:


"Registry Item" - isn't this just a "data source"?

It keeps the classes hat are needed to create the data source. It's not the data source. I am not sure of a better name

dakodakov · 2023-11-02T10:57:43Z

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/ingester.py

+class DataSourceIngester:
+    def __init__(self, job_input: IJobInput):
+        self.__ingestion_queue = Queue()
+        self.__actual_ingester = cast(IIngester, job_input)


"actual ingester"? is there another one?

Well, the class is called "DataSourceIngester" so the actual ingester that is going ot send the data to the target is the __actual_ingester..

Feel free to suggest a better name and I will rename it.

dakodakov · 2023-11-02T10:59:50Z

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/ingester.py

+        return queue_item is None
+
+    def _ingest_stream(self, ingest_entry: IngestQueueEntry):
+        for payload in ingest_entry.stream.read():


when do you stop reading?

When the stream decides to stop or the data jobs times out.

dakodakov · 2023-11-02T11:02:33Z

projects/vdk-plugins/vdk-data-sources/src/vdk/plugin/data_sources/ingester.py

+        destinations: List[IngestDestination] = None,
+        error_callback: Optional[IDataSourceErrorCallback] = None,
+    ):
+        if data_source_id not in self.__ingested_streams_set:


"ingested" streams? What if the ingestion fails? I would rename it to something something like "being_ingested" or "ingesting" streams?

Sounds good

This is adding a data source plugin for singer.io. So now users can specify singer taps as data sources They can also list all singer taps that can be found with `vdk singer --list-taps` The change depends on #2805

Addressing revew comments from #2805

antoniivanov · 2023-11-02T13:15:27Z

Thanks for the review @dakodakov I have asnwered the comments above and address the changes in this PR #2865

Addressing revew comments from #2805

vmwclabot added the cla-not-required label Oct 17, 2023

antoniivanov force-pushed the person/aivanov/data-source branch 3 times, most recently from 569a0a4 to ef872f7 Compare October 23, 2023 10:42

antoniivanov force-pushed the person/aivanov/data-source branch from ef872f7 to 8d39b7d Compare October 27, 2023 10:02

antoniivanov changed the title ~~vdk-core: data sources POC~~ vdk-datasources: data sources POC Oct 27, 2023

antoniivanov mentioned this pull request Oct 27, 2023

vdk-singer: Singer.io plugin for data sources #2821

Merged

antoniivanov force-pushed the person/aivanov/data-source branch 3 times, most recently from 29e80c8 to 3169ca2 Compare October 30, 2023 10:56