From 2c59b0ac728756b2628a4996fe4772ef4dcc40cd Mon Sep 17 00:00:00 2001 From: Antoni Ivanov Date: Sun, 16 Jul 2023 02:17:56 +0300 Subject: [PATCH 1/7] vdk-plugins: include Ingestion cycle diagram --- projects/vdk-plugins/README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/projects/vdk-plugins/README.md b/projects/vdk-plugins/README.md index 62d3b39dbe..d4fd462b8c 100644 --- a/projects/vdk-plugins/README.md +++ b/projects/vdk-plugins/README.md @@ -127,6 +127,11 @@ Check out the [connection hook spec](../vdk-core/src/vdk/api/plugin/connection_h [![connection-hooks-activity-diagram](https://user-images.githubusercontent.com/2536458/228570184-4fba653c-dd6a-4a6d-80b3-bee83beb85e6.svg)](https://www.plantuml.com/plantuml/uml/ZPDFQnin4CNl-XG3kTY7-51wgGqb4BiKcaBwf-cXBKQMnbwrMib8sgI1_V0PQOMrcrs7-EBTVK-_DoEDhdpWBZIrPe8VWx86lbVAWrJyu7WDlh8FzCO3tt6FSBkvXJTltu4zekFHxU51XGhkrf_WsXg38Y4-MllFBnZJU40ZsKtwMx8M3WxHG7lYVCPGMGdThoN3JZT8-WGlwaJ9ICRQ7nvTorBv360f7FA084whLgmbJ1krduuVYJTMBevSOofgMUH5PipcP5pdrXDdu-b5Ar_rOwBm5KFZJEyhsDrVUbhbCli5DivRjpeVdlHnTexev0dyvZ-AXlZVljo0i7NDZNmUafOki3DIGjcWEwwLv07BmQQrMXsg48zaANVRqjpsFjkt9tka4MUDmhhNSsIqZpcb6Ug27r2-4fSxAxJnBcPmsI6rXnawPzqSGeK6Pe_ev-Iy_7NXKFwvV4_FUPlE1piKzXxTC7Z8Y3dPXdASbKweSwBs23DZep9ab4hYF715lbJwQiBfWpr6c95gpmfo69RKwJbpw1iTOjdSFAwOvZlKu9AsxRJUx7t082gGX8aB3AB4CyEtZqvhQFf-c_wdcbAUV-DQdxq6oO0oPVPlmRMjQnKWE6uyxsvYgUY52nzNZSF6j46MjhxSvwbkHNICi5cDMsmR9z3JaqOIvPXUXko5wgTJYcCoAGt85HhPrFe9) +### Data Ingestion Cycle + +[![data-ingestion-workflow](https://github.com/vmware/versatile-data-kit/assets/2536458/8c27455c-9836-4110-8a8a-9660fba6706d)](https://www.plantuml.com/plantuml/uml/hPDFQzj04CNl-XH3V4aE9iH7avhOCO6cFXXiQWe1iTQEhArMCyh-gMjAltj7KXmJOUf3g-FCRsRVUxjwy46v42kR11Cimbm51PzfXpuO9jYmAtFB-oJnfQ5QELM1nzU8b27yIa2-gNEyVsJB3cPMPMLNp0Ax6JkDhjzQc1mNXl12Lmexnv5qHtn3Ap9QP2c2JMPgHU7CZXxGMxCg3pCRiOyzCOMp5lhpqzUeJjt-sEyaKKqThjeOdtbx1Sh3_1a6xM1zEX6kliw_m9FaYNl9kEMQok2ey0Exj75d27xUjTpoxW8swh3Htx5vSyUacdlk-FM9rw9_gpo-ELce4cytoc71qUFj2wqBu_ImsNe095speT1vNMnWi2bCm5N59IQ9c1zEMcjZy8AclFsEMKXpTgavlhFh2aF1-fC-IRf9Y0C2_q3tDlrOO5Pwa46eMmSUCgRSxA933OPUwFw-svZMwc1PwRHsM3lEqFlq-6edaqJMDPeanZ48aHw7ElBwvXqOdGV-ILhdDDMOgsZ3P08oqzKWZvGrrg7zpp2WUrUobaC-UXCLKXrAKo8Vmmf9GoWGcfi3EFOwUTEi9DvRr3kiaC9_IPPzk1Ij81UoFKiy8EbOsJy0) + + ## Public interfaces Any backwards compatibility guarantees apply only to public interfaces. From cc449577759a9a7e9af7e595d07f41795f35d1ea Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Sun, 16 Jul 2023 10:53:46 +0000 Subject: [PATCH 2/7] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- projects/vdk-plugins/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/projects/vdk-plugins/README.md b/projects/vdk-plugins/README.md index d4fd462b8c..2e13942c08 100644 --- a/projects/vdk-plugins/README.md +++ b/projects/vdk-plugins/README.md @@ -127,7 +127,7 @@ Check out the [connection hook spec](../vdk-core/src/vdk/api/plugin/connection_h [![connection-hooks-activity-diagram](https://user-images.githubusercontent.com/2536458/228570184-4fba653c-dd6a-4a6d-80b3-bee83beb85e6.svg)](https://www.plantuml.com/plantuml/uml/ZPDFQnin4CNl-XG3kTY7-51wgGqb4BiKcaBwf-cXBKQMnbwrMib8sgI1_V0PQOMrcrs7-EBTVK-_DoEDhdpWBZIrPe8VWx86lbVAWrJyu7WDlh8FzCO3tt6FSBkvXJTltu4zekFHxU51XGhkrf_WsXg38Y4-MllFBnZJU40ZsKtwMx8M3WxHG7lYVCPGMGdThoN3JZT8-WGlwaJ9ICRQ7nvTorBv360f7FA084whLgmbJ1krduuVYJTMBevSOofgMUH5PipcP5pdrXDdu-b5Ar_rOwBm5KFZJEyhsDrVUbhbCli5DivRjpeVdlHnTexev0dyvZ-AXlZVljo0i7NDZNmUafOki3DIGjcWEwwLv07BmQQrMXsg48zaANVRqjpsFjkt9tka4MUDmhhNSsIqZpcb6Ug27r2-4fSxAxJnBcPmsI6rXnawPzqSGeK6Pe_ev-Iy_7NXKFwvV4_FUPlE1piKzXxTC7Z8Y3dPXdASbKweSwBs23DZep9ab4hYF715lbJwQiBfWpr6c95gpmfo69RKwJbpw1iTOjdSFAwOvZlKu9AsxRJUx7t082gGX8aB3AB4CyEtZqvhQFf-c_wdcbAUV-DQdxq6oO0oPVPlmRMjQnKWE6uyxsvYgUY52nzNZSF6j46MjhxSvwbkHNICi5cDMsmR9z3JaqOIvPXUXko5wgTJYcCoAGt85HhPrFe9) -### Data Ingestion Cycle +### Data Ingestion Cycle [![data-ingestion-workflow](https://github.com/vmware/versatile-data-kit/assets/2536458/8c27455c-9836-4110-8a8a-9660fba6706d)](https://www.plantuml.com/plantuml/uml/hPDFQzj04CNl-XH3V4aE9iH7avhOCO6cFXXiQWe1iTQEhArMCyh-gMjAltj7KXmJOUf3g-FCRsRVUxjwy46v42kR11Cimbm51PzfXpuO9jYmAtFB-oJnfQ5QELM1nzU8b27yIa2-gNEyVsJB3cPMPMLNp0Ax6JkDhjzQc1mNXl12Lmexnv5qHtn3Ap9QP2c2JMPgHU7CZXxGMxCg3pCRiOyzCOMp5lhpqzUeJjt-sEyaKKqThjeOdtbx1Sh3_1a6xM1zEX6kliw_m9FaYNl9kEMQok2ey0Exj75d27xUjTpoxW8swh3Htx5vSyUacdlk-FM9rw9_gpo-ELce4cytoc71qUFj2wqBu_ImsNe095speT1vNMnWi2bCm5N59IQ9c1zEMcjZy8AclFsEMKXpTgavlhFh2aF1-fC-IRf9Y0C2_q3tDlrOO5Pwa46eMmSUCgRSxA933OPUwFw-svZMwc1PwRHsM3lEqFlq-6edaqJMDPeanZ48aHw7ElBwvXqOdGV-ILhdDDMOgsZ3P08oqzKWZvGrrg7zpp2WUrUobaC-UXCLKXrAKo8Vmmf9GoWGcfi3EFOwUTEi9DvRr3kiaC9_IPPzk1Ij81UoFKiy8EbOsJy0) From 15e9ea20fb70fb8f2f9d98c533b96f16d67d53aa Mon Sep 17 00:00:00 2001 From: Antoni Ivanov Date: Mon, 17 Jul 2023 13:07:38 +0300 Subject: [PATCH 3/7] Update README.md --- projects/vdk-plugins/README.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/projects/vdk-plugins/README.md b/projects/vdk-plugins/README.md index 2e13942c08..6fc6ce4d3a 100644 --- a/projects/vdk-plugins/README.md +++ b/projects/vdk-plugins/README.md @@ -129,8 +129,25 @@ Check out the [connection hook spec](../vdk-core/src/vdk/api/plugin/connection_h ### Data Ingestion Cycle -[![data-ingestion-workflow](https://github.com/vmware/versatile-data-kit/assets/2536458/8c27455c-9836-4110-8a8a-9660fba6706d)](https://www.plantuml.com/plantuml/uml/hPDFQzj04CNl-XH3V4aE9iH7avhOCO6cFXXiQWe1iTQEhArMCyh-gMjAltj7KXmJOUf3g-FCRsRVUxjwy46v42kR11Cimbm51PzfXpuO9jYmAtFB-oJnfQ5QELM1nzU8b27yIa2-gNEyVsJB3cPMPMLNp0Ax6JkDhjzQc1mNXl12Lmexnv5qHtn3Ap9QP2c2JMPgHU7CZXxGMxCg3pCRiOyzCOMp5lhpqzUeJjt-sEyaKKqThjeOdtbx1Sh3_1a6xM1zEX6kliw_m9FaYNl9kEMQok2ey0Exj75d27xUjTpoxW8swh3Htx5vSyUacdlk-FM9rw9_gpo-ELce4cytoc71qUFj2wqBu_ImsNe095speT1vNMnWi2bCm5N59IQ9c1zEMcjZy8AclFsEMKXpTgavlhFh2aF1-fC-IRf9Y0C2_q3tDlrOO5Pwa46eMmSUCgRSxA933OPUwFw-svZMwc1PwRHsM3lEqFlq-6edaqJMDPeanZ48aHw7ElBwvXqOdGV-ILhdDDMOgsZ3P08oqzKWZvGrrg7zpp2WUrUobaC-UXCLKXrAKo8Vmmf9GoWGcfi3EFOwUTEi9DvRr3kiaC9_IPPzk1Ij81UoFKiy8EbOsJy0) +Data engineers use one of the [IIngester](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/job_input.py#L112) methods to send data for ingestion. The way data is ingested is controlled by different ingestion plugins which implement one of three possible methods (hooks) +* **pre_ingest_process** - called before data is about to be ingested +* **ingest_payload** - does the actual ingestion (sending the data to remote store) +* **post_ingest_process** - called after data is ingested (or failed to ingest in case of error) + +[![vdk-ingest](https://github.com/vmware/versatile-data-kit/assets/2536458/a74582ef-eaaa-4693-91c4-41745212ad79)](https://www.plantuml.com/plantuml/uml/hPDFRzfC4CRl_XIZS843Yi8HnIWGb4DU3aYW5rMAP2tU0MzjppZxfnHL_UuTsm4KYvP3w-FCRsQUvrbuSbvP7yeYShcXIbbLWiFtW9GY_8X0lgcrV7ZcWYtC2fNcRJ7rR6TiDTfkQs5sk324DxfIs5iEf5lY2nO57nfaAOfCQYf5_igE3j1PiygFio9W5tjXybSjTEUdxq5Tkjsndr6awZhSpPLNyChREr0Evg_GQmQhoqMu-t_-7xn8ddXWcpTSNUcT57vYbqNO6uBl3mstVBY1ZLfiz6TiZiuRKjumjVpwmclHlrKEFvmiL8xt6sKnu-3m_etMcR5wM6yz0fAks91llIusqDjankEgv1oZICmF9usrCJX14zv-nTGdExQ9eJsw-dw_H9-nZlL5qY2AOvYw8wMPPPApq1VDs_DxWCyiAkq64CTHHEmH-1lQZqlF6QQv0pa2LUFMGSgqC_jWKOEXDtfyRAydbJeMh7HIMQmif-XSSlg5JoQHhAlrI-HZ428v3RLaVt06Hhy3_a9QcqgYSQT2uISJa9cs1hj0QHqJDFz9z6ZFIjPovBCtKI7LeJJbUSQmmYO-XFgL0KwzLjuqpOaF1UezbaZ-doJBpj-ALf0RsLubOlcY9_4Jok8N) + +Details about ingestion hooks that can be implemented [can be seen here](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/plugin/plugin_input.py#L232). + +You can see an example of [ingest plugin here](https://github.com/vmware/versatile-data-kit/blob/main/examples/ingest-and-anonymize-plugin/plugins/vdk-poc-anonymize/src/vdk/plugin/anonymize/anonymization_plugin.py) + +Ingestion hooks can be used for the following example use-cases (but not limited to them only): + +* **Data Validation (pre_ingest_process):** Plugin validates incoming data against a predefined schema or rules. For example, it verifies if all necessary fields in sales data are present and correctly formatted. +* **Data Transformation (pre_ingest_process):** Plugin transforms the data into the required format. For instance, it might convert product names to uppercase or generate new fields in sales data, or anonymize PII data. +* **Data Ingestion (ingest_payload):** The Plugin Destination pushes data to the final storage, managing connections to systems like Amazon S3, Google Cloud Storage, or SQL databases. +* **Data Auditing (post_ingest_process):** In the post-ingest phase, the plugin serves as a data auditing tool, generating reports detailing data volume, errors, and timestamps of the ingestion process. +* **Metadata Update (post_ingest_process):** A plugin updates a metadata repository with information about the ingested data, like source, time, volume, schema. ## Public interfaces From 2bea1d188a1deb8b2bc10404cfd404fcf0d023e8 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 17 Jul 2023 10:10:27 +0000 Subject: [PATCH 4/7] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- projects/vdk-plugins/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/projects/vdk-plugins/README.md b/projects/vdk-plugins/README.md index 6fc6ce4d3a..09b23c82cc 100644 --- a/projects/vdk-plugins/README.md +++ b/projects/vdk-plugins/README.md @@ -129,7 +129,7 @@ Check out the [connection hook spec](../vdk-core/src/vdk/api/plugin/connection_h ### Data Ingestion Cycle -Data engineers use one of the [IIngester](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/job_input.py#L112) methods to send data for ingestion. The way data is ingested is controlled by different ingestion plugins which implement one of three possible methods (hooks) +Data engineers use one of the [IIngester](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/job_input.py#L112) methods to send data for ingestion. The way data is ingested is controlled by different ingestion plugins which implement one of three possible methods (hooks) * **pre_ingest_process** - called before data is about to be ingested * **ingest_payload** - does the actual ingestion (sending the data to remote store) @@ -141,7 +141,7 @@ Details about ingestion hooks that can be implemented [can be seen here](https:/ You can see an example of [ingest plugin here](https://github.com/vmware/versatile-data-kit/blob/main/examples/ingest-and-anonymize-plugin/plugins/vdk-poc-anonymize/src/vdk/plugin/anonymize/anonymization_plugin.py) -Ingestion hooks can be used for the following example use-cases (but not limited to them only): +Ingestion hooks can be used for the following example use-cases (but not limited to them only): * **Data Validation (pre_ingest_process):** Plugin validates incoming data against a predefined schema or rules. For example, it verifies if all necessary fields in sales data are present and correctly formatted. * **Data Transformation (pre_ingest_process):** Plugin transforms the data into the required format. For instance, it might convert product names to uppercase or generate new fields in sales data, or anonymize PII data. From 606f96d87df92f8c37bf0dbfd8e6d42b9396e126 Mon Sep 17 00:00:00 2001 From: Antoni Ivanov Date: Tue, 18 Jul 2023 11:12:00 +0300 Subject: [PATCH 5/7] Update projects/vdk-plugins/README.md Co-authored-by: Gabriel Georgiev <45939426+gageorgiev@users.noreply.github.com> --- projects/vdk-plugins/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/projects/vdk-plugins/README.md b/projects/vdk-plugins/README.md index 09b23c82cc..06a5a3836b 100644 --- a/projects/vdk-plugins/README.md +++ b/projects/vdk-plugins/README.md @@ -137,7 +137,7 @@ Data engineers use one of the [IIngester](https://github.com/vmware/versatile-da [![vdk-ingest](https://github.com/vmware/versatile-data-kit/assets/2536458/a74582ef-eaaa-4693-91c4-41745212ad79)](https://www.plantuml.com/plantuml/uml/hPDFRzfC4CRl_XIZS843Yi8HnIWGb4DU3aYW5rMAP2tU0MzjppZxfnHL_UuTsm4KYvP3w-FCRsQUvrbuSbvP7yeYShcXIbbLWiFtW9GY_8X0lgcrV7ZcWYtC2fNcRJ7rR6TiDTfkQs5sk324DxfIs5iEf5lY2nO57nfaAOfCQYf5_igE3j1PiygFio9W5tjXybSjTEUdxq5Tkjsndr6awZhSpPLNyChREr0Evg_GQmQhoqMu-t_-7xn8ddXWcpTSNUcT57vYbqNO6uBl3mstVBY1ZLfiz6TiZiuRKjumjVpwmclHlrKEFvmiL8xt6sKnu-3m_etMcR5wM6yz0fAks91llIusqDjankEgv1oZICmF9usrCJX14zv-nTGdExQ9eJsw-dw_H9-nZlL5qY2AOvYw8wMPPPApq1VDs_DxWCyiAkq64CTHHEmH-1lQZqlF6QQv0pa2LUFMGSgqC_jWKOEXDtfyRAydbJeMh7HIMQmif-XSSlg5JoQHhAlrI-HZ428v3RLaVt06Hhy3_a9QcqgYSQT2uISJa9cs1hj0QHqJDFz9z6ZFIjPovBCtKI7LeJJbUSQmmYO-XFgL0KwzLjuqpOaF1UezbaZ-doJBpj-ALf0RsLubOlcY9_4Jok8N) -Details about ingestion hooks that can be implemented [can be seen here](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/plugin/plugin_input.py#L232). +Details about ingestion hooks [can be seen here](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/plugin/plugin_input.py#L232). You can see an example of [ingest plugin here](https://github.com/vmware/versatile-data-kit/blob/main/examples/ingest-and-anonymize-plugin/plugins/vdk-poc-anonymize/src/vdk/plugin/anonymize/anonymization_plugin.py) From bf5d112cfbbc39bc2cab9d4048b837d66833fb89 Mon Sep 17 00:00:00 2001 From: Antoni Ivanov Date: Tue, 18 Jul 2023 11:12:07 +0300 Subject: [PATCH 6/7] Update projects/vdk-plugins/README.md Co-authored-by: Gabriel Georgiev <45939426+gageorgiev@users.noreply.github.com> --- projects/vdk-plugins/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/projects/vdk-plugins/README.md b/projects/vdk-plugins/README.md index 06a5a3836b..70345036d6 100644 --- a/projects/vdk-plugins/README.md +++ b/projects/vdk-plugins/README.md @@ -139,7 +139,7 @@ Data engineers use one of the [IIngester](https://github.com/vmware/versatile-da Details about ingestion hooks [can be seen here](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/plugin/plugin_input.py#L232). -You can see an example of [ingest plugin here](https://github.com/vmware/versatile-data-kit/blob/main/examples/ingest-and-anonymize-plugin/plugins/vdk-poc-anonymize/src/vdk/plugin/anonymize/anonymization_plugin.py) +You can see an example of [an ingest plugin here](https://github.com/vmware/versatile-data-kit/blob/main/examples/ingest-and-anonymize-plugin/plugins/vdk-poc-anonymize/src/vdk/plugin/anonymize/anonymization_plugin.py) Ingestion hooks can be used for the following example use-cases (but not limited to them only): From fb7f69d4995482c8c1ad87644c16055376242f08 Mon Sep 17 00:00:00 2001 From: Antoni Ivanov Date: Tue, 18 Jul 2023 11:12:13 +0300 Subject: [PATCH 7/7] Update projects/vdk-plugins/README.md Co-authored-by: Gabriel Georgiev <45939426+gageorgiev@users.noreply.github.com> --- projects/vdk-plugins/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/projects/vdk-plugins/README.md b/projects/vdk-plugins/README.md index 70345036d6..4a71f91dc2 100644 --- a/projects/vdk-plugins/README.md +++ b/projects/vdk-plugins/README.md @@ -141,7 +141,7 @@ Details about ingestion hooks [can be seen here](https://github.com/vmware/versa You can see an example of [an ingest plugin here](https://github.com/vmware/versatile-data-kit/blob/main/examples/ingest-and-anonymize-plugin/plugins/vdk-poc-anonymize/src/vdk/plugin/anonymize/anonymization_plugin.py) -Ingestion hooks can be used for the following example use-cases (but not limited to them only): +Ingestion hooks can be used for the following example use-cases (this is not an exhaustive list): * **Data Validation (pre_ingest_process):** Plugin validates incoming data against a predefined schema or rules. For example, it verifies if all necessary fields in sales data are present and correctly formatted. * **Data Transformation (pre_ingest_process):** Plugin transforms the data into the required format. For instance, it might convert product names to uppercase or generate new fields in sales data, or anonymize PII data.