From 4450ce69f0e83cd18b5824c81a6c28230f0b54e9 Mon Sep 17 00:00:00 2001 From: Yassine Alouini Date: Mon, 27 Feb 2023 10:42:00 +0100 Subject: [PATCH] Make the SQLQueryDataSet compatible with mssql. (#101) * [kedro-docker] Layers size optimization (#92) * [kedro-docker] Layers size optimization Signed-off-by: Mariusz Strzelecki * Adjust test requirements Signed-off-by: Mariusz Strzelecki * Skip coverage check on tests dir (some do not execute on Windows) Signed-off-by: Mariusz Strzelecki * Update .coveragerc with the setup Signed-off-by: Mariusz Strzelecki * Fix bandit so it does not scan kedro-datasets Signed-off-by: Mariusz Strzelecki * Fixed existence test Signed-off-by: Mariusz Strzelecki * Check why dir is not created Signed-off-by: Mariusz Strzelecki * Kedro starters are fixed now Signed-off-by: Mariusz Strzelecki * Increased no-output-timeout for long spark image build Signed-off-by: Mariusz Strzelecki * Spark image optimized Signed-off-by: Mariusz Strzelecki * Linting Signed-off-by: Mariusz Strzelecki * Switch to slim image always Signed-off-by: Mariusz Strzelecki * Trigger build Signed-off-by: Mariusz Strzelecki * Use textwrap.dedent for nicer indentation Signed-off-by: Mariusz Strzelecki * Revert "Use textwrap.dedent for nicer indentation" This reverts commit 3a1e3f855a29c6a1b118db3e844e5f9b67ade363. Signed-off-by: Mariusz Strzelecki * Revert "Revert "Use textwrap.dedent for nicer indentation"" This reverts commit d322d353b25d414cdfdef8ee12185e5a1d9baa2c. Signed-off-by: Mariusz Strzelecki * Make tests read more lines (to skip all deprecation warnings) Signed-off-by: Mariusz Strzelecki Signed-off-by: Mariusz Strzelecki Signed-off-by: Mariusz Strzelecki Signed-off-by: Yassine Alouini * Release Kedro-Docker 0.3.1 (#94) * Add release notes for kedro-docker 0.3.1 Signed-off-by: Jannic Holzer * Update version in kedro_docker module Signed-off-by: Jannic Holzer Signed-off-by: Jannic Holzer Signed-off-by: Yassine Alouini * Bump version and update release notes (#96) Signed-off-by: Merel Theisen Signed-off-by: Yassine Alouini * Make the SQLQueryDataSet compatible with mssql. Signed-off-by: Yassine Alouini * Add one test + update RELEASE.md. Signed-off-by: Yassine Alouini * Add missing pyodbc for tests. Signed-off-by: Yassine Alouini * Mock connection as well. Signed-off-by: Yassine Alouini * Add more dates parsing for mssql backend (thanks to fgaudindelrieu@idmog.com) Signed-off-by: Yassine Alouini * Fix an error in docstring of MetricsDataSet (#98) Signed-off-by: Yassine Alouini * Bump relax pyarrow version to work the same way as Pandas (#100) * Bump relax pyarrow version to work the same way as Pandas We only use PyArrow for `pandas.ParquetDataSet` as such I suggest we keep our versions pinned to the same range as [Pandas does](https://github.com/pandas-dev/pandas/blob/96fc51f5ec678394373e2c779ccff37ddb966e75/pyproject.toml#L100) for the same reason. As such I suggest we remove the upper bound as we have users requesting later versions in [support channels](https://kedro-org.slack.com/archives/C03RKP2LW64/p1674040509133529) * Updated release notes Signed-off-by: Yassine Alouini * Add missing type in catalog example. Signed-off-by: Yassine Alouini * Add one more unit tests for adapt_mssql. Signed-off-by: Yassine Alouini * [FIX] Add missing mocker from date test. Signed-off-by: Yassine Alouini * [TEST] Add a wrong input test. Signed-off-by: Yassine Alouini * Add pyodbc dependency. Signed-off-by: Yassine Alouini * [FIX] Remove dict() in tests. Signed-off-by: Yassine Alouini * Change check to check on plugin name (#103) Signed-off-by: Merel Theisen Signed-off-by: Yassine Alouini * Set coverage in pyproject.toml (#105) Signed-off-by: Merel Theisen Signed-off-by: Yassine Alouini * Move coverage settings to pyproject.toml (#106) Signed-off-by: Merel Theisen Signed-off-by: Yassine Alouini * Replace kedro.pipeline with modular_pipeline.pipeline factory (#99) * Add non-spark related test changes Replace kedro.pipeline.Pipeline with kedro.pipeline.modular_pipeline.pipeline factory. This is for symmetry with changes made to the main kedro library. Signed-off-by: Adam Farley Signed-off-by: Yassine Alouini * Fix outdated links in Kedro Datasets (#111) * fix links * fix dill links Signed-off-by: Yassine Alouini * Fix docs formatting and phrasing for some datasets (#107) * Fix docs formatting and phrasing for some datasets Signed-off-by: Deepyaman Datta * Manually fix files not resolved with patch command Signed-off-by: Deepyaman Datta * Apply fix from #98 Signed-off-by: Deepyaman Datta --------- Signed-off-by: Deepyaman Datta Signed-off-by: Yassine Alouini * Release `kedro-datasets` `version 1.0.2` (#112) * bump version and update release notes * fix pylint errors Signed-off-by: Yassine Alouini * Bump pytest to 7.2 (#113) Signed-off-by: Merel Theisen Signed-off-by: Yassine Alouini * Prefix Docker plugin name with "Kedro-" in usage message (#57) * Prefix Docker plugin name with "Kedro-" in usage message Signed-off-by: Deepyaman Datta Signed-off-by: Yassine Alouini * Keep Kedro-Docker plugin docstring from appearing in `kedro -h` (#56) * Keep Kedro-Docker plugin docstring from appearing in `kedro -h` Signed-off-by: Deepyaman Datta Signed-off-by: Yassine Alouini * [kedro-datasets ] Add `Polars.CSVDataSet` (#95) Signed-off-by: wmoreiraa Signed-off-by: Yassine Alouini * Remove deprecated `test_requires` from `setup.py` in Kedro-Docker (#54) Signed-off-by: Deepyaman Datta Signed-off-by: Yassine Alouini * [FIX] Fix ds to data_set. Signed-off-by: Yassine Alouini --------- Signed-off-by: Mariusz Strzelecki Signed-off-by: Mariusz Strzelecki Signed-off-by: Yassine Alouini Signed-off-by: Jannic Holzer Signed-off-by: Merel Theisen Signed-off-by: Deepyaman Datta Co-authored-by: Mariusz Strzelecki Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: OKA Naoya Co-authored-by: Joel <35801847+datajoely@users.noreply.github.com> Co-authored-by: adamfrly <45516720+adamfrly@users.noreply.github.com> Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> Co-authored-by: Deepyaman Datta Co-authored-by: Walber Moreira <58264877+wmoreiraa@users.noreply.github.com> --- kedro-datasets/RELEASE.md | 2 +- .../kedro_datasets/pandas/sql_dataset.py | 69 +++++++++++++++++++ kedro-datasets/setup.py | 2 +- kedro-datasets/test_requirements.txt | 1 + .../tests/pandas/test_sql_dataset.py | 54 +++++++++++++++ 5 files changed, 126 insertions(+), 2 deletions(-) diff --git a/kedro-datasets/RELEASE.md b/kedro-datasets/RELEASE.md index 3b51df818..412fe9f9c 100644 --- a/kedro-datasets/RELEASE.md +++ b/kedro-datasets/RELEASE.md @@ -11,7 +11,7 @@ | `polars.CSVDataSet` | A `CSVDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` | ## Bug fixes and other changes - +* Add `mssql` backend to the `SQLQueryDataSet` DataSet using `pyodbc` library. # Release 1.0.2: diff --git a/kedro-datasets/kedro_datasets/pandas/sql_dataset.py b/kedro-datasets/kedro_datasets/pandas/sql_dataset.py index 1400e4981..dd5d636a1 100644 --- a/kedro-datasets/kedro_datasets/pandas/sql_dataset.py +++ b/kedro-datasets/kedro_datasets/pandas/sql_dataset.py @@ -1,6 +1,7 @@ """``SQLDataSet`` to load and save data to a SQL backend.""" import copy +import datetime as dt import re from pathlib import PurePosixPath from typing import Any, Dict, NoReturn, Optional @@ -22,6 +23,7 @@ "psycopg2": "psycopg2", "mysqldb": "mysqlclient", "cx_Oracle": "cx_Oracle", + "mssql": "pyodbc", } DRIVER_ERROR_MESSAGE = """ @@ -321,7 +323,49 @@ class SQLQueryDataSet(AbstractDataSet[None, pd.DataFrame]): >>> credentials=credentials) >>> >>> sql_data = data_set.load() + >>> + Example of usage for mssql: + :: + + + >>> credentials = {"server": "localhost", "port": "1433", + >>> "database": "TestDB", "user": "SA", + >>> "password": "StrongPassword"} + >>> def _make_mssql_connection_str( + >>> server: str, port: str, database: str, user: str, password: str + >>> ) -> str: + >>> import pyodbc # noqa + >>> from sqlalchemy.engine import URL # noqa + >>> + >>> driver = pyodbc.drivers()[-1] + >>> connection_str = (f"DRIVER={driver};SERVER={server},{port};DATABASE={database};" + >>> f"ENCRYPT=yes;UID={user};PWD={password};" + >>> "TrustServerCertificate=yes;") + >>> return URL.create("mssql+pyodbc", query={"odbc_connect": connection_str}) + >>> connection_str = _make_mssql_connection_str(**credentials) + >>> data_set = SQLQueryDataSet(credentials={"con": connection_str}, + >>> sql="SELECT TOP 5 * FROM TestTable;") + >>> df = data_set.load() + + In addition, here is an example of a catalog with dates parsing: + :: + + >>> mssql_dataset: + >>> type: kedro_datasets.pandas.SQLQueryDataSet + >>> credentials: mssql_credentials + >>> sql: > + >>> SELECT * + >>> FROM DateTable + >>> WHERE date >= ? AND date <= ? + >>> ORDER BY date + >>> load_args: + >>> params: + >>> - ${begin} + >>> - ${end} + >>> index_col: date + >>> parse_dates: + >>> date: "%Y-%m-%d %H:%M:%S.%f0 %z" """ # using Any because of Sphinx but it should be @@ -413,6 +457,8 @@ def __init__( # pylint: disable=too-many-arguments self._connection_str = credentials["con"] self._execution_options = execution_options or {} self.create_connection(self._connection_str) + if "mssql" in self._connection_str: + self.adapt_mssql_date_params() @classmethod def create_connection(cls, connection_str: str) -> None: @@ -456,3 +502,26 @@ def _load(self) -> pd.DataFrame: def _save(self, data: None) -> NoReturn: raise DataSetError("'save' is not supported on SQLQueryDataSet") + + # For mssql only + def adapt_mssql_date_params(self) -> None: + """We need to change the format of datetime parameters. + MSSQL expects datetime in the exact format %y-%m-%dT%H:%M:%S. + Here, we also accept plain dates. + `pyodbc` does not accept named parameters, they must be provided as a list.""" + params = self._load_args.get("params", []) + if not isinstance(params, list): + raise DataSetError( + "Unrecognized `params` format. It can be only a `list`, " + f"got {type(params)!r}" + ) + new_load_args = [] + for value in params: + try: + as_date = dt.date.fromisoformat(value) + new_val = dt.datetime.combine(as_date, dt.time.min) + new_load_args.append(new_val.strftime("%Y-%m-%dT%H:%M:%S")) + except (TypeError, ValueError): + new_load_args.append(value) + if new_load_args: + self._load_args["params"] = new_load_args diff --git a/kedro-datasets/setup.py b/kedro-datasets/setup.py index 9effe1fca..e054e17e8 100644 --- a/kedro-datasets/setup.py +++ b/kedro-datasets/setup.py @@ -58,7 +58,7 @@ def _collect_requirements(requires): "pandas.JSONDataSet": [PANDAS], "pandas.ParquetDataSet": [PANDAS, "pyarrow>=6.0"], "pandas.SQLTableDataSet": [PANDAS, "SQLAlchemy~=1.2"], - "pandas.SQLQueryDataSet": [PANDAS, "SQLAlchemy~=1.2"], + "pandas.SQLQueryDataSet": [PANDAS, "SQLAlchemy~=1.2", "pyodbc~=4.0"], "pandas.XMLDataSet": [PANDAS, "lxml~=4.6"], "pandas.GenericDataSet": [PANDAS], } diff --git a/kedro-datasets/test_requirements.txt b/kedro-datasets/test_requirements.txt index 8dec3619b..2b742b751 100644 --- a/kedro-datasets/test_requirements.txt +++ b/kedro-datasets/test_requirements.txt @@ -38,6 +38,7 @@ pre-commit>=2.9.2, <3.0 # The hook `mypy` requires pre-commit version 2.9.2. psutil==5.8.0 pyarrow>=1.0, <7.0 pylint>=2.5.2, <3.0 +pyodbc~=4.0.35 pyproj~=3.0 pyspark>=2.2, <4.0 pytest-cov~=3.0 diff --git a/kedro-datasets/tests/pandas/test_sql_dataset.py b/kedro-datasets/tests/pandas/test_sql_dataset.py index a1c6839d6..aa9fe8d17 100644 --- a/kedro-datasets/tests/pandas/test_sql_dataset.py +++ b/kedro-datasets/tests/pandas/test_sql_dataset.py @@ -11,6 +11,7 @@ TABLE_NAME = "table_a" CONNECTION = "sqlite:///kedro.db" +MSSQL_CONNECTION = "mssql+pyodbc://?odbc_connect=DRIVER%3DODBC+Driver+for+SQL" SQL_QUERY = "SELECT * FROM table_a" EXECUTION_OPTIONS = {"stream_results": True} FAKE_CONN_STR = "some_sql://scott:tiger@localhost/foo" @@ -417,3 +418,56 @@ def test_create_connection_only_once(self, mocker): assert mock_engine.call_count == 2 assert fourth.engines == first.engines assert len(first.engines) == 2 + + def test_adapt_mssql_date_params_called(self, mocker): + """Test that the adapt_mssql_date_params + function is called when mssql backend is used. + """ + mock_adapt_mssql_date_params = mocker.patch( + "kedro_datasets.pandas.sql_dataset.SQLQueryDataSet.adapt_mssql_date_params" + ) + mock_engine = mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine") + ds = SQLQueryDataSet(sql=SQL_QUERY, credentials={"con": MSSQL_CONNECTION}) + mock_engine.assert_called_once_with(MSSQL_CONNECTION) + assert mock_adapt_mssql_date_params.call_count == 1 + assert len(ds.engines) == 1 + + def test_adapt_mssql_date_params(self, mocker): + """Test that the adapt_mssql_date_params + function transforms the params as expected, i.e. + making datetime date into the format %Y-%m-%dT%H:%M:%S + and ignoring the other values. + """ + mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine") + load_args = { + "params": ["2023-01-01", "2023-01-01T20:26", "2023", "test", 1.0, 100] + } + ds = SQLQueryDataSet( + sql=SQL_QUERY, credentials={"con": MSSQL_CONNECTION}, load_args=load_args + ) + assert ds._load_args["params"] == [ + "2023-01-01T00:00:00", + "2023-01-01T20:26", + "2023", + "test", + 1.0, + 100, + ] + + def test_adapt_mssql_date_params_wrong_input(self, mocker): + """Test that the adapt_mssql_date_params + function fails with the correct error message + when given a wrong input + """ + mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine") + load_args = {"params": {"value": 1000}} + pattern = ( + "Unrecognized `params` format. It can be only a `list`, " + "got " + ) + with pytest.raises(DataSetError, match=pattern): + SQLQueryDataSet( + sql=SQL_QUERY, + credentials={"con": MSSQL_CONNECTION}, + load_args=load_args, + )