-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add requirements-dev-lite.txt and update tests/docs #4273
Conversation
Also update get_dataset and build_sa_validator_with_data funcs in self_check/util.py to not blow up if various SQL dialects are not installed/available.
✔️ Deploy Preview for niobium-lead-7998 ready! 🔨 Explore the source changes: 08503f3 🔍 Inspect the deploy log: https://app.netlify.com/sites/niobium-lead-7998/deploys/621e254c6d7e3a0007f885e6 😎 Browse the preview: https://deploy-preview-4273--niobium-lead-7998.netlify.app |
HOWDY! This is your friendly 🤖 CHANGELOG bot 🤖Please don't forget to add a clear and succinct description of your change under the Develop header in ✨ Thank you! ✨ |
"url": "postgresql+psycopg2://username:password@host:65432/database", | ||
}, | ||
) | ||
if is_library_loadable(library_name="psycopg2"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff looks funky, but it's just moving that chunk of code under this conditional statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for turning this around so quickly, @kenwade4 . Something like this will be a huge help for new contributors to get started with far less friction.
I imagine that @cdkini and @donaldheppner will have feedback on the specific implementation.
- I'd suggest we find people to friction log these instructions across OSSes (e.g. mac, windows, ubuntu linux)
- Friction logging in both conda and virtualenv setups would be good, too.
- I'll friction log it myself as soon as I get a chance, but that likely won't be until next week.
- If I read this right, developers who follow these instructions will be able to test against pandas and sqlite, but not other dialects of SQL, and not Spark. Is that correct? If so, we should make that clear in documentation, and provide steps for developers to progressively add Spark and other dialects to their test matrices.
Other comments inline.
@@ -385,7 +385,7 @@ def get_dataset( | |||
return PandasDataset(df, profiler=profiler, caching=caching) | |||
|
|||
elif dataset_type == "sqlite": | |||
if not create_engine: | |||
if not create_engine or not SQLITE_TYPES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It scares me that we need to make these kinds of changes.
This reinforces my view that we should not be relying on imports to determine which backends to test against. Instead, I would strongly advocate for an explicit config file for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When imports fail for those dialects, we are setting the *_TYPES
dicts to {}
, but later in this function we are just trying to access them with square bracket notation (not .get()
). There were already bailout points at if not create_engine
and it thought it made sense to bail out at that point if those dicts were empty
- postgresql: `psycopg2-binary>=2.7.6` | ||
- mysql: `PyMySQL>=0.9.3,<0.10` | ||
- mssql: `pyodbc>=4.0.30` (see step 6 for links on getting the odbc driver on your system first) | ||
- athena: `pyathena>=1.11` | ||
- bigquery: `sqlalchemy-bigquery>=1.3.0 google-cloud-secret-manager>=1.0.0 google-cloud-storage>=1.28.0` | ||
- snowflake: `snowflake-connector-python==2.5.0 snowflake-sqlalchemy>=1.2.3 azure-storage-blob>=12.5.0` | ||
- redshift: `sqlalchemy-redshift>=0.7.7` | ||
- teradata: `teradatasqlalchemy==17.0.0.1` | ||
- oracle: `cx_Oracle` | ||
- dremio: `sqlalchemy-dremio>=1.2.1 pyarrow>=0.12.0 pyodbc>=4.0.30` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one small suggestion for this @kenwade4 would it be possible to have a reference here to the appropriate requirements.txt
file? I imagine this section becoming really quickly out of date otherwise. I'm imagining something similar to what we do with python scripts in our how-to-guides.
```python file=../../../../tests/integration/docusaurus/connecting_to_your_data/database/mssql_yaml_example.py#L3-L6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirements listed reference different requirements files and combine them in some instances (i.e. stuff from requirements-dev-base.txt and requirements-dev-sqlalchemy.txt for the bigquery line). The oracle requirement isn't in any of the files since we aren't testing on oracle DBs in CI/CD.
I'd be more inclined to not include that list of individual dialects at all and tell them to "try setting up the full dev environment (as mentioned in step 6) when you're ready for more robust testing of your custom Expectations" or something like that.
I'm pretty sure @donaldheppner is opposed to splitting up the requirements-dev-*.txt files into things like requirements-dev-postgresql.txt
etc for dialects (although requirements-dev-spark.txt
only has a single line). If we were to split them up, THEN it would make sense to link to specific lines of files with python file=../../../blah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok @kenwade4 that makes sense. I guess I keep coming back to the thought of the list becoming out of date. As another idea would it be possible to add a (preferably loud) note to the requirements
files as a reminder to update the contributing_setup
doc if anything changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work - just a few questions I'd love to talk through before approving.
@@ -132,6 +158,8 @@ Depending on which features of Great Expectations you want to work on, you may w | |||
|
|||
* Caution: If another service is using port 3306, Docker may start the container but silently fail to set up the port. | |||
|
|||
> If you have a Silicon Mac (M1) this Docker image does not work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick - maybe we should use a :::warning
tag here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with that syntax in markdown. With whatever theme we are using with docusaurus, inline comments like this have a bold yellow background that stands out pretty well.
Would it just become ::warning If you have a Silicon...
??
try: | ||
assert ( | ||
PasswordMasker.mask_db_url( | ||
f"postgresql://scott:tiger@{db_hostname}:65432/mydatabase" | ||
) | ||
== f"postgresql://scott:***@{db_hostname}:65432/mydatabase" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this try/except
, should we use pytest decorators to skip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's following already existing logic that was there. I'm a fan of @pytest.mark.skipif which I added in other places
# psycopg2 (if installed in test environment) | ||
try: | ||
assert ( | ||
PasswordMasker.mask_db_url( | ||
f"postgresql+psycopg2://scott:tiger@{db_hostname}:65432/mydatabase" | ||
) | ||
== f"postgresql+psycopg2://scott:***@{db_hostname}:65432/mydatabase" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here and throughout
@@ -15,6 +15,7 @@ | |||
list_gcs_keys, | |||
map_batch_definition_to_data_reference_string_using_regex, | |||
map_data_reference_string_to_batch_definition_list_using_regex, | |||
storage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be importing the import from the source code or should we try to import GCS here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I purposely tried to import storage
from the 3 different source files since those were the places where they were failing. The test skip reasons also reflect where storage was failed to be imported.
try: | ||
dialect_classes["sqlite"] = sqlitetypes.dialect | ||
dialect_types["sqlite"] = SQLITE_TYPES | ||
except AttributeError: | ||
pass | ||
try: | ||
dialect_classes["postgresql"] = postgresqltypes.dialect | ||
dialect_types["postgresql"] = POSTGRESQL_TYPES | ||
except AttributeError: | ||
pass | ||
try: | ||
dialect_classes["mysql"] = mysqltypes.dialect | ||
dialect_types["mysql"] = MYSQL_TYPES | ||
except AttributeError: | ||
pass | ||
try: | ||
dialect_classes["mssql"] = mssqltypes.dialect | ||
dialect_types["mssql"] = MSSQL_TYPES | ||
except AttributeError: | ||
pass | ||
try: | ||
dialect_classes["bigquery"] = sqla_bigquery.BigQueryDialect | ||
dialect_types["bigquery"] = BIGQUERY_TYPES | ||
except AttributeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe for-loop this? And perhaps add a logging statement?
dialects: List[Tuple[x, y]] = [
("sqlite", sqlitetypes.dialect, SQLITE_TYPES)
...
]
for name, class_, type_ in dialects:
try:
dialect_classes[name] = class_
dialect_types[name = type_
except AttributeError:
logging.debug(...)
What's the AttributeError
here? I'm not immediately seeing how that can be triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NoneType has no attribute whatever. A lot of try/except ImportError will just end up setting whatever failed import to None.
To make it as easy as possible for potential contributors to get a dev environment setup, provide a
requirements-dev-lite.txt
file with fewest dependencies needed.Changes proposed in this pull request:
get_dataset
andbuild_sa_validator_with_data
funcs to not blow up if various SQL dialects are notinstalled/available
Definition of Done