-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop Python 3.8, add Python 3.11 and 3.12, and use dd.from_map
#81
Changes from all commits
238eab7
18fd1f3
e37232a
aeef698
4efec2b
faf4686
4fddc4b
8ddaa43
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
name: test-environment | ||
channels: | ||
- conda-forge | ||
- nodefaults | ||
dependencies: | ||
- python=3.12 | ||
- dask >=2024.3.0 | ||
- distributed | ||
- pandas | ||
- pyarrow | ||
- pytest | ||
- gcsfs | ||
- google-cloud-bigquery>=2.11.0 | ||
- google-cloud-bigquery-storage | ||
- pip | ||
- pip: | ||
- git+https://github.com/dask/dask |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,13 +5,11 @@ | |
from contextlib import contextmanager | ||
from functools import partial | ||
|
||
import dask.dataframe as dd | ||
import gcsfs | ||
import pandas as pd | ||
import pyarrow | ||
from dask.base import tokenize | ||
from dask.dataframe.core import new_dd_object | ||
from dask.highlevelgraph import HighLevelGraph | ||
from dask.layers import DataFrameIOLayer | ||
from google.api_core import client_info as rest_client_info | ||
from google.api_core import exceptions | ||
from google.api_core.gapic_v1 import client_info as grpc_client_info | ||
|
@@ -206,19 +204,7 @@ def make_create_read_session_request(): | |
) | ||
meta = schema.empty_table().to_pandas(**arrow_options) | ||
|
||
label = "read-gbq-" | ||
output_name = label + tokenize( | ||
project_id, | ||
dataset_id, | ||
table_id, | ||
row_filter, | ||
read_kwargs, | ||
) | ||
|
||
layer = DataFrameIOLayer( | ||
output_name, | ||
meta.columns, | ||
[stream.name for stream in session.streams], | ||
return dd.from_map( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you pass the label into from_map to make the task prefix more descriptive? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
partial( | ||
bigquery_read, | ||
make_create_read_session_request=make_create_read_session_request, | ||
|
@@ -227,12 +213,10 @@ def make_create_read_session_request(): | |
arrow_options=arrow_options, | ||
credentials=credentials, | ||
), | ||
label=label, | ||
[stream.name for stream in session.streams], | ||
meta=meta, | ||
label="read-bigquery", | ||
) | ||
divisions = tuple([None] * (len(session.streams) + 1)) | ||
|
||
graph = HighLevelGraph({output_name: layer}, {output_name: set()}) | ||
return new_dd_object(graph, output_name, meta, divisions) | ||
|
||
|
||
def to_gbq( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -35,14 +35,13 @@ def df(): | |
for i in range(10) | ||
] | ||
|
||
yield pd.DataFrame(records) | ||
df = pd.DataFrame(records) | ||
df["timestamp"] = df["timestamp"].astype("datetime64[us, UTC]") | ||
yield df | ||
Comment on lines
+38
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a I'll admit I'm a bit stumped here. Clearly this used to work in the past somehow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tswast maybe you have a sense for any recent changes, or if I'm just wrong here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I started hitting similar issues at some point when I bumped pandas versions, could you be running a different version here than in the past? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We noticed some weirdness around Pandas 2.0 where we started getting microsecond precision back. BigQuery itself hasn't change AFAIK. We should always respond with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah pandas started supporting non nanosecond resolution with 2.0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm okay, thanks all for clarifying. I'm inclined to just go with the small test change here. We can always handle things in a follow-up PR as needed. |
||
|
||
|
||
@pytest.fixture(scope="module") | ||
def dataset(): | ||
project_id = os.environ.get("DASK_BIGQUERY_PROJECT_ID") | ||
if not project_id: | ||
credentials, project_id = google.auth.default() | ||
def dataset(project_id): | ||
dataset_id = f"{sys.platform}_{uuid.uuid4().hex}" | ||
|
||
with bigquery.Client() as bq_client: | ||
|
@@ -110,25 +109,30 @@ def required_partition_filter_table(dataset, df): | |
yield project_id, dataset_id, table_id | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def project_id(): | ||
project_id = os.environ.get("DASK_BIGQUERY_PROJECT_ID") | ||
if not project_id: | ||
_, project_id = google.auth.default() | ||
|
||
yield project_id | ||
|
||
|
||
@pytest.fixture | ||
def google_creds(): | ||
env_creds_file = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") | ||
if env_creds_file: | ||
credentials = json.load(open(env_creds_file)) | ||
else: | ||
if os.environ.get("GOOGLE_APPLICATION_CREDENTIALS"): | ||
credentials = json.load(open(os.environ.get("GOOGLE_APPLICATION_CREDENTIALS"))) | ||
elif os.environ.get("DASK_BIGQUERY_GCP_CREDENTIALS"): | ||
credentials = json.loads(os.environ.get("DASK_BIGQUERY_GCP_CREDENTIALS")) | ||
else: | ||
credentials, _ = google.auth.default() | ||
|
||
yield credentials | ||
Comment on lines
+112
to
130
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These changes keep the current testing setup behavior (i.e. support This made it more straightforward for me to run things locally, while also not rocking the boat too much with our current CI setup. |
||
|
||
|
||
@pytest.fixture | ||
def bucket(google_creds): | ||
project_id = google_creds["project_id"] | ||
env_project_id = os.environ.get("DASK_BIGQUERY_PROJECT_ID") | ||
if env_project_id: | ||
project_id = env_project_id | ||
|
||
def bucket(google_creds, project_id): | ||
bucket = f"dask-bigquery-tmp-{uuid.uuid4().hex}" | ||
|
||
fs = gcsfs.GCSFileSystem( | ||
project=project_id, access="read_write", token=google_creds | ||
) | ||
|
@@ -140,12 +144,7 @@ def bucket(google_creds): | |
|
||
|
||
@pytest.fixture | ||
def write_dataset(google_creds): | ||
project_id = google_creds["project_id"] | ||
env_project_id = os.environ.get("DASK_BIGQUERY_PROJECT_ID") | ||
if env_project_id: | ||
project_id = env_project_id | ||
|
||
def write_dataset(google_creds, project_id): | ||
dataset_id = f"{sys.platform}_{uuid.uuid4().hex}" | ||
|
||
yield google_creds, project_id, dataset_id, None | ||
|
@@ -158,8 +157,7 @@ def write_dataset(google_creds): | |
|
||
|
||
@pytest.fixture | ||
def write_existing_dataset(google_creds): | ||
project_id = os.environ.get("DASK_BIGQUERY_PROJECT_ID", google_creds["project_id"]) | ||
def write_existing_dataset(google_creds, project_id): | ||
dataset_id = "persistent_dataset" | ||
table_id = f"table_to_write_{sys.platform}_{uuid.uuid4().hex}" | ||
|
||
|
@@ -181,7 +179,7 @@ def write_existing_dataset(google_creds): | |
[ | ||
("name", pa.string()), | ||
("number", pa.uint8()), | ||
("timestamp", pa.timestamp("ns")), | ||
("timestamp", pa.timestamp("us")), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Corresponding change given the |
||
("idx", pa.uint8()), | ||
] | ||
), | ||
|
@@ -285,14 +283,14 @@ def test_roundtrip(df, dataset_fixture, request): | |
|
||
ddf_out = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id) | ||
# bigquery does not guarantee ordering, so let's reindex | ||
assert_eq(ddf.set_index("idx"), ddf_out.set_index("idx")) | ||
assert_eq(ddf.set_index("idx"), ddf_out.set_index("idx"), check_divisions=False) | ||
|
||
|
||
def test_read_gbq(df, table, client): | ||
project_id, dataset_id, table_id = table | ||
ddf = read_gbq(project_id=project_id, dataset_id=dataset_id, table_id=table_id) | ||
|
||
assert list(ddf.columns) == ["name", "number", "timestamp", "idx"] | ||
assert list(df.columns) == list(ddf.columns) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just generalizing a bit to make things more resilient to changing column names in the test DataFrame. |
||
assert assert_eq(ddf.set_index("idx"), df.set_index("idx")) | ||
|
||
|
||
|
@@ -305,7 +303,7 @@ def test_read_row_filter(df, table, client): | |
row_filter="idx < 5", | ||
) | ||
|
||
assert list(ddf.columns) == ["name", "number", "timestamp", "idx"] | ||
assert list(df.columns) == list(ddf.columns) | ||
assert assert_eq(ddf.set_index("idx").loc[:4], df.set_index("idx").loc[:4]) | ||
|
||
|
||
|
@@ -361,7 +359,7 @@ def test_read_gbq_credentials(df, dataset_fixture, request, monkeypatch): | |
credentials=credentials, | ||
) | ||
|
||
assert list(ddf.columns) == ["name", "number", "timestamp", "idx"] | ||
assert list(df.columns) == list(ddf.columns) | ||
assert assert_eq(ddf.set_index("idx"), df.set_index("idx")) | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
dask | ||
dask>=2024.3.0 | ||
gcsfs | ||
google-cloud-bigquery >= 2.11.0 | ||
google-cloud-bigquery-storage | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we need dask/dask#11233 for some tests to pass. A little unfortunate. Maybe we can change tests to avoid the timezone issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also remove it again after the release tomorrow, so shouldn't be an issue