Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make databricks sql hook return a namedtuple #36205

Merged

Conversation

Joffreybvn
Copy link
Contributor

@Joffreybvn Joffreybvn commented Dec 13, 2023

This is a fix proposal for #36161 (review)

The ADR for Airflow's DB API specifies that DbApiHooks needs to return a tuple / list of tuples. The databricks hook by defaults returns a dict-like databricks.Row object which we now transform into a new nametuple - which is compliant with the 'common data structure' specification.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@potiuk
Copy link
Member

potiuk commented Dec 13, 2023

@Joffreybvn -> do you have more things to add to it or you think it's ready to review ?

@Joffreybvn Joffreybvn marked this pull request as ready for review December 13, 2023 13:12
Copy link
Contributor

@phanikumv phanikumv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to address #36205 (comment)

Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the discussions around this earlier I'm a bit picky on the naming. We are not making something JSON serializable - but according to the earlier discussion we make the SQL hook return the specified NamedTuple format.

I wonder btw if we should not have a typedef for it if this is part the the DbAPI. That would make things conceptually stronger and less ambiguous.

If you insist on naming it something serializable please leave out JSON. It has nothing to do with it.

Can we further be clear on what this PR wants to achieve in the commit message? The title and message are now confusing.

My suggestion:

Return SerializableRow in databricks sql hook

The ADR for Airflow' s DB API specifies it needs to return a 
named tuple SerialzableRow or a list of them.  The databricks
hook by defaults returns a dict which we now transform to a
SerializableRow.

@bolkedebruin bolkedebruin changed the title Databricks sql hook returns json-serializable namedtuple Make databricks sql hook return a serializable namedtuple Dec 13, 2023
@Joffreybvn Joffreybvn requested a review from eladkal as a code owner December 13, 2023 20:39
@Joffreybvn Joffreybvn force-pushed the fix/make-databricks-returns-namedtuples branch from cffce3e to cde35bb Compare December 13, 2023 21:38
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way it is now.

@potiuk potiuk requested a review from bolkedebruin December 14, 2023 14:27
@potiuk
Copy link
Member

potiuk commented Dec 14, 2023

@bolkedebruin @phanikumv -> please take a look and see what you think. I am all for this direction

@potiuk
Copy link
Member

potiuk commented Dec 14, 2023

And I created #36224 and assigned to myself and will shortly take a close look at this - to make sure the common.sql API is not only defined but also enforced by the pre-commit.

When I started it last year and defined .pyi for common.sql to experiment with it, there was not much interest but seems that it's a good idea to finish the work as it causes contentions and issues like this one.

Also I hope it will become a golden standard for other common interfaces exposed to our providers where ADRs and static check enforecement on backwards compatibility will be followed.

@Joffreybvn
Copy link
Contributor Author

The databricks provider was already released with the breaking change-> 5.0.1. Maybe this version should be removed / replaced by a new "5.0.1" one ?

@potiuk
Copy link
Member

potiuk commented Dec 14, 2023

The databricks provider was already released with the breaking change-> 5.0.1. Maybe this version should be removed / replaced by a new "5.0.1" one ?

Nope. We can at most yank it. There is no way to replace existing released version. Luckily it has not been yet in constraints of any Airflow version, so we can do it without any serious consequences. Let me actually yank it now.

@potiuk
Copy link
Member

potiuk commented Dec 14, 2023

Yanked

@potiuk
Copy link
Member

potiuk commented Dec 14, 2023

Also cc: @ephraimbuddy -> since we yanked 5.0.1, I will rerun the last build for v2-8-stable to refresh constraints and move the tag once done (currentlly 5.0.1 was in the constraints planned to release for 2.8.0rc3)

Job running here: https://github.com/apache/airflow/actions/runs/7199187385

@potiuk
Copy link
Member

potiuk commented Dec 14, 2023

Constraints with databricks 5.0.0 updated 3312193 and tag for 2.8.0rc3 moved @ephraimbuddy

@Joffreybvn Joffreybvn force-pushed the fix/make-databricks-returns-namedtuples branch from f3619e2 to 8b1b0e6 Compare December 14, 2023 20:26
@utkarsharma2
Copy link
Contributor

utkarsharma2 commented Dec 16, 2023

@bolkedebruin For a long-term solution would it make sense to have a way for a provider to register a custom serializer/deserializer for objects its operator might produce? We can extend this logic to include providers.

@Joffreybvn Joffreybvn marked this pull request as draft December 16, 2023 15:38
@Joffreybvn Joffreybvn marked this pull request as draft December 16, 2023 15:38
@Joffreybvn Joffreybvn force-pushed the fix/make-databricks-returns-namedtuples branch from 9255d9a to 2bfb8ad Compare December 16, 2023 18:42
@Joffreybvn Joffreybvn force-pushed the fix/make-databricks-returns-namedtuples branch from 49b8089 to affd2f4 Compare December 21, 2023 18:00
@potiuk potiuk requested a review from bolkedebruin December 21, 2023 19:11
@@ -167,7 +195,7 @@ def run(
handler: Callable[[Any], T] = ...,
split_statements: bool = ...,
return_last: bool = ...,
) -> T | list[T]:
) -> tuple | list[tuple] | list[list[tuple] | tuple] | None:
Copy link
Contributor Author

@Joffreybvn Joffreybvn Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy was not happy with -> tuple | list[tuple] | list[list[tuple] | tuple] as output - which is correct. I got error: "Overloaded function signatures 1 and 2 overlap with incompatible return types".

Following mypy docs, I added a None for it to be stop complaining. But that's not correct, this method cannot return None when a handler is provided...

Implementing definitive typing is out of the scope of this PR, and I admit mypy beat me on that case. Thus I let that for #36224 - but I'm whiling to help on that issue too of course.

@Joffreybvn
Copy link
Contributor Author

To be honest, I found quite interesting to be in controversial PR. Thus, totally fine for me :)

Current status:

Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much much better! I understand what this is about now. The follow up would make things even clearer so absolutely appreciated if you want to help out there.

@bolkedebruin bolkedebruin merged commit 5fe5d31 into apache:main Dec 22, 2023
78 checks passed
@potiuk
Copy link
Member

potiuk commented Dec 23, 2023

Hey @bolkedebruin -> next time when you merge PR - can you please make sure you keep "squash and merge" for PRs and do not remove the number - this one has been merged without it and it has no (#36205) in the name - which made it disappear from our release notes generation. I noticed it accidentally luckily.

@potiuk
Copy link
Member

potiuk commented Dec 23, 2023

Screenshot 2023-12-23 at 01 49 34

@bolkedebruin
Copy link
Contributor

Sorry @potiuk i did actually squash and merge but adjusted the commit message to cover what was done. That removed the ref. I'll be more careful

@potiuk
Copy link
Member

potiuk commented Dec 23, 2023

No worries :). I caught it luckily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants