Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pagination to HttpOperator and make it more modular #34669

Merged

Conversation

Joffreybvn
Copy link
Contributor

@Joffreybvn Joffreybvn commented Sep 28, 2023

Hello, this PR is an alternative to #34606.

This PR,

  • Implements a pagination_function to the SimpleHttpOperator: The operators supports a customizable pagination feature. User injects a callable as pagination_function. As long as this callable returns the parameters for a next call, the Operator keep calling the API (with the new parameters).
  • Refactors the SimpleHttpOperator:
    • Rename it to HttpOperator and add a deprecated SimpleHttpOperator class
    • Make it more modular, to facilitate the creation of other Operators on top of it
    • Load the Hook based on the connection_id (like the BaseSQLOperator)

Use case

I do a lot of data pulling from APIs. The SimpleHttpOperator is a great tool for that, but:

  • I've faced an API which returns data in a paginated way, do not provide the number of pages, and send a cursor (random id) to paginate over. Dynamic Tasks? I need to know in advance the number of pages. Which means calling the API two times, which is inefficient, and do not guarantee all pages are traversed during the second call.
  • For some exotic APIs, a custom HttpHook is nice to have (e.g to customize the Connection form, to pass multiple hidden string into custom auth_class, ...). Just like the SQLExecuteQueryOperator, I'd like to use the same HttpOperators which would load the right Hook based on the connection id.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI still fails but as I had a review / comments on the first PR I want to also provide some early feedback on the follow-up - knowing it is in DRAFT.

  • I assume there are some typing-glitches, but I'll leave this feedback to the CI. MyPy makes a better job than me hunting for problems. Would make a code level review later. Some signatures seem for me not matching. Can review later. Otherwise you might save time in calling breeze static-checks locally.
  • Thanks for separating-out this extension to a dedicated extended operator. This is better than adding more complexity to the SimpleHttpOperator.
  • The current implementation names the operator being "extended" which is kind of abstract, the extension nevertheless is very specific to your "pagination" use case. I understand that the pagination use case is something that is your use case. But this is for me a bit of a naming conflict to just name it "extended". I would propose either naming the operator to be PaginatedHttpOperator (as it is specific for this case with the function signatures as extension) or call the pagination_function rather be a post_call_hook as there might be other extended/generic stuff possibly be implemented after the call as generic hook.
  • Otherwise talking about "generic" and the post_call_hook - I have not tried and therefore don't know if and what is the gap: You know that the existing SimpleHttpOperator already carries a function called response_filter which can be used to inject any callable function already - in this case to post-process the call. Have you tried and thought about why not using the response_filter as a generic hook to call further HTTP pages and filter the context as merge into the overall response? Is for the pagination use case a separate operator required?

@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Sep 28, 2023

Thanks for early review ! Mypy is indeed still complaining, and the DiscordWebHook too... Will fix that tomorrow :)

About your last point: Yes, i considered the response_filter. Initially, I wrote a simple procedure embedded into a PythonOperator; which was using the hook directly. I'd not put that into the response_filter function because:

  • It's simpler to just use a PythonOperator and the hook (instead of bothering with the Operator for the first call)
  • The response filter is very convenient for filtering responses. I feel hacky when I inject things where its not intended for. "It would be great to receive a list of Responses in the response_filter" -> This is the motivation for this PR.

On the name: It's totally up to you (and other reviewers) how that should be named. Reasoning is the following: We have the SimpleHttpOperator, which is a nice toolset for simple calls. Let's have another toolset for complex calls. It has to start somewhere, so, for now, it has only a pagination feature. With a "PaginatedOperator", we don't have a toolset anymore. But it's indeed more clear what the intention is. Will rename !

About renaming to post_call_hook, I'd say no. For now, the pagination_function is indeed permissive and could be (mis)used. But this function has to define parameters for a next call, and it is used in a loop to repeat calls; thus until there another use-case/PR to modify it, I believe "pagination" is a good name for its behavior.

On the other side, I agree with adding another generic post_call_hook function right after the call to the hook. It could have been a solution for the SimpleHttpOperator. I can add it ? But so far, my use-case is already covered better by the current state of the PR.

@Joffreybvn Joffreybvn changed the title Feature: Implement pagination in new ExtendedHttpOperator + refactor SimpleHttpOperator Feature: Implement pagination in new PaginatedHttpOperator + refactor SimpleHttpOperator Sep 29, 2023
@Joffreybvn Joffreybvn force-pushed the feature/create-extended-operator-with-pagination branch 4 times, most recently from ce9c051 to 7e6ad62 Compare October 2, 2023 18:44
@Joffreybvn Joffreybvn marked this pull request as ready for review October 2, 2023 18:44
@Joffreybvn Joffreybvn force-pushed the feature/create-extended-operator-with-pagination branch 2 times, most recently from 6af80dd to 2741eda Compare October 11, 2023 07:32
@Joffreybvn
Copy link
Contributor Author

@hussein-awala Could you review this one ? (as you reviewed #34606 )
By 'making generic', you mean putting a generic name on it ?

Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasnt part of the discussion but not sure why we need 2 operators. It has the potential of confusing users and no where in the code we have something like this

Not blocking the idea just want to make sure this idea is because we want it to be like it and not because some old desicion bind us. If this is what we want then lets proceed if this is not what we want and we were driven into this by some old convention/desicion then my recommendation is to challange it first.

@Joffreybvn
Copy link
Contributor Author

Everything in one single Http Operator is indeed simpler ! From a user point of view, I get the docstring on my IDE and can opt-in to a pagination feature just by providing the required parameter. This was the idea of the first PR.

But shouldn't the SimpleHttpOperator be renamed then ? Because looping over pages is not a "simple http call" anymore. Maybe just calling it HttpOperator ?

Or BaseHttpOperator ? In this PR, the SimpleHttpOperator is rewritten to be more modular, following what was done for the BaseSQLOperator. It will be easier to create other Operators on top of it (typically to do data ingestion).

@Joffreybvn Joffreybvn force-pushed the feature/create-extended-operator-with-pagination branch from 2741eda to d94478c Compare October 21, 2023 10:50
Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow this PR seems to be in an intermediate state. Docs, tests and operator in core is not matching together. Is is WIP or do I have a mis-understanding?

@Joffreybvn Joffreybvn changed the title Feature: Implement pagination in new PaginatedHttpOperator + refactor SimpleHttpOperator Add pagination to SimpleHttpOperator and make it more modular Oct 24, 2023
@Joffreybvn Joffreybvn force-pushed the feature/create-extended-operator-with-pagination branch from 16b9540 to 2cde718 Compare October 24, 2023 20:04
@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Oct 24, 2023

I implemented back everything into the SimpleHttpOperator, and changed the title. This PR is not about an extra operator anymore, but extending the SimpleHttpOperator. Tests should be ok too.

@hussein-awala
Copy link
Member

If we decide to implement the feature in the existing operator, I suggest renaming it to HttpOperator and creating a deprecated subclass SimpleHttpOperator (without any override, just for b/c), which we can remove in the next major version.

@Joffreybvn Joffreybvn requested a review from potiuk as a code owner October 25, 2023 19:57
@Joffreybvn Joffreybvn changed the title Add pagination to SimpleHttpOperator and make it more modular Add pagination to HttpOperator and make it more modular Oct 25, 2023
Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this contribution looks good, but as I am not a committer, some more eyes are needed for review. Propose to make this into 2.8.0

@jscheffl jscheffl added this to the Airflow 2.8.0 milestone Oct 26, 2023
@Joffreybvn Joffreybvn force-pushed the feature/create-extended-operator-with-pagination branch from b2e61ac to d787257 Compare November 3, 2023 12:53
@potiuk
Copy link
Member

potiuk commented Nov 3, 2023

Anyone else ? It looks pretty good for merging :)

@eladkal eladkal merged commit 70b3bd3 into apache:main Nov 3, 2023
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Nov 10, 2023
* feat: Make SimpleHttpOperator extendable

* feat: Implement ExtendedHttpOperator

* feat: Add sync and async tests for `pagination_function`

* feat: Add example and documentation

* fix: Add missing return statements

* fix: typo in class docstring

Co-authored-by: Jens Scheffler <[email protected]>

* fix: make use of hook property in DiscordWebhookHook

* fix: rename to PaginatedHttpOperator

* fix: Correctly route reference link to PaginatedHttpOperator docs

* fix: makes SimpleHttpOperator types customizable for mypy

* fix: add missing dashes in docs + add missing reference to paginated operator

* fix: add missing reference to `PaginatedHttpOperator`

* feat: implement hook retrieval based on connection id

* feat: Merge PaginatedOperator to SimpleHttpOperator

* fix: Removes mention of PaginatedHttpOperator

* fix: Apply static checks code quality

* fix: Reformulate docs

* feat: Deprecate `SimpleHttpOperator` and rename to `HttpOperator`

* fix: Remove 'HttpOperator' from `__deprecated_classes`

---------

Co-authored-by: Jens Scheffler <[email protected]>
@ephraimbuddy ephraimbuddy added changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) and removed type:new-feature Changelog: New Features labels Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers area:system-tests changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) kind:documentation provider:http
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants