Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Retry Mechanism for Failed Polls in HexRunProjectOperator #20

Open
7 tasks
jacobcbeaudin opened this issue Sep 7, 2024 · 0 comments
Open
7 tasks

Comments

@jacobcbeaudin
Copy link

Summary

Implement a retry mechanism for the polling process in the HexRunProjectOperator to handle temporary API failures.

Description

Currently, when the HexRunProjectOperator is set to run synchronously, it polls the Hex API at regular intervals to check the status of a project run. If an API call fails during this polling process, the entire task is marked as failed in Airflow. This can lead to unnecessary task failures, especially when using a high polling frequency.

I propose adding a retry mechanism for these API calls to improve the robustness of the operator and reduce false failure reports.

Proposed Changes

  1. Add new parameters to the HexRunProjectOperator:

    • max_poll_retries: Maximum number of retries for a failed poll (default: 3)
    • poll_retry_delay: Delay between retries in seconds (default: 5)
  2. Modify the run_and_poll method in the HexHook class to implement the retry logic:

    • Wrap the run_status call in a retry loop
    • Use exponential backoff for retry delays
    • Only raise an AirflowException if all retries are exhausted
  3. Update the operator's documentation to reflect these new parameters and behavior

Implementation Details

  • Use Airflow's built-in retry utilities if available, or implement a custom retry decorator
  • Ensure that the total time spent on retries counts towards the overall timeout parameter
  • Log each retry attempt for observability

Example Usage

hex_task = HexRunProjectOperator(
    task_id='run_hex_project',
    project_id='your_project_id',
    hex_conn_id='hex_default',
    synchronous=True,
    wait_seconds=3,
    timeout=3600,
    max_poll_retries=3,
    poll_retry_delay=5,
    dag=dag,
)

Acceptance Criteria

  • The HexRunProjectOperator accepts max_poll_retries and poll_retry_delay parameters
  • Failed API calls during polling are retried according to the specified parameters
  • Retries use exponential backoff
  • The operator only fails after exhausting all retry attempts
  • Retry attempts are logged for debugging purposes
  • The operator's documentation is updated to reflect the new functionality
  • Unit tests are added to verify the retry behavior

Additional Notes

  • Consider adding a configurable jitter to the retry delay to prevent thundering herd problems
  • Evaluate if this retry mechanism should be applied to other API calls in the HexHook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant