-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIP-72: Handling task retries in task SDK + execution API #45106
base: main
Are you sure you want to change the base?
Conversation
If we agree on the approach, I will work on the tests. |
It probably might also be a good idea to de couple the entire payload construction out of Somewhat like:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove task_retries
from the payloads
task_id="test_ti_update_state_to_retry_when_restarting", | ||
state=State.RESTARTING, | ||
) | ||
session.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does nothing I think, or at the very least you should pass session on to create_task_instance
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json={ | ||
"state": State.FAILED, | ||
"end_date": DEFAULT_END_DATE.isoformat(), | ||
"task_retries": retries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this feels like a very leaky abstraction, lets remove it, and if we want a "fail and don't retry" lets have that be something else rather than overload task_retires
in a non-obvious manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this has been reworked now. I am sending a communication if retry should be attempted or not via should_retry
closes: #44351
"Retries" are majorly handled in airflow 2.x in here: https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L3082-L3101.
The idea here is that in case a task is retry able, defined by https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1054-L1073, the task is marked as "up_for_retry". Rest of the part is taken care by the scheduler loop normally if the ti state is marked correctly.
Coming to task sdk, we cannot perform validations such as https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1054-L1073 in the task runner / sdk side because we do not have/ should not have access to the database.
We can use the above state change diagram and handle the retry state while handling failed state. Instead of having API handler and states for "up_for_retry", we can handle it when we are handling failures - which we do by calling the https://github.com/apache/airflow/blob/main/airflow/api_fastapi/execution_api/routes/task_instances.py#L160-L212 API endpoint. If we can send in enough data to the api handler in the execution API, we should be able to handle the cases of retry well.
What needs to be done for porting this to
task_sdk
?Defining "try_number", "max_retries" for task instances ---> not needed because this is handled already in the scheduler side of things / parsing time and not at execution time, so we do not need to handle it. It is handled here https://github.com/apache/airflow/blob/main/airflow/models/dagrun.py#L1445-L1471 when a dag run is created and it is initialised with the initial values: max_tries(https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1809) and try_number(https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1808)
We need to have a mechanism that can send a signal from the task runner if retries are defined. We will send this in this fashion:
task runner informs the supervisor while failing that it needs to retry -> supervisor sends a normal request to the client (but with task_retries defined) -> client sends a normal API request (TITerminalStatePayload) to the execution API but with task_retries
At the execution API, we receive the request and perform a check to check if the Ti is eligible for retry, if it is, we mark it as "up_for_retry", the rest of things are taken care by the scheduler.
Testing results
Right now the PR is meant to handle
BaseException
-- will extend to all other eligible TI exceptions in follow ups.Scenario 1: With retries = 3 defined.
DAG:
Rightly marked as "up_for_retry"
TI details with max_tries
Try number in grid view
Scenario 2: With retries not defined.
DAG:
Rightly marked as "failed"
Ti detiails with 0 max_tries:
Try number in grid view
============
Pending:
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.