Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better integration between datasets and data intervals #45187

Open
1 of 2 tasks
casperhart opened this issue Dec 23, 2024 · 4 comments
Open
1 of 2 tasks

Better integration between datasets and data intervals #45187

casperhart opened this issue Dec 23, 2024 · 4 comments
Labels
area:datasets Issues related to the datasets feature kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet

Comments

@casperhart
Copy link

Description

Currently, one is able to trigger a DAG based on a dataset or a time schedule or a DatasetOrTimeSchedule, but it would be good if the dataset itself (or dataset event) could be associated with a schedule or logical_date. E.g. a monthly dataset, where an event is emitted by a DAG at most once for a given month, and such that the catchup argument of a downstream DAG is respected.

For example a DAG with two dataset dependencies, if dataset 1 has been produced for month1 and dataset2 gets produced for month2, the DAG will be triggered even though the two dataset events relate to separate intervals. I'd like to trigger the DAG only if the datset events were emitted for the same interval.

I'm fairly new to using datasets so apologies if my issue already has a solution or workaround.

Use case/motivation

I have a few issues with datasets that I'm having trouble solving:

  • A dataset producer DAG gets re-run, but we don't want downstream DAGs to be re-triggered for the same data interval.
  • Out-of-sync issues where a DAG is triggered based on a stale event in cases where multiple dataset triggers are defined: Dataset aware scheduling - is there a way to reset DAG? #36618
  • If a producer dag gets run with catchup=True and we don't want consumer DAGs to be backfilled, can we restrict backfill on consumer DAGs.

Technically this could be accomplished with TriggerDagRunOperator/ExternalTaskSensor, but these have other issues that datasets solve quite nicely. The benefit of decoupling DAGs using datasets is huge. However by using datasets, some of the benefits of time schedules are lost.

Related issues

#36618

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@casperhart casperhart added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Dec 23, 2024
Copy link

boring-cyborg bot commented Dec 23, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@dosubot dosubot bot added the area:datasets Issues related to the datasets feature label Dec 23, 2024
@tirkarthi
Copy link
Contributor

cc: @Lee-W @uranusjr

@potiuk
Copy link
Member

potiuk commented Dec 24, 2024

cc: @dstandish

@potiuk
Copy link
Member

potiuk commented Dec 24, 2024

This is another case where I think "data interval" is so established term that we should embrace it, not move away from it (re: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-83+amendment+to+support+classic+Airflow+authoring+style) .

cc: @casperhart -> I think it would be great if you also incorporate your points in the discussion in that AIP-83 amendment, I think it's pretty relevant, and I think it would be valuable to hear from others as well about the cases they think about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:datasets Issues related to the datasets feature kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

No branches or pull requests

3 participants