Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIP-72 | "unsafe" level #45200

Closed
1 of 2 tasks
raphaelauv opened this issue Dec 24, 2024 · 9 comments
Closed
1 of 2 tasks

AIP-72 | "unsafe" level #45200

raphaelauv opened this issue Dec 24, 2024 · 9 comments
Labels
area:core kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet

Comments

@raphaelauv
Copy link
Contributor

raphaelauv commented Dec 24, 2024

Description

I would like to still be able to write dag with a low level access to airflow database for operational purposes in airflow v3

it's convenient to be able to programmatically access the database session in a pythonoperator for specific cleaning / operational purposes

example

    def clear_dag_runs(dag_id, status_to_clear):
        context = get_current_context()
        session = settings.Session()

        query = session.query(DagRun).filter(
            DagRun.state == status_to_clear, DagRun.dag_id == dag_id)
        rst = query.all()

        dag_bag = DagBag(dag_folder=path.join(SRC_FOLDER, 'dags'), include_examples=False)
        dag: DAG = dag_bag.get_dag(dag_id, session)

        for dag_run in rst:
            dag.clear(
                start_date=dag_run.logical_date,
                end_date=dag_run.logical_date,
                task_ids=None,
                include_subdags=True,
                include_parentdag=True,
                only_failed=False,
            )
        session.close()

    def delete_dag(dag_id):
        context = get_current_context()
        session = settings.Session()
        keep_records_in_log = False

        for model in get_sqla_model_classes():
            if hasattr(model, "dag_id") and (not keep_records_in_log or model.__name__ != "Log"):
                session.execute(
                    delete(model)
                    .where(model.dag_id.__eq__(dag_id))
                    .execution_options(synchronize_session="fetch")
                )

    PythonOperator(
        task_id="delete",
        python_callable=delete_dag,
        op_kwargs={"dag_id": "{{params.dag_name}}"}
    )

wdyt ?

Use case/motivation

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@raphaelauv raphaelauv added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Dec 24, 2024
@dosubot dosubot bot added the area:core label Dec 24, 2024
@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

Not going to happen IMHO. I would definitely be voting NO if it came to a vote. First of all airflow DB is an internal detail. Secondly you have REST APi and Python client and we are going to have built in authentication to work automatically from worker so that you can use Python. Client'Pythonic APi to interact with Airflow DB over REST API and this is the way it is going to be the 'official'way of interacting with Airflow DB.

If you think you miss some ways of interacting with Airflow DB (do r example bulk updates) - the right approach is to add appropriate API endpoints to do it NOT use the SQL queries on the Database that might and will change any time without any warning.

What you are asking for is a step back in security and architecture and I highly doubt any maintainer would support it .

BTW. In many cases using. python Client will produce more readable and simpler code than using sqlqchemy

@raphaelauv
Copy link
Contributor Author

Thank you for the answer 👍

Without security compromise, would it be possible to create a pluggable SDK extension system ?

So at run time an airflow "custom_sdk_extension" or "sdk_custom_backend" that would let the user enrich the API of the SDK (so no need to maintain a custom fork airflow project for the user ) ?

Thanks

@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

Why Python client is not enough ?

@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

What precisely features you miss and why they cannot be added to standard API as pull request ? The 'extendable sdk' is anti-thesis of having an API. The REST API is designed to make a maintainable, public way of interacting with airflow. What your are asking for is maintainable API to be able to extend rhe API which is 'meta API API'

The thing is that you are NOT supposed to be interacting with the DB of Airflow because it might and will change. So if you do it (which you still can do if you really want by configuring your own way of accessing the same DB and writing your own code to do so. Then you are doing it on absolutely your own risk and cost. The whole thing about exposing the API by maintainers and not giving ANY official way to interacting with the DB that aims for exactly this - to make you solely and ultimately responsible if you access the things we made 'internal' deliberately .

So what you are asking for REALLY is for maintainers and community to take responsibility and maintenance cost and burden for accessing airflow db in the ways we are not able to foresee.

IMHO - this is very clear ask from maintainers:

a) if you see some access patterns that can be reused by others - contribute it back as REST API

b) if you see something useful only for you - do what you need In your but without impacting community in ways that it will make it more difficult to maintain and make changes. You need to take full responsibility for adapting the changes to future airflow version and it should be clear to you that you are commiting to it

Simply 'with great powers come great responsibilities' and if you want more power you should take full responsibility for what it means

@raphaelauv
Copy link
Contributor Author

Yes you are right, the main pattern that I have in mind is the bulk maintenance of the airflow database ( there are few users, doing >100k task_run by day ).

Thanks again for your detailed answer, it will help many users realize that building directly on top of Airflow a very custom platform ( I seen many cases of users using XCOM as almost a product database 😱, and also querying the airflow database for very very custom orchestration decisioning ) without separation of concern is not a good idea.

I guess that users will surely have intriguing and subtle questions when airflow V3 will be release 😀

Thanks again

@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

I think it's not a problem to build custom platform - the REST API of ours is precisely what's making it possible and it was the main motivator for the API implementation.

And again - if some features are missing there, adding an API endpoint via PR is precisely how you can achieve what you need.

You can also export the data somewhere (even using Airflow DAG !) using the API and do whatever you need outside of airflow DB. This is a typical pattern that internal DB needs to be performant and maintainable for the Application using it and if you need to do any other mechanism to access the data and aggregate or process it in other ways you export the data and so it 'outside'.

And yes - you can query and retrieve data using XCOM REST API. And again - interacting with them over the XCOM part of the Python client is even easier than using sqlalchemy or SQL.

And the fact that you should not be using db access had not only been mentioned every time someone brought the subject in slack and GitHub but also very clearly explained in https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html#what-is-not-part-of-the-public-interface-of-apache-airflow

When we implemented REST API 4 years ago for Airflow 2 we made it very clear that it is the only maintained and supported interface and we gave our users 4 years to adapt.

So I think it is not unexpected, it's a very deliberate, very well announced move and we went a great length to add tools and explanations how to use it.

If anyone is still not ready for that, well, there is still time to prepare.

@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

I guess that users will surely have intriguing and subtle questions when airflow V3 will be release 😀

I hope so. And I hope that it will lead that very users to contribute back whatever they need and miss in Airflow 3 API in a reusable way for others

That is one of the intentions we had when we added the API

@raphaelauv
Copy link
Contributor Author

raphaelauv commented Dec 26, 2024

With airflow 2.11 we will be able to use the sdk , right ?

@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

With airflow 2.11 we will be able to use the sdk , right ?

I think it is planned as one of the rules . Certainly we discussed it as an option and we already did something similar for db isolation AIP-44 work - so we already have the right code implemented in v2-10-test branch, we just need to turn the AIP-44 error into deprecation warning

If it is not already there #41641 - it's.a.good idea to add it there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

No branches or pull requests

2 participants