Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLExecuteQueryOperator not timing out within expected timeframe #45930

Open
1 of 2 tasks
bentorb opened this issue Jan 22, 2025 · 7 comments
Open
1 of 2 tasks

SQLExecuteQueryOperator not timing out within expected timeframe #45930

bentorb opened this issue Jan 22, 2025 · 7 comments

Comments

@bentorb
Copy link

bentorb commented Jan 22, 2025

Apache Airflow Provider(s)

common-sql

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==9.0.0
apache-airflow-providers-common-sql==1.19.0
apache-airflow-providers-mysql==5.7.3

Apache Airflow version

2.10.3

Operating System

Amazon Linux 2023

Deployment

Amazon (AWS) MWAA

Deployment details

The issue happens on a production Amazon (AWS) MWAA environment, and is also reproducible locally using the MWAA local runner (deployed on a Mac).

What happened

The SQLExecuteQueryOperator fails with timeout, but only after the query has completed.

This is the workflow currently happening: T0: Start -> T1: Run query -> T2: Expected timeout failure -> T3: Query finishes -> T4: Operator fails on timeout

What you think should happen instead

Once the execution timeout is reached, the SQLExecuteQueryOperator should kill the query and then fail.

This is the expected workflow: T0: Start -> T1: Run query -> T2: Expected timeout, operator kills query and fails

How to reproduce

The issue is reproducible both on a production Amazon (AWS) MWAA environment, and locally using the MWAA local runner (deployed on a Mac).

The DAG below contains two tasks:

  1. The first one executes a long-running query on a MySQL database using the SQLExecuteQueryOperator.
  2. The second one executes a long-running process using the PythonOperator.

The PythonOperator fails its execution exactly after one minute (as expected), the SQLExecuteQueryOperator takes a few minutes until the query completes (not as expected).

import os
import time

from datetime import timedelta

from airflow.decorators import dag
from airflow.models.connection import Connection
from airflow.operators.python import PythonOperator
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator


@dag(schedule=None)
def test_sql_operator_timeout_dag():

    # SQL operator test
    connection = Connection(
        conn_id="mysql_connection",
        conn_type="mysql",
        host="xyz",
        port="3306",
        schema="test",
        login="xyz",
        password="xyz",
    )

    env_key = f"AIRFLOW_CONN_{connection.conn_id.upper()}"
    connection_uri = connection.get_uri()
    os.environ[env_key] = connection_uri
    
    SQLExecuteQueryOperator(
        task_id="mysql_long_query",
        conn_id=connection.conn_id,
        sql="select benchmark(1200000000, md5('when will it end?'));",
        execution_timeout=timedelta(minutes=1),
    )


    # Python operator test
    def sleeping_function():
        time.sleep(120)

    PythonOperator(
        task_id="python_long_task",
        python_callable=sleeping_function,
        execution_timeout=timedelta(minutes=1),
    )

test_sql_operator_timeout_dag = test_sql_operator_timeout_dag()

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@bentorb bentorb added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 22, 2025
Copy link

boring-cyborg bot commented Jan 22, 2025

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@eladkal eladkal added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Jan 22, 2025
@eladkal
Copy link
Contributor

eladkal commented Jan 22, 2025

cc @dabla maybe you will have time to look into this? Seems inconsistent behavior of SQLExecuteQueryOperator

@dabla
Copy link
Contributor

dabla commented Jan 22, 2025

I've also encountered the same behaviour, as the connection is blocking the thread or something and thus the SQLExecuteQueryOperator task only stops once the connection returns back, at least that's what I experienced.

So to avoid that issue, what I've done when using the JdbcHook, was by specifying a connection timeout on the JDBC connection, that way when the task has to timeout, the connection will also (even though the query will continue to run on the database) and thus not block the thread of the Airflow worker, but this was a specific solution with JDBC connections.

Bellow the code:

hook = DbApiHook.get_hook(conn_id="conn_id")
    with hook.get_conn() as conn:
        with closing(conn.cursor()) as cur:
            stmt = conn.jconn.createStatement()
            if timeout:
                stmt.setQueryTimeout(
                    int(timeout.total_seconds())
                )  # Set the timeout in seconds
            stmt.executeQuery(sql)  # Execute the SQL query
            return fetch_one_handler(cur)

So maybe in the hook some kind of timeout should also be specified on the connection being used, or if you want a more finegrained approach per operator, the you could specify the connection timeout through the hook_params of the SQLExecuteQueryOperator? Still, this al depends if the underlying connection supports it of course.

@eladkal
Copy link
Contributor

eladkal commented Jan 22, 2025

Should we apply this logic for the underlying get hook function that we are using? While not generic to all if we can provide workaround for connections that support it I think we should.
Yet the underlying issue is not clear to me. The kill signal from timeout comes from the scheduler. Why would the task not respond to it? The exectuion_timeout is for Airflow, not for the underlying SQL engine.

@dabla
Copy link
Contributor

dabla commented Jan 22, 2025

Should we apply this logic for the underlying get hook function that we are using? While not generic to all if we can provide workaround for connections that support it I think we should. Yet the underlying issue is not clear to me. The kill signal from timeout comes from the scheduler. Why would the task not respond to it? The exectuion_timeout is for Airflow, not for the underlying SQL engine.

We could, but that possibly would not work for all cases and would be a workaround of the real issue.
It's indeed a weird issue and I was also surprised by that behaviour. Back then I thought it was related to the Jdbc connection and maybe there was some glitch between the Python and Java process, but now we can assume this is a generic issue related to database connections. So yes it's weird that worker doesn't get killed.

@eladkal
Copy link
Contributor

eladkal commented Jan 22, 2025

but now we can assume this is a generic issue related to database connections. So yes it's weird that worker doesn't get killed.

I am not sure it explains it. Airflow runs a process (that happens to be database connection) Airflow should be able to kill the process. The fact that the issue happens for Sql but not for Python is really odd. From the description it sounds like something is blocking the scheduler from killing the task (because it tried to kill it after its finished). I think that the issue is with the scheduler itself.

@dabla
Copy link
Contributor

dabla commented Jan 22, 2025

but now we can assume this is a generic issue related to database connections. So yes it's weird that worker doesn't get killed.

I am not sure it explains it. Airflow runs a process (that happens to be database connection) Airflow should be able to kill the process. The fact that the issue happens for Sql but not for Python is really odd. From the description it sounds like something is blocking the schedule from killing the task (because it tried to kill it after its finished). I think that the issue is with the scheduler itself.

Question, at least for me, is how do you debug something like this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants