HttpToS3Operator OOM when downloading large file #46008

rogalski · 2025-01-24T08:24:12Z

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

main is affected

Apache Airflow version

2.10

Operating System

Linux

Deployment

Amazon (AWS) MWAA

Deployment details

No response

What happened

Whole file is attempted to be loaded to memory.
Task exited with return code -9.

airflow/providers/src/airflow/providers/amazon/aws/transfers/http_to_s3.py

Lines 164 to 175 in bb77ebf

    
           def execute(self, context: Context): 
        
               self.log.info("Calling HTTP method") 
        
               response = self.http_hook.run(self.endpoint, self.data, self.headers, self.extra_options) 
        
               self.s3_hook.load_bytes( 
        
                   response.content, 
        
                   self.s3_key, 
        
                   self.s3_bucket, 
        
                   self.replace, 
        
                   self.encrypt, 
        
                   self.acl_policy, 
        
               )

In this code response.content is in-memory representation of file content (as bytes).

What you think should happen instead

Lazy load via stream=True, response.raw and S3Hook.load_fileobj()

How to reproduce

Run HttpToS3Operator with file larger than RAM available (~20GB in my setup was enough).

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

raphaelauv · 2025-01-24T14:35:06Z

hi, HttpToS3Operator is an airflow operator doing the data transfer thanks to the code you linked.

So data is going through the airflow_worker ( if using celery executor )

you have multiple options:

use the kubernetes Executor ( or any other executor that let you custom the RAM context of the task ) for this task and the code of the operator ( but still not doing stream transfer ) will not suffer of the airflow_worker hardware limits

or

use the KubernetesPodOperator and trigger an efficient specialist tool to execute the transfer , like RCLONE -> https://rclone.org/docs/

rogalski added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 24, 2025

potiuk removed kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 26, 2025

apache locked and limited conversation to collaborators Jan 26, 2025

potiuk converted this issue into discussion #46066 Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

HttpToS3Operator OOM when downloading large file #46008

HttpToS3Operator OOM when downloading large file #46008

rogalski commented Jan 24, 2025

raphaelauv commented Jan 24, 2025

This issue was moved to a discussion.

This issue was moved to a discussion.

HttpToS3Operator OOM when downloading large file #46008

HttpToS3Operator OOM when downloading large file #46008

Comments

rogalski commented Jan 24, 2025

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

raphaelauv commented Jan 24, 2025

This issue was moved to a discussion.