Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpToS3Operator OOM when downloading large file #46008

Closed
1 of 2 tasks
rogalski opened this issue Jan 24, 2025 · 1 comment
Closed
1 of 2 tasks

HttpToS3Operator OOM when downloading large file #46008

rogalski opened this issue Jan 24, 2025 · 1 comment

Comments

@rogalski
Copy link

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

main is affected

Apache Airflow version

2.10

Operating System

Linux

Deployment

Amazon (AWS) MWAA

Deployment details

No response

What happened

Whole file is attempted to be loaded to memory.
Task exited with return code -9.

def execute(self, context: Context):
self.log.info("Calling HTTP method")
response = self.http_hook.run(self.endpoint, self.data, self.headers, self.extra_options)
self.s3_hook.load_bytes(
response.content,
self.s3_key,
self.s3_bucket,
self.replace,
self.encrypt,
self.acl_policy,
)

In this code response.content is in-memory representation of file content (as bytes).

What you think should happen instead

Lazy load via stream=True, response.raw and S3Hook.load_fileobj()

How to reproduce

Run HttpToS3Operator with file larger than RAM available (~20GB in my setup was enough).

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@rogalski rogalski added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 24, 2025
@raphaelauv
Copy link
Contributor

hi, HttpToS3Operator is an airflow operator doing the data transfer thanks to the code you linked.

So data is going through the airflow_worker ( if using celery executor )

you have multiple options:

  1. use the kubernetes Executor ( or any other executor that let you custom the RAM context of the task ) for this task and the code of the operator ( but still not doing stream transfer ) will not suffer of the airflow_worker hardware limits

or

  1. use the KubernetesPodOperator and trigger an efficient specialist tool to execute the transfer , like RCLONE -> https://rclone.org/docs/

@potiuk potiuk removed kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 26, 2025
@apache apache locked and limited conversation to collaborators Jan 26, 2025
@potiuk potiuk converted this issue into discussion #46066 Jan 26, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

3 participants