Debug Memory Leak in Autogen #4893

Leon0402 · 2025-01-04T23:12:30Z

@Leon0402 Can you show where your runtime is created? this might be due to the runtime is not removing references to created agents. 

To mitigate you might want to create new instances of runtime for each task.

I think we should handle it in a separate PR.

_Originally posted by @ekzhu in https://github.com/microsoft/autogen/issues/4885#issuecomment-2571434115_

Thanks @ekzhu, you could be right about that. Possibly some interplay with gather() as I read something in that direction about it. I currently try to reproduce in a smaller setup.

What do you mean by runtime? My Task Runner? This basically just is:

class TaskRunner:
    def __init__(self, cfg: Config):
        self._cfg = cfg
        
    async def run_agent(self, sample: TaskSample, output_dir: Path):
         // define agents here
         // run chat
         // save results to some file

I do not store anything in the object itself. So it was my assumption that agents should get cleaned up after run_agent is left.

The text was updated successfully, but these errors were encountered:

ekzhu · 2025-01-04T23:21:41Z

Thanks for creating the issue. A simple setup can be tried without the jupyter code executor to isolate the cause. And then add the jupyter executor to run a simple piece of code. See the difference.

Leon0402 · 2025-01-06T23:43:24Z

Here is a memory chart of my long running task:

So yeah not great :D Debugging and isolating the cause is not too easy though. I think I was able to get something useful though.

This is after one full iteration of run_task, so down in the for loop here, where everything should be cleaned up by the gather.

async def run_task(cfg: Config, task: TaskType):
    ...

    semaphore = asyncio.Semaphore(cfg.concurrency_limit)

    async def run_single_sample(task_runner: TaskRunner, task_sample: TaskSample):
        async with semaphore:
            await task_runner.run_agent(task_sample, cfg.output_dir / task.value)

    task_runner = TaskRunner(cfg)
    samples = [run_single_sample(task_runner, task_sample) for task_sample in sliced_samples]
    await tqdm.gather(*samples, desc=f"Task: {task.value}")


async def run_tasks(cfg: Config):
    for task in cfg.tasks:
        await run_task(cfg, task)
        # Added a asyncio.sleep(60) to be sure
        --> HERE

Looking at my debug output, it seems:

Cleanup tasks are still pending and I assume they bind other resources to them and after a while it stacks up!
The cause seems to be somewhere around here https://github.com/openai/openai-python/blob/1e07c9d839e7e96f02d0a4b745f379a43086334c/src/openai/_base_client.py#L1356

I am not too familar with the whole async stuff yet, but that line looks a little bit shady to me. I know __del__ has some heavy caveats in python.

Any thoughts on this?

Edit: I cannot really reliably reproduce this. So maybe my theory is wrong here :(

ekzhu · 2025-01-10T07:10:38Z

Ah! For 10+ years of me using Python I have never been a fan of its performance. If performance is really important for you. You might want to wait for our .NET release. cc @rysweet

Leon0402 · 2025-01-10T15:17:06Z

Ah! For 10+ years of me using Python I have never been a fan of its performance. If performance is really important for you. You might want to wait for our .NET release. cc @rysweet

Well no, it is not really about performance here. My code (or autogens code) just leaks memory somewhere (not in the classical sense like in C++, but in the python sense that something is not garbarge collected until the very end of the program).
Unfortunately, I do not have any more time to debug this (already invested quite some effort in this), but as a workaround I manually do a gc.collect in my main loop. While it does not completely fix the issue (there is still a leak for some time), it seems to at least bound the memory usage to something managable on a laptop with 16GB.

ekzhu · 2025-01-10T21:05:57Z

Okay. Let's keep this thread open and come back to it when we have some more clarity.

github-actions bot added the needs-triage label Jan 4, 2025

ekzhu added this to the 0.4.1 milestone Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug Memory Leak in Autogen #4893

Debug Memory Leak in Autogen #4893

Leon0402 commented Jan 4, 2025 •

edited

Loading

ekzhu commented Jan 4, 2025

Leon0402 commented Jan 6, 2025 •

edited

Loading

ekzhu commented Jan 10, 2025 •

edited

Loading

Leon0402 commented Jan 10, 2025 •

edited

Loading

ekzhu commented Jan 10, 2025

Debug Memory Leak in Autogen #4893

Debug Memory Leak in Autogen #4893

Comments

Leon0402 commented Jan 4, 2025 • edited Loading

ekzhu commented Jan 4, 2025

Leon0402 commented Jan 6, 2025 • edited Loading

ekzhu commented Jan 10, 2025 • edited Loading

Leon0402 commented Jan 10, 2025 • edited Loading

ekzhu commented Jan 10, 2025

Leon0402 commented Jan 4, 2025 •

edited

Loading

Leon0402 commented Jan 6, 2025 •

edited

Loading

ekzhu commented Jan 10, 2025 •

edited

Loading

Leon0402 commented Jan 10, 2025 •

edited

Loading