Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug Memory Leak in Autogen #4893

Open
Leon0402 opened this issue Jan 4, 2025 · 5 comments
Open

Debug Memory Leak in Autogen #4893

Leon0402 opened this issue Jan 4, 2025 · 5 comments
Milestone

Comments

@Leon0402
Copy link
Contributor

Leon0402 commented Jan 4, 2025

@Leon0402 Can you show where your runtime is created? this might be due to the runtime is not removing references to created agents. 

To mitigate you might want to create new instances of runtime for each task.

I think we should handle it in a separate PR.

_Originally posted by @ekzhu in https://github.com/microsoft/autogen/issues/4885#issuecomment-2571434115_

Thanks @ekzhu, you could be right about that. Possibly some interplay with gather() as I read something in that direction about it. I currently try to reproduce in a smaller setup.

What do you mean by runtime? My Task Runner? This basically just is:

class TaskRunner:
    def __init__(self, cfg: Config):
        self._cfg = cfg
        
    async def run_agent(self, sample: TaskSample, output_dir: Path):
         // define agents here
         // run chat
         // save results to some file

I do not store anything in the object itself. So it was my assumption that agents should get cleaned up after run_agent is left.

@ekzhu
Copy link
Collaborator

ekzhu commented Jan 4, 2025

Thanks for creating the issue. A simple setup can be tried without the jupyter code executor to isolate the cause. And then add the jupyter executor to run a simple piece of code. See the difference.

@ekzhu ekzhu added this to the 0.4.1 milestone Jan 4, 2025
@Leon0402
Copy link
Contributor Author

Leon0402 commented Jan 6, 2025

Here is a memory chart of my long running task:
Image

So yeah not great :D Debugging and isolating the cause is not too easy though. I think I was able to get something useful though.

Image

This is after one full iteration of run_task, so down in the for loop here, where everything should be cleaned up by the gather.

async def run_task(cfg: Config, task: TaskType):
    ...

    semaphore = asyncio.Semaphore(cfg.concurrency_limit)

    async def run_single_sample(task_runner: TaskRunner, task_sample: TaskSample):
        async with semaphore:
            await task_runner.run_agent(task_sample, cfg.output_dir / task.value)

    task_runner = TaskRunner(cfg)
    samples = [run_single_sample(task_runner, task_sample) for task_sample in sliced_samples]
    await tqdm.gather(*samples, desc=f"Task: {task.value}")


async def run_tasks(cfg: Config):
    for task in cfg.tasks:
        await run_task(cfg, task)
        # Added a asyncio.sleep(60) to be sure
        --> HERE

Looking at my debug output, it seems:

I am not too familar with the whole async stuff yet, but that line looks a little bit shady to me. I know __del__ has some heavy caveats in python.

Any thoughts on this?

Edit: I cannot really reliably reproduce this. So maybe my theory is wrong here :(

@ekzhu
Copy link
Collaborator

ekzhu commented Jan 10, 2025

Ah! For 10+ years of me using Python I have never been a fan of its performance. If performance is really important for you. You might want to wait for our .NET release. cc @rysweet

@Leon0402
Copy link
Contributor Author

Leon0402 commented Jan 10, 2025

Ah! For 10+ years of me using Python I have never been a fan of its performance. If performance is really important for you. You might want to wait for our .NET release. cc @rysweet

Well no, it is not really about performance here. My code (or autogens code) just leaks memory somewhere (not in the classical sense like in C++, but in the python sense that something is not garbarge collected until the very end of the program).
Unfortunately, I do not have any more time to debug this (already invested quite some effort in this), but as a workaround I manually do a gc.collect in my main loop. While it does not completely fix the issue (there is still a leak for some time), it seems to at least bound the memory usage to something managable on a laptop with 16GB.

@ekzhu
Copy link
Collaborator

ekzhu commented Jan 10, 2025

Okay. Let's keep this thread open and come back to it when we have some more clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants