Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] [Container OOM]: Containers OOM with Amazon Linux 2023 ECS AMI #240

Open
rixwan-sharif opened this issue Apr 23, 2024 · 18 comments
Open

Comments

@rixwan-sharif
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
ECS Containers are getting killed due to Out of Memory with new Amazon Linux 2023 ECS AMI.

Which service(s) is this request for?
ECS - with EC2 (Autoscaling and Capacity Provider setup)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We are deploying our workloads on AWS ECS (EC2 based). We recently migrated our underlying cluster instances AMI to Amazon Linux 2023 (previously using Amazon Linux 2). After the migration, we are facing a lot of "OOM Container Killed" for our services without any change on the service side.

Are you currently working around this issue?

  • By increasing hard memory limit for now (we have more then 600 services , it's hard to do that for each of that)
  • Rolling back to Amazon Linux 2 AMI
@rgoltz
Copy link

rgoltz commented May 2, 2024

Hey @rixwan-sharif + all others with 👍

Thanks for the info and sharing your experience!
We are a normal ECS-on-EC2 user as well - We are currently still in the planning phase for the switch from Amazon Linux 2 to Amazon Linux 2023. We also have a large number of containers. It's possible for your to add further details about your setup here? (ECS: Setting of Limits on TaskDefinition and/or Container-Level // Images: Container OS used (like Alpine, Distroless, ...), Application Framework in the container (e.g. Java/Sprint Boot or NodeJS, etc) // Adding some Metrics using AL2 vs. AL2023, etc.)

We asked AWS ECS team via support case whether there is a known error or behavior like you described here - they said no.

Please note that the issue raised is not a known issue internally. Also there are no known issues related to Amazon Linux 2023 for out of memory behaviour.

Did you also forward the problem as a support case?

Thanks, Robert

@sparrc sparrc transferred this issue from aws/containers-roadmap May 2, 2024
@sparrc
Copy link
Contributor

sparrc commented May 2, 2024

Hello, I have transferred this issue to the ECS/EC2 AMI repo from containers-roadmap, since this sounds more like it could be a bug or change in behavior in the AMI, rather than a feature request.

@rixwan-sharif could you let us know which AL2023 AMI version you used? Was it the latest available? Could you also provide the task and container limits that you have in your task definition(s)?

Two differences that come to mind that may be relevant are the latest AL2023 AMI is using Docker 25.0 and cgroups v2, whereas the latest AL2 AMI is currently on Docker 20.10 and cgroups v1.

@sparrc
Copy link
Contributor

sparrc commented May 2, 2024

If you were not using the latest AL2023 AMI, one thing to note is that the Amazon Linux team released a systemd fix in late-September 2023 for a bug in the cgroup OOM-kill behavior. (Search "systemd" in release notes here: https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.2.20230920.html)

@sparrc
Copy link
Contributor

sparrc commented May 2, 2024

If anyone has data to provide to the engineering team that can't be shared here, please feel free to email it to ecs-agent-external (at) amazon (dot) com, thank you :)

@rixwan-sharif
Copy link
Author

rixwan-sharif commented May 6, 2024

Hi, This is the AMI version we are using

AMI Version : al2023-ami-ecs-hvm-2023.0.20240409-kernel-6.1-x86_64

Task/Container details

Base Docker Image: adoptopenjdk/openjdk14:x86_64-debian-jdk-14.0.2_12
Application Framework: Java/Sprint Boot

Resources:

CPU : 0.125
Memory(hard): 1GB

Docker stats on Amazon Linux 2 AMI.

image

Docker stats on Amazon Linux 2023 AMI. (Increased Memory hard limit to 3GB as container was OOMing with 1GB of memory)

image

And yes we already opened a support case too (Case ID 171387184800518). this is what we got from support.

[+] Upon further troubleshooting, we found that there seems to be issue with AL2023 AMI which our internal team is already working on and below are wordings shared by them:

We are following up on your inquiry regarding increased container OOM (out-of-memory) kills when using ECS-Optimized AL2023 AMI, comparing to AL2 AMI. The team is investigating to identify what has caused the OOM kill behavior change. We suspect a combination of Cgroupv2 and container runtime updates. As a workaround, we recommend that customer adjust container and task memory hard limit to have a larger buffer, based on their container memory usage patterns, using ECS Container Insights metrics.

@egemen-touchstay
Copy link

Hi @sparrc, after switching to AML 2023 from AML 2 we faced w/ a similar issue as well, we haven't got OOM yet but the memory consumption has nearly doubled, and memory consumption regularly increases as the application runs which looks like a memory leak.
AMI Version: al2023-ami-ecs-hvm-2023.0.20240515-kernel-6.1-x86_64
Base Docker Image: python:3.8-slim
Application Framework: Python / Django / uwsgi

@watkinsmike
Copy link

watkinsmike commented Jul 29, 2024

Also experiencing the same behavior with the latest AMI:
AMI al2023-ami-ecs-hvm-2023.0.20240712-kernel-6.1-x86_64

Going from AL2 to AL2023 results in significant memory consumption increase that seems to just keep increasing over time. (seems generalized regardless of language/framework). This is especially troublesome since AWS is recommending AL2023 over AL2.

If the ECS internal team is aware of this is there somewhere where we can track it? This thread doesn't really indicate anything is being done to investigate or fix.

It seems like this has been an issue for months and would like to stay current on any progress or updates.

@yrral86
Copy link

yrral86 commented Oct 11, 2024

We have memory limits set at the TaskDefinition level and the JVM is allocating heap based on the total physical RAM on the host system now, which is blowing up our RAM usage. We were able to set memory limits on the individual ContainerDefinitions within the TaskDefinition and that seems to have fixed it for us.

@dduvnjak
Copy link

dduvnjak commented Dec 10, 2024

We ran into the same issues, our Java services were either freezing or hitting OOM errors after the upgrade to AL2023. It turned out we were running a JDK version which did not have support for Docker cgroups v2, meaning the JDKs inside the containers were not aware of the memory limits imposed by ECS tasks. Upgrading the JDK to a recent version with cgroups v2 support fixed the issue

@chadlwilson
Copy link

@dduvnjak Interesting. I've had issues too, however this is running Java 21.0.5 so it can't be the same as your root cause.

Since I dont have good metrics before/after the change I can't really assume the memory usage is greater, so I'd just come to the naive conclusion that the memory killer is just far more aggressive with ECS AMI AL2023 and cgroups v2 and perhaps instantaneous blips over the max (which come down shortly after) are being punished more severely and instantly than before. I can't find good articles, but I was under the understanding that this was one of the design goals in cgroups v2 to improve isolation (requiring punishing containers over maxes more strictly).

I suppose it's possible that even on a JDK with cgroups v2 support the ergonomics and values it uses differ slightly causing it to consume more though...

Strangely enoguh at https://docs.aws.amazon.com/linux/al2023/ug/ecs.html it notes

All processes in the cgroup were always killed instead of the OOM-Killer choosing one process at a time, which is the intended behavior.

Actually, I thought that killing everything in the cgroup/container was one of the intentions/goals for containers with cgroups v2.... (but can confirm that on AL2023 ECS AMIs it is just picking a process inside my container to kill, which leads to unpredictable behaviour)

@Navapon
Copy link

Navapon commented Dec 12, 2024

I have an issue due to running datadog-agent daemon sets with 512MB hard limits memory.

Got A lot of OOM Alerts all the time.

Which is before migrating from AL2 to AL2023 this problem never happened, So we migrated back to al.2

@chadlwilson
Copy link

I wonder how the experience differs between Fargate ECS users and old-school EC2 ones. Mine is non-fargate for what it's worth.

@beth-soptim
Copy link

We have the same issue with Fargate. AWS doesn't seem to care since the "solution" is to increase the (paid) memory via Fargate/EC2.

@Navapon
Copy link

Navapon commented Dec 12, 2024

I wonder how the experience differs between Fargate ECS users and old-school EC2 ones. Mine is non-fargate for what it's worth.

For my experience the fargate it self quite convenient for whom doesn't wanna concern the infrastructure but it come with costly if you compare with ec2 it self

But it has fargate spot that very cheap, if your workload has falut torelant ability it is a good candidate

By the way there is limitations when using fargate like you cannot using adjust or using host capability.

@chadlwilson
Copy link

chadlwilson commented Dec 12, 2024

Yes, I understand Fargate - I meant whether there was an increase in OOM kills for containers experienced on Fargate users similar to EC2 users with AL2023 - not the trade-offs. This ticket was opened with respect to EC2s originally and is on the AMI repo, so doesn't really related to Fargate by definition.

Actually, after looking more closely since I dont use Fargate right now personally, only AL2 platform version is supported for Fargate (?) , so this problem (increase in memory usage or OOM kills AL2 -> AL2023) cannot apply, by definition.

Based on Amazon Linux 2.

Same per aws/containers-roadmap#2285

@Navapon
Copy link

Navapon commented Dec 12, 2024

Oh sorry i misunderstood.

I have use fargate version 1.4.0 with prod workload

have no experience with any OOM kills like this.

Yes, I understand Fargate - I meant whether there was an increase in OOM kills for containers experienced on Fargate users similar to EC2 users with AL2023 - not the trade-offs. This ticket was opened with respect to EC2s originally and is on the AMI repo, so doesn't really related to Fargate by definition.

Actually, after looking more closely since I dont use Fargate right now personally, only AL2 platform version is supported for Fargate (?) , so this problem (increase in memory usage or OOM kills AL2 -> AL2023) cannot apply, by definition.

Based on Amazon Linux 2.

Same per aws/containers-roadmap#2285

@chadlwilson
Copy link

We were able to set memory limits on the individual ContainerDefinitions within the TaskDefinition and that seems to have fixed it for us

There's some commentary on cgroups v2 at https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://github.com/cockroachdb/cockroach/issues/114774 that notes a few relevant things that might explain why it's necessary to set task specific limits when relying on cgroups kernel stats to auto-configure software (memory.max from cgroups v2 not having a way to determine effective memory limits of a child cgroup, unlike memory.limit_in_bytes from cgroups v1)

@dduvnjak
Copy link

@dduvnjak Interesting. I've had issues too, however this is running Java 21.0.5 so it can't be the same as your root cause.

Since I dont have good metrics before/after the change I can't really assume the memory usage is greater, so I'd just come to the naive conclusion that the memory killer is just far more aggressive with ECS AMI AL2023 and cgroups v2 and perhaps instantaneous blips over the max (which come down shortly after) are being punished more severely and instantly than before. I can't find good articles, but I was under the understanding that this was one of the design goals in cgroups v2 to improve isolation (requiring punishing containers over maxes more strictly).

@chadlwilson fwiw, our ECS task containers have Cpu, Memory and MemoryReservation limits set, and the JVM is configured with -XX:MaxRAMPercentage=85.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants