Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS v1.30.4 Upgrade Caused Instant Service CrashbackLoop #3370

Open
RareBodhi opened this issue Oct 24, 2024 · 4 comments
Open

EKS v1.30.4 Upgrade Caused Instant Service CrashbackLoop #3370

RareBodhi opened this issue Oct 24, 2024 · 4 comments

Comments

@RareBodhi
Copy link

RareBodhi commented Oct 24, 2024

What happened:

Our AWS infrastructure upgraded our EKS nodes from v1.30.1-eks-e564799 to v1.30.4-eks-16b398d. The instant this happened, all of our production services crashed, all at once. With the exception of one pod on a few of our services where minimum node tolerations were configured.

Every single one of our (NodeJS and Java) services fell into a crashback loop all at once.

If true, it could be traced to this release https://github.com/aws/eks-distro/releases/tag/v1-30-eks-15

NodeJS Error

#
# Fatal error in , line 0
# Check failed: 12 == (*__errno_location ()).
#
#FailureMessage Object: 0xffffee1205a0
 1: 0xceb064  [node]
 2: 0x1f43eb0 V8_Fatal(char const*, ...) [node]
 3: 0x1f4e5e8 v8::base::OS::SetPermissions(void*, unsigned long, v8::base::OS::MemoryPermission) [node]
 4: 0x10e1974 v8::internal::MemoryAllocator::SetPermissionsOnExecutableMemoryChunk(v8::internal::VirtualMemory*, unsigned long, unsigned long, unsigned long) [node]
 5: 0x10e1cb4 v8::internal::MemoryAllocator::AllocateAlignedMemory(unsigned long, unsigned long, unsigned long, v8::internal::AllocationSpace, v8::internal::Executability, void*, v8::internal::VirtualMemory*) [node]
 6: 0x10e1eb8 v8::internal::MemoryAllocator::AllocateUninitializedChunkAt(v8::internal::BaseSpace*, unsigned long, v8::internal::Executability, unsigned long, v8::internal::PageSize) [node]
 7: 0x10e2488 v8::internal::MemoryAllocator::AllocatePage(v8::internal::MemoryAllocator::AllocationMode, v8::internal::Space*, v8::internal::Executability) [node]
 8: 0x10f6e78 v8::internal::PagedSpaceBase::TryExpandImpl() [node]
 9: 0x10f98c0  [node]
10: 0x10f9e54 v8::internal::PagedSpaceBase::RefillLabMain(int, v8::internal::AllocationOrigin) [node]
11: 0x1070988 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
12: 0x1050230 v8::internal::Factory::CodeBuilder::AllocateInstructionStream(bool) [node]
13: 0x1050604 v8::internal::Factory::CodeBuilder::BuildInternal(bool) [node]
14: 0xed068c v8::internal::baseline::BaselineCompiler::Build(v8::internal::LocalIsolate*) [node]
15: 0xee2a04 v8::internal::GenerateBaselineCode(v8::internal::Isolate*, v8::internal::Handle<v8::internal::SharedFunctionInfo>) [node]
16: 0xf3c1b0 v8::internal::Compiler::CompileSharedWithBaseline(v8::internal::Isolate*, v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [node]
17: 0xf3c734 v8::internal::Compiler::CompileBaseline(v8::internal::Isolate*, v8::internal::Handle<v8::internal::JSFunction>, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [node]
18: 0xece3cc v8::internal::baseline::BaselineBatchCompiler::CompileBatch(v8::internal::Handle<v8::internal::JSFunction>) [node]
19: 0xf4708c v8::internal::Compiler::Compile(v8::internal::Isolate*, v8::internal::Handle<v8::internal::JSFunction>, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [node]
20: 0x14587a8 v8::internal::Runtime_CompileLazy(int, unsigned long*, v8::internal::Isolate*) [node]
21: 0x1862a84  [node]

The Java error was less verbose but looks like

Error occurred during initialization of VM
Failed to mark memory page as executable - check if grsecurity/PaX is enabled

What you expected to happen:

Auto scheduled upgrade to patch version bump 1.30.4 does not cause all services to crash with internal Memory errors.

How to reproduce it (as minimally and precisely as possible):

I don't currently have a reproduction, I want this bug report to represent a centralised location for others who potentially experienced the same issue to engage in conversation here. I also cannot guarantee with certainty its an issue with the EKS image, but it does seem very suspicious that the upgrade to 1.30.4 caused it, and downgrade fixed it instantly.

Anything else we need to know?:

Environment:

Related

@hiradkariminlx
Copy link

we had the same issue with v1.30.4-eks-16b398d version. we downgraded to v1.30.1-eks-e564799 version and it seems to be working for now.

@RareBodhi
Copy link
Author

Our solution which has just worked for us:

  1. Force downgrade back to our previous image version v1.30.1-eks-e564799
  2. Use AWS AMI ami-0e7912f782e850202
  3. Change from Bottlerocket to AL2

@RareBodhi
Copy link
Author

Potentially related... not sure: projectcalico/calico#7886

@RareBodhi
Copy link
Author

We've had ongoing networking issues with Redis Elasticache, DB, and oher k8s networked services since attempting to perform this downgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants