-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pods with attached security groups cannot reach Pod Identity Agent link local address #2797
Comments
Not sure why I thought the node security group would come into play here... traffic should never leave the node! However, doing a bit more digging, I've found that traffic isn't being routed correctly on pods with security groups attached. Failing pod:
Compared to a pod with no security groups:
So something about security group attachment also alters the route table for the pod. |
Having now re-read the documentation and a few issues that seemed related, I have a functioning setup and a better understanding of CNI behavior. I've added the following options to my CNI config:
The key insight came from pulling on the thread in #1384, which seems to be more or less the same issue here--pods with SGPP configured have their traffic routed over a branch ENI and thus entirely bypass the primary interface on the node, and can't be routed to the link local address used by the Pod Identity Agent without using standard mode and external SNAT. I think the documentation should be updated to make clear that external SNAT and standard mode are required when using Pod Identities with SGPP. It would also be helpful to clarify the reasons why external SNAT is necessary. I suspect inbound communication to your pods from external VPNs, direct connections, and external VPCs will become a far less common use case than simply expecting Pod Identities to work. I think this should also be explained in the Pod Identities documentation so I've left feedback there as well. It might also help to consider a more general solution for routing/forwarding traffic to link-local addresses. However, I'm admittedly quite naive on the practical challenges and implications of such a solution. These suggestions aside, thank you for the good documentation and engagement in issues. I was able to figure this out without much effort. Hopefully this will help someone else running into the same problem. 😄 |
@tmehlinger thank you for reporting this and for the impressive debugging! I went looking and it does not appear that the Pod Identity agent was covered with Security Groups for Pods, which is depressing to hear, so you are likely the first person to have tried this. Your assessment is accurate: when a pod matches a Security Group Policy, it is associated with a branch ENI and all traffic from the pod is routed through the trunk ENI on the node to ensure that EC2 security group rules are enforced. I will bring this up with our project manager to make sure we have this documented and a plan to come up with a more generalized solution. As a side note, I see that you have enabled Network Policy support. Do you have plans to move away from Security Groups in favor of Kubernetes network policies? That may simplify this networking setup significantly. |
Yes, this will be the first order that will provide clarity. Thanks for the detailed report and debugging. |
No problem, happy to help. :)
For intra-cluster communication, yes. However, I have pods that need to communicate with other AWS resources (RDS instances, for example) in peered/isolated VPC subnets so I'll still need SGPP. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
Issue closed due to inactivity. |
I ran into a very similar issue w/ Eks pod identity and enableing pod security groups. All while using ISTIO So. first was my app stopped spawning. The containerCredentialProvider was getting mad about not being able to access instance metadata. The istio proxy was complaining a lot about not getting to it's certificate backend. The solution above didn't work. I'm not sure what the SNAT parameter is supposed to do here, but all my network traffic from the pod was on the new ENI, and thus was not part of the node's security group. So it couldn't talk to
I added the node's security group to the pod's SecurityGroupPolicy, and that worked. On this page: it says this This doesn't seem to be true. My pod was not able to hit the Eks Pod Identity service URL when the security group was applied w/out adding the node-security group to the pod. I ended up here with the aws-node init params, although I'm not sure which did anything useful
Version info:
|
@orsenthil could we get this re-opened? This use-case is 100% optional now, but might become more important soon for us. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
Our users have encountered this (We believe) in Istio ambient as well
We SNAT kubelet health probes to link-local addresses, and if POD_SECURITY_GROUP_ENFORCING_MODE=strict (current default) those probes begin to fail. Settting
We had a similar issue with Calico, and the resolution there was to change Calico to ignore link-local addresses. Link-local addresses are by-RFC not routable outside the local link, so blocking them via pod-level security group enforcement seems wrong/unnecessary - the link-local CIDR should probably be completely ignored by AWS VPC CNI here. Or at least, AWS VPC CNI should provide a flag or other override that allows pod security group enforcement to categorically ignore all link-local addresses. |
When |
Yes, I notice that the code removes the default route rule to force all packets thru thru the trunked ENI via a lower-ordered rule. Adding back a higher-priority route rule that only handles link-local (similar to what that code already does for IPv6 gateway ICMP packets) would fix it. Something like this: bleggett@dfbc3fb If there's a strong case for capturing link-local traffic with SecurityGroups, a new flag is fine. I'm not sure there is a point in pushing link-local traffic through SG enforcement tho, by definition. Given that there is already code there that excludes some kinds of traffic if |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
Actually, It was suggested that instead of fixing this bug, a workaround would be adding a link-local CIDR to the SG to let link-local traffic thru. But that's not a valid workaround if the packets can't make it that far to begin with in |
When using security groups for pods, pods with security groups attached cannot reach the Pod Identity agent on 169.254.170.23. Any pods without security groups can reach the agent without issue. The agent pods have no security groups associated and I've ensured that the security groups on failing pods permit egress traffic on TCP port 80, and my node security group permits ingress traffic on port 80 from cluster subnets. I've tried various combinations of egress/ingress from pod/node security groups, and even a blanket policy that permits traffic to/from
0/0
with no success.I'm using the EKS Addon with the following configuration:
I've tried running the CNI with
POD_SECURITY_GROUP_ENFORCING_MODE
set tostandard
but this causes traffic to a peered VPCs to be denied in addition to pod identity traffic being dropped (and I want strict enforcement, regardless).Reading the documentation for standard mode behavior:
My totally wild guess about what's happening is strict mode requires enforcement of security group rules and the node security group is dropping traffic destined for a link local address as invalid.
Could someone point me the right direction? Thanks!
Environment:
Kubernetes version (use
kubectl version
):Server Version: version.Info{Major:"1", Minor:"28+", GitVersion:"v1.28.5-eks-5e0fdde", GitCommit:"e78a4be9da4c375a87a109e0f4a5f4a8d2bc17c0", GitTreeState:"clean", BuildDate:"2024-01-02T20:34:46Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}
CNI Version:
v1.16.2-eksbuild.1
OS (e.g:
cat /etc/os-release
):AWS EKS 1.28.5 AMI.
Kernel (e.g.
uname -a
):Linux ip-10-0-128-192.us-west-2.compute.internal 5.10.205-195.807.amzn2.aarch64 #1 SMP Tue Jan 16 18:29:00 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
The text was updated successfully, but these errors were encountered: