Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceExportReconciler keep re registering instances in aws cloudmap #312

Open
mukshe01 opened this issue Dec 6, 2023 · 10 comments
Open
Labels

Comments

@mukshe01
Copy link

mukshe01 commented Dec 6, 2023

Hi,

We are running v0.3.1 of aws-cloud-map-mcs-controller to register services to aws cloudmap. we have another application looks up cloudmap for service discovery. we have a requirement to populate some custom attributes for cloud map instances.

is there a way to define custom attributes in ServiceExport definition of kubernates object?. currently its defined like this.

kind: ServiceExport
apiVersion: multicluster.x-k8s.io/v1alpha1
metadata:
namespace: namespace1
name: service-name

currently we are using a lambda trigger in aws to populate these custom attributes. this is causing an issue with ServiceExportReconciler , its keep re registering instances in aws cloudmap as endpoint/instance definition is different in aws cloudmap to what controller desires . is there a way to stop this re register from happening?.

Thank you.

@runakash
Copy link
Member

runakash commented Dec 6, 2023

Hello @mukshe01,

We don't have support custom attributes export/import.

currently we are using a lambda trigger in aws to populate these custom attributes. this is causing an issue with ServiceExportReconciler , its keep re registering instances in aws cloudmap as endpoint/instance definition is different in aws cloudmap to what controller desires . is there a way to stop this re register from happening?.

The code compares the endpoints IP/Port with CloudMap instances. I don't believe changing the attributes will trigger re-registration.

@mukshe01
Copy link
Author

mukshe01 commented Dec 7, 2023

Hi Akash,

Many thanks for your response,

we are trying to understand what change is triggering the re registrations. in kubernates the app exposed by the service hasnt restarted for last 16 hours means the ip shouldnt change, also the port 8080 stays constant.

$ date
Thu Dec 7 09:55:51 UTC 2023

$ kubectl get po rest-api-develop-fc4d7778c-pqss5 -n participant1-develop -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rest-api-develop-fc4d7778c-pqss5 1/1 Running 0 16h 172.24.126.219 ip-172-24-126-165.ec2.internal

this is kubernates service config.

$ kubectl describe svc rest-api-develop -n participant1-develop
Name: rest-api-develop
Namespace: participant1-develop
Labels: app.kubernetes.io/instance=rest-api-develop
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=rest-api
app.kubernetes.io/version=1.16.0
helm.sh/chart=rest-api-0.1.0
Annotations: meta.helm.sh/release-name: rest-api-develop
meta.helm.sh/release-namespace: participant1-develop
Selector: app.kubernetes.io/instance=rest-api-develop,app.kubernetes.io/name=rest-api
Type: NodePort
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.100.234.225
IPs: 10.100.234.225
Port: ra-main 8080/TCP
TargetPort: 8080/TCP
NodePort: ra-main 30679/TCP
Endpoints: 172.24.126.219:8080
Session Affinity: None
External Traffic Policy: Cluster
Events:

below are logs of two occurrences of endpoint re registration for namespace participant1-develop, service rest-api-develop. as you see its registering instances/endpoints to cloudmap with same ip(ip didnt change), ideally we expect it shouldnt update cloudmap as there is no change. i have attached full logs of controller for your reference.

--
{"level":"info","ts":1701931995.9447143,"logger":"controllers.ServiceExportReconciler","msg":"updating Cloud Map service","namespace":"participant1-develop","name":"rest-api-develop"}
{"level":"info","ts":1701931995.9447424,"logger":"cloudmap.client","msg":"fetching a service","namespace":"participant1-develop","name":"rest-api-develop"}
{"level":"info","ts":1701931995.9447896,"logger":"cloudmap.client","msg":"registering endpoints","namespaceName":"participant1-develop","serviceName":"rest-api-develop","endpoints":[{"Id":"tcp-172_24_126_21
9-8080","IP":"172.24.126.219","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","Port":8080,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1s
ource-develop","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1694697956000,"Ready":true,"Hostname":"","Nodename":"ip-172-24-126-165.ec2.internal","Attributes":{"K8S_
CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}}]}

{"level":"info","ts":1701937389.3627253,"logger":"controllers.ServiceExportReconciler","msg":"updating Cloud Map service","namespace":"participant1-develop","name":"rest-api-develop"}
{"level":"info","ts":1701937389.3627617,"logger":"cloudmap.client","msg":"fetching a service","namespace":"participant1-develop","name":"rest-api-develop"}
{"level":"info","ts":1701937389.3628213,"logger":"cloudmap.client","msg":"registering endpoints","namespaceName":"participant1-develop","serviceName":"rest-api-develop","endpoints":[{"Id":"tcp-172_24_126_21
9-8080","IP":"172.24.126.219","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","Port":8080,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1s
ource-develop","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1694697956000,"Ready":true,"Hostname":"","Nodename":"ip-172-24-126-165.ec2.internal","Attributes":{"K8S_
CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}}]}

would you help us understand what change is triggering the update?.

cloud-map-mcs-controller.log
is there a way to increase logging level and to change timestamp format to human readable? (we are running the controller using command kubectl apply -k "mcs-controller/controller_install_release" after downloading everything from https://github.com/aws/aws-cloud-map-mcs-controller-for-k8s/tree/main/config locally, we can change this file config/default/manager_auth_proxy_patch.yaml to add additional arg here if you tell us what it should be to increase tracing and change time format).

  • name: manager
    args:
    - "--health-probe-bind-address=:8081"
    - "--metrics-bind-address=127.0.0.1:8080"
    - "--leader-elect"

Regards
Shekhar

@runakash
Copy link
Member

runakash commented Dec 7, 2023

You can use this reference to customize the logging https://sdk.operatorframework.io/docs/building-operators/golang/references/logging/

@mukshe01
Copy link
Author

mukshe01 commented Dec 8, 2023

Hi,

Thank you for that we changed loglevel to debug(--zap-log-level debug), and captured the logs when re registration of endpoints happened. looking at logs it is not clear why this happened, as the endpoint app did not restart and endpoint ips did not change.

here is the full log.

mcs-controlller-debug.log

log snippet when re registering endpoints.

sterSetId":"1source"}
2023-12-08T16:35:56.058213089Z stderr F {"level":"debug","ts":1702053356.0581286,"logger":"controllers.ServiceExportReconciler","msg":"reconciling ServiceExport","Namespace":"participant1-develo
p","Name":{"namespace":"participant1-develop","name":"rest-api-develop"}}
2023-12-08T16:35:56.058251185Z stderr F {"level":"info","ts":1702053356.0581875,"logger":"controllers.ServiceExportReconciler","msg":"updating Cloud Map service","namespace":"participant1-develo
p","name":"rest-api-develop"}
2023-12-08T16:35:56.05827025Z stderr F {"level":"info","ts":1702053356.058198,"logger":"cloudmap.client","msg":"fetching a service","namespace":"participant1-develop","name":"rest-api-develop"}
2023-12-08T16:35:56.058335948Z stderr F {"level":"info","ts":1702053356.0582664,"logger":"cloudmap.client","msg":"registering endpoints","namespaceName":"participant1-develop","serviceName":"res
t-api-develop","endpoints":[{"Id":"tcp-172_24_126_204-8080","IP":"172.24.126.204","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","
Port":8080,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1source-develop","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1694697956000,"Ready":false,
"Hostname":"","Nodename":"ip-172-24-126-165.ec2.internal","Attributes":{"K8S_CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}},{"Id":"tcp-172_24_126_219-8080","IP":"172.24.1
26.219","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","Port":8080,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1source-deve
lop","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1694697956000,"Ready":true,"Hostname":"","Nodename":"ip-172-24-126-165.ec2.internal","Attributes":{"K8
S_CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}}]}
2023-12-08T16:35:57.981590837Z stderr F {"level":"debug","ts":1702053357.9814687,"logger":"controllers.CloudmapReconciler","msg":"clusterProperties found","ClusterId":"1source-develop","ClusterS
etId":"1source"}
2023-12-08T16:35:57.981632268Z stderr F {"level":"debug","ts":1702053357.9815843,"logger":"controllers.CloudmapReconciler","msg":"syncing namespace","namespace":"participant1-develop"}
2023-12-08T16:35:57.988809874Z stderr F {"level":"debug","ts":1702053357.98873,"logger":"controllers.CloudmapReconciler","msg":"syncing service","namespace":"participant1-develop","service":"res
t-api-develop"}
2023-12-08T16:35:57.98900082Z stderr F {"level":"debug","ts":1702053357.9889314,"logger":"controllers.CloudmapReconciler","msg":"updating EndpointSlice","namespace":"participant1-develop","name"
:"imported-vonhveonrv-m6dsh"}
2023-12-08T16:35:57.997397891Z stderr F {"level":"debug","ts":1702053357.9972944,"logger":"controllers.ServiceExportReconciler","msg":"clusterProperties found","ClusterId":"1source-develop","Clu
sterSetId":"1source"}
2023-12-08T16:35:57.998135175Z stderr F {"level":"debug","ts":1702053357.998055,"logger":"controllers.CloudmapReconciler","msg":"syncing namespace","namespace":"participant2-develop"}
2023-12-08T16:35:58.002388689Z stderr F {"level":"debug","ts":1702053358.0023232,"logger":"controllers.CloudmapReconciler","msg":"syncing service","namespace":"participant2-develop","service":"r
est-api-develop"}

we have 3 service export objects in kubernates, when re registration happens it happens for all 3 services. 

we are seeing this re registration behavior happening every few hours and we are unclear what is triggering this. each occurrence of this issue results in few seconds of service outage.  

would you help us how to prevent this happening?, 


Thank you Regards
Shekhar

@runakash
Copy link
Member

runakash commented Dec 9, 2023

On reviewing the ServiceExport controller code, it seems we are exporting the endpoints based on the k8s events on the Service and EndpointSlice. Updating metadata will trigger the k8s events, thus it's getting overridden.

@mukshe01
Copy link
Author

Hi Akash,

we compared Service, EndpointSlice k8s objects definitions for effected endpoints before and after the occurrence of the problem and they are exactly identical, we do not see any metadata difference (command ran kubectl describe endpointslices rest-api-staging-xkqrz -n participant1-staging).

here is an example endpoint slice.

endpointslices_p1_10-12.txt

also we captured kubernates events for last 24 hours of namespace the service is in , we do not see any k8s event at exact time the problem occurs. are we missing anything?.

On a side note we created service export objects for 2 new services(just to test) within same cluster yesterday. we did not see the problem occurred for the new services. however for original service it occurred 8 to 10 times in last 24 hours. 1 of test service runs on same k8s node(ec2) and other runs on different node. interestingly the only difference between the test services and live service is the custom attributes we populated in cloudmap for the instances using a lambda.

Would you help us to get to the bottom of this issue, please let us know if you need any information from our side.

Regards
Shekhar

@runakash
Copy link
Member

The controller watches the kubernetes service and endpoint slice resources. The code for that is abstracted out by operator sdk. Perhaps you can subscribe to the events and debug on what attribute change is triggering reconciliation.

@mukshe01
Copy link
Author

Hi Akash,

we captured all k8s events in the cluster for last few hours, attached the events dump file. also attaching mcs controller log.

mcs_controller_log_20231212.txt
events_log_20231212.txt

as we see in mcs controller log controller is re reregistering instance tcp-172_24_125_249-8080 at 2023-12-12T11:45:55.

Name":{"namespace":"participant1-qar","name":"rest-api-qar"}}
2023-12-12T11:45:55.611488108Z stderr F {"level":"info","ts":1702381555.6114109,"logger":"controllers.ServiceExportReconciler","msg":"updating Cloud Map service","namespace":"participant1-qar","
name":"rest-api-qar"}
2023-12-12T11:45:55.611498026Z stderr F {"level":"info","ts":1702381555.6114464,"logger":"cloudmap.client","msg":"fetching a service","namespace":"participant1-qar","name":"rest-api-qar"}
2023-12-12T11:45:55.617349345Z stderr F {"level":"info","ts":1702381555.6172726,"logger":"cloudmap.client","msg":"registering endpoints","namespaceName":"participant1-qar","serviceName":"rest-ap
i-qar","endpoints":[{"Id":"tcp-172_24_125_249-8080","IP":"172.24.125.249","EndpointPort":{"Name":"ra-main","Port":8080,"TargetPort":"","Protocol":"TCP"},"ServicePort":{"Name":"ra-main","Port":80
80,"TargetPort":"8080","Protocol":"TCP"},"ClusterId":"1source-qar","ClusterSetId":"1source","ServiceType":"ClusterSetIP","ServiceExportCreationTimestamp":1700579083000,"Ready":true,"Hostname":""
,"Nodename":"ip-172-24-125-8.ec2.internal","Attributes":{"K8S_CONTROLLER":"aws-cloud-map-mcs-controller-for-k8s 887737e (887737e)"}}]}

image

we do not see any K8s events at this time. we see some events at 2023-12-12T11:40:42(5 minutes before the issue) of type SuccessfullyReconciled for targetGroupBinding,

.[0K4.
6","resourceVersion":"32067978","creationTimestamp":"2023-12-12T11:40:42Z","managedFields":[{"manager":"controller","operation":"Update","apiVersion":"v1","time":"2023-12-12T11:40:42Z"}]},"invol
.[0Kv.
edObject":{"kind":"TargetGroupBinding","namespace":"participant3-qar","name":"rest-api-qar","uid":"ac1dd5d8-ae3c-4023-a087-8b7ce140f3ad","apiVersion":"elbv2.k8s.aws/v1beta1","resourceVersion":"2
.[0K4.
208797"},"reason":"SuccessfullyReconciled","message":"Successfully reconciled","source":{"component":"targetGroupBinding"},"firstTimestamp":"2023-12-11T06:47:40Z","lastTimestamp":"2023-12-12T11:
.[0K4.
0:42Z","count":28,"type":"Normal","eventTime":null,"reportingComponent":"","reportingInstance":""}}.[0K.

we can confirm these events wont update service/endpointslices meta data as we captured and compared service/endpointslices k8s objects before and after the issue and they are identical.
also these k8s events gets generated exactly once every hour and we are not seeing the issue happening every hour.

would you be able to help us to identify which event is causing the controller to re register the endpoints.

@mukshe01
Copy link
Author

Hi,

Just to update on this,

We are still not sure what is triggering this change. however we are sure that the issue only occurs to instances to which we add custom attributes in aws cloudmap.

we have 4 similar applications running on same K8s node, the issue happens to 3 instances (around same time), for 1 it wont happen. the 1 instance being the one for which we are not altering aws cloudmap attributes. we we added custom attribute to this instance in cloudmap the issue started happening.

Regards
Shekhar.

@runakash
Copy link
Member

runakash commented Jan 9, 2024

Hello,

There is no code logic within the controller that triggers the re-registration. The controller subscribes to kubernetes updates events for the kind ServiceExport, Service, EndpointSlice and ClusterProperty. Unfortunately we don't have enough observability to say what kind-of update is triggering an event.

You can choose to fork the controller and implement the attributes reconciliation. The way to approach this will be to add the instance attributes as ServiceExport object annotations

kind: ServiceExport
apiVersion: multicluster.x-k8s.io/v1alpha1
metadata:
  namespace: example
  name: my-service
  annotations:
    attr1: val1
    attr2: val2

And then update the serviceexport_controller.go to export the annotations as attributes.

You can push the change as a PR, and we can review the code.


And a disclaimer that the mcs controller is in alpha stage. Should be used in production system with caution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants