Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Pacemaker awsvip failing with different errors #1876

Open
jayan800 opened this issue Jun 23, 2023 · 4 comments
Open

AWS Pacemaker awsvip failing with different errors #1876

jayan800 opened this issue Jun 23, 2023 · 4 comments

Comments

@jayan800
Copy link

Hi All,

We are running a two node pacemaker cluster in AWS and we use "awsvip" resource type to configure the vip IP. Below is the conf

pcs resource show privip_node1

Resource: privip_node1 (class=ocf provider=heartbeat type=awsvip)
Attributes: secondary_private_ip=10.x.x.x
Operations: migrate_from interval=0s timeout=30s (privip_node1-migrate_from-interval-0s)
migrate_to interval=0s timeout=30s (privip_node1-migrate_to-interval-0s)
monitor interval=20s timeout=30s (privip_node1-monitor-interval-20s)
start interval=0s timeout=30s (privip_node1-start-interval-0s)
stop interval=0s timeout=30s (privip_node1-stop-interval-0s)
validate interval=0s timeout=10s (privip_node1-validate-interval-0s)

pcs resource show node1_vip

Resource: node1_vip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.x.x.x
Operations: monitor interval=10s timeout=20s (node1_vip-monitor-interval-10s)
start interval=0s timeout=20s (node1_vip-start-interval-0s)
stop interval=0s timeout=20s (node1_vip-stop-interval-0s)

The EC2 instance is configured to use IMDSV2.The fence_aws agent and resource-agent have also been upgraded to the most recent versions, which support imdsv2. Additionally, the resource is set up to use the IAM Profile credentials.

fence-agents-aws-4.2.1-41.el7_9.3.x86_64
python-s3transfer-0.1.13-1.0.1.el7.noarch
resource-agents-4.1.1-61.el7_9.15.x86_64

pip list | grep -i boto
boto3 (1.10.0)
botocore (1.13.50)

aws --version
aws-cli/2.9.4 Python/3.9.11 Linux/3.10.0-1160.80.1.0.1.el7.x86_64 exe/x86_64.oracle.7 prompt/off

pip3 list | grep -i boto
boto3 1.23.10
botocore 1.26.10

The privip resource consistently fails with the different errors:

pengine: warning: unpack_rsc_op_failure: Processing failed monitor of privip_node2 on node2: unknown error | rc=1
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000 process (PID 109357) timed out
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000 process (PID 109357) timed out
Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000:109357 - timed out after 30000ms

Jun 16 10:01:43 node2 lrmd[36967]: notice: privip_node2_monitor_20000:13042:stderr [ Unable to locate credentials. You can configure credentials by running "aws configure". ]
Jun 16 10:01:43 node2 crmd[36970]: notice: privip_node2_monitor_20000:91 [ % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 359 100 359 0 0 37513 0 --:--:-- --:--:-- --:--:-- 39888\n\nUnable to locate credentials. You can configure credentials by running "aws configure".\n ]

Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ #15 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection refused ]
Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ #15 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection refused ]
Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ An error occurred (MissingParameter) when calling the DescribeInstances operation: The request must contain the parameter InstanceId ]

Failed Resource Actions:

  • privip_node1_start_0 on node1 'not running' (7): call=250, status=complete, exitreason='instance_id not found. Is this a EC2 instance?',
    last-rc-change='Fri May 26 07:27:46 2023', queued=0ms, exec=6597ms

Any advice would be great.

@oalbrigt
Copy link
Contributor

Try running pcs resource debug-start --full <resource>. That should show you all the commands it's running, and hopefully some pointers to what's wrong.

@jayan800
Copy link
Author

jayan800 commented Jun 26, 2023

Thank you.

The debug command completed without any errors.

is there anything else to check?

@jayan800 jayan800 changed the title AWS Pacemaker awsvip faling with different errors AWS Pacemaker awsvip failing with different errors Jun 26, 2023
@oalbrigt
Copy link
Contributor

You can run pcs resource update <resource> trace_ra=1 and then disable/enable or restart the resource.

The trace files will be available in /var/lib//heartbeat/trace_ra/.

@jayan800
Copy link
Author

Thank you. I will enable the trace.
fingers crossed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants