Move SSH key generation script from pam.d to /etc/profile (3.x) #1545
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes
When FSx Lustre is configured with the new root_squash feature, and ParallelCluster is configured with Active Directory with home folders within the FSx mount, pam_exec.so is unable to properly run the SSH key generation script. This is because pam_exec.so runs the script as root, but root does not have access to any home folders to manipulate the files due to the fact that root is regarded as nobody/nogroup within the root_squash'd FSx mount point.
Using su in the generation script to impersonate the user does not work around the problem, as su itself would trigger pam_exec.so, and trigger a loop, which doesn't look trivial to avoid to me.
Instead, I suggest moving the key generation to /etc/profile, which is executed by default for every interactive shells, by the connecting user, and serves the purpose.
Tests
I have performed the Parallel Cluster initialization & successfully logged in with an AD user, with its SSH key material properly generated upon login.
References
Checklist
It is my first interaction with this repository, and my first few days with ParallelCluster as a whole - testing has been a bit of a journey between building/uploading a new pcluster, node tool, image, and cookbooks - been bumping into the version checks in various places for quite a while as the versions feeding into the checks appear to be coming from different places, between the AMI baked "bootstrap" version, the userdata generated by pcluster,
b1
not being tolerated by Berks, ... But at last, the solution works for my cluster - and actually initially wrote the change purely in Ansible, but I was bumping another provisioning issue breaking Cloudformation's initial rollout (seeOther issue
below) so decided to start editing the cookbook anyways.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Other issue
With Parallel Cluster v3.2, root_squash was also broken during provisioning due to something else (see case / see logs below), but that issue seems to have been already resolved, although I am not 100% sure why by just glancing over the code.