Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move SSH key generation script from pam.d to /etc/profile (3.x) #1545

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

Q-Hilbtec
Copy link

@Q-Hilbtec Q-Hilbtec commented Sep 28, 2022

Description of changes

When FSx Lustre is configured with the new root_squash feature, and ParallelCluster is configured with Active Directory with home folders within the FSx mount, pam_exec.so is unable to properly run the SSH key generation script. This is because pam_exec.so runs the script as root, but root does not have access to any home folders to manipulate the files due to the fact that root is regarded as nobody/nogroup within the root_squash'd FSx mount point.

Using su in the generation script to impersonate the user does not work around the problem, as su itself would trigger pam_exec.so, and trigger a loop, which doesn't look trivial to avoid to me.

Instead, I suggest moving the key generation to /etc/profile, which is executed by default for every interactive shells, by the connecting user, and serves the purpose.

Tests

I have performed the Parallel Cluster initialization & successfully logged in with an AD user, with its SSH key material properly generated upon login.

References

  • Case ID 10861640911

Checklist

  • Make sure you are pointing to the right branch and add a label in the PR title (i.e. 2.x vs 3.x)
  • Check all commits' messages are clear, describing what and why vs how.
  • Check if documentation is impacted by this change.

It is my first interaction with this repository, and my first few days with ParallelCluster as a whole - testing has been a bit of a journey between building/uploading a new pcluster, node tool, image, and cookbooks - been bumping into the version checks in various places for quite a while as the versions feeding into the checks appear to be coming from different places, between the AMI baked "bootstrap" version, the userdata generated by pcluster, b1 not being tolerated by Berks, ... But at last, the solution works for my cluster - and actually initially wrote the change purely in Ansible, but I was bumping another provisioning issue breaking Cloudformation's initial rollout (see Other issue below) so decided to start editing the cookbook anyways.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Other issue

With Parallel Cluster v3.2, root_squash was also broken during provisioning due to something else (see case / see logs below), but that issue seems to have been already resolved, although I am not 100% sure why by just glancing over the code.

    - mount fs-XXXX.fsx.us-east-1.amazonaws.com@tcp:/jwpx5bev to /data
  * mount[/data] action enable[2022-09-26T20:46:46+00:00] INFO: Processing mount[/data] 
action enable (aws-parallelcluster-config::fsx_mount line 57)
[2022-09-26T20:46:46+00:00] INFO: mount[/data] enabled

    - enable fs-XXXX.fsx.us-east-1.amazonaws.com@tcp:/jwpx5bev
  * directory[/data] action create[2022-09-26T20:46:46+00:00] INFO: Processing 
directory[/data] action create (aws-parallelcluster-config::fsx_mount line 94)

    ================================================================================
    Error executing action `create` on resource 'directory[/data]'
    ================================================================================

    Errno::EPERM
    ------------
    Operation not permitted @ apply2files - /data

    Resource Declaration:
    ---------------------
    # In 
/etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-config/recipes/fsx_mount.rb

     94:   directory fsx_shared_dir do
     95:     owner 'root'
     96:     group 'root'
     97:     mode '1777'
     98:   end
     99: end

    Compiled Resource:
    ------------------
    # Declared in 
/etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-config/recipes/fsx_mount.rb:94:in 
`block in from_file'

    directory("/data") do
      action [:create]
      default_guard_interpreter :default
      declared_type :directory
      cookbook_name "aws-parallelcluster-config"
      recipe_name "fsx_mount"
      mode "1777"
      owner "root"
      group "root"
      path "/data"
    end

@Q-Hilbtec Q-Hilbtec requested review from a team as code owners September 28, 2022 04:11
When FSx Lustre is configured with the new root_squash feature,
and ParallelCluster is configured with Active Directory with
home folders within the FSx mount, pam_exec.so is unable to
properly run the SSH key generation script. This is because
pam_exec.so runs the script as root, but root does not have
access to any home folders to manipulate the files due to the
fact that root is regarded as nobody/nogroup within the
root_squash'd FSx mount point.

Using su in the generation script to impersonate the user does
not work around the problem, as su itself would trigger
pam_exec.so, and trigger a loop, which doesn't look trivial to
avoid to me.

Instead, I suggest moving the key generation to /etc/profile,
which is executed by default for every interactive shells, by
the connecting user, and serves the purpose.
@demartinofra
Copy link
Contributor

demartinofra commented Oct 3, 2022

Hi @QuentinM-Hilbtec thanks for the contribution. You brought up a valid point and proposed an interesting solution. My concern is that by moving the logic to the profile script you only trigger it when opening a login shell. What if we keep the trigger at the pam level and perform a switch user before creating the necessary directories and files? Would it work?

@Q-Hilbtec
Copy link
Author

Dear @demartinofra,

Thanks for your review.

My concern is that by moving the logic to the profile script you only trigger it when opening a login shell.

Your point makes sense, I expected it might be a requirement - we could always move it to /etc/bashrc (for Red Hat) and /etc/bash.bashrc (for Ubuntu), and it would get executed for both interactive and non-interactive shells. Happy to try that out if that would be an acceptable answer.

What if we keep the trigger at the pam level and perform a switch user before creating the necessary directories and files? Would it work?

I imagined doing that initially, and tried implementing this solution for about an hour before giving up. Switching from root to PAM_USER within the SSH key generation script with su ends up triggering /etc/pam.d/su's pam_exec.so and creating a recursive loop. I've tried adding some conditions within the script based on the env vars provided by pam to prevent the recursion, and while it worked to some extent, so still happened to hit recursions in some situations. Those recursions run uncontrolled and just completely blow up the server. So it did not feel safe to pursue this idea, which would add a slew of complexity and maintenance anyways. Someone more experienced than me (not hard) could figure out an elegant way, although Google didn't look very promising in that direction either.,

@demartinofra
Copy link
Contributor

demartinofra commented Oct 11, 2022

Thanks @QuentinM-Hilbtec for the additional info. We are looking into what it takes to support this scenario without breaking the use cases that rely on key being generated as part of the pam routine.

To get you unblocked in the short term my suggestion (if you haven't done it already) is to use an OnNodeConfigured custom script to disable the pam action and move it into the shell profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants