Move SSH key generation script from pam.d to /etc/profile (3.x) #1545

Q-Hilbtec · 2022-09-28T04:11:04Z

Description of changes

When FSx Lustre is configured with the new root_squash feature, and ParallelCluster is configured with Active Directory with home folders within the FSx mount, pam_exec.so is unable to properly run the SSH key generation script. This is because pam_exec.so runs the script as root, but root does not have access to any home folders to manipulate the files due to the fact that root is regarded as nobody/nogroup within the root_squash'd FSx mount point.

Using su in the generation script to impersonate the user does not work around the problem, as su itself would trigger pam_exec.so, and trigger a loop, which doesn't look trivial to avoid to me.

Instead, I suggest moving the key generation to /etc/profile, which is executed by default for every interactive shells, by the connecting user, and serves the purpose.

Tests

I have performed the Parallel Cluster initialization & successfully logged in with an AD user, with its SSH key material properly generated upon login.

References

Case ID 10861640911

Checklist

Make sure you are pointing to the right branch and add a label in the PR title (i.e. 2.x vs 3.x)
Check all commits' messages are clear, describing what and why vs how.
Check if documentation is impacted by this change.

It is my first interaction with this repository, and my first few days with ParallelCluster as a whole - testing has been a bit of a journey between building/uploading a new pcluster, node tool, image, and cookbooks - been bumping into the version checks in various places for quite a while as the versions feeding into the checks appear to be coming from different places, between the AMI baked "bootstrap" version, the userdata generated by pcluster, b1 not being tolerated by Berks, ... But at last, the solution works for my cluster - and actually initially wrote the change purely in Ansible, but I was bumping another provisioning issue breaking Cloudformation's initial rollout (see Other issue below) so decided to start editing the cookbook anyways.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Other issue

With Parallel Cluster v3.2, root_squash was also broken during provisioning due to something else (see case / see logs below), but that issue seems to have been already resolved, although I am not 100% sure why by just glancing over the code.

    - mount fs-XXXX.fsx.us-east-1.amazonaws.com@tcp:/jwpx5bev to /data
  * mount[/data] action enable[2022-09-26T20:46:46+00:00] INFO: Processing mount[/data] 
action enable (aws-parallelcluster-config::fsx_mount line 57)
[2022-09-26T20:46:46+00:00] INFO: mount[/data] enabled

    - enable fs-XXXX.fsx.us-east-1.amazonaws.com@tcp:/jwpx5bev
  * directory[/data] action create[2022-09-26T20:46:46+00:00] INFO: Processing 
directory[/data] action create (aws-parallelcluster-config::fsx_mount line 94)

    ================================================================================
    Error executing action `create` on resource 'directory[/data]'
    ================================================================================

    Errno::EPERM
    ------------
    Operation not permitted @ apply2files - /data

    Resource Declaration:
    ---------------------
    # In 
/etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-config/recipes/fsx_mount.rb

     94:   directory fsx_shared_dir do
     95:     owner 'root'
     96:     group 'root'
     97:     mode '1777'
     98:   end
     99: end

    Compiled Resource:
    ------------------
    # Declared in 
/etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster-config/recipes/fsx_mount.rb:94:in 
`block in from_file'

    directory("/data") do
      action [:create]
      default_guard_interpreter :default
      declared_type :directory
      cookbook_name "aws-parallelcluster-config"
      recipe_name "fsx_mount"
      mode "1777"
      owner "root"
      group "root"
      path "/data"
    end

When FSx Lustre is configured with the new root_squash feature, and ParallelCluster is configured with Active Directory with home folders within the FSx mount, pam_exec.so is unable to properly run the SSH key generation script. This is because pam_exec.so runs the script as root, but root does not have access to any home folders to manipulate the files due to the fact that root is regarded as nobody/nogroup within the root_squash'd FSx mount point. Using su in the generation script to impersonate the user does not work around the problem, as su itself would trigger pam_exec.so, and trigger a loop, which doesn't look trivial to avoid to me. Instead, I suggest moving the key generation to /etc/profile, which is executed by default for every interactive shells, by the connecting user, and serves the purpose.

demartinofra · 2022-10-03T16:38:41Z

Hi @QuentinM-Hilbtec thanks for the contribution. You brought up a valid point and proposed an interesting solution. My concern is that by moving the logic to the profile script you only trigger it when opening a login shell. What if we keep the trigger at the pam level and perform a switch user before creating the necessary directories and files? Would it work?

Q-Hilbtec · 2022-10-04T16:54:06Z

Dear @demartinofra,

Thanks for your review.

My concern is that by moving the logic to the profile script you only trigger it when opening a login shell.

Your point makes sense, I expected it might be a requirement - we could always move it to /etc/bashrc (for Red Hat) and /etc/bash.bashrc (for Ubuntu), and it would get executed for both interactive and non-interactive shells. Happy to try that out if that would be an acceptable answer.

What if we keep the trigger at the pam level and perform a switch user before creating the necessary directories and files? Would it work?

I imagined doing that initially, and tried implementing this solution for about an hour before giving up. Switching from root to PAM_USER within the SSH key generation script with su ends up triggering /etc/pam.d/su's pam_exec.so and creating a recursive loop. I've tried adding some conditions within the script based on the env vars provided by pam to prevent the recursion, and while it worked to some extent, so still happened to hit recursions in some situations. Those recursions run uncontrolled and just completely blow up the server. So it did not feel safe to pursue this idea, which would add a slew of complexity and maintenance anyways. Someone more experienced than me (not hard) could figure out an elegant way, although Google didn't look very promising in that direction either.,

demartinofra · 2022-10-11T17:16:54Z

Thanks @QuentinM-Hilbtec for the additional info. We are looking into what it takes to support this scenario without breaking the use cases that rely on key being generated as part of the pam routine.

To get you unblocked in the short term my suggestion (if you haven't done it already) is to use an OnNodeConfigured custom script to disable the pam action and move it into the shell profile.

Q-Hilbtec requested review from a team as code owners September 28, 2022 04:11

Q-Hilbtec force-pushed the ad_sshgen_nopam branch from 554f8df to 20cc762 Compare September 28, 2022 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move SSH key generation script from pam.d to /etc/profile (3.x) #1545

Move SSH key generation script from pam.d to /etc/profile (3.x) #1545

Q-Hilbtec commented Sep 28, 2022 •

edited

Loading

demartinofra commented Oct 3, 2022 •

edited

Loading

Q-Hilbtec commented Oct 4, 2022

demartinofra commented Oct 11, 2022 •

edited

Loading

Move SSH key generation script from pam.d to /etc/profile (3.x) #1545

Are you sure you want to change the base?

Move SSH key generation script from pam.d to /etc/profile (3.x) #1545

Conversation

Q-Hilbtec commented Sep 28, 2022 • edited Loading

Description of changes

Tests

References

Checklist

Other issue

demartinofra commented Oct 3, 2022 • edited Loading

Q-Hilbtec commented Oct 4, 2022

demartinofra commented Oct 11, 2022 • edited Loading

Q-Hilbtec commented Sep 28, 2022 •

edited

Loading

demartinofra commented Oct 3, 2022 •

edited

Loading

demartinofra commented Oct 11, 2022 •

edited

Loading