Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaked paramiko ssh connection threads prevent rho from exiting #15

Open
rrauenza opened this issue Dec 17, 2015 · 16 comments
Open

Leaked paramiko ssh connection threads prevent rho from exiting #15

rrauenza opened this issue Dec 17, 2015 · 16 comments

Comments

@rrauenza
Copy link

rho seems to get stuck scanning some ilo's that are on our network. If I watch lsof I can see the connections, and then they eventually disappear. I get these messages as they do:

2015-12-17 10:43:20,575 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:43:21,670 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:43:22,949 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:43:24,369 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:44:24,480 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)
2015-12-17 10:44:25,398 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)

And then nothing else happens. rho hangs indefnitely after that -- it doesn't seem to recognize that the ssh jobs have expired.

strace doesn't help -- it just shows all threads hanging on futex()

I think this is happening while rho is trying to login with different auth's.

@rrauenza
Copy link
Author

What's odd ... is I think it gave up authenticating long before those errors:

2015-12-17 10:35:14,544 rho WARNING - Found ssh on 10.139.28.10:22, but no auths worked. (ssh_jobs/244)

@jmrodri
Copy link
Contributor

jmrodri commented Dec 17, 2015

Thanks for reporting the issue. So do you expect rho to be able to connect to these ilo's? I want to make sure I can some how recreate a scenario where rho tries but can't login and hangs.

  1. What OS are you running rho from?
  2. What version of paramiko are you using?
  3. What version of rho are you using?
  4. Can you run rho dumpconfig? don't paste it here as there will be sensitive information printed out like passwords etc. I'm more interested in how many auths and profiles it needed to scan. IF you could sanitize it that would be helpful.

@jmrodri jmrodri self-assigned this Dec 17, 2015
@rrauenza
Copy link
Author

As a baseline, I'm running CentOS 7. I've pulled the latest rho from github.

At the moment I'm stepping through a remote debugger with the network isolated to the single ilo. I've put a break point here:

     def run(self):
          while not self.quitting:
             try:
+                import rpdb; rpdb.set_trace()
                 # grab a "ssh_job" off the q
                 ssh_job = self.ssh_queue.get()
                 self.get_transport(ssh_job)

Right now it is looping through about 12 auths .. (which will all likely fail, and that is fine/expected.)

$ rpm -qi python-paramiko
Name : python-paramiko
Version : 1.15.1
Release : 1.el7

The config is basically a single ip at the moment (it was a /24) and there are about 15 auths to try, including an ssh key one as the final attempt. The others are just root/password.

@rrauenza
Copy link
Author

I lost connection with the debugger. I'm running it alone (no debugger) as a single ip to see what happens.

@rrauenza
Copy link
Author

While running, each connection uses two sockets / fds. I assume this is expected.

Every 2.0s: sudo lsof -p `pgrep rho` | grep TCP                                                                                                                          
    rho     9496 rich    3u  IPv4 471098       0t0     TCP dhcp113:48175->xxxxx-ilo:ssh (ESTABLISHED)
    rho     9496 rich    4u  IPv4 469734       0t0     TCP dhcp113:48172->xxxxx-ilo:ssh (ESTABLISHED)

Looks like a new pair is created for each auth request.
It's taking a long time for each auth request, btw. I think the ILO is just slow to respond.

Ah, and it just finished.

2015-12-17 11:44:38,043 paramiko.transport DEBUG - starting thread (client mode): 0xe4b990L (transport/1428)
2015-12-17 11:44:38,185 paramiko.transport INFO - Connected (version 2.0, client RomSShell_4.62) (transport/1428)
2015-12-17 11:44:38,186 paramiko.transport DEBUG - kex algos:[u'diffie-hellman-group14-sha1', u'diffie-hellman-group1-sha1'] server key:[u'ssh-dss'] client encrypt:[u'aes256-cbc', u'aes192-cbc', u'aes128-cbc', u'3des-cbc'] server encrypt:[u'aes256-cbc', u'aes192-cbc', u'aes128-cbc', u'3des-cbc'] client mac:[u'hmac-sha1'] server mac:[u'hmac-sha1'] client compress:[u'none'] server compress:[u'none'] client lang:[u''] server lang:[u''] kex follows?False (transport/1428)
2015-12-17 11:44:38,186 paramiko.transport DEBUG - Ciphers agreed: local=aes128-cbc, remote=aes128-cbc (transport/1428)
2015-12-17 11:44:38,186 paramiko.transport DEBUG - using kex diffie-hellman-group14-sha1; server key type ssh-dss; cipher: local aes128-cbc, remote aes128-cbc; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none (transport/1428)
2015-12-17 11:44:47,578 paramiko.transport DEBUG - Switch to new keys ... (transport/1428)
2015-12-17 11:44:47,588 paramiko.transport DEBUG - Adding ssh-dss host key for 10.139.28.10: 16f7a2dad98817eb046b7c482172d4d6 (transport/1428)
2015-12-17 11:44:47,624 paramiko.transport DEBUG - userauth is OK (transport/1428)
2015-12-17 11:45:36,877 paramiko.transport INFO - Authentication (password) failed. (transport/1428)
2015-12-17 11:45:36,925 paramiko.transport DEBUG - EOF in transport thread (transport/1428)
2015-12-17 11:45:36,944 rho ERROR - login failed (ssh_jobs/295)
2015-12-17 11:45:36,944 rho WARNING - Found ssh on 10.139.28.10:22, but no auths worked. (ssh_jobs/244)
2015-12-17 11:45:36,978 paramiko.transport DEBUG - EOF in transport thread (transport/1428)

and exited "successfully" -- meaning, no hang. So the hang only happens when we have more connections.

I'm going to try this http://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-running-python-application and running the full subnet again. See if I can get a stack trace when it hangs.

@rrauenza
Copy link
Author

I end up entering a state where I get this output:

2015-12-17 12:23:42,335 rho WARNING - Found ssh on 10.139.28.7:22, but no auths worked. (ssh_jobs/244)

But I still have one socket connected per lsof -- same ip:

rho     13507 rich    5u  IPv4 585630       0t0     TCP dhcp113:51586->10.139.28.7:ssh (ESTABLISHED)

...which eventually disappears after a

2015-12-17 12:29:10,422 paramiko.transport INFO - Disconnect (code 2): Protocol Timeout (transport/1428)

But then rho doesn't exit.

I can't break in on SIGUSR1 to get the stack trace. If I use pstack on the pid, all threads are blocked on sem_wait().

@rrauenza
Copy link
Author

I've added this to get periodic aggregated stack traces:

def stacktraces():                                                              
      print >> sys.stderr, "\n*** STACKTRACE - START ***\n"                       
      traces = collections.defaultdict(list)                                      
      for threadId, stack in sys._current_frames().items():                       
          code = []                                                               
          # code.append("\n# ThreadID: %s" % threadId)                            
          for filename, lineno, name, line in traceback.extract_stack(stack):     
              code.append('File: "%s", line %d, in %s' % (filename,               
                                                          lineno, name))          
              if line:                                                            
                  code.append("  %s" % (line.strip()))                            
          traces[tuple(code)].append("%s" % threadId)                             

      for k, v in traces.iteritems():                                             
          print >> sys.stderr, "\n# ThreadIDs: %s" % v                            
          for line in k:                                                          
              print >> sys.stderr, line                                           
      print >> sys.stderr, "\n*** STACKTRACE - END ***\n"                         


  def stacktrace_loop():                                                          
      import time                                                                 
      while True:                                                                 
          time.sleep(30)                                                          
          stacktraces()                                                           

  t = threading.Thread(target=stacktrace_loop)                                    
  t.setDaemon(True)                                                               
  t.start()                   

And upped self.max_threads to 256 to shorten the run time.

@rrauenza
Copy link
Author

Ok, as it finishes, I do have threads blocked on sockets:

# ThreadIDs: ['139730493523712', '139730644592384', '139730527094528', '139730451560192']
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib/python2.7/site-packages/paramiko/transport.py", line 1590, in run
  ptype, m = self.packetizer.read_message()
File: "/usr/lib/python2.7/site-packages/paramiko/packet.py", line 341, in read_message
  header = self.read_all(self.__block_size_in, check_rekey=True)
File: "/usr/lib/python2.7/site-packages/paramiko/packet.py", line 204, in read_all
  x = self.__socket.recv(n)

And then when I'm in the "finished but hung" state, I get the following. We're blocked on Queues... are we missing a message in self.all_tasks_done.wait()?

*** STACKTRACE - START ***


# ThreadIDs: ['139731685287680', '139730795661056', '139731425113856', '139731878319872', '139731164940032', '139730904766208', '139732658841344', '139732499379968', '139731718858496', '139732339918592', '139732180457216', '139732566521600', '139731458684672', '139732272776960', '139730938337024', '139731752429312', '139733246134016', '139731492255488', '139732297955072', '139731232081664', '139732616877824', '139732457416448', '139730971907840', '139733254526720', '139732138493696', '139731979032320', '139731786000128', '139733237741312', '139731525826304', '139732323133184', '139731265652480', '139733262919424', '139732029388544', '139731005478656', '139731198510848', '139731819570944', '139732734375680', '139731559397120', '139733271312128', '139732415452928', '139732255991552', '139731299223296', '139732096530176', '139731937068800', '139731039049472', '139731853141760', '139733220955904', '139731592967936', '139731332794112', '139731072620288', '139732692412160', '139732532950784', '139730812446464', '139732373489408', '139732214028032', '139731626538752', '139732843480832', '139731895105280', '139732062959360', '139731366364928', '139732054566656', '139730846017280', '139732851873536', '139732331525888', '139731660109568', '139732558128896', '139731399935744', '139732650448640', '139732490987264', '139731139761920', '139732860266240', '139732172064512', '139730879588096', '139731693680384', '139731433506560', '139732868658944', '139732289562368', '139731173332736', '139732465809152', '139730913158912', '139732767946496', '139731727251200', '139732608485120', '139732449023744', '139731467077376', '139732130100992', '139731970639616', '139731206903552', '139732020995840', '139730946729728', '139731760822016', '139731500648192', '139731240474368', '139732725982976', '139730980300544', '139732407060224', '139732247598848', '139731794392832', '139732088137472', '139731928676096', '139731534219008', '139731274045184', '139731106191104', '139731013871360', '139731827963648', '139732818302720', '139731567789824', '139732574914304', '139732684019456', '139732524558080', '139731307616000', '139732365096704', '139732205635328', '139731047442176', '139731886712576', '139731861534464', '139732583307008', '139731601360640', '139732549736192', '139731341186816', '139731081012992', '139732591699712', '139732801517312', '139730820839168', '139732642055936', '139732482594560', '139731634931456', '139732163671808', '139731374757632', '139732600092416', '139732281169664', '139731114583808', '139733229348608', '139730854409984', '139731668502272', '139731408328448', '139732759553792', '139730787268352', '139732440631040', '139732012603136', '139731962246912', '139732121708288', '139731702073088', '139731441899264', '139731181725440', '139730921551616', '139731735643904', '139732717590272', '139731475470080', '139732398667520', '139732239206144', '139731215296256', '139732079744768', '139731920283392', '139730955122432', '139731769214720', '139732809910016', '139731509040896', '139731248867072', '139730988693248', '139732675626752', '139732516165376', '139731802785536', '139732356704000', '139732197242624', '139731542611712', '139731450291968', '139732541343488', '139731282437888', '139731022264064', '139732826695424', '139731576182528', '139732793124608', '139731316008704', '139732633663232', '139732474201856', '139731055834880', '139732155279104', '13973  self.run()
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 176, in run
  ssh_job = self.out_queue.get()
File: "/usr/lib64/python2.7/Queue.py", line 168, in get
  self.not_empty.wait()
File: "/usr/lib64/python2.7/threading.py", line 339, in wait
  waiter.acquire()

# ThreadIDs: ['139733503039232']
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 764, in run
  self.__target(*self.__args, **self.__kwargs)
File: "rho.git/bin/rho", line 63, in stacktrace_loop
  stacktraces()
File: "rho.git/bin/rho", line 45, in stacktraces
  for filename, lineno, name, line in traceback.extract_stack(stack):

# ThreadIDs: ['139733494646528']
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
  self.run()
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 204, in run
  prog_buf = self.prog_queue.get()
File: "/usr/lib64/python2.7/Queue.py", line 168, in get
  self.not_empty.wait()
File: "/usr/lib64/python2.7/threading.py", line 339, in wait
  waiter.acquire()

# ThreadIDs: ['139733737269056']
File: "rho.git/bin/rho", line 71, in <module>
  CLI().main()
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/cli.py", line 97, in main
  cmd.main()
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/clicommands.py", line 327, in main
  self._do_command()
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/clicommands.py", line 568, in _do_command
  missing = self.scanner.scan_profiles(self.options.profiles)
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/scanner.py", line 97, in scan_profiles
  self._run_scan()
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/scanner.py", line 110, in _run_scan
  self.ssh_jobs.run_jobs(callback=self._callback)
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 415, in run_jobs
  self.ssh_queue.join()
File: "/usr/lib64/python2.7/Queue.py", line 82, in join
  self.all_tasks_done.wait()
File: "/usr/lib64/python2.7/threading.py", line 339, in wait
  waiter.acquire()

*** STACKTRACE - END ***

@rrauenza
Copy link
Author

Ah, Queue24 is written as a compatibility class... but I'm using python 2.7.x. I'm going to try running this with Queue24 just a direct derivation of Queue, with no overrides.

If that doesn't fix it, then I wonder if an exception is causing a missing task_done message on one of the queues.

edit: Changing the Queue class didn't seem to fix it, but I've left it as a direct derivation for now.

@rrauenza
Copy link
Author

I've found an exception I think isn't caught. Do you realize PkgInfoParseException is derived from BaseException?

exception BaseException
The base class for all built-in exceptions. It is not meant to be directly inherited
by user-defined classes > (for that, use Exception). If str() or unicode() is called
on an instance of this class, the representation of > the argument(s) to the instance
are returned, or the empty string when there were no arguments.

This causes a thread to except all the way out since you only catch Exception:

Exception in thread rho_ssh_thread-131:                                         
Traceback (most recent call last):                                              
  File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner      
    self.run()                                                                  
  File "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 357, in run
    self.get_transport(ssh_job)                                                 
  File "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 342, in get_transport
    self.run_cmds(ssh_job)                                                      
  File "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 328, in run_cmds
    rho_cmd.populate_data(output)                                               
  File "/mts/home5/rich/redhat-scan/rho.git/src/rho/rho_cmds.py", line 62, in populate_data
    self.parse_data()                                                           
  File "/mts/home5/rich/redhat-scan/rho.git/src/rho/rho_cmds.py", line 147, in parse_data
    installed_packages = [PkgInfo(line, "|") for line in self.cmd_results[0][0].splitlines()]
  File "/mts/home5/rich/redhat-scan/rho.git/src/rho/rho_cmds.py", line 548, in __init__
    raise PkgInfoParseException()                                               
PkgInfoParseException            

@rrauenza
Copy link
Author

Changing the parent class of PkgInfoParseException from BaseException to Exception seems to have cleared it up.

edit: nope. More info coming.

@rrauenza rrauenza changed the title rho hangs PkgInfoParseException derived from BaseException causes rho to hang when PkgInfoParseException occurs. Dec 17, 2015
@rrauenza
Copy link
Author

There still seems to be a problem of the ssh threads hanging on recv(). I think they aren't getting shut down after the exception.

I've tried adding a close:

     def run(self):
         while not self.quitting:
             try:
                 # grab a "ssh_job" off the q
                 ssh_job = self.ssh_queue.get()
                 self.get_transport(ssh_job)
                 self.out_queue.put(ssh_job)
                 self.ssh_queue.task_done()
             except Exception as e:
                 log.error("Exception: %s" % e)
                 log.error(traceback.print_tb(sys.exc_info()[2]))
+                if self.ssh:
+                    self.ssh.close()
                 self.ssh_queue.task_done()

and


     def get_transport(self, ssh_job):
         if ssh_job.ip is "":
             return None

         try:
             self.connect(ssh_job)
             if not self.ssh:
                 return

             # there was a connection/auth failure
             if ssh_job.error:
                 return
             self.run_cmds(ssh_job)
             self.ssh.close()

         except Exception as e:
             log.error("Exception on %s: %s" % (ssh_job.ip, e))
             log.error(sys.exc_info())
             log.error(traceback.print_tb(sys.exc_info()[2]))
             ssh_job.connection_result = False
             ssh_job.command_output = e
+            if self.ssh:
+                self.ssh.close()

There might be a smarter way to deal with it, like always closing at the end of run(). Or in a "finally:" after the exception, maybe with the close itself also wrapped in a try block.

I think what is happening is that SshThread isn't recycling self.ssh -- so it never gets GC'ed on the last iteration if an error happens.

This causes the underlying paramiko thread to never finish, which prevents the script from exiting.

Another alternative might be to make the paramiko threads daemon threads... but I think it would be better to ensure self.ssh is always closed at the end of every iteration.

Those additional changes above, including the s/BaseException/Exception/ change seems to have improved my reliability further.

@rrauenza rrauenza changed the title PkgInfoParseException derived from BaseException causes rho to hang when PkgInfoParseException occurs. Exceptions in SshThread() leave zombie Paramiko ssh connection threads, causing rho to not exit when done Dec 18, 2015
@rrauenza rrauenza changed the title Exceptions in SshThread() leave zombie Paramiko ssh connection threads, causing rho to not exit when done Zombie Paramiko ssh connection threads prevent rho from exiting Dec 18, 2015
@rrauenza
Copy link
Author

Couple more places that might need the connection closed:


                 # Implies we've found an SSH server listening:
                 except paramiko.AuthenticationException as e:
                     # Because we stop checking ports once we find one where ssh
                     # is listening, we can report the error message here and it
                     # will end up in the final report correctly:
                     err = _("login failed")
                     log.error(err)
                     ssh_job.error = err
                     found_port = port
+                    if self.ssh:
+                        self.ssh.close()
                     continue

                 # No route to host:
                 except socket.error as e:
                     log.warn("No route to host, skipping port: %s" % debug_str)
                     ssh_job.error = str(e)
                     break

                 # TODO: Hitting a live port that isn't ssh will result in
                 # paramiko.SSHException, do we need to handle this explicitly?

                 # Something else happened:
                 except Exception as detail:
                     log.warn("Connection error: %s - %s" % (debug_str,
                                                             str(detail)))
                     ssh_job.error = str(detail)
+                    if self.ssh:
+                        self.ssh.close()
                     continue

@rrauenza
Copy link
Author

Still getting hung/orphan threads stuck on recv. run might also need to cleanup at the end:

     def run(self):
         while not self.quitting:
             try:
+                try:
+                    if self.ssh:
+                        self.ssh.close()
+                except Exception as e:
+                    log.error("Exceptions: %s" % e)
+                    log.error(sys.exc_info())
+                    log.error(traceback.print_tb(sys.exc_info()[2]))
                 # grab a "ssh_job" off the q
                 ssh_job = self.ssh_queue.get()
                 self.get_transport(ssh_job)
                 self.out_queue.put(ssh_job)
                 self.ssh_queue.task_done()
             except Exception as e:
                 log.error("Exception: %s" % e)
                 log.error(traceback.print_tb(sys.exc_info()[2]))
+                if self.ssh:
+                    self.ssh.close()
                 self.ssh_queue.task_done()
+        if self.ssh:  # Cleanup before exiting.
+            self.ssh.close()

@rrauenza
Copy link
Author

The tool still isn't completing because some systems are still stuck in auth -- after 12 hours:

ThreadIDs: ['139837625333504', '139838028183296', '139838811657984', '139836517496576', '139837642118912', '139837482657536', '139837516228352']
File: "/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 374, in run
self.get_transport(ssh_job)
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 343, in get_transport
self.connect(ssh_job)
File: "/mts/home5/rich/redhat-scan/rho.git/src/rho/ssh_jobs.py", line 288, in connect
timeout=ssh_job.timeout)
File: "/usr/lib/python2.7/site-packages/paramiko/client.py", line 307, in connect
look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
File: "/usr/lib/python2.7/site-packages/paramiko/client.py", line 510, in _auth
self._transport.auth_password(username, password)
File: "/usr/lib/python2.7/site-packages/paramiko/transport.py", line 1166, in auth_password
return self.auth_handler.wait_for_response(my_event)
File: "/usr/lib/python2.7/site-packages/paramiko/auth_handler.py", line 192, in wait_for_response
event.wait(0.1)
File: "/usr/lib64/python2.7/threading.py", line 621, in wait
self.__cond.wait(timeout, balancing)
File: "/usr/lib64/python2.7/threading.py", line 361, in wait
_sleep(delay)

@rrauenza rrauenza changed the title Zombie Paramiko ssh connection threads prevent rho from exiting Leaked paramiko ssh connection threads prevent rho from exiting Dec 18, 2015
@rrauenza
Copy link
Author

As I said, I found that some connections hung overnight on authentication. So as a workaround, I'm running rho against a single IP at a time (found via nmap) and rho wrapped with the Linux 'timeout' tool. I wrote a bash script to parallelize them.

This tool really needs some testing on more diverse/realistic networks/environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants