Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c-s][3.x] driver stuck on the Caused by: java.net.NoRouteToHostException: No route to host error #285

Open
2 tasks
vponomaryov opened this issue Apr 22, 2024 · 4 comments
Assignees

Comments

@vponomaryov
Copy link

vponomaryov commented Apr 22, 2024

Packages

Scylla version: 5.5.0~dev-20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
Kernel Version: 5.15.0-1056-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Runnnig 24h CI job one of stress commands just hung with the following logs:

READ,     2301350655,    2451,    2451,    2451,     1.7,     1.6,     3.0,     4.1,     8.6,    10.2,86405.0,  0.00115,      0,      0,       0,       0,       0,       0
WRITE,    2301334980,    2410,    2410,    2410,     1.6,     1.4,     2.8,     4.1,     8.8,    11.0,86405.0,  0.00115,      0,      0,       0,       0,       0,       0
total,    4602685635,    4861,    4861,    4861,     1.7,     1.5,     2.9,     4.1,     8.8,    11.0,86405.0,  0.00115,      0,      0,       0,       0,       0,       0
...
WARN  04:53:18,482 Error creating netty channel to /10.4.10.17:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.4.10.17:9042
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
...
WARN  05:13:43,800 Error creating netty channel to /10.4.9.59:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.4.9.59:9042
Caused by: java.net.NoRouteToHostException: No route to host
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
...
WARN  06:41:36,297 Error creating netty channel to /10.4.10.17:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.4.10.17:9042
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
...
WARN  06:50:40,248 Error creating netty channel to /10.4.9.92:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.4.9.92:9042
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
...
WARN  07:06:41,336 Error creating netty channel to /10.4.10.28:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.4.10.28:9042
Caused by: java.net.NoRouteToHostException: No route to host
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
WARN  07:06:44,408 Error creating netty channel to /10.4.10.224:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.4.10.224:9042
Caused by: java.net.NoRouteToHostException: No route to host
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
WARN  07:07:08,664 Error creating netty channel to /10.4.9.59:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.4.9.59:9042
Caused by: java.net.NoRouteToHostException: No route to host
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
WARN  07:08:13,472 Error creating netty channel to /10.4.11.208:9042
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.4.11.208:9042
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at com.datastax.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

All other stress commands that were running in parallel finished it's work in time.

Finally that hung stress command was killed by the SCT timeout.

Impact

Stress command never ends.

How frequently does it reproduce?

Observed first time?

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-8c5076b3-9 (34.244.99.104 | 10.4.10.17) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-8 (54.216.183.7 | 10.4.10.224) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-7 (3.253.80.116 | 10.4.10.28) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-6 (54.170.25.116 | 10.4.8.193) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-5 (54.78.171.43 | 10.4.10.34) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-4 (34.244.144.124 | 10.4.9.92) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-3 (54.154.226.224 | 10.4.10.46) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-20 (3.254.97.23 | 10.4.11.68) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-2 (52.214.239.220 | 10.4.8.159) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-19 (54.154.138.172 | 10.4.10.38) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-18 (18.201.36.232 | 10.4.9.59) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-17 (34.245.55.193 | 10.4.10.215) (shards: -1)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-16 (3.250.113.107 | 10.4.9.117) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-15 (3.250.205.70 | 10.4.11.6) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-14 (3.249.160.41 | 10.4.11.73) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-13 (3.255.223.247 | 10.4.11.137) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-12 (34.245.143.1 | 10.4.9.140) (shards: -1)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-11 (54.194.84.26 | 10.4.10.143) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-10 (3.250.99.126 | 10.4.11.208) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-8c5076b3-1 (3.249.100.47 | 10.4.11.119) (shards: 14)

OS / Image: ami-0f3a5d434ac91ab52 (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 8c5076b3-319b-461f-9c2c-1262a411f00a
Test name: scylla-master/tier1/longevity-50gb-3days-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 8c5076b3-319b-461f-9c2c-1262a411f00a
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 8c5076b3-319b-461f-9c2c-1262a411f00a

Logs:

Jenkins job URL
Argus

@mykaul
Copy link

mykaul commented Apr 24, 2024

@vponomaryov - I assume you've verified no issues on the client side? There's a difference between connection refused and no route to host. Any nemesis that took place at the time? Which loader was that?

@vponomaryov
Copy link
Author

@vponomaryov - I assume you've verified no issues on the client side? There's a difference between connection refused and no route to host. Any nemesis that took place at the time? Which loader was that?

The first error message in the chain of errors was at 04:35:55,256.
It overlaps with the disrupt_abort_repair nemesis, it succeeded.
It was loader-2.
Note that the tear-down started at 07:07:56.

What client issues do you mean? C-S and java-driver are the client for scylla cluster.

Anyway, stress command had concrete time limit (Duration: 1,440 MINUTES), it should have ended in time even having connection issues.

Final error is following:

Command did not complete within 95700 seconds!

Whereas the configured 1440 minutes is 86400 seconds. It is smaller than timeout for the 9300 seconds.

@Bouncheck Bouncheck self-assigned this Jul 26, 2024
@Bouncheck
Copy link
Collaborator

@vponomaryov Does it reproduce consistently or did it reoccur?

@vponomaryov
Copy link
Author

@vponomaryov Does it reproduce consistently or did it reoccur?

It doesn't reproduce consistently.
I don't know whether it has reoccurred somewhere or not.

I suspect it is some rare race condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants