Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-promoting a peer with an active connection fails. #2572

Closed
karknu opened this issue Sep 2, 2020 · 4 comments
Closed

Re-promoting a peer with an active connection fails. #2572

karknu opened this issue Sep 2, 2020 · 4 comments
Assignees
Labels
bug Something isn't working diffusion Issues / PRs related to diffusion layer

Comments

@karknu
Copy link
Contributor

karknu commented Sep 2, 2020

A connection that was initiated locally and went from Cold -> Warm -> Hot -> Cold (other side still use us so it is Warm) gets stuck in an un-promotable state.
Subsequent attempts to promote the peer happens every 10s and will fail with the error above.

[Karls-Ma:cardano.node.PeerSelection:Info:56] [2020-09-02 12:19:34.65 UTC] [String "TracePublicRootsRequest 5 1",String "TracePeerSelection"]
[Karls-Ma:cardano.node.PublicRootPeers:Info:97] [2020-09-02 12:19:34.65 UTC] [String "TracePublicRootDomains []",String "TracePublicRootPeers"]
[Karls-Ma:cardano.node.PeerSelection:Info:56] [2020-09-02 12:19:34.65 UTC] [String "TracePublicRootsResults (fromList []) 8 256s",String "TracePeerSelection"]
[Karls-Ma:cardano.node.PeerSelection:Warning:56] [2020-09-02 12:20:40.67 UTC] [String "TraceGovernorWakeup",String "TracePeerSelection"]
[Karls-Ma:cardano.node.PeerSelection:Info:56] [2020-09-02 12:20:40.67 UTC] [String "TracePromoteColdPeers 2 0 (fromList [192.168.0.4:47776])",String "TracePeerSelection"]
[Karls-Ma:cardano.node.ConnectionManager:Warning:99] [2020-09-02 12:20:40.67 UTC] [String "ConnectionExists 192.168.0.4:47776 Outbound",String "ConnectionManager"]
[Karls-Ma:cardano.node.PeerSelection:Error:56] [2020-09-02 12:20:40.67 UTC] [String "TracePromoteColdFailed 192.168.0.4:47776 320s (ConnectionExistsError 192.168.0.4:47776 Outbound)",String "TracePeerSelection"]
[Karls-Ma:cardano.node.PeerSelection:Warning:56] [2020-09-02 12:23:50.66 UTC] [String "TraceGovernorWakeup",String "TracePeerSelection"]
[Karls-Ma:cardano.node.PeerSelection:Info:56] [2020-09-02 12:23:50.66 UTC] [String "TracePublicRootsRequest 5 1",String "TracePeerSelection"]
[Karls-Ma:cardano.node.PublicRootPeers:Info:101] [2020-09-02 12:23:50.66 UTC] [String "TracePublicRootDomains []",String "TracePublicRootPeers"]
[Karls-Ma:cardano.node.PeerSelection:Info:56] [2020-09-02 12:23:50.66 UTC] [String "TracePublicRootsResults (fromList []) 9 512s",String "TracePeerSelection"]
[Karls-Ma:cardano.node.PeerSelection:Warning:56] [2020-09-02 12:26:00.67 UTC] [String "TraceGovernorWakeup",String "TracePeerSelection"]
[Karls-Ma:cardano.node.PeerSelection:Info:56] [2020-09-02 12:26:00.67 UTC] [String "TracePromoteColdPeers 2 0 (fromList [192.168.0.4:47776])",String "TracePeerSelection"]
[Karls-Ma:cardano.node.ConnectionManager:Warning:103] [2020-09-02 12:26:00.67 UTC] [String "ConnectionExists 192.168.0.4:47776 Outbound",String "ConnectionManager"]
[Karls-Ma:cardano.node.PeerSelection:Error:56] [2020-09-02 12:26:00.67 UTC] [String "TracePromoteColdFailed 192.168.0.4:47776 640s (ConnectionExistsError 192.168.0.4:47776 Outbound)",String "TracePeerSelection"]
@coot coot added networking bug Something isn't working labels Sep 2, 2020
@coot coot changed the title Re-promoting a peer with a an active connection fails. Re-promoting a peer with an active connection fails. Mar 5, 2022
@coot
Copy link
Contributor

coot commented Nov 22, 2022

Let's try to reproduce it in our e2e simulation test.

@bolt12
Copy link
Contributor

bolt12 commented Mar 7, 2023

I have encountered an instance of this issue, today when looking at the logs of the deployed nodes running PeerSharing. I noticed that after a miniprotocol errors I see the node getting shutdown by a DemoteAsynchronous trace, but I do not see a clean exit message anywhere, before the node attempting to be promoted again which in turn leads to a ConnectionExists error

@bolt12
Copy link
Contributor

bolt12 commented Mar 8, 2023

Regarding this I analysed the logs and extracted the following trace sequence since node start to the first connection exists error:

  1. PromoteColdLocalPeers
  2. ConnectionHandler HandshakeSuccess
  3. PeerStateChanged ColdToWarm
  4. PromoteColdDone
  5. PromoteWarmLocalPeers
  6. PeerStatusChanged WarmToHot
  7. PromoteWarmDone
  8. PromoteToWarmRemote
  9. InboundGovernor ResponderError (miniprotocol exception)
  10. Mux Remote ExceptionExit
  11. PeerStatusChanged HotToCold
  12. DemoteLocalAsynchronous
  13. ConnectionHandler ShutdownPeer OutboundError
  14. PromoteColdLocalPeers
  15. ConnectionHandler HandshakeSuccess
  16. PeerStatusChanged ColdToWarm
  17. InboundGov PromoteToWarmRemote
  18. PromoteColdDone
  19. PromoteWarmColdPeers
  20. PeerStatusChanged WarmToHot
  21. PromoteWarmDone
  22. Mux Remote ExceptionExit
  23. PeerStatusChanged HotToCold
  24. InboundGovernor Remote MuxErrored
  25. DemoteLocalAsynchronous
  26. ConnectionHandler ShutdownPeer Outbound Error
  27. PromoteColdLocalPeers
  28. ConnectionHandler HandshakeSuccess
  29. PeerStatusChanged ColdToWarm
  30. PromotedToWarmRemote
  31. PromotedColdDone
  32. InboundGovernor Remote MuxErrored
  33. PeerStatusChanged WarmToCold
  34. DemoteLocalAsynchrounous
  35. PromoteColdLocalPeers
  36. ConnectionExists <-----
  37. ConnectionHandler (resource vanished)
  38. PromoteColdFailed
  39. PromoteColdLocalPeers
  40. ConnectionHandler Handshake success
  41. PeerStatusChanged ColdToWarm
  42. DemotedToCooldRemote OperationSuccessTr
  43. PromoteColdDone
  44. PromotedToWarmRemote
  45. InboundGovernor Remote MuxErrored
  46. PeerStatusChanged WarmToCold
  47. DemoteLocalAsync
  48. PromoteColdLocalPeers
  49. ConnectionExists <-----
  50. HandhsakeError (resource vanished)
  51. PromoteColdDailed
    ...

@coot coot added diffusion Issues / PRs related to diffusion layer and removed networking labels Dec 18, 2023
@bolt12
Copy link
Contributor

bolt12 commented Jan 31, 2024

I believe this is fixed with #4734 and #4555. Hot demotions failing are a bit more frequent than what we expected and the node wasn't handling these cases very well, in particular it wasn't waiting for the connection to effectively terminate before removing the peers from the appropriate sets. I believe TrConnectionExists traces in these situations have been mitigated.

@bolt12 bolt12 closed this as completed Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working diffusion Issues / PRs related to diffusion layer
Projects
Status: Done
Development

No branches or pull requests

3 participants