Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messages go into infinite immediate retries in case there are a lot of messages (over 100) with persistent issue while processing. #1259

Open
bonieckimarcin opened this issue Jun 16, 2023 · 5 comments

Comments

@bonieckimarcin
Copy link

bonieckimarcin commented Jun 16, 2023

Describe the bug

Description

We have encountered a problem with infinite loop in immediate retries in case there is a lot of messages in queue one next to each other (during tests it was between 100 and 1000), which have persistent issue while processing. After a configured number of immediate retires (2 for proceeded test) all messages go to first delayed retry. Next after configured delay messges are going back to processing. At this moment they undergo infinite immediate retries and never go to delayed retry again.
We have noticed that this problem occurs when LimitMessageProcessingConcurrencyTo is set to more than 1.

Expected behavior

Messages go through immediate, then delayed retries and finally will land in the error queue.

Actual behavior

Messages after first iteration of immediate retries go to first delayed retry. After that they undergo infinite immediate retries.

Versions

Affected package:

  • NServiceBus.RabbitMQ 7.0.2, 7.0.3, 7.0.4, 7.0.5

Tested with:

  • NServiceBus 7.8.2
  • .net Framework 4.7.2 and .net6
  • RabbitMQ Server 3.10.20

Steps to reproduce

  1. Create a classic queue endpoint in .net Framework 4.7.2 or .net6 which will handle one type of message
  2. Use delayed message infrastructure v2
  3. Set following endpoint parameters:
  • Number of immediate retries: 2
  • Number of delayed retries: 10
  • Delayed retry time increase: 60s
  • LimitMessageProcessingConcurrencyTo: 3
  • Outbox enabled
  • TransportTransaction: ReceiveOnly
  • PrefetchMultiplier:10
  1. In message handler throw an exception to simulate permament processing problem
  2. Keep endpoint stopped
  3. Add more than 100 messages to the queue
  4. Start endpoint
  5. Observe endpoint log

Relevant log output

2023-06-13 00:00:18,437 [282] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:00:18,445 [282] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:00:19,022 [193] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:00:19,037 [193] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:00:19,584 [278] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:00:19,591 [278] WARN  NServiceBus.RecoverabilityExecutor - Delayed Retry will reschedule message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' after a delay of 00:01:00 because of an exception:
2023-06-13 00:01:25,815 [261] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:25,826 [261] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:26,136 [252] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:26,148 [252] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:26,420 [242] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:26,430 [242] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:26,628 [242] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:26,634 [242] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:26,842 [256] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:26,847 [256] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:27,050 [276] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:27,058 [276] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:27,311 [276] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:27,318 [276] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:27,561 [276] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:27,568 [276] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:27,734 [274] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:27,744 [274] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:27,935 [274] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:27,953 [274] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:28,154 [238] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:28,159 [238] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:28,323 [256] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:28,333 [256] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:28,475 [196] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:28,483 [196] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
2023-06-13 00:01:28,756 [258] INFO  TransportMutator - incoming msg id: '2eedbb99-fdae-482d-a17f-b01f016aa0ee'
2023-06-13 00:01:28,763 [258] INFO  NServiceBus.RecoverabilityExecutor - Immediate Retry is going to retry message '2eedbb99-fdae-482d-a17f-b01f016aa0ee' because of an exception:
and on ...

Additional Information

Workarounds

  • Turn off immediate retries
    or
  • Set LimitMessageProcessingConcurrencyTo: 1
@bonieckimarcin
Copy link
Author

In addition we have also noticed that switching PrefechMultiplier to 1 makes immediate retries to work properly

@bsmykowskiinf
Copy link

We also prepared a solution with minimal configuration with which we were able to reproduce the error.
To reproduce the issue, please follow instructions in readme file from the following repository:
https://github.com/bsmykowskiinf/RabbitRetriesTest

@bording
Copy link
Member

bording commented Jun 16, 2023

@bonieckimarcin @bsmykowskiinf Thanks for including a repro project. That will make it much easier to investigate.

One question. I see you're using classic queues. Have you tried quorum queues? It would be good to know if it works properly with quorum queues or still fails in the same way.

@bording bording added the bug label Jun 16, 2023
@bording
Copy link
Member

bording commented Jun 16, 2023

I can reproduce the issue, and as I suspected, it only occurs when classic queues are used.

The broker provides retry count information for us via the x-delivery-count header, but only when consuming messages from a quorum queue. We have to fall back to tracking the data locally when classic queues are being used.

For this reason, and many others, I highly recommend converting to quorum queues as soon as possible! We have a migration guide and a tool to help you do this.

Regarding a fix for when running on classic queues, locally tracked information is never going to be as reliable as information provided by the broker, but we could increase the capacity of the collection. It would still be possible to hit this problem if there are enough failing messages in the queue regardless of how large the collection is, but it's clear that the current value of 100 is too low!

@bonieckimarcin
Copy link
Author

Upgrading to quorum is on our to do list. But looking at size and complexity of our project, it requires a lot of testing and we are not able to do that immediately. We still have to work with classic queues for some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants