-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956
Conversation
How about using threadLocal? |
To store the contents of the hashmap or else ? |
Make the |
|
With threadLocal, just need little code change. |
Looks good. It can replace the lock. And change the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want the order of the metrics when export to be the same as when they were added, we must globally sort all the metrics.
Do we need to global sort the metrics? Seems the issue mentioned in the ticket is just caused by lock? |
There are a significant number of application metrics, and I want to minimize them as they approach capacity. This issue is unrelated to the current Jira task, so I removed the sorting code and will create another pull request to address it without sorting. |
|
@zaynt4606 Thank you for your code contribution. will there be any code changes? Can I cherry pick this submission to my company now? |
The code will change in followUp . To solve the jira problem you can cherry pick this pr (dont need to cherry pick the followUp). |
@zaynt4606 Thank you very much, your response and code bug fixes are so fast, thumbs up to you 👍 |
@zaynt4606 After I cherry-pick the code to our company, do I just need to replace the client of the worker node and restart it to solve the problem? Do I need to replace the client of the master node and restart it? |
common/src/main/scala/org/apache/celeborn/common/metrics/source/AbstractSource.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/metrics/source/AbstractSource.scala
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/metrics/source/AbstractSource.scala
Outdated
Show resolved
Hide resolved
u don't need to replace the client and have to redeploy the server. |
common/src/main/scala/org/apache/celeborn/common/metrics/source/AbstractSource.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Could you add ut to check whether the capacity limit is effective?
More ut tests are in the followup |
Ok, merge to main(v0.6.0) |
@zaynt4606 @turboFei @RexXiong @FMX Thank you for your code contribution. I cherry-picked the submitted code to our company's internal version, rebuilt and compiled the code, replaced the celeborn-common_2.12-0.5.0.jar and celeborn-client-flink-1.16-shaded_2.12-0.5.0.jar of the worker node, and restarted the celeborn worker node. After observing for a week, this problem no longer occurs. Thank you very much🙏 |
Glad to help! |
…lure caused by locked resources Remove the ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail. Current problems:[jira CELEBORN-1743](https://issues.apache.org/jira/browse/CELEBORN-1743) the lock in [[CELEBORN-1453]](apache#2548) might block the thread. No Manual test same result with CELEBORN-1453 ![image](https://github.com/user-attachments/assets/3e3a4c53-1cf6-48f6-8c37-67d875d675af) Closes apache#2956 from zaynt4606/clb1743. Authored-by: zhengtao <[email protected]> Signed-off-by: Shuang <[email protected]>
…lure caused by locked resources …lure caused by locked resources Remove the ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail. Current problems:[jira CELEBORN-1743](https://issues.apache.org/jira/browse/CELEBORN-1743) the lock in [[CELEBORN-1453]](#2548) might block the thread. No Manual test same result with CELEBORN-1453 ![image](https://github.com/user-attachments/assets/3e3a4c53-1cf6-48f6-8c37-67d875d675af) Closes #2956 from zaynt4606/clb1743. Authored-by: zhengtao <shuaizhentao.sztalibaba-inc.com> ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #3005 from zaynt4606/branch-0.5-dev. Authored-by: zhengtao <[email protected]> Signed-off-by: mingji <[email protected]>
What changes were proposed in this pull request?
Remove the ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail.
Why are the changes needed?
Current problems:jira CELEBORN-1743
the lock in [CELEBORN-1453] might block the thread.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manual test
same result with CELEBORN-1453