Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. #2956

Closed
wants to merge 9 commits into from

Conversation

zaynt4606
Copy link
Contributor

@zaynt4606 zaynt4606 commented Nov 27, 2024

What changes were proposed in this pull request?

Remove the ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail.

Why are the changes needed?

Current problems:jira CELEBORN-1743
the lock in [CELEBORN-1453] might block the thread.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual test
same result with CELEBORN-1453
image

@zaynt4606 zaynt4606 changed the title [CELEBORN-1743] fix metrics data interruption and job fail due to the bolck of the lock [CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources. Nov 27, 2024
@turboFei
Copy link
Member

How about using threadLocal?

@zaynt4606
Copy link
Contributor Author

zaynt4606 commented Nov 27, 2024

How about using threadLocal?

To store the contents of the hashmap or else ?
These hashMaps have just been moved together from under the code, not created.

@turboFei
Copy link
Member

How about using threadLocal?

To store the contents of the hashmap or else ? These hashMaps have just been moved together from under the code, not created.

Make the innerMetrics be ThreadLocal.

@zaynt4606
Copy link
Contributor Author

How about using threadLocal?

To store the contents of the hashmap or else ? These hashMaps have just been moved together from under the code, not created.

Make the innerMetrics be ThreadLocal.

innerMetrics appears to only be used to limit capacity in the code and I think it can be removed directly. The original code retrieves data from the hashmap and puts it into innerMetrics, which affects the intended order of the queue.

@turboFei
Copy link
Member

turboFei commented Nov 27, 2024

With threadLocal, just need little code change.

turboFei@a208931

@zaynt4606
Copy link
Contributor Author

With threadLocal, just need little code change.

turboFei@7384ca6

Looks good. It can replace the lock. And change the innerMetrics.remove() to return can avoids ineffective repeated queue updates when capacity is exceeded. Ordering can also be added to this part of the code.
Does innerMetrics need to be preserved?
cc @FMX @RexXiong

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want the order of the metrics when export to be the same as when they were added, we must globally sort all the metrics.

@turboFei
Copy link
Member

Do we need to global sort the metrics?

Seems the issue mentioned in the ticket is just caused by lock?

@zaynt4606
Copy link
Contributor Author

zaynt4606 commented Nov 28, 2024

Do we need to global sort the metrics?

Seems the issue mentioned in the ticket is just caused by lock?

There are a significant number of application metrics, and I want to minimize them as they approach capacity.

This issue is unrelated to the current Jira task, so I removed the sorting code and will create another pull request to address it without sorting.

@zaynt4606
Copy link
Contributor Author

With threadLocal, just need little code change.
turboFei@7384ca6

Looks good. It can replace the lock. And change the innerMetrics.remove() to return can avoids ineffective repeated queue updates when capacity is exceeded. Ordering can also be added to this part of the code. Does innerMetrics need to be preserved? cc @FMX @RexXiong

updateInnerMetrics will be called by recordTimer <- doStopTimer <- FetchHandler: workerSource.stopTimer
not only used in getMetrics so threadLocal can't be used here.

@zhisheng17
Copy link

@zaynt4606 Thank you for your code contribution. will there be any code changes? Can I cherry pick this submission to my company now?

@zaynt4606
Copy link
Contributor Author

zaynt4606 commented Dec 2, 2024

@zaynt4606 Thank you for your code contribution. will there be any code changes? Can I cherry pick this submission to my company now?

The code will change in followUp . To solve the jira problem you can cherry pick this pr (dont need to cherry pick the followUp).

@zhisheng17
Copy link

@zaynt4606 Thank you very much, your response and code bug fixes are so fast, thumbs up to you 👍

@zhisheng17
Copy link

@zaynt4606 After I cherry-pick the code to our company, do I just need to replace the client of the worker node and restart it to solve the problem? Do I need to replace the client of the master node and restart it?

@zaynt4606
Copy link
Contributor Author

@zaynt4606 After I cherry-pick the code to our company, do I just need to replace the client of the worker node and restart it to solve the problem? Do I need to replace the client of the master node and restart it?

u don't need to replace the client and have to redeploy the server.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Could you add ut to check whether the capacity limit is effective?

@zaynt4606
Copy link
Contributor Author

zaynt4606 commented Dec 4, 2024

LGTM, Could you add ut to check whether the capacity limit is effective?

More ut tests are in the followup

@RexXiong RexXiong closed this in cc04d13 Dec 4, 2024
@RexXiong
Copy link
Contributor

RexXiong commented Dec 4, 2024

Ok, merge to main(v0.6.0)

@zhisheng17
Copy link

@zaynt4606 @turboFei @RexXiong @FMX Thank you for your code contribution. I cherry-picked the submitted code to our company's internal version, rebuilt and compiled the code, replaced the celeborn-common_2.12-0.5.0.jar and celeborn-client-flink-1.16-shaded_2.12-0.5.0.jar of the worker node, and restarted the celeborn worker node. After observing for a week, this problem no longer occurs. Thank you very much🙏

@zaynt4606
Copy link
Contributor Author

@zaynt4606 @turboFei @RexXiong @FMX Thank you for your code contribution. I cherry-picked the submitted code to our company's internal version, rebuilt and compiled the code, replaced the celeborn-common_2.12-0.5.0.jar and celeborn-client-flink-1.16-shaded_2.12-0.5.0.jar of the worker node, and restarted the celeborn worker node. After observing for a week, this problem no longer occurs. Thank you very much🙏

Glad to help!
Recording the problem and giving feedback is also an important part of building Celeborn! Thank you, too~

zaynt4606 added a commit to zaynt4606/celeborn that referenced this pull request Dec 18, 2024
…lure caused by locked resources

Remove the  ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail.

Current problems:[jira CELEBORN-1743](https://issues.apache.org/jira/browse/CELEBORN-1743)
the lock in [[CELEBORN-1453]](apache#2548) might block the thread.

No

Manual test
same result with CELEBORN-1453
![image](https://github.com/user-attachments/assets/3e3a4c53-1cf6-48f6-8c37-67d875d675af)

Closes apache#2956 from zaynt4606/clb1743.

Authored-by: zhengtao <[email protected]>
Signed-off-by: Shuang <[email protected]>
FMX pushed a commit that referenced this pull request Dec 18, 2024
…lure caused by locked resources

…lure caused by locked resources

Remove the  ConcurrentLinkedQueue and lock in AbstractSource which might cause the metrics data interruption and job fail.

Current problems:[jira CELEBORN-1743](https://issues.apache.org/jira/browse/CELEBORN-1743) the lock in [[CELEBORN-1453]](#2548) might block the thread.

No

Manual test
same result with CELEBORN-1453
![image](https://github.com/user-attachments/assets/3e3a4c53-1cf6-48f6-8c37-67d875d675af)

Closes #2956 from zaynt4606/clb1743.

Authored-by: zhengtao <shuaizhentao.sztalibaba-inc.com>

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #3005 from zaynt4606/branch-0.5-dev.

Authored-by: zhengtao <[email protected]>
Signed-off-by: mingji <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants