Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: erase persistent terms on app stop #670

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SergeTupchiy
Copy link
Contributor

No description provided.

Copy link

codecov bot commented Dec 26, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (977dcb1) 73.02% compared to head (bf08e05) 73.08%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #670      +/-   ##
==========================================
+ Coverage   73.02%   73.08%   +0.05%     
==========================================
  Files          61       61              
  Lines        1924     1932       +8     
==========================================
+ Hits         1405     1412       +7     
- Misses        519      520       +1     
Flag Coverage Δ
api 69.71% <100.00%> (+0.07%) ⬆️
elixir 17.34% <0.00%> (-0.13%) ⬇️
erlang 74.40% <100.00%> (+0.05%) ⬆️
exporter 67.47% <ø> (ø)
sdk 78.76% <100.00%> (+0.07%) ⬆️
zipkin 54.16% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bryannaegele
Copy link
Contributor

This would trigger a global GC. What is the purpose of deleting the terms?

@SergeTupchiy
Copy link
Contributor Author

This would trigger a global GC. What is the purpose of deleting the terms?

If the app is completely stopped, I think it should cleanup its data, even if it has some extra cost like global GC.
Additionally, this code doesn't look absolutely safe: https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry_api/src/opentelemetry.erl#L248
The previous tracer can be present and contain stale config resulting in various bugs and issues.
This is a good example: https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry/test/opentelemetry_SUITE.erl#L593
In my another PR (#671), the second run of this test case (batch_processor group after simple_processor group) would fail without cleaning up persistent terms, since it will use an older config.

@tsloughter
Copy link
Member

So we actually used to do this but I removed it. I need to dig up the commit/PR to remember why :)

@tsloughter
Copy link
Member

Or no, maybe we never deleted all terms, just set the tracer to noop, which I realized was unnecessary and removed #569

@bryannaegele
Copy link
Contributor

I agree in principle with the "clean up after yourself" but under the guidelines we have of trying to minimize application impact during runtime, I'm reluctant to trigger a global GC. And we actively discourage attempting to restart the sdk.

@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Jan 3, 2024

I agree in principle with the "clean up after yourself" but under the guidelines we have of trying to minimize application impact during runtime, I'm reluctant to trigger a global GC. And we actively discourage attempting to restart the sdk.

If sdk restart is discouraged, it is probably not that bad to cleanup persistent terms when it is stopped (let's assume it really needs to be stopped in that case). It would also probably make sense to change the way tracers are stored in persistent term (currently there are too many terms, e.g. a tracer per each application, though essentially they are all the same). Erasing fewer persistent terms must cause less stress on the system.

Side note: In my personal opinion, if sdk restart is discouraged, two things must be accomplished:

  • sdk must have a decent API to change its configuration on fly;
  • sdk must have some API to control its activity, so that a user can enable/disable it and be sure that only a minimum amount of resources is occupied when sdk is not in use.

Finally, global GC impact during the cleanup may be not that dramatic to justify leaving the terms even after full sdk stop.. 🤔
I did some tests (on a basic hardware):

  • erasing persistent terms with 1M processes, each holding an active tracer needed ~60-80 ms (the processes were sleeping almost all the time and their heaps were small/empty though)
  • erasing terms with 200K processes, each holding a tracer and doing some work + allocating some other garbage terms was done in ~300 ms, but the system was really busy and approaching OOM.
    Unless a user stops/starts sdk actively, I think the overhead of persistent_term cleanup is quite tolerable.

@tsloughter
Copy link
Member

Maybe the overhead is acceptable. It is a tough one to be sure about so we err'ed on the side of just not causing the GC.

I like it being cleaner and helping with testing. I'd say I'm torn on this issue.

@ferd
Copy link
Member

ferd commented Jan 7, 2024

Given the intent is to clean-up the persistent terms in order to clean them up when restarting because otherwise there's a bug with straggling data, this means that there are two ways to fix it:

  1. clean-up on shutdown
  2. aggressively overwrite on config change/startup

Both the clean-up and the overwrite should result in a global GC, but the overwrite should only GC the terms that changed. This may be less impactful?

@tsloughter
Copy link
Member

Overwrite doesn't work in the case of the Tracer being cached -- unless we expired the cache on restart somehow.

For tests I realized the immediately solution is using a unique TracerProvider for each test suite.

The term is keyed in part on the TracerProvider's name and by default the global TracerProvider is used when a user requests a Tracer -- its name is otel_tracer_provider_global. Tests not explicitly testing use of the global provider should create and use a new one.

This does not work for SDK configurations like default propagators.

@tsloughter
Copy link
Member

Technically we are supposed to provide a way to reconfigure the Tracers. So we actually SHOULD add this with a big warning in the docs to not do it unless you really really need to and understand the consequences.

@bryannaegele
Copy link
Contributor

One more thing to call out around Fred's comments. AFAIK we've never heard of otel itself crashing nor of use cases where folks are stopping and starting the app outside of tests. I think Tristan's comments address a few folks issues where specific tests are being affected but aren't production issues. I'd rather that get addressed than reworking these particular internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants