Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated values need to be sorted according to weight as well as mean in MergingDigest #154

Closed
tdunning opened this issue Jan 20, 2021 · 1 comment

Comments

@tdunning
Copy link
Owner

tdunning commented Jan 20, 2021

If there are multiple centroids with the same mean we should prefer to sort the ones with higher weight towards the median.

The reason for doing this is that without this tweak on the sort, centroids with the same mean that previously for which the size constraint was satisfied could be inverted during the sort and, as a result, now violate the constraint. That leads to reduced accuracy and apparent non-deterministic behavior. It can also cause problems because the first and last centroid can wind up with weights greater than one. This can adversely affect accuracy since the interpolation algorithms assume that the extreme centroids have unit weight.

This problem is known to occur in the Merging Digest, but it is likely to be true of the AVLTreeDigest as well. This issue only has to do with the MergingDigest. Issue #155 deals with the corresponding problem in the AVLTreeDigest.

@tdunning
Copy link
Owner Author

This was closed by eca1125
Instead of fancy sorting by weight, I changed the sort to be stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant