Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypergeometric incorrect samples #1508

Open
benjamin-lieser opened this issue Oct 10, 2024 · 6 comments
Open

Hypergeometric incorrect samples #1508

benjamin-lieser opened this issue Oct 10, 2024 · 6 comments
Labels
A-degenerate Algorithm fails on some input X-bug Type: bug report

Comments

@benjamin-lieser
Copy link
Collaborator

Hypergeometric::new(100,50,49) produces samples which are very likely not from this distribution.

The distribution is not very extreme, so I would expect this to sample correctly.

One piece of evidence is in the failed KS test (see #1504)

I also did a chisquared test which gives a p value of 0.0 for a million samples:

The frequencies I sample:

array([0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 4.00000e+00, 3.70000e+01,
       1.99000e+02, 8.04000e+02, 2.90700e+03, 8.51200e+03, 1.99080e+04,
       3.89010e+04, 7.11980e+04, 1.12618e+05, 1.42632e+05, 1.42973e+05,
       1.42408e+05, 1.21861e+05, 8.93170e+04, 5.53630e+04, 2.98190e+04,
       1.32620e+04, 5.03400e+03, 1.67100e+03, 4.55000e+02, 9.80000e+01,
       1.60000e+01, 3.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00])

the theoretical ones:

array([5.05494304e-22, 6.19230523e-19, 2.42738365e-16, 4.56348126e-14,
       4.93312324e-12, 3.40385504e-10, 1.60467452e-08, 5.42150748e-07,
       1.35989479e-05, 2.60193203e-04, 3.87924412e-03, 4.58456124e-02,
       4.35533318e-01, 3.36461453e+00, 2.13412693e+01, 1.12041664e+02,
       4.90182279e+02, 1.79733502e+03, 5.54966604e+03, 1.44875492e+04,
       3.20795733e+04, 6.04095861e+04, 9.69418655e+04, 1.32768207e+05,
       1.55338802e+05, 1.55338802e+05, 1.32768207e+05, 9.69418655e+04,
       6.04095861e+04, 3.20795733e+04, 1.44875492e+04, 5.54966604e+03,
       1.79733502e+03, 4.90182279e+02, 1.12041664e+02, 2.13412693e+01,
       3.36461453e+00, 4.35533318e-01, 4.58456124e-02, 3.87924412e-03,
       2.60193203e-04, 1.35989479e-05, 5.42150748e-07, 1.60467452e-08,
       3.40385504e-10, 4.93312324e-12, 4.56348126e-14, 2.42738365e-16,
       6.19230523e-19, 5.05494304e-22])
@benjamin-lieser benjamin-lieser added the X-bug Type: bug report label Oct 10, 2024
@WarrenWeckesser
Copy link
Collaborator

It looks like there is something wrong with the rejection-acceptance sampling method. For various input parameters, I generated 10000000 samples, and compared the frequencies of the output values to the theoretical frequencies. ("Compare" means I just printed the observed and expected frequencies, and looked for big discrepancies. I used a Python script for that, and used the pmf method of scipy.stats.hypergeom to calculate the expected frequencies.) I get incorrect results with inputs such as (65, 30, 28), (48, 25, 20), and (40, 20, 19), which all use the rejection-acceptance method. I haven't seen any unexpected discrepancies when the inverse-transform sampling method is used.

@benjamin-lieser benjamin-lieser added the A-degenerate Algorithm fails on some input label Oct 10, 2024
@WarrenWeckesser
Copy link
Collaborator

The (100,50,49) case uses the inverse sampling code

Interesting. That's not what I observe.

With the master branch, when I call Hypergeometric::new(100, 50, 49), it selects the method RejectionAcceptance. The code that determines the method is

const HIN_THRESHOLD: f64 = 10.0;
let m = ((k + 1) as f64 * (n1 + 1) as f64 / (n + 2) as f64).floor();
let sampling_method = if m - f64::max(0.0, k as f64 - n2 as f64) < HIN_THRESHOLD {

With those inputs, n1 = n2 = 50, n = 100, k = 49, and the mode m = 25. The value that determines the sampling method is m - max(0, k - n2) = 25, which is greater than the threshold HIN_THRESHOLD = 10, so RejectionAcceptance is used.

@benjamin-lieser
Copy link
Collaborator Author

The (100,50,49) case uses the inverse sampling code

Interesting. That's not what I observe.

With the master branch, when I call Hypergeometric::new(100, 50, 49), it selects the method RejectionAcceptance. The code that determines the method is

const HIN_THRESHOLD: f64 = 10.0;
let m = ((k + 1) as f64 * (n1 + 1) as f64 / (n + 2) as f64).floor();
let sampling_method = if m - f64::max(0.0, k as f64 - n2 as f64) < HIN_THRESHOLD {

With those inputs, n1 = n2 = 50, n = 100, k = 49, and the mode m = 25. The value that determines the sampling method is m - max(0, k - n2) = 25, which is greater than the threshold HIN_THRESHOLD = 10, so RejectionAcceptance is used.

I was mistaken, I had it in a debugger from the KS tests, but I think I forgot to comment out the other hyperparameter

@benjamin-lieser
Copy link
Collaborator Author

I would wait a bit if someone with experience with the algorithm (maybe @teryror ?) wants to investigate. Otherwise I will try myself.

@WarrenWeckesser
Copy link
Collaborator

It turns out the problem is a bug in the original algorithm. R discovered this years ago: https://bugs.r-project.org/show_bug.cgi?id=7314

The fix is to change this line

f /= (n1 - i) as f64 * (k - i) as f64;

to

f /= (n1 - i + 1) as f64 * (k - i + 1) as f64; 

There is a separate (but apparently not so significant) bug: in ln_of_factorial(v), Stirling's approximation is used instead of actually computing ln(v!) = ln(gamma(v + 1)) accurately. That approximation is not very good for small to moderate values of v. For example, with Hypergeometric::new(40, 20, 19), the function ln_of_factorial(v) is called with values ranging from 7 to 13. For the input 7, it returns 6.621371043387192, but the correct value of ln(7!) is 8.525161361065415. I haven't tried to figure out how this affects the correctness of the algorithm.

@benjamin-lieser
Copy link
Collaborator Author

Really good catch :)

I actually tried to see if fixing the Stirling helps, but it did not have any measurable effect on the KS statistic (but this was with the bug still there). It would be good to know what would be the minimal values it can be called with.

@benjamin-lieser benjamin-lieser mentioned this issue Oct 12, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-degenerate Algorithm fails on some input X-bug Type: bug report
Projects
None yet
Development

No branches or pull requests

2 participants