Hypergeometric incorrect samples #1508

benjamin-lieser · 2024-10-10T10:02:53Z

Hypergeometric::new(100,50,49) produces samples which are very likely not from this distribution.

The distribution is not very extreme, so I would expect this to sample correctly.

One piece of evidence is in the failed KS test (see #1504)

I also did a chisquared test which gives a p value of 0.0 for a million samples:

The frequencies I sample:

array([0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 4.00000e+00, 3.70000e+01,
       1.99000e+02, 8.04000e+02, 2.90700e+03, 8.51200e+03, 1.99080e+04,
       3.89010e+04, 7.11980e+04, 1.12618e+05, 1.42632e+05, 1.42973e+05,
       1.42408e+05, 1.21861e+05, 8.93170e+04, 5.53630e+04, 2.98190e+04,
       1.32620e+04, 5.03400e+03, 1.67100e+03, 4.55000e+02, 9.80000e+01,
       1.60000e+01, 3.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00])

the theoretical ones:

array([5.05494304e-22, 6.19230523e-19, 2.42738365e-16, 4.56348126e-14,
       4.93312324e-12, 3.40385504e-10, 1.60467452e-08, 5.42150748e-07,
       1.35989479e-05, 2.60193203e-04, 3.87924412e-03, 4.58456124e-02,
       4.35533318e-01, 3.36461453e+00, 2.13412693e+01, 1.12041664e+02,
       4.90182279e+02, 1.79733502e+03, 5.54966604e+03, 1.44875492e+04,
       3.20795733e+04, 6.04095861e+04, 9.69418655e+04, 1.32768207e+05,
       1.55338802e+05, 1.55338802e+05, 1.32768207e+05, 9.69418655e+04,
       6.04095861e+04, 3.20795733e+04, 1.44875492e+04, 5.54966604e+03,
       1.79733502e+03, 4.90182279e+02, 1.12041664e+02, 2.13412693e+01,
       3.36461453e+00, 4.35533318e-01, 4.58456124e-02, 3.87924412e-03,
       2.60193203e-04, 1.35989479e-05, 5.42150748e-07, 1.60467452e-08,
       3.40385504e-10, 4.93312324e-12, 4.56348126e-14, 2.42738365e-16,
       6.19230523e-19, 5.05494304e-22])

The text was updated successfully, but these errors were encountered:

WarrenWeckesser · 2024-10-10T15:04:54Z

It looks like there is something wrong with the rejection-acceptance sampling method. For various input parameters, I generated 10000000 samples, and compared the frequencies of the output values to the theoretical frequencies. ("Compare" means I just printed the observed and expected frequencies, and looked for big discrepancies. I used a Python script for that, and used the pmf method of scipy.stats.hypergeom to calculate the expected frequencies.) I get incorrect results with inputs such as (65, 30, 28), (48, 25, 20), and (40, 20, 19), which all use the rejection-acceptance method. I haven't seen any unexpected discrepancies when the inverse-transform sampling method is used.

WarrenWeckesser · 2024-10-10T15:42:59Z

The (100,50,49) case uses the inverse sampling code

Interesting. That's not what I observe.

With the master branch, when I call Hypergeometric::new(100, 50, 49), it selects the method RejectionAcceptance. The code that determines the method is

rand/rand_distr/src/hypergeometric.rs

Lines 196 to 198 in 0fba940

    
           const HIN_THRESHOLD: f64 = 10.0; 
        
           let m = ((k + 1) as f64 * (n1 + 1) as f64 / (n + 2) as f64).floor(); 
        
           let sampling_method = if m - f64::max(0.0, k as f64 - n2 as f64) < HIN_THRESHOLD {

With those inputs, n1 = n2 = 50, n = 100, k = 49, and the mode m = 25. The value that determines the sampling method is m - max(0, k - n2) = 25, which is greater than the threshold HIN_THRESHOLD = 10, so RejectionAcceptance is used.

benjamin-lieser · 2024-10-10T15:44:23Z

The (100,50,49) case uses the inverse sampling code

Interesting. That's not what I observe.

With the master branch, when I call Hypergeometric::new(100, 50, 49), it selects the method RejectionAcceptance. The code that determines the method is

rand/rand_distr/src/hypergeometric.rs

Lines 196 to 198 in 0fba940

const HIN_THRESHOLD: f64 = 10.0;

let m = ((k + 1) as f64 * (n1 + 1) as f64 / (n + 2) as f64).floor();

let sampling_method = if m - f64::max(0.0, k as f64 - n2 as f64) < HIN_THRESHOLD {

With those inputs, n1 = n2 = 50, n = 100, k = 49, and the mode m = 25. The value that determines the sampling method is m - max(0, k - n2) = 25, which is greater than the threshold HIN_THRESHOLD = 10, so RejectionAcceptance is used.

I was mistaken, I had it in a debugger from the KS tests, but I think I forgot to comment out the other hyperparameter

benjamin-lieser · 2024-10-11T09:50:59Z

I would wait a bit if someone with experience with the algorithm (maybe @teryror ?) wants to investigate. Otherwise I will try myself.

WarrenWeckesser · 2024-10-11T17:56:57Z

It turns out the problem is a bug in the original algorithm. R discovered this years ago: https://bugs.r-project.org/show_bug.cgi?id=7314

The fix is to change this line

rand/rand_distr/src/hypergeometric.rs

Line 362 in f5185d9

f /= (n1 - i) as f64 * (k - i) as f64;

to

f /= (n1 - i + 1) as f64 * (k - i + 1) as f64;

There is a separate (but apparently not so significant) bug: in ln_of_factorial(v), Stirling's approximation is used instead of actually computing ln(v!) = ln(gamma(v + 1)) accurately. That approximation is not very good for small to moderate values of v. For example, with Hypergeometric::new(40, 20, 19), the function ln_of_factorial(v) is called with values ranging from 7 to 13. For the input 7, it returns 6.621371043387192, but the correct value of ln(7!) is 8.525161361065415. I haven't tried to figure out how this affects the correctness of the algorithm.

benjamin-lieser · 2024-10-12T08:41:18Z

Really good catch :)

I actually tried to see if fixing the Stirling helps, but it did not have any measurable effect on the KS statistic (but this was with the bug still there). It would be good to know what would be the minimal values it can be called with.

benjamin-lieser added the X-bug Type: bug report label Oct 10, 2024

benjamin-lieser assigned benjamin-lieser and unassigned benjamin-lieser Oct 10, 2024

benjamin-lieser added the A-degenerate Algorithm fails on some input label Oct 10, 2024

benjamin-lieser mentioned this issue Oct 12, 2024

Hypergeo fix #1510

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hypergeometric incorrect samples #1508

Hypergeometric incorrect samples #1508

benjamin-lieser commented Oct 10, 2024

WarrenWeckesser commented Oct 10, 2024

WarrenWeckesser commented Oct 10, 2024

benjamin-lieser commented Oct 10, 2024

benjamin-lieser commented Oct 11, 2024

WarrenWeckesser commented Oct 11, 2024

benjamin-lieser commented Oct 12, 2024

Hypergeometric incorrect samples #1508

Hypergeometric incorrect samples #1508

Comments

benjamin-lieser commented Oct 10, 2024

WarrenWeckesser commented Oct 10, 2024

WarrenWeckesser commented Oct 10, 2024

benjamin-lieser commented Oct 10, 2024

benjamin-lieser commented Oct 11, 2024

WarrenWeckesser commented Oct 11, 2024

benjamin-lieser commented Oct 12, 2024