Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::rand::distributions::Exp1 is biased #10084

Closed
huonw opened this issue Oct 26, 2013 · 1 comment · Fixed by #10196
Closed

std::rand::distributions::Exp1 is biased #10084

huonw opened this issue Oct 26, 2013 · 1 comment · Fixed by #10196

Comments

@huonw
Copy link
Member

huonw commented Oct 26, 2013

The following program generates 10,000 samples of the sample mean of 100,000 Exponential(1) random variates.

use std::rand::{StdRng, Rng};
use std::rand::distributions::Exp1;

static SAMPLES_PER_MEAN: uint = 100_000;
static NUM_MEANS: uint = 10_000;

fn main() {
    let mut rng = StdRng::new();
    let n = SAMPLES_PER_MEAN as f64;

    for _ in range(0, NUM_MEANS) {
        let mut s = 0.;
        for _ in range(0, SAMPLES_PER_MEAN) {
            s += *rng.gen::<Exp1>();
        }
        println!("{:.20}", s / n)
    }
}

The following plot plots the kernel density estimate of the above (solid) line, against the expected density based on the central limit theorem (dashed). They should match fairly closely with samples of this size. (The vertical line is the expected mean value of the samples, i.e. the peak should lie there.)

exp-density

Note that the scale is slightly deceptive and the bias isn't huge, the sample mean is converging to approximately 0.9964 rather than 1. However, it is highly unlikely that this is by chance with such large samples (1 billion samples in total), and my experimentation has shown it to be stable, i.e. running it several times with many different values of NUM_MEANS and SAMPLES_PER_MEAN all give plots similar to the above, with overall means around 0.9964.

(Plot created with R via:

dat <- read.table('out.csv')$V1
x <- seq(0.98, 1.02, by=0.0001)
png('exp-density.png', width=600, height=600)
plot(density(dat), main='Density of 10000 means of 100000 samples of Exp(1)', lty=1)
lines(x, dnorm(x, 1, 1/sqrt(100000)), lty=2)
lines(c(1.0, 1.0), c(-1000, 1000), lty=3)
legend('topright', c('Sample density', 'True density'), lty=c(1,2))
dev.off()

FWIW, generating a similar sample in R matches the expected density almost exactly.)


Having done a very small amount of investigation, I think this might be an error in the Ziggurat tables for exp, or an error in the implementation of Ziggurat for non-symmetric random variables. (Reasons: the f64 generator seems to match a uniform(0,1) closely, and the normal generator (which also uses Ziggurat) matches its expected distribution closely too. That said, I didn't check either of these super closely.)

@ghost ghost assigned huonw Oct 26, 2013
@huonw
Copy link
Member Author

huonw commented Oct 26, 2013

I'm working on this (as actively as I can).

huonw added a commit to huonw/rust that referenced this issue Oct 27, 2013
The Ziggurat implementation of Exp1 is incorrect: it appears to have an
expected value of around 0.994 rather than 1 and the density of a large
number of sample means does not match the normal distribution expected
by the central limit theorem (specifically, it's a normal distribution
centred at 0.994).

This replaces Ziggurat with a "plain" inverse-cdf implementation for
now. This implementation has the expected properties, although it is
approximately 4 times slower.

cc rust-lang#10084 (this does *not* fix it).
bors added a commit that referenced this issue Nov 1, 2013
The code was using (in the notation of Doornik 2005) `f(x_{i+1}) -
f(x_{i+2})` rather than `f(x_i) - f(x_{i+1})`. This corrects that, and
removes the F_DIFF tables which caused this problem in the first place.

They `F_DIFF` tables are a micro-optimisation (in theory, they could
easily be a micro-pessimisation): that `if` gets hit about 1% of the
time for Exp/Normal, and the rest of the condition involves RNG calls
and a floating point `exp`, so it is unlikely that saving a single FP
subtraction will be very useful (especially as more tables means more
memory reads and higher cache pressure, as well as taking up space in
the binary (although only ~2k in this case)).

Closes #10084. Notably, unlike that issue suggests, this wasn't a
problem with the Exp tables. It affected Normal too, but since it is
symmetric, there was no bias in the mean (as the bias was equal on the
positive and negative sides and so cancelled out) but it was visible as
a variance slightly lower than it should be.

New plot:

![exp-density](https://f.cloud.github.com/assets/1203825/1445796/42218dfe-422a-11e3-9f98-2cd146b82b46.png)

I've started writing some tests in [huonw/random-tests](https://github.com/huonw/random-tests) (not in the main repo because they can and do fail occasionally, due to randomness, but it is on Travis and Rust-CI so it will hopefully track the language), unsurprisingly, they're [currently failing](https://travis-ci.org/huonw/random-tests/builds/13313987) (note that both exp and norm are failing, the former due to both mean and variance the latter due to just variance), but pass at the 0.01 level reliably with this change.

(Currently the only test is essentially a quantitative version of the plots I've been showing, which is run on the `f64` `Rand` instance (uniform 0 to 1), and the Normal and Exp distributions.)
@huonw huonw closed this as completed in 1e2283d Nov 1, 2013
@huonw huonw removed their assignment Jun 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant