-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy between TFP and Stan's CholeskyLKJ log prob #694
Comments
Anudhyan/Srinivas have a candidate for the discrepancy. |
Thanks for addressing this @brianwa84 ! Just wanted to report that I'm still getting the discrepancy. For matrices of dimension 2 I'm getting a correlation of 1 between Stan and TFP's log_probs, but not the same values (difference in mean and standard deviation); For matrices of dimensions 3,4,5 I get very high correlation (>0.98) but not the same values. Perhaps this might be related to differences in the implementation of the bijector? |
You are testing against TFP nightly? Can you double check |
AFAIK, yes - version is Also, regarding my previous comment - for matrices of size 2, Stan and TFP actually fully agree (after subtracting CorrelationCholesky inv_log_det from TFP's prediction). For dim>2 I'm still getting the discrepancy. |
You shouldn't need to subtract CorrelationCholesky inv_log_det. It should match exactly, does it not? |
If I remembered correctly, Stan might drop some normalizing constant depending on the way you model it, i.e., you might get different result with
compare to
|
Yep, see https://mc-stan.org/docs/2_21/functions-reference/cholesky-lkj-correlation-distribution.html
|
I'm pretty sure the tilde notation drops constant additive terms, while the For other distributions we've tried (normal, poisson, exponential, etc) we were able to reproduce Stan's lp__ exactly ( |
Hmmm but that would contradict the Stan doc, which I think it is a better source of truth https://mc-stan.org/docs/2_21/functions-reference/cholesky-lkj-correlation-distribution.html |
Also - the discrepancy might came from dtype - try rerunning it with |
I'm not sure I follow - the Stan doc says that if you use |
Tried changing dtype to float64 and got the same result. |
(Sorry about the many questions, so far my attempts to run rstan have been futile, so I can't reproduce the stan sampling code above.) Ah, does Stan automatically add the unconstraining jacobian terms? Then it is not expected to match up with the TFP CholeskyLKJ log prob. With the bijector correction it may or may not match up, because as you said the bijector implementations could be different! However, I think I just realized that tfb.CorrelationCholesky bijector is now incompatible with LKJCholesky distribution. LKJCholesky distribution is parameterized with dL_ij (1 <= i < j <= n) the strictly lower triangular entries; or the area element of the (n-1)-ball if you look at each row independently. While CorrelationCholesky bijector assumes a parameterized over the surface element of n-sphere. I'll work on a fix... |
Latest from @bloops : The distribution is fixed but the accompanying bijector needs to be updated. |
Thanks! |
@adamhaber / @junpenglao Do you mind running another check against Stan's log prob? We can compare difference of two log_probs to get rid of difference due to normalizer (if any) |
Hmm, if that's still the case I think it's possible it's just a different bijector (but both Stan's and TFP's might be correct.) |
ftr the code for lkj_lpdf in stan math are available at the link below https://mc-stan.org/math/d7/d74/lkj__corr__cholesky__lpdf_8hpp_source.html I like the math docs because you can click on functions and go over to their definitions |
This is indeed due to differences in the bijectors; Stan's |
Awesome! Thanks Adam for verifying the fix -- and reporting the original bug. I've added a more stringent numerical test that CorrChol bijector + CholeskyLKJ samples the correct marginals as reported by the literature. So hopefully the TFP bijector is also correct. |
Will the difference in the implementations of the bijectors have any effect on sampling? I think it should effect the geometry of the posterior distribution, and therefore the behaviour of the sampler... right? |
Yes, that's correct, having a different bijector does affect the geometry and the behavior of the sampler. But it's possible to still have the samples converge to the correct distribution. To elaborate, to sample from a distribution X, we can think of drawing samples from a distribution Y = f^{-1}(X) under the hood, where "f" is the bijector. Once we draw y1, y2, ... yn ~ Y, we can compute the corresponding samples from X as x1 = f(y1), x2 = f(y2).... xn = f(yn). It's possible to have multiple different 'f's while still correctly sampling from X -- which is the thing we care about. In HMC, the bijector can affect the log prob calculation as well as gradients. Presumably there can be some bijectors better than others, leading to a different effective sample rate. In this case I don't think there is a reported "good" bijector so we can just choose any reasonable one. It might be a nice research question though. |
@adamhaber is right. In the language, the
The doc just writes out the whole density. The implementation has a template parameter
In Stan, all transforms affect both other than the affine (linear) transform, which doesn't affect gradients. Maximum likelihood estimation turns off the Jacobian adjustment.
The Jacobian gets applied when mapping from unconstrained parameters to constrained parameters.
This is the question we want to answer. There are also lots of options for simplex transforms and for positive constrained and interval constrained transforms. |
I've played around with this by changing Stan's cholesky_corr_constrain.hpp to match TFP's implementation:
I'm not sure if this is enough to conclude if any implementation should be preferred; it does seem like Stan's implementation is somewhat more robust in the sense that ESS/sec is less sensitive to the concentration parameter. I can wrap this in a notebook and share if anyone's interested in the precise numbers. |
Hi, Trying to finish and publish this comparison; so far I've altered Stan's implementation to match TFP's, but I also want to see what happens when I change TFP's |
We don't have built-in ways to get good timings for TFP. Current best practice is to (i) wrap all your code in a toplevel @tf.function (if you think TFP is fast, you're presumably already doing this), then (ii) execute that function several times on the same parameters, timing each execution separately. The first run is expected to be much slower than the others because it triggers tracing, graph construction, and (if you used @tf.function(experimental_compile=True)) XLA compilation. Whether the duration you ultimately want is the cold or warm is another matter. Be sure enough of your parameters are Tensors (not numpy arrays) when you call the function---the tracing partially evaluates the body with respect to Python values, which can easily take forever if it accidentally unrolls a loop that shouldn't have been unrolled. BTW, you might also wish to set autograph=False in you tf.function. By default, tf.function will walk your Python AST looking for Python if statements and while loops to turn into tf.cond and friends. Our code doesn't rely on this, so if yours doesn't either you can skip the AST walk. IIRC, Autograph blacklists TFP so it doesn't try to AST-walk our code even if you leave it on, so it shouldn't matter very much, but I thought I would tell you anyway. |
We usually think of the transform as a function f mapping from the constrained parameters to the unconstrained parameters. So the target log density is incremented by the log absolute determinant of the Jacobian of the inverse of f (that is, the function f^{-1} mapping from the unconstrained parameters to the constraine parameters is the one whose Jacobian is calculated). There are details in the Stan reference manual chapter on constrained parameter transforms. |
Thanks @axch ! My current attempt looks like this: import tensorflow as tf
import tensorflow_probability as tfp
from time import time
tfd = tfp.distributions
tfb = tfp.bijectors
dtype = tf.float32
@tf.function(experimental_compile=True)
def sample(bij, log_prob_fn, nchain=4, num_main_iters=1000, num_warmup_iters=1000):
initial_state = tf.random.uniform((N*(N-1)//2,), -2, 2, dtype, name="initializer")
step_sizes = 1e-2 * tf.ones_like(initial_state)
kernel = tfp.mcmc.TransformedTransitionKernel(
tfp.mcmc.nuts.NoUTurnSampler(
target_log_prob_fn=lambda x:log_prob_fn(x),
step_size=step_sizes,
),
bijector=bij,
)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
kernel,
target_accept_prob=tf.cast(0.8, dtype=dtype),
num_adaptation_steps=num_warmup_iters,
step_size_setter_fn=lambda pkr, new_step_size : pkr._replace(
inner_results=pkr.inner_results._replace(step_size=new_step_size)
),
step_size_getter_fn=lambda pkr: pkr.inner_results.step_size,
log_accept_prob_getter_fn=lambda pkr: pkr.inner_results.log_accept_ratio,
)
# Sampling from the chain.
mcmc_trace, pkr = tfp.mcmc.sample_chain(
num_results=num_main_iters,
num_burnin_steps=num_warmup_iters,
current_state=[bij.forward(initial_state)],
kernel = kernel
)
return mcmc_trace, pkr
N = 5
nu = 1.5
n_rep = 100
m = tfd.CholeskyLKJ(N, nu)
bijector = tfb.CorrelationCholesky()
start = time()
trace, pkr = sample(bijector, m.log_prob)
times = []
for i in range(n_rep):
start = time()
trace, pkr = sample(bijector, m.log_prob)
times.append(time()-start) Is this a reasonable why to do this? Is there a way to further optimise sampling/profiling? |
Looks like you're excluding the first run time. If I were you, I would capture it and plot all of them, and only decide what to exclude after seeing what that plot looks like. (Also, I'd be curious to see the plot myself, if you wouldn't mind posting?) Other than that, it's just nits:
Otherwise, looks good to me! |
Thanks! Regarding the timings - sure, I hope to publish an initial report in the next couple of days. |
@adamhaber depending on what you are trying to benchmark - end-to-end pipeline benchmark might not be too informative (depending on your goal) giving there are many implementation differences between Stan and TFP. If your concern is mostly the speed differences of the bijector, using HMC with the same step size for both Stan and TFP might be better (notice however step size is defined differently in Stan and TFP, as Stan uses a mass matrix and a scalar step size, whereas TFP use essentially |
By itself, the TFP implementation is faster; Stan applies tanh element wise, and TFP does not. What I'm trying to understand is whether the more "costly" bijector turns out to be more efficient when taking into account the effect of the bijector on the geometry of the posterior (maybe tanh "smoothes" the posterior and makes life easier for HMC or whatnot). In order to test this, I'm planning to do the following:
In addition, I want to change Stan's implementation to match TFP's and repeat 1-3, hoping to get consistent results. :-) In addition to sampling CholeskyLKJ "alone", I also want to see what happens when it is a part of a larger model with some observations. I hope this makes sense; any suggestions on how to improve this would be much appreciated! |
I see, i think that makes sense - the within framework comparison will likely be quite informative and free of the other differences of framework :-) |
What we really care about in most applied stats problems is time to effective sample size 100 or thereabouts. It's probably enough to evaluate on effective sample size per second after warmup as things that sample well tend to warmup well. Problems can be caused by divergences from the Hamiltonian in the more extreme case where the step size isn't low enough. |
Cmdstan reports ESS/sec after warmup by default? |
Stan's constraining transformation (forward in TFP terms, inverse in Stan terms, see Bob's response above) is as follows (x in the Cholesky factor, z in the unconstrained vector after element-wise tanh): x(0, 0) = 1;
int k = 0;
for (int i = 1; i < K; ++i) {
x(i, 0) = z(k++);
T sum_sqs = square(x(i, 0));
for (int j = 1; j < i; ++j) {
lp += 0.5 * log1m(sum_sqs);
x(i, j) = z(k++) * sqrt(1.0 - sum_sqs);
sum_sqs += square(x(i, j));
}
x(i, i) = sqrt(1.0 - sum_sqs);
} The
|
Yes, cmdstan reports estimated ESS/second after warmup. That's because we think it's the single best measure of the speed of a model. Everything else depends on number of draws or number of warmup draws. |
python for loop would definitely not very performative - maybe @bloops could help in terms of implementation here |
Any way to measure this in TFP? |
I dont remember whether the ess computation is up-to-date. For consistency maybe use arviz for computation of ESS for both Stan and TFP? |
For the time taken number of seconds estimate, you can wrap the There is also an effective_sample_size in TFP here but as Junpeng says it might be better to use the same implementation of ESS for both TFP and Stan. |
ESS is just The slightly tricky bit is leaving out warmup time. You can do this by making two separate calls to
|
(note above also needs tf.function decoration to be performant, and if you want to account for compilation time, follow @bloops) |
Thanks for the helpful replies! I guess that because the comparison is within-framework, getting exactly the same ESS/sec measurement like Cmdstan isn't critical; it just means that comparing the TFP and Stan's result would be trick. If anyone has any tips on how to implement Stan's bijector in a vectorised way (that respects TF's batching semantics), I'd be very interested. Meanwhile I'll keep wasting papers on trying to compute jacobian corrections manually. :-) |
Do you have a version of the implementation in TF without batching right now? Happy to take a stab if so :-) |
I'm currently trying a "hybrid" approach - I want to see what happens if instead of implementing Stan's bijector exactly, I simply apply an element-wise tanh to the unconstrained vector before plugging it into the lower triangular matrix. If I'm doing the math correctly, the jacobian correction should be fairly simple: the jacobian of the rows-normalisation is already there ( Sounds reasonable? |
In that case, you might be able to use the |
Thanks - that's certainly simpler (I also verified and it comes out the same. Math seems to work). Here's a short description of what I did: Here are the results:
Ratios below the horizontal line mean the current implementation is faster (element wise tanh is hurting performance); ratios above the horizontal line mean the chained bijector is faster. It seems like for concentrations larger than 1, TFP current implementation is preferred, pretty much for all matrix sizes. This is also consistent with what I got when I changed Stan's bijector implementation - that is, performance improved for large concentrations but became worse for very small concentrations. This is probably not enough to say which one "is better" (to the extent that this is a meaningful question), but I hope it provides an interesting first step. Questions/suggestions are welcomed. Current plan - repeat this for a simple model with observations. |
Thank you!! This is very interesting, I am glad that Given this result (and the similar observation in Stan), with the comment from @bob-carpenter that in general a concentration >1 is preferred , I think it is a safe choice for using |
Small update - with help from @junpenglao, I've implemented a simple model that uses CholeskyLKJ to correlate slopes and intercepts between linear regression models (McElreath's cafes model): model = tfd.JointDistributionSequentialAutoBatched(
[
# rho, the prior for the correlation matrix between intercepts and slopes
tfd.CholeskyLKJ(2,2),
# sigma, prior std for the waiting time
tfd.Exponential(rate = 1),
# sigma_café, prior of stds for intercepts and slopes (vector of 2)
tfd.Exponential(rate = tf.ones(2, dtype=tf.float32)),
# beta, the prior mean for the intercept and slopes
tfd.Normal(loc = [5, -1], scale = [2, 0.5]),
# per-café intercepts and slopes
lambda betas,sigma_café,sigma,chol_rho : tfd.Sample(
tfd.MultivariateNormalTriL(
loc = betas,
scale_tril = tf.linalg.LinearOperatorDiag(sigma_café).matmul(chol_rho)
),
sample_shape=n_cafés
),
# per-café waiting times
lambda mvn, betas, sigma_café, sigma : tfd.Independent(
tfd.Normal(
loc = tf.gather(mvn[...,0],café_id,axis=-1) + tf.gather(mvn[...,1],café_id,axis=-1)*afternoon,
scale = sigma[..., tf.newaxis]
),
reinterpreted_batch_ndims=1
)
]
) Note that the correlation matrix is small (dim=2) and concentration is slightly above 1. The model is fitted to generated data; I've generated 100 different datasets of different sizes (to get a feeling whether there's a likelihood vs. prior effect going on), and measured the sampling time with and without the chained |
I really like the speedup plot for visualizing what's going on. If we could add the Stan version on the same scale, that'd be ideal. Either Stan or TFP's current version can be the baseline. The simulated data needs to go beyond two dimensions. Ideally as many as 100 or even more if the compute budget's there. [edit: I somehow conflated @adamhaber's posts---it looks like Step 1's already done with just the LKJ generating data for fitting correlations and Step 2 is in the works.] Step 1 is to probably just fit multivariate normal data to a multivariate normal model where you control the level of correlation in the data. One way to do that is to generate a time-series correlation matrix:
for Step 2 would be to generate correlated data, then use it as the error in a seeminlgy unrelated regression (SUR) model, which is just what the econ folks call a multivariate outcome regression model with correlated error. Step 3 would be to use it as a hierarchical prior in a varying effects model where each item has more than two random effects (Gelman's red-state/blue-state model has two for the intercept and slope of vote share based on income by state, but this'd need to be extended). @mitzimorris is about to send us data for a big election study that has that structure, though I don't know if it's currently using an LKJ prior on the random effects. The thing to do for a paper would also be to simulate data. This is really cool, by the way. I've been meaning to do this for years for a lot of these transforms. There's obviously also the simplex itself to consider. But even something as simple as positive constrained parameters can be put together on an arbitrary scale for most of their domain. Right now the inverse transform is exp(y) in Stan but it could be flattened out so that
The extra offset is just so that our initialization of |
I'm trying to understand a discrepancy in the log_prob computation between Stan and TFP. I don't know if this is bug, and if it is, where; I'm reporting this here in case anyone thinks this may be of interest.
I generated Cholesky factors of LKJ matrices using the following Stan code:
I extracted samples and log_prob:
Finally, I computed the log prob of these matrices using TFP:
When I compare
tfp_lps
to Stan'slp__
, I get a high correlation, but they're not the same. For comparison, running this check with something like a normal distribution passesnp.allclose
. I've also suspected that this might be related to different handling of log_det_jacobian, but subtracting tfb.CorrelationCholesky().inverse_log_det_jacobian(L,2) didn't help.Is there something wrong in the way I'm running the comparison, or in the expectation that they would be the same in the first place?
Thanks in advance!
The text was updated successfully, but these errors were encountered: