Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between TFP and Stan's CholeskyLKJ log prob #694

Closed
adamhaber opened this issue Dec 17, 2019 · 53 comments
Closed

Discrepancy between TFP and Stan's CholeskyLKJ log prob #694

adamhaber opened this issue Dec 17, 2019 · 53 comments
Assignees

Comments

@adamhaber
Copy link

adamhaber commented Dec 17, 2019

I'm trying to understand a discrepancy in the log_prob computation between Stan and TFP. I don't know if this is bug, and if it is, where; I'm reporting this here in case anyone thinks this may be of interest.

I generated Cholesky factors of LKJ matrices using the following Stan code:

code <- "
data {
  int K;
}
parameters {
  cholesky_factor_corr[K] L;
}
model {
   target += lkj_corr_cholesky_lpdf(L|2.0);
}

I extracted samples and log_prob:

library(rstan)
data = list(K=5)
fit <- stan(model_code = code, data=data)
e <- as.data.frame(extract(fit))

Finally, I computed the log prob of these matrices using TFP:

m = tfd.CholeskyLKJ(5,2)
b = tfb.CorrelationCholesky()

for datum in data[:k,:-1]:
    L = datum.reshape(5,5).T
    tfp_lps.append(m.log_prob(L).numpy())

When I compare tfp_lps to Stan's lp__, I get a high correlation, but they're not the same. For comparison, running this check with something like a normal distribution passes np.allclose. I've also suspected that this might be related to different handling of log_det_jacobian, but subtracting tfb.CorrelationCholesky().inverse_log_det_jacobian(L,2) didn't help.

Is there something wrong in the way I'm running the comparison, or in the expectation that they would be the same in the first place?

Thanks in advance!

@brianwa84
Copy link
Contributor

Anudhyan/Srinivas have a candidate for the discrepancy.

@adamhaber
Copy link
Author

Thanks for addressing this @brianwa84 !

Just wanted to report that I'm still getting the discrepancy. For matrices of dimension 2 I'm getting a correlation of 1 between Stan and TFP's log_probs, but not the same values (difference in mean and standard deviation); For matrices of dimensions 3,4,5 I get very high correlation (>0.98) but not the same values. Perhaps this might be related to differences in the implementation of the bijector?

@brianwa84 brianwa84 reopened this Jan 10, 2020
@brianwa84
Copy link
Contributor

You are testing against TFP nightly? Can you double check tfp.__version__?

@adamhaber
Copy link
Author

AFAIK, yes - version is '0.9.0-dev20200110'.

Also, regarding my previous comment - for matrices of size 2, Stan and TFP actually fully agree (after subtracting CorrelationCholesky inv_log_det from TFP's prediction). For dim>2 I'm still getting the discrepancy.

@bloops
Copy link
Contributor

bloops commented Jan 11, 2020

You shouldn't need to subtract CorrelationCholesky inv_log_det. It should match exactly, does it not?

@junpenglao
Copy link
Contributor

If I remembered correctly, Stan might drop some normalizing constant depending on the way you model it, i.e., you might get different result with

target += lkj_corr_cholesky_lpdf(L|2.0);

compare to

L ~ lkj_corr_cholesky(2.0);

@junpenglao
Copy link
Contributor

Yep, see https://mc-stan.org/docs/2_21/functions-reference/cholesky-lkj-correlation-distribution.html

Increment target log probability density with lkj_corr_cholesky_lpdf( L | eta) dropping constant additive terms.

@adamhaber
Copy link
Author

adamhaber commented Jan 11, 2020

I'm pretty sure the tilde notation drops constant additive terms, while the target += notation does not; see here. Unless TFP also drops these terms, it seemed to me that it would be better to use the target += notation for comparison. Would be happy to hear your thoughts on this.

For other distributions we've tried (normal, poisson, exponential, etc) we were able to reproduce Stan's lp__ exactly (np.allclose-exact). Regarding the bijectors - AFAIK, Stan's lp__ contains jacobian corrections, while simply calling CholeskyLKJ.log_prob does not... perhaps the bijector itself is different?

@junpenglao
Copy link
Contributor

junpenglao commented Jan 11, 2020

I'm pretty sure the tilde notation drops constant additive terms, while the target += notation does not;

Hmmm but that would contradict the Stan doc, which I think it is a better source of truth https://mc-stan.org/docs/2_21/functions-reference/cholesky-lkj-correlation-distribution.html

@junpenglao
Copy link
Contributor

Also - the discrepancy might came from dtype - try rerunning it with dtype=tf.float64

@adamhaber
Copy link
Author

I'm not sure I follow - the Stan doc says that if you use L ~ lkj_corr_cholesky(eta), you increment lp__ without adding constant additive terms. My current understanding is that the difference between L ~ lkj_corr_cholesky(eta) and target += lkj_corr_cholesky_lpdf(L|eta) is exactly those additive terms, which is way I chose the latter. What am I missing? :-)

@adamhaber
Copy link
Author

Tried changing dtype to float64 and got the same result.

@bloops
Copy link
Contributor

bloops commented Jan 11, 2020

(Sorry about the many questions, so far my attempts to run rstan have been futile, so I can't reproduce the stan sampling code above.)

Ah, does Stan automatically add the unconstraining jacobian terms? Then it is not expected to match up with the TFP CholeskyLKJ log prob. With the bijector correction it may or may not match up, because as you said the bijector implementations could be different!

However, I think I just realized that tfb.CorrelationCholesky bijector is now incompatible with LKJCholesky distribution. LKJCholesky distribution is parameterized with dL_ij (1 <= i < j <= n) the strictly lower triangular entries; or the area element of the (n-1)-ball if you look at each row independently. While CorrelationCholesky bijector assumes a parameterized over the surface element of n-sphere. I'll work on a fix...

@brianwa84
Copy link
Contributor

Latest from @bloops : The distribution is fixed but the accompanying bijector needs to be updated.

@adamhaber
Copy link
Author

Thanks!

@bloops
Copy link
Contributor

bloops commented Feb 5, 2020

@adamhaber / @junpenglao Do you mind running another check against Stan's log prob? We can compare difference of two log_probs to get rid of difference due to normalizer (if any)

@adamhaber
Copy link
Author

With tfp 0.10.0-dev20200206, I still get the deviation (line is y=x):
image
Would be happy to give more details/run different tests if that helps.

@bloops
Copy link
Contributor

bloops commented Feb 6, 2020

Hmm, if that's still the case I think it's possible it's just a different bijector (but both Stan's and TFP's might be correct.)

@SteveBronder
Copy link

ftr the code for lkj_lpdf in stan math are available at the link below

https://mc-stan.org/math/d7/d74/lkj__corr__cholesky__lpdf_8hpp_source.html

I like the math docs because you can click on functions and go over to their definitions

@adamhaber
Copy link
Author

adamhaber commented Feb 10, 2020

This is indeed due to differences in the bijectors; Stan's lkj_corr_cholesky_log and TFP's CholeskyLKJ.log_prob give exactly the same results.

@bloops
Copy link
Contributor

bloops commented Feb 10, 2020

Awesome! Thanks Adam for verifying the fix -- and reporting the original bug.

I've added a more stringent numerical test that CorrChol bijector + CholeskyLKJ samples the correct marginals as reported by the literature. So hopefully the TFP bijector is also correct.

@adamhaber
Copy link
Author

Will the difference in the implementations of the bijectors have any effect on sampling? I think it should effect the geometry of the posterior distribution, and therefore the behaviour of the sampler... right?

@bloops
Copy link
Contributor

bloops commented Feb 11, 2020

Yes, that's correct, having a different bijector does affect the geometry and the behavior of the sampler. But it's possible to still have the samples converge to the correct distribution.

To elaborate, to sample from a distribution X, we can think of drawing samples from a distribution Y = f^{-1}(X) under the hood, where "f" is the bijector. Once we draw y1, y2, ... yn ~ Y, we can compute the corresponding samples from X as x1 = f(y1), x2 = f(y2).... xn = f(yn). It's possible to have multiple different 'f's while still correctly sampling from X -- which is the thing we care about.

In HMC, the bijector can affect the log prob calculation as well as gradients. Presumably there can be some bijectors better than others, leading to a different effective sample rate. In this case I don't think there is a reported "good" bijector so we can just choose any reasonable one. It might be a nice research question though.

@bob-carpenter
Copy link

bob-carpenter commented Feb 18, 2020

the Stan doc says that if you use L ~ lkj_corr_cholesky(eta), you increment lp__ without adding constant additive terms

@adamhaber is right. In the language, the ~ sampling statement drops constant terms. We're going to refine this soon so that you can decide to keep constants with ~ and drop them with target += --- we just haven't worked out the syntax yet.

Hmmm but that would contradict the Stan doc, which I think it is a better source of truth

The doc just writes out the whole density. The implementation has a template parameter propto that if set to true drops all the constant terms in the density.

In HMC, the bijector can affect the log prob calculation as well as gradients.

In Stan, all transforms affect both other than the affine (linear) transform, which doesn't affect gradients.

Maximum likelihood estimation turns off the Jacobian adjustment.

does Stan automatically add the unconstraining jacobian terms

The Jacobian gets applied when mapping from unconstrained parameters to constrained parameters.

Presumably there can be some bijectors better than others, leading to a different effective sample rate. In this case I don't think there is a reported "good" bijector so we can just choose any reasonable one. It might be a nice research question though.

This is the question we want to answer. There are also lots of options for simplex transforms and for positive constrained and interval constrained transforms.

@adamhaber
Copy link
Author

In HMC, the bijector can affect the log prob calculation as well as gradients. Presumably there can be some bijectors better than others, leading to a different effective sample rate. In this case I don't think there is a reported "good" bijector so we can just choose any reasonable one. It might be a nice research question though.

I've played around with this by changing Stan's cholesky_corr_constrain.hpp to match TFP's implementation:

  • Benchmarked "by itself", the TFP version is faster (both function evaluation and gradient computation), since there's no need to compute (and take the gradient of) tanh(x).
  • However, when sampling from a lkj_corr_cholesky distribution, ESS/sec was significantly lower for low concentration values (1 and below), and slightly higher for high concentration values (>10). This varied a little between different matrix size (I've tried dim=2,4,10,30,60).

I'm not sure if this is enough to conclude if any implementation should be preferred; it does seem like Stan's implementation is somewhat more robust in the sense that ESS/sec is less sensitive to the concentration parameter. I can wrap this in a notebook and share if anyone's interested in the precise numbers.

@adamhaber
Copy link
Author

Hi,

Trying to finish and publish this comparison; so far I've altered Stan's implementation to match TFP's, but I also want to see what happens when I change TFP's _forward and _inverse_log_det_jacobian to match Stan's (_inverse shouldn't be called while sampling, right?). My question is - is there a way to get ESS/sec in TFP? I can simply divide effective_sample_size by the execution time of sample_chain, but I'm not sure if this will introduce overhead times (XLA compilation?) that will bias the estimate. Is this a reasonable way to do this? If not - is there any other way to extract ESS/sec like in Cmdstan?

@axch
Copy link
Contributor

axch commented Mar 12, 2020

We don't have built-in ways to get good timings for TFP. Current best practice is to (i) wrap all your code in a toplevel @tf.function (if you think TFP is fast, you're presumably already doing this), then (ii) execute that function several times on the same parameters, timing each execution separately. The first run is expected to be much slower than the others because it triggers tracing, graph construction, and (if you used @tf.function(experimental_compile=True)) XLA compilation. Whether the duration you ultimately want is the cold or warm is another matter. Be sure enough of your parameters are Tensors (not numpy arrays) when you call the function---the tracing partially evaluates the body with respect to Python values, which can easily take forever if it accidentally unrolls a loop that shouldn't have been unrolled.

BTW, you might also wish to set autograph=False in you tf.function. By default, tf.function will walk your Python AST looking for Python if statements and while loops to turn into tf.cond and friends. Our code doesn't rely on this, so if yours doesn't either you can skip the AST walk. IIRC, Autograph blacklists TFP so it doesn't try to AST-walk our code even if you leave it on, so it shouldn't matter very much, but I thought I would tell you anyway.

@bob-carpenter
Copy link

We usually think of the transform as a function f mapping from the constrained parameters to the unconstrained parameters. So the target log density is incremented by the log absolute determinant of the Jacobian of the inverse of f (that is, the function f^{-1} mapping from the unconstrained parameters to the constraine parameters is the one whose Jacobian is calculated). There are details in the Stan reference manual chapter on constrained parameter transforms.

@adamhaber
Copy link
Author

Thanks @axch !

My current attempt looks like this:

import tensorflow as tf
import tensorflow_probability as tfp
from time import time
tfd = tfp.distributions
tfb = tfp.bijectors

dtype = tf.float32

@tf.function(experimental_compile=True)
def sample(bij, log_prob_fn, nchain=4, num_main_iters=1000, num_warmup_iters=1000):
    initial_state = tf.random.uniform((N*(N-1)//2,), -2, 2, dtype, name="initializer")
    step_sizes = 1e-2 * tf.ones_like(initial_state)
    kernel = tfp.mcmc.TransformedTransitionKernel(
        tfp.mcmc.nuts.NoUTurnSampler(
            target_log_prob_fn=lambda x:log_prob_fn(x),
            step_size=step_sizes,
        ),
        bijector=bij,
    )

    kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
        kernel,
        target_accept_prob=tf.cast(0.8, dtype=dtype),
        num_adaptation_steps=num_warmup_iters,
        step_size_setter_fn=lambda pkr, new_step_size : pkr._replace(
            inner_results=pkr.inner_results._replace(step_size=new_step_size)
        ),
        step_size_getter_fn=lambda pkr: pkr.inner_results.step_size,
        log_accept_prob_getter_fn=lambda pkr: pkr.inner_results.log_accept_ratio,
    )

    # Sampling from the chain.
    mcmc_trace, pkr = tfp.mcmc.sample_chain(
        num_results=num_main_iters,
        num_burnin_steps=num_warmup_iters,
        current_state=[bij.forward(initial_state)],
        kernel = kernel
    )

    return mcmc_trace, pkr

N = 5
nu = 1.5
n_rep = 100
m = tfd.CholeskyLKJ(N, nu)
bijector = tfb.CorrelationCholesky()

start = time()
trace, pkr = sample(bijector, m.log_prob)

times = []
for i in range(n_rep):
    start = time()
    trace, pkr = sample(bijector, m.log_prob)
    times.append(time()-start)

Is this a reasonable why to do this? Is there a way to further optimise sampling/profiling?

@axch
Copy link
Contributor

axch commented Mar 13, 2020

Looks like you're excluding the first run time. If I were you, I would capture it and plot all of them, and only decide what to exclude after seeing what that plot looks like. (Also, I'd be curious to see the plot myself, if you wouldn't mind posting?)

Other than that, it's just nits:

  • Does setting autograph=False in the tf.function make a difference?
  • IIRC, you don't need to make the step size the same shape as the event space, because the sampler and the adaptation will broadcast correctly; a scalar will do. (Or shape [num_chains, 1] for separate per-chain adaptation.) But that shouldn't make much difference to the timings either.

Otherwise, looks good to me!

@adamhaber
Copy link
Author

Thanks! Regarding the timings - sure, I hope to publish an initial report in the next couple of days.

@junpenglao
Copy link
Contributor

@adamhaber depending on what you are trying to benchmark - end-to-end pipeline benchmark might not be too informative (depending on your goal) giving there are many implementation differences between Stan and TFP.

If your concern is mostly the speed differences of the bijector, using HMC with the same step size for both Stan and TFP might be better (notice however step size is defined differently in Stan and TFP, as Stan uses a mass matrix and a scalar step size, whereas TFP use essentially diag(mass_matrix) * scalar_step_size)

@adamhaber
Copy link
Author

By itself, the TFP implementation is faster; Stan applies tanh element wise, and TFP does not. What I'm trying to understand is whether the more "costly" bijector turns out to be more efficient when taking into account the effect of the bijector on the geometry of the posterior (maybe tanh "smoothes" the posterior and makes life easier for HMC or whatnot). In order to test this, I'm planning to do the following:

  1. Measure sampling times of CholeskyLKJ with different dimension and concentration values; estimate ESS/sec.
  2. Change TFP's bijector implementation to match Stan's, including Jacobian corrections.
  3. Repeat 1 and compare.

In addition, I want to change Stan's implementation to match TFP's and repeat 1-3, hoping to get consistent results. :-) In addition to sampling CholeskyLKJ "alone", I also want to see what happens when it is a part of a larger model with some observations.

I hope this makes sense; any suggestions on how to improve this would be much appreciated!

@junpenglao
Copy link
Contributor

I see, i think that makes sense - the within framework comparison will likely be quite informative and free of the other differences of framework :-)
Since you are tracing all the kernel result, I think it will be also useful to look at how many leapfrogs it takes - if the hypothesis of different geometry is true, then we might see decrease step size (and as a consequence, increase number of leapfrog steps) in some combination of parameters.

@bob-carpenter
Copy link

What we really care about in most applied stats problems is time to effective sample size 100 or thereabouts. It's probably enough to evaluate on effective sample size per second after warmup as things that sample well tend to warmup well. Problems can be caused by divergences from the Hamiltonian in the more extreme case where the step size isn't low enough.

@adamhaber
Copy link
Author

Cmdstan reports ESS/sec after warmup by default?

@adamhaber
Copy link
Author

Stan's constraining transformation (forward in TFP terms, inverse in Stan terms, see Bob's response above) is as follows (x in the Cholesky factor, z in the unconstrained vector after element-wise tanh):

x(0, 0) = 1;
int k = 0;
for (int i = 1; i < K; ++i) {
  x(i, 0) = z(k++);
  T sum_sqs = square(x(i, 0));
  for (int j = 1; j < i; ++j) {
    lp += 0.5 * log1m(sum_sqs);
    x(i, j) = z(k++) * sqrt(1.0 - sum_sqs);
    sum_sqs += square(x(i, j));
  }
  x(i, i) = sqrt(1.0 - sum_sqs);
}

The lp += part is accumulating the jacobian correction along the way. I've tried to think of a tf-like, vectorised way to do this but didn't succeed (the sqrt(1.0 - sum_sqs) factor gets ugly pretty fast); a simpler normalisation of the rows (y /= tf.norm(y, axis=-1)[..., tf.newaxis]) would give me a proper Cholesky factor but the jacobian correction would be different (I think). I can imagine writing a similar loop within CorrelationCholesky._forward; however,

  1. I'm not sure how to handle batch dimensions (are there additional shape issues?).
  2. Efficiency wise, how bad would such a loop be? Will running the sampler with experimental_compile=True help?

@bob-carpenter
Copy link

Yes, cmdstan reports estimated ESS/second after warmup. That's because we think it's the single best measure of the speed of a model. Everything else depends on number of draws or number of warmup draws.

@junpenglao
Copy link
Contributor

python for loop would definitely not very performative - maybe @bloops could help in terms of implementation here

@adamhaber
Copy link
Author

Yes, cmdstan reports estimated ESS/second after warmup. That's because we think it's the single best measure of the speed of a model. Everything else depends on number of draws or number of warmup draws.

Any way to measure this in TFP?

@junpenglao
Copy link
Contributor

I dont remember whether the ess computation is up-to-date. For consistency maybe use arviz for computation of ESS for both Stan and TFP?

@bloops
Copy link
Contributor

bloops commented Mar 13, 2020

Yes, cmdstan reports estimated ESS/second after warmup. That's because we think it's the single best measure of the speed of a model. Everything else depends on number of draws or number of warmup draws.

Any way to measure this in TFP?

For the time taken number of seconds estimate, you can wrap the run_chain function in a tf.function(experimental_compile=True) such as here.
https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/Modeling_with_JointDistribution.ipynb
Then execute this multiple times with timeit. experimental_compile=True turns on XLA compilation which sometimes provides speedups. Also it's worth checking with experimental_compile=False. Note the first time run_chain is called the compilation time will also included.

There is also an effective_sample_size in TFP here but as Junpeng says it might be better to use the same implementation of ESS for both TFP and Stan.
https://www.tensorflow.org/probability/api_docs/python/tfp/mcmc/effective_sample_size?hl=en

@csuter
Copy link
Member

csuter commented Mar 13, 2020

ESS is just tfp.mcmc.diagnostic.effective_sample_size

The slightly tricky bit is leaving out warmup time. You can do this by making two separate calls to tfp.mcmc.sample_chain, starting the second one off with the final results from the first one. Something like

kernel = ...
initial_state = ...

states, _, final_kernel_results = tfp.mcmc.sample_chain(
   kernel=kernel,
   current_state=initial_state,
   num_results=1,
   num_burnin_steps=...,
   return_final_kernel_results=True)

start = time.time()
states, _ = tfp.mcmc.sample_chain(
   kernel=kernel,
   current_state=states[-1, ...],
   num_results=...,
   num_burnin_steps=0)
end = time.time()

ess = tfp.mcmc.diagnostic.effective_sample_size(
    states, cross_chain_dims=...)  # cross_chain_dims, following Vehtari et al 2019

print('ess/s:', ess / (end - start))

@csuter
Copy link
Member

csuter commented Mar 13, 2020

(note above also needs tf.function decoration to be performant, and if you want to account for compilation time, follow @bloops)

@adamhaber
Copy link
Author

Thanks for the helpful replies! I guess that because the comparison is within-framework, getting exactly the same ESS/sec measurement like Cmdstan isn't critical; it just means that comparing the TFP and Stan's result would be trick.

If anyone has any tips on how to implement Stan's bijector in a vectorised way (that respects TF's batching semantics), I'd be very interested. Meanwhile I'll keep wasting papers on trying to compute jacobian corrections manually. :-)

@junpenglao
Copy link
Contributor

Do you have a version of the implementation in TF without batching right now? Happy to take a stab if so :-)

@adamhaber
Copy link
Author

I'm currently trying a "hybrid" approach - I want to see what happens if instead of implementing Stan's bijector exactly, I simply apply an element-wise tanh to the unconstrained vector before plugging it into the lower triangular matrix.

If I'm doing the math correctly, the jacobian correction should be fairly simple: the jacobian of the rows-normalisation is already there (-tf.reduce_sum(tf.range(2, n + 2, dtype=y.dtype) * tf.math.log(tf.linalg.diag_part(y)),axis=-1)), and each element in this sum should be multiplied by something like tf.reduce_sum(tf.log(1-np.tanh(xs)**2)) where xs are the k elements in the k-th row; the jacobian of the element-wise tanh is diagonal so its log determinant is the sum of the logs of the diagonal.

Sounds reasonable?

@junpenglao
Copy link
Contributor

I simply apply an element-wise tanh to the unconstrained vector before plugging it into the lower triangular matrix.

In that case, you might be able to use the tfb.Chain to chain the tanh bijector with the CorrelationCholesky (please verify it as I am not at all familiar with the Stan implementation especially how jacobian is implemented)

@adamhaber
Copy link
Author

Thanks - that's certainly simpler (I also verified and it comes out the same. Math seems to work).

Here's a short description of what I did:
For each dimension in [3, 5, 10, 15, 50, 100], and concentration in [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100], I called the sampler (as described above) 50 times and measured runtimes. I repeated this twice - first with the usual tfb.CorrelationCholesky() bijector, and second with the tfb.Chain([tfb.CorrelationCholesky(), tfb.Tanh()]) chained bijector. I then compared the runtimes across different dimension and concentration values.

Here are the results:

image

  • Each dot represents the ratio of the means over trials in each parameters configuration.
  • Black horizontal line is ratio = 1, meaning equal sampling times.
  • Black vertical line is nu = 1.
  • Standard deviations in runtimes weren't very interesting, especially if you exclude the first point (which is sampling + compilation time).

Ratios below the horizontal line mean the current implementation is faster (element wise tanh is hurting performance); ratios above the horizontal line mean the chained bijector is faster.

It seems like for concentrations larger than 1, TFP current implementation is preferred, pretty much for all matrix sizes. This is also consistent with what I got when I changed Stan's bijector implementation - that is, performance improved for large concentrations but became worse for very small concentrations.

This is probably not enough to say which one "is better" (to the extent that this is a meaningful question), but I hope it provides an interesting first step. Questions/suggestions are welcomed. Current plan - repeat this for a simple model with observations.

@junpenglao
Copy link
Contributor

Thank you!! This is very interesting, I am glad that tfb.Chain([tfb.CorrelationCholesky(), tfb.Tanh()]) work out of the box.

Given this result (and the similar observation in Stan), with the comment from @bob-carpenter that in general a concentration >1 is preferred , I think it is a safe choice for using tfb.CorrelationCholesky() as a default bijector for LKJ-like Correlation, with tfb.Chain([tfb.CorrelationCholesky(), tfb.Tanh()]) as an alternative when concentration < 1 is desired.

@adamhaber
Copy link
Author

Small update - with help from @junpenglao, I've implemented a simple model that uses CholeskyLKJ to correlate slopes and intercepts between linear regression models (McElreath's cafes model):

model = tfd.JointDistributionSequentialAutoBatched(
    [
        # rho, the prior for the correlation matrix between intercepts and slopes
        tfd.CholeskyLKJ(2,2),     
        # sigma, prior std for the waiting time 
        tfd.Exponential(rate = 1), 
        # sigma_café, prior of stds for intercepts and slopes (vector of 2)
        tfd.Exponential(rate = tf.ones(2, dtype=tf.float32)), 
        # beta, the prior mean for the intercept and slopes
        tfd.Normal(loc = [5, -1], scale = [2, 0.5]),   
        # per-café intercepts and slopes
        lambda betas,sigma_café,sigma,chol_rho : tfd.Sample( 
            tfd.MultivariateNormalTriL(
                loc = betas,
                scale_tril = tf.linalg.LinearOperatorDiag(sigma_café).matmul(chol_rho)
            ),
            sample_shape=n_cafés
        ),
        # per-café waiting times
        lambda mvn, betas, sigma_café, sigma : tfd.Independent(  
            tfd.Normal(
                loc = tf.gather(mvn[...,0],café_id,axis=-1) + tf.gather(mvn[...,1],café_id,axis=-1)*afternoon,
                scale = sigma[..., tf.newaxis]
            ),
            reinterpreted_batch_ndims=1
        )
    ]
)

Note that the correlation matrix is small (dim=2) and concentration is slightly above 1. The model is fitted to generated data; I've generated 100 different datasets of different sizes (to get a feeling whether there's a likelihood vs. prior effect going on), and measured the sampling time with and without the chained tfb.Tanh. For the vast majority of the generated datasets (>90%), it seems like the chained bijector is slightly slower. There are obvious caveats - it's not measuring exactly what I want it to measure (end-to-end sampling time and not ESS/sec after warmup), the data is coming from a concentration>1 LKJ so the previous result already suggests Tanh wouldn't help, etc; still, it's nice to see that this behaviour is consistent with a somewhat more realistic use-case.

@bob-carpenter
Copy link

bob-carpenter commented Mar 16, 2020

This is probably not enough to say which one "is better" (to the extent that this is a meaningful question)

I really like the speedup plot for visualizing what's going on. If we could add the Stan version on the same scale, that'd be ideal. Either Stan or TFP's current version can be the baseline.

The simulated data needs to go beyond two dimensions. Ideally as many as 100 or even more if the compute budget's there.

[edit: I somehow conflated @adamhaber's posts---it looks like Step 1's already done with just the LKJ generating data for fitting correlations and Step 2 is in the works.]

Step 1 is to probably just fit multivariate normal data to a multivariate normal model where you control the level of correlation in the data. One way to do that is to generate a time-series correlation matrix:

Omega[i, j] = rho^abs(i - j)

for rho in (-1, 1) that'll control correlation of adjacent predictors in alpha[1:N]. Then generate a scale matrix with something like lognormal(0, 0.5) draws. That'll let you generate data then you can fit it with the LKJ prior on the correlation (and another prior on the scales).

Step 2 would be to generate correlated data, then use it as the error in a seeminlgy unrelated regression (SUR) model, which is just what the econ folks call a multivariate outcome regression model with correlated error.

Step 3 would be to use it as a hierarchical prior in a varying effects model where each item has more than two random effects (Gelman's red-state/blue-state model has two for the intercept and slope of vote share based on income by state, but this'd need to be extended). @mitzimorris is about to send us data for a big election study that has that structure, though I don't know if it's currently using an LKJ prior on the random effects. The thing to do for a paper would also be to simulate data.

This is really cool, by the way. I've been meaning to do this for years for a lot of these transforms. There's obviously also the simplex itself to consider. But even something as simple as positive constrained parameters can be put together on an arbitrary scale for most of their domain. Right now the inverse transform is exp(y) in Stan but it could be flattened out so that

finv(y) = y + 3 for y in (-3 + epsilon, infinity)
              g(y)  for y in (-infinity, -3 + epsilon)

The extra offset is just so that our initialization of y ~ uniform(-2, 2) is centered on finv(y) = 1, which works well for scale parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants