-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime of the cost computation #37
Comments
By the way, see Sections 3.1 and 3.2 in @pavlin-policar's write-up here: https://github.com/pavlin-policar/fastTSNE/blob/master/notes/notes.pdf. If I understood 3.1 correctly, then this should accelerate the computation 2 times (even though I would argue that 12% of the runtime for cost computation is still quite a lot). The 3.2 is about something else but it would be a cool thing to implement as well. |
Also, the cost is computed once before the iterations start ( |
Wow, this is a great catch. I had forgotten to parallelize the cost computation. It is now parallelized in my fork; I will PR it hopefully tomorrow (just want to test on linux and fix the windows problems from the other issue). I agree it would be nice to accelerate it further using a faster computation as in Pavlin's writeup...but for now, this should make it negligible relative to the rest of the code. If you don't see that, then we can work on accelerating it or making it optional as you suggested. Thanks for catching this! |
Cool, looking forward to all the changes! |
The thing in Section 3.1 is pretty useful since we can reduce the number of passes needed over the data from 2 to 1 and can be done completely during the gradient computation. Actually evaluating the error also turns out to take a long time - I guess computing the log in the KL divergence a lot of times can be quite slow. I like the correction in Section 3.2 because with exaggeration, p_{ij}s are not actual probabilities and the KL divergence is just plain wrong. However, if we apply the correction the true KL divergence can go up during the early exaggeration phase which could be unsettling if you don't know what's going on (I imagine most people expect optimization to reduce the error, not increase it!). |
@pavlin-policar Let me see if I understand eq (29) in your PDF correctly: it seems it can be further simplified as
right? So the adjustment boils down to: take the "raw" KL divergence (computed with exaggerated p's), divide by alpha and subtract log(alpha). |
@dkobak so I'll just convert your expression to an image so it's easier to see: And yes, you're exactly right! and now I feel very silly for not seeing this before. And this is also slightly more efficient since we get rid of the sum (although this is evaluated so rarely it won't really matter) but the expression is much nicer. I've also tested this numerically in my code and, indeed, it is the same - like the maths say. |
@pavlin-policar I agree it's a nice expression! Feel free to add it to your Section 3.2. I've just implemented it and added to the pending #39.
Not sure what you mean here exactly: if KL_raw decreases, then KL_raw/alpha - log(alpha) will also decrease. Do you mean that the adjusted KL can jump up after the exaggeration is switched off? |
Yes, sorry for the confusion. This happens when we use late exaggeration and not in general. When switching from the normal regime to the late exaggeration regime, the corrected KL divergence does indeed go up. I've copied below a run on the MNIST data set - ignore the timing and output - I've used my implementation - but the effect should be the same. The horizontal lines indicate the switch of regime - from early exaggeration to no exaggeration to late exaggeration. During the early exaggeration phase and no exaggeration regime the KL divergence drops down to 3.3663, and it rises to 3.4769 during late exaggeration.
This is what I had in mind and this doesn't apply at all if we don't use late exaggeration, which I believe is the default. |
@pavlin-policar Hmm. Looking at your output, I don't quite understand why the KL divergence keeps growing during the last 250 iterations. The "raw" KL divergence should be decreasing because that's what gradient descent is doing, and the adjusted KL divergence is a linear function of it, as we discussed above. What am I missing? |
@pavlin-policar Actually by now I got myself completely confused. We seem to have concluded that the loss (KL divergence) after exaggeration is given by a linear function of the loss without exaggeration. If so, then how would any exaggeration have any effect?.. This does not make any sense; I should probably go to bed and get some sleep. |
@dkobak You're right, this makes no sense - I have no idea what's going on. I'll have to look into this a bit more - it seems like it could be a bug in my code, but it seems unlikely it since I use the same code for all optimization. I've tried to run a simple example using this implementation using the R wrapper, but I don't think the costs are being transferred into R properly and I just get an array of zeros (or I may be using it wrong). I'll look at this again tomorrow, hopefully, a bit of sleep does us both some good. |
Of course, I had forgotten to compile the pulled C++ code... But at least now I know I'm not completely crazy. Running the R wrapper on iris with the following commands: source('fast_tsne.R')
tmp = as.matrix(iris[, 1:4])
fftRtsne(tmp, start_late_exag_iter=750, perplexity=12, late_exag_coeff=5, get_costs=T) I get the following output (using the latest master)
So in this case, it's not going up consistently - probably because the data set is different, but it does go up. I still have no idea what's going on, but it seems to be the "correct" behaviour. Maybe @linqiaozhi can shed some light on this? |
@dkobak @pavlin-policar I like the adjusted KL divergence, if only because people don't get freaked out by the fact that it changes so wildly when exaggeration turns on/off. @pavlin-policar I think the reason that the cost is bouncing around in the last 250 iterations is that the gradient descent is overshooting. This dataset has 150 points, after a few iterations it has essentially converged. So, you are at the minimum and bouncing around with every iteration, so some times it goes up, some times it goes down. This behavior is much more apparent with exaggeration, because exaggeration effectively increases the step size. That's not strictly true, because the exaggeration is not only increasing the magnitude of each step, but also the direction of each step (whereas the step size only increases the magnitude). So, when you are already at the minimum, increasing the magnitude of each step by late exaggeration means you will overshoot more easily. Note that you see the same behavior without late exaggeration (Iteration 550 to 600 it increases), but it is more obvious in the late exaggeration phase for the above reason. That is my interpretation, at least. |
@linqiaozhi @pavlin-policar George, I don't think you have fully appreciated Pavlin's and mine confusion! Not sure if it's because you did not notice the apparent "paradox" or because its resolution was obvious to you :) Here is the paradox:
To be honest, yesterday I was stupefied by this. 1-2 are true by definition. 4-6 are really straightforward. Where is the mistake??!! Well, I think I got it now. The mistake is in 3. The t-SNE loss after exaggeration is not KL(ap || q). When one uses the standard loss KL(p || q) to arrive to the expression for the gradient which is a sum of attractive and repulsive forces, along the way one uses that \sum p_ij = 1. This is used for the repulsive term. When doing exaggeration, the coefficient \alpha is inserted into the attractive term only, meaning that the resulting gradient is not the gradient of KL(ap || q), because \sum (ap_ij) is not 1. Does it make sense? It's a bit cumbersome to tell without writing all the math out. This resolves the paradox. As far as the computation of KL(p||q) is concerned, I think Pavlin's adjustment still works fine because KL computation is not influenced by the gradient and remains correct. Now it also isn't surprising that the KL(p||q) can go up during late exaggeration. What I am wondering, is what actually is the form of the loss that is (implicitly) optimized when using exaggeration! |
@dkobak Oh, I see what you are saying! Yes I totally missed the point with my first response! Yes, you are absolutely right. I think that part of the confusion was in the fact that in the code, we implement early/late exaggeration by multiplying the P_ijs by a constant alpha to give \tilde{P_ij}. This suggests that we are minimizing the KL divergence between \tilde{P_ij} and Q_ij. However, this is not the case. As you know, when you work out the gradient of the KL divergence between P_ij and Q_ij, you end up with two terms, corresponding to the attractive force and the repulsive force. We multiply alpha times this attractive force to make it stronger. This changes everything...the result is that we are no longer optimizing the KL divergence between P_ij and Q_ij, but something totally different. To figure out what we are optimizing (as you asked in the end of your message), one would essentially have to integrate this new "gradient" and see what it looks like. That actually might be quite interesting, not sure how hard it would be. But anyways, it will not look like KL(\tilde{P_ij}, Q_ij). As you pointed out, if you go to Appendix 1 of the original t-SNE paper, where van der Maaten and Hinton derive the derivative of KL(P_ij, Q_ij), you'll see that they use the fact that P_ij sums to 1. If you plug in \tilde{P_ij} instead of P_ij and follow that derivation, I expect that the resulting gradient would not be alpha*(Attractive Term) + (Repulsive Term). In other words, I do not believe the "adjusted KL divergence" that we are now reporting is actually what is being minimized...but probably something close to it. |
@linqiaozhi Yes. I think this might be relevant to the stuff we've been discussing via email about exaggeration vs. fatter tails.
One actually does not need to integrate. I looked at the derivation of the gradient, and the whole thing can be followed backwards very straightforwardly. The standard loss is KL(p || q) = const - \sum p_ij log q_ij ~ -\sum p_ij log w_ij + \sum p_ij log Z = -\sum p_ij log w_ij + log Z where q = w/Z. Introducing \alpha in the gradient is equivalent to introducing \alpha into the first term here in this loss. So this gives L = - a \sum p_ij log w_ij + log Z = - \sum (ap_ij) log w_ij + \sum (ap_ij) (log Z)/a = \sum (ap_ij) log (w_ij/Z^1/a) meaning that the new loss is KL (ap || w/Z^(1/a)). However, using I like it! The input probabilities stay the same, but the output probabilities (not really probabilities anymore as they do not sum to 1) are changed by effectively decreasing Z. In the context of our email discussion the question now becomes, is there a different output kernel for |
@dkobak Wow, I like that very much...this seems to be a really fruitful perspective on the early/late exaggeration phases. I'll have to think more about what this means in terms of the output kernel and late exaggeration--I agree it could lead to a more principled approach. Very cool. |
@linqiaozhi, @dkobak Awesome that you've figured it out. I too was wrong at the third point in @dkobak's list, but looking at the derivation, it makes sense. Intuitively, it seemed to me that if we want to pull the points closer together, this should change the objective, but I was really confused how the correction was linear for the exaggeration phase if the cost was changed. Anyways, really cool. So one implication of this expression I guess is that piling on exaggeration won't really help much, because of the 1/a relation. So switching from no exaggeration to 12 will have a much larger effect than switching from 12 to 24. Does that sound about right? |
@pavlin-policar Well, going from 1 to 12 is 12x increase, so I'd guess a comparable increase would be from 12 to 144, instead of 12 to 24. But it's hard to say. I think our understanding of exaggeration is still very poor. In general, one should really distinguish "early" exaggeration from "late" exaggeration (by the way, I haven't seen any benefit in using a phase with exaggeration=1 between early and late; I prefer to switch from early to late directly). Early exaggeration is an optimization trick. Late exaggeration is a meaningful change of the loss function. For MNIST with n=70k early alpha=12 and late alpha=4 work quite well, see https://github.com/KlugerLab/FIt-SNE/blob/master/examples/test.ipynb. For the 10x dataset with n=1.3 mln, Linderman & Steinberger seem to suggest something like alpha=1.3e+6/10/200 = 650 (given learning rate h=200) for the early phase (right @linqiaozhi ?) , whereas I find that alpha=8 works well for the late. I am doing some additional experiments with it right now... |
@linqiaozhi Running current master on the same data as in the first post in this issue (n=1.3mln), I see that the cost computation still happens on a single thread; I think this is due to #40. More worryingly, however, I now get 160s runtime per 50 iterations, instead of 130s previously (before the big PR over the weekend) :-( |
@dkobak Wow, that's not the right direction! I compared the current master with 61e4a2b, for N=1 million, and the times are about the same on my machine. Until we address #40, this is what I would have expected, since as I explained with the "negative result" in that PR over the weekend, we didn't really get much out of multithreading the repulsive forces. However, it should definitely not be slower! I wonder if the multithreading in "Step 3" of that function is causing overhead on your architecture. Would you mind comparing the times with and without multithreading the n_body_fft_2d() function? That is, you could add the line Anyways, this is hugely helpful. I am finding that small improvements from multithreading don't generalize well, and can even backfire in very bad ways on different machines (which is hard to know unless other people compare the times on their machines). If we confirm that the problem is overhead from multithreading, then we can just remove it for that function. The real boost that you get from multithreading is for the attractive forces, anyways. |
I recompiled the source and restarted my Python scripts. I get 168s with the current master, 174s if The machine is Intel Xeon CPU E3-1230 v5 @ 3.40GHz, 4 cores 2 threads each. |
@dkobak I also tried comparing on my linux server, and still can't replicate the difference in your timings. In my fork, I added timings to 61e4a2b in a branch called profiled_legacy. Would you mind compiling that branch with -DTIME_CODE, and also compiling the master of that fork with -DTIME_CODE and posting the results? You could just copy the resulting binaries into whatever test directory structure you have already been using, if that is easier. Hopefully this will tell us what is slowing down. Sorry to bother you with this... For example, on my server, a typical iteration has the following timings (and I added in the cost of error too):
Everything should be the same except for Step 3, which is a bit faster due to the parallelization. |
Sure. The timings jump around a bit (would be good to see averages over some iterations...), especially Attractive Forces (it's also by far the longest here), but from quickly eyeballing the output for a couple of minutes it seems that maybe Step 1 got consistently slower?.. However, the difference is tiny and I don't quite understand why the difference per 50 iterations is 160s vs 130s. I have to leave now, let me double check it tomorrow. Profiled_legacy:
Master:
|
Okay, thanks. When you get the chance, could you also report the times for the computation of error? That's essentially all that's left...in the times you just sent, the Interpolation in Master is faster in total (Because speedup of Step 3). |
Yes. Something does not add up here. I'm pretty sure that the error took the same time (around 30s). I'll try to analyse the output for 50 consecutive iterations. |
Good morning! I spent a fair amount of time profiling the two versions of code and preparing a plot of the timing difference for each step over 50 iterations... ... before realizing that in That's exactly the 30s difference I was observing 😱 Apart from that, the interpolation did get a little faster and the rest is unchanged. We should both be embarrassed now! :) |
@dkobak Alas...oh man, that is rough. So sorry you had to spend all that time on it. Anyways, thanks for figuring it out. Of course, that explains it perfectly. That should have been the first thing I thought of, but I guess I did not realize that I made that change when I switched timing libraries (which was just so that it would run on Windows). Ouch, man. |
No problem, that's how these things are... I am relieved now. Closing this issue. |
While running the code on the n=1.3mln dataset, I noticed that the CPU usage looks like that:
There are exactly 50 short valleys (each valley takes ~1s) with short periods of 100% CPU for all cores (also ~1s) -- this I assume corresponds to the attractive force computation (parallel on all cores) and the repulsive force computation (currently not parallel). Together they take ~2s per iteration, i.e. ~100s per 50 iterations. But after 50 iterations there is a long ~30s period when something happens on one core. WHAT IS IT?!
The only thing that is done every 50 iterations, is the cost computation. Does the cost computation really take so long? Can it be accelerated? If not, and if it really takes 25% of the computation time, then I think it should be optional and switched off by default (at least for large datasets).
The text was updated successfully, but these errors were encountered: