Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use expansions of log-density to fit once and not break shape=0 and shape \neq 0 into cases #15

Merged
merged 7 commits into from
Mar 16, 2021

Conversation

cgeoga
Copy link
Contributor

@cgeoga cgeoga commented Feb 22, 2021

This small adjustment uses series expansions of parts of the log density for both the GEV and GPD for stability near and at \xi=0. This not only saves on compute time because you now only need to do one optimization, but now the resulting estimation behavior agrees more with existing packages in other languages, like ismev.

@juliohm
Copy link
Member

juliohm commented Feb 22, 2021

That looks like an amazing PR @cgeoga , thanks for putting the time into improving the package.

I will review it carefully soon.

Copy link
Member

@juliohm juliohm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing PR. To the point and very clean.

Just a few code style suggestions here and there and a few questions. After that we can see if the tests pass. Also feel free to add more tests with data you already know well.

@codecov-io
Copy link

codecov-io commented Feb 23, 2021

Codecov Report

Merging #15 (f89f5e4) into master (af61fbb) will increase coverage by 47.75%.
The diff coverage is 91.89%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #15       +/-   ##
===========================================
+ Coverage   37.28%   85.04%   +47.75%     
===========================================
  Files           7        7               
  Lines         118      107       -11     
===========================================
+ Hits           44       91       +47     
+ Misses         74       16       -58     
Impacted Files Coverage Δ
src/fitting.jl 93.61% <91.89%> (+93.61%) ⬆️
src/maxima.jl 92.30% <0.00%> (+0.64%) ⬆️
src/stats.jl 64.28% <0.00%> (+14.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af61fbb...f89f5e4. Read the comment docs.

@juliohm
Copy link
Member

juliohm commented Feb 25, 2021

It seems that the tests are failing for an unrelated reason. I will try to check what is happening today.

@juliohm
Copy link
Member

juliohm commented Feb 25, 2021

I've added fixes to the master branch to use the new ReferenceTests.jl package in place of VisualRegressionTests.jl. I have also triggered GitHub CI again here to see if tests pass now, maybe you will need to rebase the PR, see https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request

After that we only need some additional tests to show that the fits look great :) Looking forward to merging this nice work.

@juliohm
Copy link
Member

juliohm commented Feb 26, 2021

@cgeoga could you rebase the PR so that tests pass?

cgeoga and others added 4 commits February 27, 2021 14:40
series expansions for stability near and at \xi=0. Bump version.
Co-authored-by: Júlio Hoffimann <[email protected]>
Co-authored-by: Júlio Hoffimann <[email protected]>
@cgeoga
Copy link
Contributor Author

cgeoga commented Feb 27, 2021

@juliohm Apologies for the delay---I think I've followed the instructions and rebased, although this is my first time doing this and maybe have made some mistake. Crossing fingers.

@juliohm
Copy link
Member

juliohm commented Feb 27, 2021

That is awesome @cgeoga , you did it perfectly. Do you think you could write some tests with data you are familiar already?

Also any reference that I can read to learn more about the implemented expansion?

@cgeoga
Copy link
Contributor Author

cgeoga commented Feb 27, 2021

You're too kind! I wouldn't say there is any data I'm particularly familiar with, but I could certainly write some tests that simulate data of various distributions and check the terminal likelihood and MLEs with R's ismev or extRemes or something. Would something like that be agreeable?

With regard to the expansion, I wouldn't say I have a reference. I'm sure there is one, but I just computed basic Taylor expansions of stuff at \xi=0. For the GEV distribution, for example, the variable tx for $\xi$ near zero is just a Taylor expansion of what wikipedia calls $t(x)$, for example, and similar for the GPD. I'm not a numerical analyst and so my cutoffs and truncations might not be ideal, but from some ad-hoc testing I did with bigfloats the agreement seems pretty good, and is still exact for $\xi = 0$ exactly.

@juliohm
Copy link
Member

juliohm commented Feb 28, 2021

Some tests would be great :) That way we can avoid introducing other bugs in the future. If you can double check with R's packages that would be even nicer. 💯

trouble figuring out how to set up the testing environment to install R
in the first place.
@cgeoga
Copy link
Contributor Author

cgeoga commented Feb 28, 2021

Hey @juliohm, I've pushed up some tests that compare both fits with R. I was able to work this out for bitbucket in the past, but I'm having a little trouble figuring out how to ask github to install R for these tests, so they will probably fail here. Could you help me out with that bit?

Copy link
Member

@juliohm juliohm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests! It would be much easier in the long run to simply compute the expected results for the parameters and then save them in the test suite. No need to install R here.

@@ -2,6 +2,8 @@
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
ImageIO = "82e4d734-157c-48bb-816b-45c225c6df19"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
RCall = "6f49c342-dc21-5d91-9882-a32aef131414"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we don't need RCall here, can you just save the results to a text file for comparison? Adding the R language as a dependency isn't a good idea.

@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 1, 2021

Ah, storing in a text file makes sense, apologies. I'll push up a revision in the next 24 hours.

-- Tweaks to the cutoffs for expansions (log_gev_pdf had a typo with 1e-4)
-- Change box constraints on parameters to be closer to zero (primarily
a debugging experiment for tests)
-- Remove RCall dep in tests, add Serialization dep
-- Add serialized R mle results from ismev for testing
-- Update the JuMP for specifying Ipopt, per a warning message

The current testing fails several times, however, for one specific
distribution: GEV(1, 1, 1/2). And the failure is JuMP or Ipopt related.
All estimates that return with a success code pass the test comparing
with the R library ismev. So the only thing left to do is figure out the
GEV(1,1,1/2) failure issue.
@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 2, 2021

This is mostly done, and I have a commit here just to mark the almost-solution. One of the parts of the test---fitting some simulated GEV(1,1,1/2) data---seems to fail pretty often in JuMP/Ipopt, and I'm having a little trouble figuring out why. But all estimates that return with a success code match R's pretty well, so at least there's that.

@juliohm
Copy link
Member

juliohm commented Mar 2, 2021

That is awesome @cgeoga , thanks. I know that fitting extreme distributions with MLE can be quite challenging for some parameter sets, but I don't have much experience either. There is an open issue to implement the method of moments instead as people say it is more stable.

Regarding the PR, do you think you could simplify the data even further? I think we don't need a serialization, just a simple text or csv file with data and expected parameters in the name (e.g. mu_1_sigma_2_eps_0.csv). We could then add a comment as you did to explain how these files were generated in R, using raw R code. For maintainers this is much easier to update. They just need a installation of R, copy/paste the script, generate more data if needed, and commit more text files to the test suite.

@juliohm
Copy link
Member

juliohm commented Mar 7, 2021

@cgeoga did you have a chance to take a look into this? We are almost there :)

@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 8, 2021

@juliohm, sorry for dropping off a bit. Suddenly have gotten pretty busy. I'm not sure I understand why a pure R script is simpler, though. We do ultimately need the same collection of random numbers to be passed to R and Julia, and so if we use a pure R script then don't we need to save R's random numbers to then pass to Julia? That is much more information to store than R's point estimates from samples produced in Julia with StableRNGs.

I will defer to your judgment here, but considering that I'm missing the point I'd appreciate understanding what you're getting at better before committing something that isn't agreeable again. Could you also comment on the issue with serialization? It seems to me like they both get read with one line, be it readdlm("file.csv", ',') or deserialize("file"), and the serialization makes it a little easier to store a Vector{NTuple{3, Vector{Float64}}} directly.

@juliohm
Copy link
Member

juliohm commented Mar 8, 2021

Maybe I misinterpreted the code in the PR, but I understood it is relying on calling R to produce some random numbers from the distributions, and then using the Julia fit implementation suggested to recover the parameters from the data. Is that accurate? In the affirmative case, I don't see a reason to rely on another language. That is more difficult to maintain in the long run. Ideally we could simply save the samples in a text file, and perform fitting with Julia without any dependency on R. I hope it makes sense.

I feel that adding R as a dependency in a test suite just to fit a bunch of random numbers generated there is overkill. But please let me know if I am misinterpreting something. I am so busy with other fronts that I can barely stop to read the code carefully.

@juliohm
Copy link
Member

juliohm commented Mar 8, 2021

So I am assuming the experiment you are proposing in the tests goes as follows:

  1. Draw xs ~ GEV(mu, sigma, eps) using R
  2. Save xs, mu, sigma, eps to disk using some serialization
  3. Load xs in a Julia session and try to recover mu, sigma, eps

Is that correct? Why not:

  1. Draw xs - GEV(mu, sigma, eps) in Julia
  2. Fit distribution in Julia using your PR
  3. Compare known mu, sigma, eps with results

@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 8, 2021

Ah, okay, it's good we're talking about this. Because the code is actually doing (close to) your alternative already. It is using Distributions and StableRNGs to sample from the distributions, not R. There is no application of R's randomness in the script, unless the fitting in ismev does something behind the scenes. To lay it out in a similar format:

  1. Draw data using Distributions + StableRNGs in Julia
  2. Fit that data in R with ismev, passing it in via RCall
  3. Store the parameter estimates provided by R in a serialized file

Then, the actually Julia testing is generating the same random values (thanks to StableRNGs, this should be consistent even across Julia versions), fitting them with this Julia code, and comparing the parameter estimates to the ones obtained in R.

I personally think it is more meaningful to compare the point estimates with what another established software provides versus with the true values used for the simulation. If R were finding a value of the parameters that had a meaningfully different terminal likelihood, for example, I would still say this software isn't working, even if it provides estimators that are still within whatever our testing tolerance is of the true values used to generate the values. Extracting the terminal objective value from JuMP problems is strangely complicated, so the code currently just compares the parameter estimates. But that seems pretty good to me still.

@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 8, 2021

The code for this generation is in a block comment at the bottom of tests/fitting.jl, by the way. I am imagining that people who want to test the package can just play with the comments, run it once, and then revert the file or something.

@juliohm
Copy link
Member

juliohm commented Mar 8, 2021 via email

@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 8, 2021

If I may ask, have you taken a look at the newest test file that's pushed up? Dumping the results of the R code is precisely what the code is doing. The results are 30 Vector{Float64} for each the GPD and GEV, so I think that's a bit cumbersome to literally inline into the file.

Also, again respectfully, Serialization is in stdlib. Maybe it was incorrect to add it to the Project.toml file. But there are no new dependencies. I do see the value in storing the parameter estimates in a plain file, but then of course the very small additional complexity of the serialization is just moved to the code: the code to generate the R fits will now need to reformat things, and the testing code itself will do something like iterating over Iterators.partition(eachrow("written_r_results_dlm"), 3), which in my opinion doesn't seem like a great deal. Is there any downside to just serializing besides the fact that you can't open the serialized file in a text editor?

@juliohm
Copy link
Member

juliohm commented Mar 8, 2021 via email

@juliohm
Copy link
Member

juliohm commented Mar 9, 2021 via email

@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 9, 2021

That's right, but it's doing that ten times each. So that's a lot of numbers to store inline in a file even to a few significant digits, I would think. Honestly, I think the fastest way to work this out would be if you ran the code and perhaps suggested how you'd like it to be stored. I do not have strong feelings, but I think there's still some miscommunication about what exactly needs to be stored. Maybe you could just run that block-commented code, inspect the objects gev_R_results and gpd_R_results, and let me know how you'd like them to be stored.

@juliohm
Copy link
Member

juliohm commented Mar 9, 2021

The results are great, I was just making sure that there is no over-engineering here. Just the fact that we are not loading an R session anymore in the test suites is a plus. We can at least deserialize the files in a Julia session when needed to inspect the values. The way I would have organized this would be much simpler with a CSV file for each distribution case. So dist1 would be a CSV where each row would have mu, sigma, eps for the different samples of dist1. As simple as that, a dataframe with nsamples rows and 3 columns.

@juliohm
Copy link
Member

juliohm commented Mar 9, 2021

That way we could at least see that all estimates are approximately the same, and approximately equal to the true parameters. Do you think it would be too much work to follow this format? I can merge the PR as is, no problem.

@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 9, 2021

Ah, okay, sure. That will be six total CSV files because we're testing GPD and GEV, but maybe I'll just make a subdirectory for them. I'll get that done today.

@cgeoga
Copy link
Contributor Author

cgeoga commented Mar 13, 2021

Sorry to be a bit slow here. But I've just pushed up those changes.

@juliohm
Copy link
Member

juliohm commented Mar 13, 2021

Nice! Much cleaner and we can now actually have an idea of how these results look like by just inspecting the text files 💯

As a final improvement, can you please remove the try-catch blocks where you throw an error message with the index of the failure? This should simplify the test by a lot:

for (JL, R) in zip((dist1, dist2, dist3), (r_dist1, r_dist2, r_dist3))
  # compare
end

We don't need the complex logic with enumerate.

@juliohm
Copy link
Member

juliohm commented Mar 16, 2021

I will merge the PR, it is already amazing! Thanks for the nice improvement. I will fix the minor details locally.

@juliohm juliohm merged commit 1041a7d into JuliaEarth:master Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants