Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the Poisson distribution #15814

Merged
merged 1 commit into from
May 13, 2021
Merged

Conversation

talgalili
Copy link
Contributor

@talgalili talgalili commented Mar 11, 2021

Adding the Poisson distribution, which is central to many statistical procedures (https://en.wikipedia.org/wiki/poisson_distribution) (#15798)

Test plan (adding unit-tests)

Following the diff template of: https://github.com/prestodb/presto/pull/11981/files

== RELEASE NOTES ==

General Changes

  • Add poisson_cdf and inverse_poisson_cdf functions.

(like the beta_cds: https://prestodb.io/docs/current/release/release-0.215.html?highlight=beta_cds)

Copy link
Contributor

@leepface leepface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe these are the maven-checks errors you're seeing. You may also need to rebase for the other errors that look unrelated.

@talgalili talgalili force-pushed the dist_poisson branch 7 times, most recently from 4a26de4 to 9213486 Compare April 27, 2021 12:20
Copy link
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some nits. Please fix the checkstyle error.

@talgalili talgalili force-pushed the dist_poisson branch 2 times, most recently from 0ab7b2a to 3311a86 Compare April 28, 2021 15:06
@talgalili
Copy link
Contributor Author

A point to notice:
The poisson distribution has an infinite support on the right hand side. Because the function returns an integer, it would not return an Inf object, but rather some MAXINT style number (a very large integer).
One option was to put that large number in the test. But instead, what we decided to do (Eden and me), is to just forbid the use of p=1. The reason is because that's also the current default that is used in inverse_normal_cdf, which has inf support, and it just forbids the use of p=1 (or p=0).
Also, in this recent update to the diff, I've changed inverse_poisson_cdf to return BIGINT instead of DOUBLE.

A followup to this discussion is that I should also change inverse_chisquare_cdf to not allow p=1 (right now it allows it, and returns Inf as expected, but that cases an inconsistent behavior between it and inverse_normal_cdf)

@talgalili
Copy link
Contributor Author

(I may wish to move this to int from bigint, need to verify further. https://commons.apache.org/proper/commons-math/javadocs/api-3.5/org/apache/commons/math3/distribution/IntegerDistribution.html )

@rongrong
Copy link
Contributor

Both checkstyle error and test failures are related. Please fix

@talgalili
Copy link
Contributor Author

talgalili commented Apr 28, 2021 via email

@talgalili talgalili force-pushed the dist_poisson branch 5 times, most recently from 9b5c9c9 to c1f94d9 Compare April 29, 2021 10:49
@talgalili
Copy link
Contributor Author

@rongrong - diff was fixed, and all tests now came back green. It's back to you for review :)

@talgalili talgalili force-pushed the dist_poisson branch 2 times, most recently from 929026e to 01c3bac Compare April 29, 2021 19:49
@talgalili
Copy link
Contributor Author

Thanks @rongrong , I've now updated the text in the diff. I've also expanded a bit the description of inverse_poisson_cdf so it's clearer which value it return.

@talgalili
Copy link
Contributor Author

Hey @rongrong - this diff includes all the fixes we've discussed. Could you please review and let me know if it's good to merge, or if there are other steps to take?
Thanks :)

@talgalili talgalili requested a review from rongrong May 10, 2021 19:17
Copy link
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nits. Thanks!

@talgalili
Copy link
Contributor Author

Thanks @rongrong - made the fixes, diff is now ready for merging :)

@mbasmanova mbasmanova merged commit 39c66e3 into prestodb:master May 13, 2021
@mbasmanova
Copy link
Contributor

Thank you, @talgalili, for the contribution.

@talgalili
Copy link
Contributor Author

Thanks @mbasmanova for the help in merging this. :)
And @rongrong - for a great review.

@SqlType(StandardTypes.INTEGER) long value)
{
checkCondition(value >= 0, INVALID_FUNCTION_ARGUMENT, "value must be a non-negative integer");
checkCondition(lambda > 0, INVALID_FUNCTION_ARGUMENT, "lambda must be greater than 0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the library API is already doing this check, I say just do a try/catch and throw user_error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Since this is the method used by all the distribution functions (i.e.: normal, beta, chi-square, binomial), do you think it should be changed there as well?
If so - could you please help with this change? (I'm not experienced in Java, so want to make sure I understand what you're proposing and which error will be thrown)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do something like:

try { ... } catch(NotStrictlyPositiveException notStrictlyPositiveException) { throw new PrestoException(GENERIC_USER_ERROR, ...)

Look at StandardErrorCodes.java and other files to see the pattern what they do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah it will be good to do for all of them.

{
checkCondition(value >= 0, INVALID_FUNCTION_ARGUMENT, "value must be a non-negative integer");
checkCondition(lambda > 0, INVALID_FUNCTION_ARGUMENT, "lambda must be greater than 0");
PoissonDistribution distribution = new PoissonDistribution(lambda);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the lambda going to be generally fixed in a query? If so, you should find a way to avoid new object creation to improve memory perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also interesting. As I wrote to your other comment - this is the method used by all the distribution functions (i.e.: normal, beta, chi-square, binomial), do you think it should be changed there as well?
If so - could you please help with this change / propose how to do it?

Thanks upfront.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good suggestion here :( I looked in the code but can't find other examples. I will look further and comment here if I find something.

@sujay-jain sujay-jain mentioned this pull request May 21, 2021
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants