-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sample rows no replace #180
Conversation
This looks to be failing because of something in the dcm. Not sure why the previous builds were OK and not this one. |
Hi, So overall I like it, I like that you split it into smaller functions. I do not have time to look over in detail. So detailed comments incoming... as soon as I have time. |
…enes in without replacement calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seam not to have gotten the "review" tool. so I'll do it the old fashioned way:
line 28-9 if p.sum() == 0: p = np.ones(len(p))
Why is a zero sum an even distribution? Not sure what is better but that is surprising.
line 30 round(p.sum(), 0) != 1
Why are you rounding? This matches from 0.4 to 1.5. did you mean abs(p.sum() - 1.0) < 0.0001
More incoming as I think through the permutations of prob_column/accounting_column
Per the zero sum: this is something which seems to come up from time to time, especially when doing segmented models, so I was trying to account for this here. Alternatively, this could just return None. This probably makes more sense the more I think about it. Or I guess could we could just have this raise a value error. The inclusion of the round is an over site. This should probably just be |
I don't think a direct compere is correct either http://floating-point-gui.de/errors/comparison/ Ok lets start looking at accounting_sample_replace. Generally looks ok. It is distorting the probabilities by retrying, but it was doing that before as well. As I have said before, I have not yet used this code, so perhaps this distortion is what is intended. Also if we overshot, we could use the cumsum trick, but that would lead to different distortion. So forget I mentioned it. Looks good. looking at accounting_sample_no_replace. line 144:
|
General comment: I think that in the most cases the overwhelming majority of the samples are satisfied in the initial sample. The refinement is simply playing with the margins to try to get the counts to match exactly. We primarily use this method for our household transition model where we have household population targets but are sampling households. I think that in most non-trivial cases if you compare the resulting probabilities you will not find any distortion beyond the random variance you would see across runs. But I can test that further. With respect to line 144,
|
Sorry for the slow reply. I had meetings all day yesterday. Ok, so the distortion is intended. Objection withdrawn. :-) I think, the error is this one exactly the cases where there are not enough items to satisfy the request. Our new strategy in those cases picks randomly from the 0 probability items, not a bad solution. Thanks for the link to the article! I was surprised it did not include a proof of the theorems. The approach seems to be called "Reservoir Sampling" and discussed in many places. Thanks for putting up with my nitpiks. Looks good to me. |
No worries, I appreciate the thorough review. With respect to the floating point issue, I was going to change this to:
This is based on looking at the numpy comparison: Does that make sense? |
Yes, that looks good. and thanks for the links to the source! I am +1 for merge. Someone with the rights want to take a look? @waddell @fscottfoti |
Looks good to me. Thanks @bridwell! |
This addresses feedback when the accounting total is not met (#178) and also providing probability distributions to influence the sample (#149).
To obtain feedback, set the return_status argument to True, and this will return both the sampled rows and the status. If False, then only the sampled rows will be returned. This is still the default, so as to not break any existing calls.
To provide sampling probabilities, set the prob_column argument to the name of the column in the data frame containing the weights or probabilities. These values will be normalized so they sum to 1, if they do not already.
The logic for sampling with accounting but without replacement has been changed to hopefully be more robust.