Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sample rows no replace #180

Merged
merged 8 commits into from
Dec 6, 2016
Merged

Conversation

bridwell
Copy link
Contributor

@bridwell bridwell commented Nov 14, 2016

This addresses feedback when the accounting total is not met (#178) and also providing probability distributions to influence the sample (#149).

To obtain feedback, set the return_status argument to True, and this will return both the sampled rows and the status. If False, then only the sampled rows will be returned. This is still the default, so as to not break any existing calls.

To provide sampling probabilities, set the prob_column argument to the name of the column in the data frame containing the weights or probabilities. These values will be normalized so they sum to 1, if they do not already.

The logic for sampling with accounting but without replacement has been changed to hopefully be more robust.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 94.29% when pulling 2643fcf on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 94.284% when pulling 921bc5e on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 94.284% when pulling 1e58e3a on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

@bridwell
Copy link
Contributor Author

This looks to be failing because of something in the dcm. Not sure why the previous builds were OK and not this one.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 94.287% when pulling 7eae801 on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 14, 2016

Hi,

So overall I like it, I like that you split it into smaller functions. I do not have time to look over in detail. So detailed comments incoming... as soon as I have time.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 94.293% when pulling 0c429a0 on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

Copy link
Contributor

@Eh2406 Eh2406 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seam not to have gotten the "review" tool. so I'll do it the old fashioned way:

line 28-9 if p.sum() == 0: p = np.ones(len(p))
Why is a zero sum an even distribution? Not sure what is better but that is surprising.
line 30 round(p.sum(), 0) != 1
Why are you rounding? This matches from 0.4 to 1.5. did you mean abs(p.sum() - 1.0) < 0.0001

More incoming as I think through the permutations of prob_column/accounting_column

@bridwell
Copy link
Contributor Author

Per the zero sum: this is something which seems to come up from time to time, especially when doing segmented models, so I was trying to account for this here. Alternatively, this could just return None. This probably makes more sense the more I think about it. Or I guess could we could just have this raise a value error.

The inclusion of the round is an over site. This should probably just be p.sum() != 1 . Thanks for catching that.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 15, 2016

I don't think a direct compere is correct either http://floating-point-gui.de/errors/comparison/

Ok lets start looking at accounting_sample_replace. Generally looks ok. It is distorting the probabilities by retrying, but it was doing that before as well. As I have said before, I have not yet used this code, so perhaps this distortion is what is intended. Also if we overshot, we could use the cumsum trick, but that would lead to different distortion. So forget I mentioned it. Looks good.

looking at accounting_sample_no_replace.

line 144: ran_p = pd.Series(np.power(np.random.rand(len(p)), 1.0 / p), index=data.index)
How does this work? I am rusty on gibbs sampling, so it is probably just me. Could we use numpy.random.choice(data.index, len(data.index), False, p)?

refine the sample is coded well. It does distort the probabilities. Pro, more like the replace variant. Con, not what we did before nor what the docstrings say we do.

@bridwell
Copy link
Contributor Author

General comment: I think that in the most cases the overwhelming majority of the samples are satisfied in the initial sample. The refinement is simply playing with the margins to try to get the counts to match exactly. We primarily use this method for our household transition model where we have household population targets but are sampling households. I think that in most non-trivial cases if you compare the resulting probabilities you will not find any distortion beyond the random variance you would see across runs. But I can test that further.

With respect to line 144,

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 17, 2016

Sorry for the slow reply. I had meetings all day yesterday.

Ok, so the distortion is intended. Objection withdrawn. :-)

I think, the error is this one exactly the cases where there are not enough items to satisfy the request. Our new strategy in those cases picks randomly from the 0 probability items, not a bad solution.

Thanks for the link to the article! I was surprised it did not include a proof of the theorems. The approach seems to be called "Reservoir Sampling" and discussed in many places.

Thanks for putting up with my nitpiks. Looks good to me.

@bridwell
Copy link
Contributor Author

No worries, I appreciate the thorough review. With respect to the floating point issue, I was going to change this to:

if abs(p.sum() - 1.0) > 1e-8:

This is based on looking at the numpy comparison:
https://github.com/numpy/numpy/blob/maintenance/1.9.x/numpy/random/mtrand/mtrand.pyx#L1082
https://github.com/numpy/numpy/blob/maintenance/1.9.x/numpy/random/mtrand/mtrand.pyx#L551

Does that make sense?

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 17, 2016

Yes, that looks good. and thanks for the links to the source!

I am +1 for merge. Someone with the rights want to take a look? @waddell @fscottfoti

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 94.293% when pulling 7858ea3 on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

@janowicz
Copy link
Contributor

janowicz commented Dec 6, 2016

Looks good to me. Thanks @bridwell!

@janowicz janowicz merged commit fe90b0b into UDST:master Dec 6, 2016
@bridwell bridwell deleted the fix_sample_rows_no_replace branch March 28, 2017 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants