Fix sample rows no replace #180

bridwell · 2016-11-14T18:49:49Z

This addresses feedback when the accounting total is not met (#178) and also providing probability distributions to influence the sample (#149).

To obtain feedback, set the return_status argument to True, and this will return both the sampled rows and the status. If False, then only the sampled rows will be returned. This is still the default, so as to not break any existing calls.

To provide sampling probabilities, set the prob_column argument to the name of the column in the data frame containing the weights or probabilities. These values will be normalized so they sum to 1, if they do not already.

The logic for sampling with accounting but without replacement has been changed to hopefully be more robust.

coveralls · 2016-11-14T18:52:35Z

Coverage increased (+0.1%) to 94.29% when pulling 2643fcf on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

coveralls · 2016-11-14T19:04:44Z

Coverage increased (+0.1%) to 94.284% when pulling 921bc5e on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

coveralls · 2016-11-14T19:24:40Z

Coverage increased (+0.1%) to 94.284% when pulling 1e58e3a on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

bridwell · 2016-11-14T19:38:18Z

This looks to be failing because of something in the dcm. Not sure why the previous builds were OK and not this one.

coveralls · 2016-11-14T20:08:24Z

Coverage increased (+0.1%) to 94.287% when pulling 7eae801 on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

Eh2406 · 2016-11-14T20:42:41Z

Hi,

So overall I like it, I like that you split it into smaller functions. I do not have time to look over in detail. So detailed comments incoming... as soon as I have time.

…enes in without replacement calls

coveralls · 2016-11-14T21:42:15Z

Coverage increased (+0.1%) to 94.293% when pulling 0c429a0 on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

Eh2406

I seam not to have gotten the "review" tool. so I'll do it the old fashioned way:

line 28-9 if p.sum() == 0: p = np.ones(len(p))
Why is a zero sum an even distribution? Not sure what is better but that is surprising.
line 30 round(p.sum(), 0) != 1
Why are you rounding? This matches from 0.4 to 1.5. did you mean abs(p.sum() - 1.0) < 0.0001

More incoming as I think through the permutations of prob_column/accounting_column

bridwell · 2016-11-14T22:39:25Z

Per the zero sum: this is something which seems to come up from time to time, especially when doing segmented models, so I was trying to account for this here. Alternatively, this could just return None. This probably makes more sense the more I think about it. Or I guess could we could just have this raise a value error.

The inclusion of the round is an over site. This should probably just be p.sum() != 1 . Thanks for catching that.

Eh2406 · 2016-11-15T16:09:13Z

I don't think a direct compere is correct either http://floating-point-gui.de/errors/comparison/

Ok lets start looking at accounting_sample_replace. Generally looks ok. It is distorting the probabilities by retrying, but it was doing that before as well. As I have said before, I have not yet used this code, so perhaps this distortion is what is intended. Also if we overshot, we could use the cumsum trick, but that would lead to different distortion. So forget I mentioned it. Looks good.

looking at accounting_sample_no_replace.

line 144: ran_p = pd.Series(np.power(np.random.rand(len(p)), 1.0 / p), index=data.index)
How does this work? I am rusty on gibbs sampling, so it is probably just me. Could we use numpy.random.choice(data.index, len(data.index), False, p)?

refine the sample is coded well. It does distort the probabilities. Pro, more like the replace variant. Con, not what we did before nor what the docstrings say we do.

bridwell · 2016-11-15T23:27:02Z

General comment: I think that in the most cases the overwhelming majority of the samples are satisfied in the initial sample. The refinement is simply playing with the margins to try to get the counts to match exactly. We primarily use this method for our household transition model where we have household population targets but are sampling households. I think that in most non-trivial cases if you compare the resulting probabilities you will not find any distortion beyond the random variance you would see across runs. But I can test that further.

With respect to line 144,

The problem with calling numpy.random.choice, is that if there are 0 valued items then an error is raised stating something like there are not enough non 0 items. So doing a weighted permutation with this would require filtering out the non-zero item first. This is possible, but there could still be cases where there are not enough items to satisfy the request.
This is based on the following paper:
http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
From my testing, this seems to produce similar results to numpy.random.choice:
https://github.com/AZMAG/smartpy_core/blob/master/smartpy_core/notebooks/compare_weighted_sampling_approaches.ipynb

Eh2406 · 2016-11-17T21:59:19Z

Sorry for the slow reply. I had meetings all day yesterday.

Ok, so the distortion is intended. Objection withdrawn. :-)

I think, the error is this one exactly the cases where there are not enough items to satisfy the request. Our new strategy in those cases picks randomly from the 0 probability items, not a bad solution.

Thanks for the link to the article! I was surprised it did not include a proof of the theorems. The approach seems to be called "Reservoir Sampling" and discussed in many places.

Thanks for putting up with my nitpiks. Looks good to me.

bridwell · 2016-11-17T22:33:17Z

No worries, I appreciate the thorough review. With respect to the floating point issue, I was going to change this to:

if abs(p.sum() - 1.0) > 1e-8:

This is based on looking at the numpy comparison:
https://github.com/numpy/numpy/blob/maintenance/1.9.x/numpy/random/mtrand/mtrand.pyx#L1082
https://github.com/numpy/numpy/blob/maintenance/1.9.x/numpy/random/mtrand/mtrand.pyx#L551

Does that make sense?

Eh2406 · 2016-11-17T23:12:45Z

Yes, that looks good. and thanks for the links to the source!

I am +1 for merge. Someone with the rights want to take a look? @waddell @fscottfoti

coveralls · 2016-11-17T23:25:41Z

Coverage increased (+0.1%) to 94.293% when pulling 7858ea3 on AZMAG:fix_sample_rows_no_replace into caa6ab8 on UDST:master.

janowicz · 2016-12-06T17:09:23Z

Looks good to me. Thanks @bridwell!

bridwell added 2 commits November 10, 2016 16:04

update methods for sampling with accounting

330063f

update tests, fix probs, add matching status

2643fcf

clean up print statements

921bc5e

clean up docstrings

1e58e3a

one more docstring change

62106d4

allow probabilities when sampling without accounting

7eae801

require input data to have unique index, add tests checking for uniqu…

0c429a0

…enes in without replacement calls

Eh2406 reviewed Nov 14, 2016

View reviewed changes

update sample rows

7858ea3

janowicz merged commit fe90b0b into UDST:master Dec 6, 2016

janowicz mentioned this pull request Dec 6, 2016

No feedback if iteration does not reach exact result in utils.sampling.sample_rows #178

Closed

bridwell deleted the fix_sample_rows_no_replace branch March 28, 2017 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sample rows no replace #180

Fix sample rows no replace #180

bridwell commented Nov 14, 2016 •

edited

Loading

coveralls commented Nov 14, 2016

coveralls commented Nov 14, 2016

coveralls commented Nov 14, 2016

bridwell commented Nov 14, 2016

coveralls commented Nov 14, 2016

Eh2406 commented Nov 14, 2016

coveralls commented Nov 14, 2016

Eh2406 left a comment •

edited

Loading

bridwell commented Nov 14, 2016

Eh2406 commented Nov 15, 2016

bridwell commented Nov 15, 2016

Eh2406 commented Nov 17, 2016

bridwell commented Nov 17, 2016

Eh2406 commented Nov 17, 2016

coveralls commented Nov 17, 2016

janowicz commented Dec 6, 2016

Fix sample rows no replace #180

Fix sample rows no replace #180

Conversation

bridwell commented Nov 14, 2016 • edited Loading

coveralls commented Nov 14, 2016

coveralls commented Nov 14, 2016

coveralls commented Nov 14, 2016

bridwell commented Nov 14, 2016

coveralls commented Nov 14, 2016

Eh2406 commented Nov 14, 2016

coveralls commented Nov 14, 2016

Eh2406 left a comment • edited Loading

Choose a reason for hiding this comment

bridwell commented Nov 14, 2016

Eh2406 commented Nov 15, 2016

bridwell commented Nov 15, 2016

Eh2406 commented Nov 17, 2016

bridwell commented Nov 17, 2016

Eh2406 commented Nov 17, 2016

coveralls commented Nov 17, 2016

janowicz commented Dec 6, 2016

bridwell commented Nov 14, 2016 •

edited

Loading

Eh2406 left a comment •

edited

Loading