Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Handle categorical dtype to/from R #9187

Closed
wants to merge 6 commits into from

Conversation

jseabold
Copy link
Contributor

@jseabold jseabold commented Jan 2, 2015

No description provided.

@jseabold
Copy link
Contributor Author

jseabold commented Jan 2, 2015

Failure is unrelated and why didn't skip-ci work?

@shoyer
Copy link
Member

shoyer commented Jan 4, 2015

Is using the pandas interface the preferred way to do pandas/R dataframe conversion, or should we leave this to rpy2 now that it has built in support for pandas? I've had pretty good luck with using pandas2ri.activate() from rpy2, which I believe entirely skips the conversion machinery in pandas itself. At the very least, I suspect a documentation update is due.

@lgautier any input?

For reference, here is where you'll find the pandas conversion code in rpy2: https://bitbucket.org/rpy2/rpy2/src/f1f15fe7eb6d1d65e8b3de4ee7fd697b800f001e/rpy/robjects/pandas2ri.py?at=default

BTW, I restarted the failed travis job -- we'll see if it work this time.

@lgautier
Copy link
Contributor

lgautier commented Jan 4, 2015

I am adding the following comment so you (pandas) can make the your decision with data.

rpy2 has integrated a two-way conversion of pandas DataFrame to rpy2 DataFrame (proxy of data.frame in the embedded R) since version 2.4.0 (current release is 2.5.4) and there is currently no plan to drop support for it: rmagic, originally in ipython and now part of rpy2, appears to rely on pandas for a prime representation of data frames in Python (@davclark can comment or add anything I'd have omitted about this).

While the conversion in rpy2 is covered by unit tests and is thought to be practical and well behaved, rpy2 is modular and relatively easy to customize by design. Alternative conversions, built from scratch or built on the top of the existing ones, can be added (see http://rpy.sourceforge.net/rpy2/doc-2.5/html/robjects_convert.html?highlight=register#customizing-the-conversion)

@jseabold
Copy link
Contributor Author

jseabold commented Jan 5, 2015

I don't have a strong opinion on this, though I suspect any "fixes" should definitely make it in to rpy2. As a prior though, I assume pandas development moves a bit faster, but that's mostly a guess. My mercurial/bitbucket-fu is also super weak (so far), so it's easier for me to make a pandas PR :)

@jorisvandenbossche
Copy link
Member

Related discussions in #7309 and #7385

@jreback jreback added the Categorical Categorical Data Type label Jan 5, 2015
@jreback
Copy link
Contributor

jreback commented Jan 5, 2015

@jseabold it does not appear to me that any of this is actually tested on travis (it is all skipped), though it doesn't even print the skip message.

So ok with putting this in if you can confirm that it works locally for you?

cc @unutbu can you give a check?

@lgautier
Copy link
Contributor

lgautier commented Jan 6, 2015

@jseabold

As a prior though, I assume pandas development moves a bit faster, but that's mostly a guess.

Ya ya pandas is more and better in any possible and conceivable way [yawn]
;-)

My mercurial/bitbucket-fu is also super weak (so far), so it's easier for me to make a pandas PR :)

1- Go to https://bitbucket.org/rpy2/rpy2
2- Click "Fork" (left side, 5th icon)
3- Edit in the web browser
4- Submit a pull request

levels = np.asarray(obj.levels)
values = np.asarray(obj)
if com.is_float_dtype(values):
mask = np.isnan(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jseabold: I don't see a test exercising this code path (when is_float_dtype(values) is True). Can you explain its purpose and/or perhaps add a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was copy-pasted from the existing code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jseabold: Sorry, my bad. I see now that that code was added due to
GH #1615. Unfortunately, there appears to be a regression when
factors_as_strings=False:

from pandas.rpy.common import load_data
prestige = load_data('Prestige', 'car', factors_as_strings=False)

raises ValueError: codes need to be between -1 and len(categories)-1.

And the issue also affects factors_as_strings=True, where

prestige = load_data('Prestige', 'car', factors_as_strings=True)

raises IndexError: index 2147483647 is out of bounds for axis 0 with size 3

@unutbu
Copy link
Contributor

unutbu commented Feb 5, 2015

A use case for this PR: http://stackoverflow.com/q/28353937/190597

@jreback jreback mentioned this pull request Mar 6, 2015
5 tasks
@jreback
Copy link
Contributor

jreback commented Mar 8, 2015

let's redirect to rpy2 for these types of conversions, see #9602

@jreback jreback closed this Mar 8, 2015
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Mar 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants