Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clarification Chapter 3] FutureWarning: elementwise comparison failed; returning scalar instead #583

Open
vedanthv opened this issue Aug 3, 2022 · 3 comments

Comments

@vedanthv
Copy link

vedanthv commented Aug 3, 2022

This is a warning I got while running the Binary Classifier (5 Detector) code from Chapter 3 specifically when I was creating the subset of the dataset with only 5's on the train and test set.

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

This error forbids me from running the SGDClassifier in the next code block of the book/jupyter notebook since y is not 1D array.

Also realized that the same error is still open as an issue on numpy and pandas repositories.

I'm using the versions mentioned in the readme of this repository.

Any help regarding this is appreciated. If a similar issue exists, please leave a comment and I'll close this.

Thanks!

@ian-coccimiglio
Copy link

ian-coccimiglio commented Aug 8, 2022

Hi Vedanthv,

I noticed this too. The problem seems to occur because "y_train" is created as type "object." Then the condition "y_train == 5" checks whether these objects are equivalent to integers - which they aren't, so every element returns False. Here, we can see that the first element ought to be True.

>>> X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
>>> y_train
array(['5', '0', '4', ..., '5', '6', '8'], dtype=object)

>>> y_train == 5
array([False, False, False, ..., False, False, False])

My solution was to cast y_train as type integer, and reshape it in the following step (SGDClassifier expects a 2D array in the correct shape).

>>> X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
>>> y_train = y_train.astype(int)
>>> y_train
array([5, 0, 4, ..., 5, 6, 8])

>>> y_train == 5
array([ True, False, False, ...,  True, False, False])

>>> sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
>>> sgd_clf.fit(X_train, y_train_5)
>>> sgd_clf.predict(some_digit.reshape(1,-1))
array([ True])

@vedanthv
Copy link
Author

vedanthv commented Aug 8, 2022

Hi Ian,
Thanks for the clarification! This fixed the problem

@ageron
Copy link
Owner

ageron commented Sep 26, 2022

Thanks for your question @vedanthv , and thanks for the solution @ian-coccimiglio !
It's indeed important to cast the labels to integers. The books includes this line at the bottom of page 86: y = y.astype(np.uint8).
Also, since the book was published, fetch_openml() changed: it used to return NumPy arrays, but now it returns Pandas DataFrames. This breaks some of the code in the notebooks. Luckily there's an easy fix: just set as_frame=False when calling fetch_openml() and everything should work fine.

Btw, the third edition of the book will come out in October 2022, and the updated notebooks are available at https://github.com/ageron/handson-ml3

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants