Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-package] Change default missing value to NaN for better alignment. #11225

Merged
merged 1 commit into from
Feb 12, 2025

Conversation

ayoub317
Copy link
Contributor

@ayoub317 ayoub317 commented Feb 9, 2025

#11221
This PR updates the default missing value to NaN for consistency in the JVM package and across bindings.

In the JVM package, the DMatrix is sometimes instantiated with the missing value set to 0 (reference) and sometimes as NaN (reference).
This inconsistency occurs only in the Java part. In Scala, the missing value is always set to NaN (reference).

@trivialfis
Copy link
Member

cc @wbo4958 .

@trivialfis
Copy link
Member

Hi, thank you for the PR. I'm curious about the impact on the spark package and its sparse vector.

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 11, 2025

LGTM.

@ayoub317
Copy link
Contributor Author

ayoub317 commented Feb 11, 2025

@trivialfis Very interesting point ! That was also the first thing I considered when making the PR.
In my opinion, there is no impact, users will continue to get the same results as long as they follow the same processing steps for both training and inference.
By the way, I think many people are unaware of how sparse vectors are handled. They often use the Vector Assembler and sparse vectors naively, without realizing that zeros are treated as missing values (even outside of Spark like a scipy sparse matrix).
The JVM package addressed this before with the allowNonZeroForMissing parameter (introduced here). I believe the same protection is now in place by requiring users to explicitly set the missing value when using sparse vectors.
While reviewing the doc update regarding this, it was quite clear. However I couldn’t find any specific documentation on PySpark XGBoost.
What do you think about implementing a similar safeguard in PySpark XGBoost like a warning or error for such cases? Also would you be interested in starting some documentation for PySpark XGBoost? I’d be happy to draft an initial version.

@trivialfis
Copy link
Member

What do you think about implementing a similar safeguard in PySpark XGBoost like a warning or error for such cases

Sounds good!

Also would you be interested in starting some documentation for PySpark XGBoost?

Yes, feel free to ping me if there's anything I can help.

@trivialfis trivialfis merged commit 8fc48d0 into dmlc:master Feb 12, 2025
55 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants