[jvm-package] Change default missing value to NaN for better alignment. #11225

ayoub317 · 2025-02-09T18:17:39Z

#11221
This PR updates the default missing value to NaN for consistency in the JVM package and across bindings.

In the JVM package, the DMatrix is sometimes instantiated with the missing value set to 0 (reference) and sometimes as NaN (reference).
This inconsistency occurs only in the Java part. In Scala, the missing value is always set to NaN (reference).

trivialfis · 2025-02-10T07:34:08Z

cc @wbo4958 .

trivialfis · 2025-02-11T08:54:17Z

Hi, thank you for the PR. I'm curious about the impact on the spark package and its sparse vector.

wbo4958 · 2025-02-11T11:46:47Z

LGTM.

ayoub317 · 2025-02-11T20:00:05Z

@trivialfis Very interesting point ! That was also the first thing I considered when making the PR.
In my opinion, there is no impact, users will continue to get the same results as long as they follow the same processing steps for both training and inference.
By the way, I think many people are unaware of how sparse vectors are handled. They often use the Vector Assembler and sparse vectors naively, without realizing that zeros are treated as missing values (even outside of Spark like a scipy sparse matrix).
The JVM package addressed this before with the allowNonZeroForMissing parameter (introduced here). I believe the same protection is now in place by requiring users to explicitly set the missing value when using sparse vectors.
While reviewing the doc update regarding this, it was quite clear. However I couldn’t find any specific documentation on PySpark XGBoost.
What do you think about implementing a similar safeguard in PySpark XGBoost like a warning or error for such cases? Also would you be interested in starting some documentation for PySpark XGBoost? I’d be happy to draft an initial version.

trivialfis · 2025-02-12T11:18:14Z

What do you think about implementing a similar safeguard in PySpark XGBoost like a warning or error for such cases

Sounds good!

Also would you be interested in starting some documentation for PySpark XGBoost?

Yes, feel free to ping me if there's anything I can help.

Change default missing value to NaN for better alignment

f8e8233

ayoub317 mentioned this pull request Feb 9, 2025

java predict results are different from python predict results by loading the same model #11221

Closed

wbo4958 approved these changes Feb 11, 2025

View reviewed changes

trivialfis approved these changes Feb 12, 2025

View reviewed changes

trivialfis merged commit 8fc48d0 into dmlc:master Feb 12, 2025
55 of 57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-package] Change default missing value to NaN for better alignment. #11225

[jvm-package] Change default missing value to NaN for better alignment. #11225

ayoub317 commented Feb 9, 2025 •

edited

Loading

trivialfis commented Feb 10, 2025

trivialfis commented Feb 11, 2025

wbo4958 commented Feb 11, 2025

ayoub317 commented Feb 11, 2025 •

edited

Loading

trivialfis commented Feb 12, 2025

[jvm-package] Change default missing value to NaN for better alignment. #11225

[jvm-package] Change default missing value to NaN for better alignment. #11225

Conversation

ayoub317 commented Feb 9, 2025 • edited Loading

trivialfis commented Feb 10, 2025

trivialfis commented Feb 11, 2025

wbo4958 commented Feb 11, 2025

ayoub317 commented Feb 11, 2025 • edited Loading

trivialfis commented Feb 12, 2025

ayoub317 commented Feb 9, 2025 •

edited

Loading

ayoub317 commented Feb 11, 2025 •

edited

Loading