Score (prediction) calculated by xgboost does not match what is expected from tree model generated by xgboost #1204

srangwal · 2016-05-17T01:08:24Z

TL;DR

I am seeing that, due to rounding errors, the score calculated by xgboost for a tree ensemble does not match the one expected based on manually looking at the tree model.

Background

We have our own implementation of tree scoring and while comparing the score that our library generates for a given tree model with the score (prediction) that xgboost comes up with for the same tree model we find that due to rounding errors somewhere the tree traversal the score done match

Using some training data we trained a tree ensemble using xgboost and the xgboost outputs the following model

TREE MODEL

 booster[0]:
 0:[feature_7<0.0397409] yes=1,no=2,missing=1
         1:[feature_3<0.0667176] yes=3,no=4,missing=3
                 3:[feature_4<0.312636] yes=7,no=8,missing=7
                         7:leaf=-0.329882
                         8:leaf=-0.273303
                 4:[feature_1<0.350854] yes=9,no=10,missing=9
                         9:leaf=-0.259339
                         10:leaf=-0.181342
         2:[feature_3<0.154725] yes=5,no=6,missing=5
                 5:[feature_5<0.0974576] yes=11,no=12,missing=11
                         11:leaf=-0.219728
                         12:leaf=-0.131518
                 6:[feature_10<0.0839368] yes=13,no=14,missing=13
                         13:leaf=-0.126043
                         14:leaf=-0.0395797
 booster[1]:
 0:[feature_9<0.033038] yes=1,no=2,missing=1
         1:[feature_3<0.0765929] yes=3,no=4,missing=3
                 3:[feature_5<0.0953409] yes=7,no=8,missing=7
                         7:leaf=-0.264299
                         8:leaf=-0.198648
                 4:[feature_7<0.0226379] yes=9,no=10,missing=9
                         9:leaf=-0.203953
                         10:leaf=-0.13269
         2:[feature_2<0.190862] yes=5,no=6,missing=5
                 5:[feature_5<0.130523] yes=11,no=12,missing=11
                         11:leaf=-0.147195
                         12:leaf=-0.0626878
                 6:[feature_2<0.528541] yes=13,no=14,missing=14
                         13:leaf=-0.0245767
                         14:leaf=0.0536115
 booster[2]:
 0:[feature_9<0.0270387] yes=1,no=2,missing=1
         1:[feature_1<0.445829] yes=3,no=4,missing=3
                 3:[feature_6<0.103294] yes=7,no=8,missing=7
                         7:leaf=-0.242124
                         8:leaf=-0.190886
                 4:[feature_2<0.50734] yes=9,no=10,missing=9
                         9:leaf=-0.157586
                         10:leaf=-0.0441379
         2:[feature_2<0.130774] yes=5,no=6,missing=5         ←----------------- Different branching
                 5:[feature_11<0.0535646] yes=11,no=12,missing=11
                         11:leaf=-0.154575
                         12:leaf=-0.0750511
                 6:[feature_12<0.562536] yes=13,no=14,missing=13
                         13:leaf=-0.0398532
                         14:leaf=0.051261

For this test data

TEST DATA

feature_1=1.0
feature_2=0.13077405095100403
feature_3=0.11696787178516388
feature_4=1.0
feature_5=0.17436540126800537
feature_6=1.0
feature_7=0.02141261100769043
feature_9=0.05511551350355148
feature_10=0.08659037202596664

Xgboost score = 0.401647
Score with our own library = 0.4295020650820223

(NOTE: scores of individual trees are added (summation) and score = sigmod(sum of scores from each tree))

For the line marked different branching one can deduce that our library is evaluating the condition to be false, and hence ends up calculating score -0.0398532 as the score of the third tree.

Based on the score generated by xgboost one can deduce that this same condition is evaluated as true by xgboost and xgboost ends up calculating score -0.154575 as the score of the third tree.

The text was updated successfully, but these errors were encountered:

khotilov · 2016-05-18T00:24:49Z

It's most likely the differences in rounding.

Trees are printed using default std::stringstream precision

xgboost/src/tree/tree_model.cc

Line 78 in 51154f4

std::stringstream fo("");

which usually means that the float split values are represented with 6 meaningful digits, as you may see from your example. And default rounding, if I remember correctly, is towards zero.

You might try the following hack in order to see more digits in the split value: add fo.precision(18); after that line and rebuild.

srangwal · 2016-05-18T21:47:08Z

Thanks @khotilov. That helped. Would it be a good idea to set the precision to the highest precision of float in xgboost so as to avoid discrepancies in what xgboost uses as splitvalue/score and what other libraries using the output of xgboost use during scoring. If so, I can create a pull request

The same would be required for prediction value as well (

xgboost/src/cli_main.cc

Line 312 in 51154f4

dmlc::ostream os(fo.get());

)

khotilov mentioned this issue Jun 16, 2017

Add dump_format=json option #1726

Merged

pommedeterresautee mentioned this issue Aug 27, 2017

xgb dump significant figures for leaf preds #2608

Closed

khotilov mentioned this issue Aug 31, 2017

DumpModel stringstream precision problem #2659

Closed

tonydifranco mentioned this issue Dec 1, 2017

split evaluation inaccuracy #2920

Closed

tqchen closed this as completed Jul 4, 2018

lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score (prediction) calculated by xgboost does not match what is expected from tree model generated by xgboost #1204

Score (prediction) calculated by xgboost does not match what is expected from tree model generated by xgboost #1204

srangwal commented May 17, 2016 •

edited

Loading

khotilov commented May 18, 2016

srangwal commented May 18, 2016 •

edited

Loading

Score (prediction) calculated by xgboost does not match what is expected from tree model generated by xgboost #1204

Score (prediction) calculated by xgboost does not match what is expected from tree model generated by xgboost #1204

Comments

srangwal commented May 17, 2016 • edited Loading

TL;DR

Background

TREE MODEL

TEST DATA

khotilov commented May 18, 2016

srangwal commented May 18, 2016 • edited Loading

srangwal commented May 17, 2016 •

edited

Loading

srangwal commented May 18, 2016 •

edited

Loading