MAPE for data with multiple dim maybe not consistent with if we split the data by axis=0 and calcuate MAPE seprately. #297

dding3 · 2021-04-23T22:10:31Z

It can be reproed by:

loss_sum_mape = []
pred_res_test = np.load("pred.npy")
real_res_test = np.load("real.npy")
for i in range(3):
    pred_i_test = pred_res_test[:,i,:]
    real_i_test = real_res_test[:,i,:]
    mape_test = MAPE(pred_i_test, real_i_test)
    loss_sum_mape.append(mape_test)

loss_sum_mape_2 = MAPE(pred_res_test, real_res_test)
print("MAPE test: ", loss_sum_mape)
print("MAPE test together: ", loss_sum_mape_2)

loss_sum_mape_test is [array([4.8204937], dtype=float32), array([8.793998], dtype=float32), array([9.444209], dtype=float32)]
while loss_sum_mape_test_togete is [2.8503518 5.209334 5.52606 ]

pred.npy and real.npy has been upload to
arda@clx_gateway:~/ding/nokia/baidu_traffic$ ls *.npy
pred.npy real.npy

We use numpy builld-in np.mean func to calculate MAPE. Seems like numpy use single precision by default for np.mean function, if we force np.mean to use np.float64, loss_sum_mape_test_togete will be very close to loss_sum_mape_test.
[4.82048401 8.79399453 9.44423784]

The text was updated successfully, but these errors were encountered:

TheaperDeng · 2021-04-25T01:40:08Z

This is an alarming case.

In short

This problem is caused by the inconsistent behavior of sum algorithm for contiguous/non-contiguous memory provided by numpy. There has been an similar issue opened for numpy (numpy.mean along multiple axis gives wrong result for large arrays numpy/numpy#8869)
If you are calling zoo.automl.common.metrics.MAPE, the args order should be MAPE(y_true, y_pred)
Maybe we should force metric calculation to Float64 to alleviate this issue? @yushan111

In detail

IEEE Float32 has a large ranges still it can not present all the real number in its ranges. The distribution density is larger when number is close to 0 while lower otherwise.
For non-contiguous memory, the sum algorithm provided a naive method to calculate summation, which is sum_val += a[i] for i in range(length). This algorithm has more and more precision loss when summation value become larger.
Here is a clearer case:

>>> test_array = np.ones((20000000, 4, 1), dtype = np.float32)
>>> np.mean(test_array[:,0,:], axis=0)
array([1.], dtype=float32)
>>> np.mean(test_array, axis=0)
array([[0.8388608],
       [0.8388608],
       [0.8388608],
       [0.8388608]], dtype=float32)

shanyu-sys · 2021-04-25T08:31:56Z

For "MAPE" or other metrics including percentage, moving the *100 out of np.mean() might help with the issue.
And sure we can convert the data to float64.

TheaperDeng linked a pull request Apr 25, 2021 that will close this issue

Fix large array metrics precision error & add highdim support for some metrics intel/BigDL#3858

Merged

TheaperDeng mentioned this issue Apr 26, 2021

Fix large array metrics precision error & add highdim support for some metrics intel/BigDL#3858

Merged

TheaperDeng closed this as completed in intel/BigDL#3858 Apr 26, 2021

liu-shaojun transferred this issue from intel/BigDL Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAPE for data with multiple dim maybe not consistent with if we split the data by axis=0 and calcuate MAPE seprately. #297

MAPE for data with multiple dim maybe not consistent with if we split the data by axis=0 and calcuate MAPE seprately. #297

dding3 commented Apr 23, 2021

TheaperDeng commented Apr 25, 2021 •

edited

Loading

shanyu-sys commented Apr 25, 2021

MAPE for data with multiple dim maybe not consistent with if we split the data by axis=0 and calcuate MAPE seprately. #297

MAPE for data with multiple dim maybe not consistent with if we split the data by axis=0 and calcuate MAPE seprately. #297

Comments

dding3 commented Apr 23, 2021

TheaperDeng commented Apr 25, 2021 • edited Loading

shanyu-sys commented Apr 25, 2021

TheaperDeng commented Apr 25, 2021 •

edited

Loading