Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPE for data with multiple dim maybe not consistent with if we split the data by axis=0 and calcuate MAPE seprately. #297

Closed
dding3 opened this issue Apr 23, 2021 · 2 comments · Fixed by intel/BigDL#3858

Comments

@dding3
Copy link

dding3 commented Apr 23, 2021

It can be reproed by:

loss_sum_mape = []
pred_res_test = np.load("pred.npy")
real_res_test = np.load("real.npy")
for i in range(3):
    pred_i_test = pred_res_test[:,i,:]
    real_i_test = real_res_test[:,i,:]
    mape_test = MAPE(pred_i_test, real_i_test)
    loss_sum_mape.append(mape_test)

loss_sum_mape_2 = MAPE(pred_res_test, real_res_test)
print("MAPE test: ", loss_sum_mape)
print("MAPE test together: ", loss_sum_mape_2)

loss_sum_mape_test is [array([4.8204937], dtype=float32), array([8.793998], dtype=float32), array([9.444209], dtype=float32)]
while loss_sum_mape_test_togete is [2.8503518 5.209334 5.52606 ]

pred.npy and real.npy has been upload to
arda@clx_gateway:~/ding/nokia/baidu_traffic$ ls *.npy
pred.npy real.npy

We use numpy builld-in np.mean func to calculate MAPE. Seems like numpy use single precision by default for np.mean function, if we force np.mean to use np.float64, loss_sum_mape_test_togete will be very close to loss_sum_mape_test.
[4.82048401 8.79399453 9.44423784]

@TheaperDeng
Copy link

TheaperDeng commented Apr 25, 2021

This is an alarming case.

In short

  1. This problem is caused by the inconsistent behavior of sum algorithm for contiguous/non-contiguous memory provided by numpy. There has been an similar issue opened for numpy (numpy.mean along multiple axis gives wrong result for large arrays numpy/numpy#8869)
  2. If you are calling zoo.automl.common.metrics.MAPE, the args order should be MAPE(y_true, y_pred)
  3. Maybe we should force metric calculation to Float64 to alleviate this issue? @yushan111

In detail

  1. IEEE Float32 has a large ranges still it can not present all the real number in its ranges. The distribution density is larger when number is close to 0 while lower otherwise.
  2. For non-contiguous memory, the sum algorithm provided a naive method to calculate summation, which is sum_val += a[i] for i in range(length). This algorithm has more and more precision loss when summation value become larger.
  3. Here is a clearer case:
>>> test_array = np.ones((20000000, 4, 1), dtype = np.float32)
>>> np.mean(test_array[:,0,:], axis=0)
array([1.], dtype=float32)
>>> np.mean(test_array, axis=0)
array([[0.8388608],
       [0.8388608],
       [0.8388608],
       [0.8388608]], dtype=float32)

@shanyu-sys
Copy link
Contributor

For "MAPE" or other metrics including percentage, moving the *100 out of np.mean() might help with the issue.
And sure we can convert the data to float64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants