pandas.dataframe.values floating number changes automatically #21885

snowflake01986 · 2018-07-13T01:51:26Z

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd
df = pd.read_excel('test.xlsx', 'test1', header=0, index_col=None)
print(df.values)

Problem description

I loaded a pandas dataframe from the attached test.xlsx, of which the content is as follows:
name c1 c2
0 r1 0.014 0.000-0.054
1 r2 0.984 0.025-1.785
As we can see, the c1 columns has been well rounded. For some reasons, I needed only the values numpy.darray, but the floating precision expands undesirably and changes a little as follows:

array([['r1', 0.013999999999999999, '0.000-0.054'],
['r2', 0.9840000000000001, '0.025-1.785']], dtype=object)

what is odd is that I have some other similar tables which resulted in the expected results. So this really beyond me.

Expected Output

What I wanted was the perfectly correspondance of dataframe:
array([['r1', 0.0134, '0.000-0.054'],
['r2', 0.984, '0.025-1.785']], dtype=object)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: zh_CN.utf8
LANG: en_US.UTF-8
LOCALE: zh_CN.UTF-8

pandas: 0.19.1
nose: None
pip: 10.0.1
setuptools: 26.1.1
Cython: None
numpy: 1.13.3
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.7.3
lxml: None
bs4: 4.5.1
html5lib: 1.0b10
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-07-13T09:50:32Z

I suspect the discrepancies are rooted in the Excel engine. Reading this as a CSV does not have that rounding problem as you describe.

gfyoung · 2018-11-20T09:31:19Z

Actually, that assessment was incorrect. The issue persists even with CSV. The crux of the problem actually lies with our internal converter to float (or numeric in general):

import pandas.util.testing as tm
import pandas._libs.lib as lib
import numpy as np

inp = np.array(["0.014", "0.984"], dtype=object)
exp = np.array([0.014, 0.984])

tm.assert_numpy_array_equal(lib.maybe_convert_numeric(inp, set(), False), exp)
...
AssertionError: numpy array are different

numpy array values are different (100.0 %)
[left]:  [0.013999999999999999, 0.9840000000000001]
[right]: [0.014, 0.984]

david-liu-brattle-1 · 2019-02-02T04:04:08Z

@gfyoung
The string parsing functions seem to call a custom built xstrtod function

pandas/pandas/_libs/src/parser/tokenizer.c

Lines 1532 to 1534 in bb43726

    
           double xstrtod(const char *str, char **endptr, char decimal, char sci, 
        
                          char tsep, int skip_trailing) { 
        
               double number;

which does a fine job of evaluating the string but the issue here is it's not evaluating it exactly as python (or numpy) is evaluating it. float('0.014')==0.014==np.fromstring(b'0.014',sep=' ')[0] but the xstrtod('0.014') != 0.014. For consistency's sake I think it makes sense that a number read in by pandas as string should be evaluated and written back out as the same number. (currently 0.014 is written back out as 0.0139999999 after being evaluated). It's a fluke that this issue isn't being picked up by any of the tests. For example, if "0.014" would make the following fail if it in the array:

pandas/pandas/tests/dtypes/test_inference.py

Lines 398 to 405 in bb43726

    
           def test_maybe_convert_numeric_post_floatify_nan(self, coerce): 
        
               # see gh-13314 
        
               data = np.array(['1.200', '-999.000', '4.500'], dtype=object) 
        
               expected = np.array([1.2, np.nan, 4.5], dtype=np.float64) 
        
               nan_values = {-999, -999.0} 
        
               out = lib.maybe_convert_numeric(data, nan_values, coerce) 
        
               tm.assert_numpy_array_equal(out, expected)

This seems like a fairly straightforward fix of replacing xstrtod with the Python float evaluator PyOS_string_to_double from Python.h, unless there is a good reason to stick with the original xstrtod?

As a side note, the almost identical xstrtod is defined again in parser_helper.h, which seems like an oversight.

pandas/pandas/_libs/src/parse_helper.h

Lines 148 to 151 in bb43726

    
           static double xstrtod(const char *str, char **endptr, char decimal, char sci, 
        
                                 int skip_trailing, int *maybe_int) { 
        
               double number; 
        
               int exponent;

gfyoung · 2019-02-02T07:13:45Z

As a side note, the almost identical xstrtod is defined again in parser_helper.h, which seems like an oversight.

Weird...I think it would be good to see if we could unify the two.

This seems like a fairly straightforward fix of replacing xstrtod with the Python float evaluator PyOS_string_to_double from Python.h, unless there is a good reason to stick with the original xstrtod

I'm not sure what the reason was implementing our own. However, I would encourage you to investigate the consequences of doing so, both from an accuracy and performance perspective.

seb-emmot · 2020-10-20T09:54:49Z

Have there been any progress on this issue?
I experience the same issue when importing floats from CSV data. This makes it very difficult to verify that my data transform operations work as expected.

I'm using Pandas 1.1.2

seb-emmot · 2020-10-20T10:49:02Z

Have there been any progress on this issue?
I experience the same issue when importing floats from CSV data. This makes it very difficult to verify that my data transform operations work as expected.

I'm using Pandas 1.1.2

It seems like I found a solution for this, at least when using the read_csv functionality. float_precision='round_trip' as argument to the read_csv function. Based on https://stackoverflow.com/questions/36909368/precision-lost-while-using-read-csv-in-pandas

Not sure if this solves the original issue that @snowflake01986 had though.

ikramersh · 2021-10-17T10:01:23Z

input_csv = StringIO('''
  0.984, 1.05153
  0.0134,  1.05152
''')

df = pandas.read_csv(input_csv, header=None)
print(df)
for index, row in df.iterrows():
    print(row[0], row[1])
print('pandas version: ', pandas.__version__)

The above code running on https://colab.research.google.com/ shows some of the the original numbers being represented by a large number of decimal places after loading into a dataframe. The output is:

        0        1
0  0.9840  1.05153
1  0.0134  1.05152
0.9840000000000001 1.0515299999999999
0.0134 1.05152
pandas version:  1.1.5

When the code is executed using a forked version of pandas source version 1.3.3 the problem does not appear. The output is:

        0        1
0  0.9840  1.05153
1  0.0134  1.05152
0.984 1.05153
0.0134 1.05152
pandas version: 1.3.3

The same result is produced when reading the four numbers from a .xlsx file.

It appears that the issue has been fixed.

gfyoung added Numeric Operations Arithmetic, Comparison, and Logical operations IO Excel read_excel, to_excel labels Jul 13, 2018

gfyoung added the Internals Related to non-user accessible pandas implementation label Nov 20, 2018

WillAyd added IO Data IO issues that don't fit into a more specific label and removed IO Excel read_excel, to_excel labels Jan 21, 2019

WillAyd added this to the Contributions Welcome milestone Jan 21, 2019

mroeschke added Bug and removed IO Data IO issues that don't fit into a more specific label labels Apr 27, 2020

mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 20, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.dataframe.values floating number changes automatically #21885

pandas.dataframe.values floating number changes automatically #21885

snowflake01986 commented Jul 13, 2018 •

edited by gfyoung

Loading

INSTALLED VERSIONS

gfyoung commented Jul 13, 2018 •

edited

Loading

gfyoung commented Nov 20, 2018 •

edited

Loading

david-liu-brattle-1 commented Feb 2, 2019 •

edited

Loading

gfyoung commented Feb 2, 2019

seb-emmot commented Oct 20, 2020

seb-emmot commented Oct 20, 2020

ikramersh commented Oct 17, 2021 •

edited

Loading

pandas.dataframe.values floating number changes automatically #21885

pandas.dataframe.values floating number changes automatically #21885

Comments

snowflake01986 commented Jul 13, 2018 • edited by gfyoung Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Jul 13, 2018 • edited Loading

gfyoung commented Nov 20, 2018 • edited Loading

david-liu-brattle-1 commented Feb 2, 2019 • edited Loading

gfyoung commented Feb 2, 2019

seb-emmot commented Oct 20, 2020

seb-emmot commented Oct 20, 2020

ikramersh commented Oct 17, 2021 • edited Loading

snowflake01986 commented Jul 13, 2018 •

edited by gfyoung

Loading

Output of `pd.show_versions()`

gfyoung commented Jul 13, 2018 •

edited

Loading

gfyoung commented Nov 20, 2018 •

edited

Loading

david-liu-brattle-1 commented Feb 2, 2019 •

edited

Loading

ikramersh commented Oct 17, 2021 •

edited

Loading