-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas.dataframe.values floating number changes automatically #21885
Comments
I suspect the discrepancies are rooted in the Excel engine. Reading this as a CSV does not have that rounding problem as you describe. |
Actually, that assessment was incorrect. The issue persists even with CSV. The crux of the problem actually lies with our internal converter to float (or numeric in general): import pandas.util.testing as tm
import pandas._libs.lib as lib
import numpy as np
inp = np.array(["0.014", "0.984"], dtype=object)
exp = np.array([0.014, 0.984])
tm.assert_numpy_array_equal(lib.maybe_convert_numeric(inp, set(), False), exp)
...
AssertionError: numpy array are different
numpy array values are different (100.0 %)
[left]: [0.013999999999999999, 0.9840000000000001]
[right]: [0.014, 0.984] |
@gfyoung pandas/pandas/_libs/src/parser/tokenizer.c Lines 1532 to 1534 in bb43726
which does a fine job of evaluating the string but the issue here is it's not evaluating it exactly as python (or numpy) is evaluating it. pandas/pandas/tests/dtypes/test_inference.py Lines 398 to 405 in bb43726
This seems like a fairly straightforward fix of replacing As a side note, the almost identical pandas/pandas/_libs/src/parse_helper.h Lines 148 to 151 in bb43726
|
Weird...I think it would be good to see if we could unify the two.
I'm not sure what the reason was implementing our own. However, I would encourage you to investigate the consequences of doing so, both from an accuracy and performance perspective. |
Have there been any progress on this issue? I'm using Pandas 1.1.2 |
It seems like I found a solution for this, at least when using the read_csv functionality. Not sure if this solves the original issue that @snowflake01986 had though. |
input_csv = StringIO('''
0.984, 1.05153
0.0134, 1.05152
''')
df = pandas.read_csv(input_csv, header=None)
print(df)
for index, row in df.iterrows():
print(row[0], row[1])
print('pandas version: ', pandas.__version__) The above code running on https://colab.research.google.com/ shows some of the the original numbers being represented by a large number of decimal places after loading into a dataframe. The output is: 0 1
0 0.9840 1.05153
1 0.0134 1.05152
0.9840000000000001 1.0515299999999999
0.0134 1.05152
pandas version: 1.1.5 When the code is executed using a forked version of pandas source version 1.3.3 the problem does not appear. The output is: 0 1
0 0.9840 1.05153
1 0.0134 1.05152
0.984 1.05153
0.0134 1.05152
pandas version: 1.3.3 The same result is produced when reading the four numbers from a .xlsx file. It appears that the issue has been fixed. |
Code Sample, a copy-pastable example if possible
test.xlsx
Problem description
I loaded a pandas dataframe from the attached test.xlsx, of which the content is as follows:
name c1 c2
0 r1 0.014 0.000-0.054
1 r2 0.984 0.025-1.785
As we can see, the c1 columns has been well rounded. For some reasons, I needed only the values numpy.darray, but the floating precision expands undesirably and changes a little as follows:
array([['r1', 0.013999999999999999, '0.000-0.054'],
['r2', 0.9840000000000001, '0.025-1.785']], dtype=object)
what is odd is that I have some other similar tables which resulted in the expected results. So this really beyond me.
Expected Output
What I wanted was the perfectly correspondance of dataframe:
array([['r1', 0.0134, '0.000-0.054'],
['r2', 0.984, '0.025-1.785']], dtype=object)
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: zh_CN.utf8
LANG: en_US.UTF-8
LOCALE: zh_CN.UTF-8
pandas: 0.19.1
nose: None
pip: 10.0.1
setuptools: 26.1.1
Cython: None
numpy: 1.13.3
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.7.3
lxml: None
bs4: 4.5.1
html5lib: 1.0b10
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: