-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Construction of a DatetimeIndex from a list of Timestamp with timezone #49048
Comments
I'm seeing about 2.5ms on main. |
I installed pandas from source (version 1.6.0.dev0+319.g98323eec0e) and the issue still persists (almost 2 seconds per loop). INSTALLED VERSIONScommit : 98323ee pandas : 1.6.0.dev0+319.g98323eec0e |
This has a very hight impact on our code. Some of the code we develop has gone from 100 seconds to 4000 seconds just by changing from pandas 1.4.4 to > 1.5. Our profiling shows that the function \pandas\core\arrays\datetimes,py:1980(_sequence_to_dt64ns) spends 18 seconds on average in 1.5 but on 1.4.4 spends 0.003 seconds. I too can confirm that this is not an issue on linux (debian) but is on Windows 11. The issue seems to be present also in non virtual environments. |
Do the profiling results point to anything more specific? |
After looking more closely at the results it looks like nt.stat is called roughly 10 000 times more in 1.5 than in 1.4 for one of our tests. Below is an excerpt of cprofile's output sorted on total time. This is the same run where i got the 18 seconds on average per call for _sequence_to_dt64ns. I am still a newbie at reading cprofiler's results, let me know if there is anything I could do to provide additional information.
edit: tried to make the table pretty |
can you use |
Attached is the entire output from two different runs with 1.4.4 and 1.5.1 respectively. |
@pganssle it looks like zoneinfo load_tzdata is taking something like 93% of the runtime here. any idea what's going on? |
@jbrockmendel Are you doing something weird to load the ZoneInfo objects? I don't really understand why load_tzdata is being called 118k times, since you should be hitting the cache after the first one. Do you see the same issue if you add Edit Looking at this more closely, I don't actually see why @bernardkaplan What happens if you install the |
@pganssle, I install tzdata and now the run times are normal. See attached cprofiler output. |
python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)" The code above speeds up from 2 seconds to 3 milliseconds after installing tzdata |
OK, I think the problem is probably this. If you don't have zoneinfo data, It tries to construct You can probably get closer to ideal by doing something like this (the line starting cdef inline bint is_utc_zoneinfo(tzinfo tz):
# Workaround for cases with missing tzdata
# https://github.com/pandas-dev/pandas/pull/46425#discussion_r830633025
if tz is None or zoneinfo is None or not isinstance(tz, ZoneInfo):
return False
global utc_zoneinfo
if utc_zoneinfo is None:
try:
utc_zoneinfo = ZoneInfo("UTC")
except zoneinfo.ZoneInfoNotFoundError:
return False
# Warn if tzdata is too old, even if there is a system tzdata to alert
# users about the mismatch between local/system tzdata
import_optional_dependency("tzdata", errors="warn", min_version="2022.1")
return tz is utc_zoneinfo
You can also make this a bit more targeted by moving the Current users can install |
I confirm that installing tzdata solved the performance issue. |
Anyone know how to reproduce this on Windows on main? I followed the contributing guide for Windows (which I hadn't done before, I was always using Linux), then did (.venv) PS C:\Users\User\pandas-dev> python -m timeit -n 5 "import pandas as pd; x = pd.DatetimeIndex([pd.Timestamp.now(tz='Europe/Paris')] * 1000)"
5 loops, best of 5: 1.83 msec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (123 msec) was more than four times slower than the best time (1.83 msec). |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
The construction of a DatetimeIndex from a list of Timestamp with timezone appears to be slow when pandas 1.5 is installed in a virtual environment (venv) on Windows.
Here is a mininal script that run slowly on my machine:
Remarks:
Installed Versions
INSTALLED VERSIONS
commit : 87cfe4e
python : 3.9.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252
pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.4
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
Prior Performance
The text was updated successfully, but these errors were encountered: