-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pip binary wheel manylinux installations #3060
Comments
See #2166 (comment) BTW, for questions GDAL uses the mailing list https://lists.osgeo.org/mailman/listinfo/gdal-dev |
This issue is more of a feature request for packaging than a question. |
This request for optimized, shared libs is also related to stripping binaries for a smaller footprint, see also |
Packaging GDAL is a significant effort. For Python'ers, While this is not equivalent to the pip experience, Conda is probably the best way to go that provides PROJ and GDAL binaries shared among several packages. |
This feature request is specific to pip installations using binary wheels; although the conda option is useful, it's irrelevant to resolving this specific feature request, unless something about that solution can be applied in some way to a common, shared pip installation. A pointer to the details of that solution might be helpful here. Clearly, Fiona and rasterio have already solved some of the problem of packaging libgdal for python-pip installations, so this feature request is only about providing a common denominator for any packages that consume a common binary library. A possible solution is to fork and adapt https://github.com/rasterio/rasterio-wheels for the gdal project and then rasterio/Fiona might depend on a common pip-dgal dependency that provides a binary libgdal built with https://github.com/matthew-brett/multibuild (?). (Similarly for |
I'm not sure there's an appetite of the rasterio team to collaborate on this, since they see the wheels as a competitive advantage over the current situation of gdal-python that doesn't provide binary wheels. If there's no collaboration, that could result in some Fiona/rasterio version using some GDAL version, and the gdal-python one using another one, and people at runtime loading both would get clashes/crashes. |
My experiments on this currently lead to a post-pip install hack like (don't do this at home): hack_shared_libs () {
site=$1
export GDAL_DATA="${site}/share/gdal_data"
export PROJ_DATA="${site}/share/proj_data"
mkdir -p "${GDAL_DATA}"
mkdir -p "${PROJ_DATA}"
export SHARED_LIBS="${site}/share/libs"
mkdir -p "${SHARED_LIBS}"
find "${site}" -type d -name 'gdal_data' | while read -r data_path; do
if [ "$data_path" != "$GDAL_DATA" ]; then
rsync -auq "$data_path"/ "$GDAL_DATA"/
rm -rf "$data_path"
ln -s "$GDAL_DATA" "$data_path"
fi
done
find "${site}" -type d -name 'proj_data' | while read -r data_path; do
if [ "$data_path" != "$PROJ_DATA" ]; then
rsync -auq "$data_path"/ "$PROJ_DATA"/
rm -rf "$data_path"
ln -s "$PROJ_DATA" "$data_path"
fi
done
# Updating the LD_LIBRARY_PATH can fix symbol resolution
export LD_LIBRARY_PATH="$SHARED_LIBS:$LD_LIBRARY_PATH"
move_to_shared_libs () {
lib_path=$1
if [ -d "$lib_path" ]; then
rsync -auq "$lib_path"/ "$SHARED_LIBS"/
rm -rf "$lib_path"
ln -s "$SHARED_LIBS" "$lib_path"
fi
}
move_to_shared_libs "$site"/rasterio.libs
move_to_shared_libs "$site"/Fiona.libs
move_to_shared_libs "$site"/numpy.libs
move_to_shared_libs "$site"/pyproj/.libs
move_to_shared_libs "$site"/shapely/.libs
# TODO: remove this hack on shapely/geos.py
# due to https://github.com/Toblerity/Shapely/issues/1013
# try a hack to patch shapely/geos.py
patch "$site"/shapely/geos.py "$SCRIPT_PATH"/patches/shapely/geos.patch
# # To check for missing symbols, use:
# find "$SHARED_LIBS"/ -name "*.so*" | while read lib_name; do
# ldd -r "$lib_name" 2>&1
# done
} Using it requires setting some env-vars like:
This is subsequently tested by running a project pytest suite on the modified venv site-packages that has the share/libs modifications. I'm not proposing any general use of such hack, just noting that the experiment works (but requires a patch on shapely/geos.py to find libgeos OK). I fully get that there's possible reluctance to work on it and preferences to use conda vs. pure-pip and various competition among packages, but we all stand on the shoulders of giants in one way or another and all the packaging solutions get better in various ways, so an open mind to all the evolution is useful. I don't expect this to be resolved anytime soon and if there is some kind of perceived pressure to close issues promptly, so be it, but otherwise it might help to leave it open/unresolved. I can't go out on a limb to try to do a bunch of work myself on it unless there is support for it. While a solid solution would be ideal, all I can do is hack something for now. I don't know enough about the details of both the conda-packaging and the pip-packaging/multibuild/wheels to see a simple solution right away - open to useful pointers. While this hack is nasty, there is something appealing about the shared-libs experiment:
|
https://github.com/conda-incubator/conda-press could be interesting (although work on it seems to have stalled)
The reference to conda is not that irrelevant, in the sense that conda has been specifically designed to overcome this limitation of wheels to handle (non-python) shared libraries. As @rouault already mentioned, packaging is a big effort. Moreover, this doesn't only need someone stepping up to do the work in the actual geospatial packages (gdal, (py)proj, geos, rasterio, shapely, etc), but also in the general python packaging ecosystem. Currently, wheels are not designed to link to a specific build of another wheel to share libraries like this (you can pin versions, but not exact builds), and there is no way to know if another wheel is compatible or not. For example, both numpy and scipy's wheels include its own copy of openblas, and don't share this (it's a similar problem as the one you raise here, and they haven't solved it yet). |
I've spent some time to review some links above and still need time to further review details of conda packaging, auditwheel and conda-press. I want to better understand some of the ABI issues and the so called "thin" vs. "fat" library installs and paths, e.g. in my hacked share/libs there appear to be the same .so versions with various prefixes and no obvious way to prune them using some kind of sem-ver dep-tree solution:
As I understand it at present, the argument is that conda properly solves the shared-libs problems and pip wheels use auditwheel to avoid library conflicts. The build complexity to support all platforms and releases is massive. I guess my hope in opening this issue on this project was that it might be a common denominator for other pythonic libraries to depend on for a binary libgdal (and similarly for proj, libgeos, etc.). I don't know, but it seems like this would require something like the [1] http://manpages.ubuntu.com/manpages/trusty/man8/update-alternatives.8.html |
if libgdal-foo.so and libgdal-bar.so are loaded in the same process, this is going to crash at runtime due to -foo using some symbols of -bar, or the reverse.With DLL, such symbol mixing seems to be less likely due to how symbol resolving works, but on Linux, such .so hell happens in practice. |
The current practice in the rasterio and fiona wheels is to package .so libs within subdirectories of those site-packages, with some filename identifiers for those .so files that attempt to "isolate" the libs from each other. Is it true that loading rasterio and fiona wheels in the same process, which would try to load symbols from seperate .so files, could lead to in-memory symbol mixing? (If so, is that an important argument or a necessary practice for using common shared libs? What happens with conda, Debian or other OS installations that install multiple versions of the gdal lib, or is that not possible without symbol conflicts?). Part of the motivation for this issue is to reduce package sizes, but symbol resolution is paramount and AFAIK the current wheel builds (auditwheel) patch a few things to avoid conflicts. I have not yet done a closer study of auditwheel to understand the requirements and patches it applies (I can only assume it works). |
It would be great to see this issue revived. Installing downstream packages like fiona (in order to install geopandas) that do not supply gdal binaries is a slightly frustrating experience when one is used to working in python virtual environments using pip-compile for dependency/version resolution. I'm aware of the conda package, but can't use it in our workflow since we don't rely on conda environments. What is the advantage conda has here for packaging that can't be reproduced in pip? I was surprised to see that rasterio provides the gdal binary as part of their installation. Considering that there will undoubtedly be more packages in the future that rely on gdal, wouldn't it be good practice to provide a pip-installable installation for them? |
@thomasaarholt having python gdal wheels will not provide GDAL library for fiona.
rasterio provide gdal libs but for it's own internal use, not the binary. |
Conda has the fundamental advantage of being a general package manager, not just a python package manager. So with conda you can install gdal itself, and install python packages that depend on this gdal package. See my comment above #3060 (comment) for some more details on that.
Note that fiona does supply binaries for linux and mac, only not for windows (which is of course an important missing piece, but just to point out). |
Oh my god. I'm an idiot-ish. First off, thanks for your prompt replies, they make sense! On where I was an idiot: I recently switched to an M1 mac, on which I've been trying to install geopandas in a docker image running debian linux. Now this debian image runs on "native" arm architecture. And there (currently) aren't fiona binaries for arm architecture linux... (I'll write an issue on their github). Now that itself isn't particularly idiot-ish, except that this is the second time in a week I have had (and had diagnosed for me) this problem. The other one was polars (pandas alternative). |
@rouault on your point of using conda:
Especially the last point goes against the very spirit of the open source community. For the distribution of OSS we should not rely on a pay to use package repository - especially not as the go-to solution. |
A few responses specifically on the conda topic:
While the state of pip/wheels certainly has improved a lot, the case being discussed here (being able to depend on a GDAL python wheel for other packages such as rasterio or fiona to build against) is, whether you like it or not, something that currently cannot be done with pip, and is specifically solved by conda.
Small clarification: There are two discussions getting mixed here in this issue:
For the first issue (I am not a GDAL maintainer, but interpreting what was said before, eg #2166 (comment), #3060 (comment)), I think the idea of providing wheels is generally welcomed, but requires someone stepping up to do the work ("a champion to lead the effort"). The second issue is much more complex, and AFAIK not something that can be solved by GDAL but requires changes in the broader python packaging space (see my comment above #3060 (comment)). I suppose many are mostly interested in the first issue (being able to more easily install the GDAL python bindings with pip). It might be worth to open a separate issue for that to distinguish both discussions. |
a champion for bootstrapping, and then an automated process. The process of manually building binary wheels for each release wouln't be sustainable on the long term. The nice thing with conda-forge is the automation and transparency in build recipees. As far as I know for pip binary wheels, everyone has its custom, somewhat opaque way, of generating wheels, which is tricky when you don't have direct access to the various build OS. GDAL and its ecosystem (the fact that there are several python packages that share the same binary dependencies with GDAL, things like fiona, rasterio, pyproj, pygeos) are probably among the worst candidates to work nicely with pip as it is currently. Another difficulty with GDAL is that doing binary builds of it is a trade-off of multiple factors and there's no good answer: do you want a minimal GDAL build with just a few popular drivers (good luck to have people agree on which few popular drivers should be included) ? or a large one with ~ all (open source) dependencies ? or one with only permissive and LGPL-like dependencies ? or one with GPL dependencies ? ... |
Regarding automation, in pyogrio we are quite happy with our current solution of using Github Actions with cibuildwheel (and in our case also using vcpkg to build GDAL, this might be different for GDAL itself if you already have build scripts for each platform): https://github.com/geopandas/pyogrio/blob/f16009e26bc9982a531bc4f8570fdf6d6dfa6829/.github/workflows/release.yml#L81-L182 (and fiona recently adopted this as well for providing windows wheels) |
Regarding the dependencies, to some extent, this can be solved by offering plugins, see my efforts here: Which can be easily installed alongside the unofficial windows binary wheels: |
I have configured pipelines for building binary wheels at https://gitlab.com/mentaljam/gdal-wheels. Wheels are published to GitLab's Python package registry. Readme contains some usage examples. |
Do you have more details on how you setup the runners? I see that its referenced by self-hosted-linux and self-hosted-windows tags. |
@DruidNx, first I tried GitLab's shared Windows runner. However, it has a fixed timeout of 2 hours, which is not enough to compile all the dependencies. So, I had to configure a self-hosted shell runner on my PC. Later I configured the building of manylinux wheels and set up a self-hosted Docker runner right away. To be honest, I did not try a shared Linux runner. Maybe it will be sufficient for building GDAL. UPD: Two hours may sound like a long time, but shared runners are limited in CPU resources, which dramatically increases compilation times |
Thanks, I was able to get the wheels, but its missing the binary tools like ogr2ogr, ogrinfo, etc. |
rasterio, shapely, pyproj etc. now provide pure-pip installations with binary wheels that provide manylinux [1] binaries for gdal, proj, geos etc.; can this project also provide binary wheel installations? e.g. see
Potentially Shared Libs
Any way it could work would be great, but it might be optimal if some common binary libs for gdal/ogr could be shared among various python libraries that require them. (The same could apply to proj/pyproj.). That is, not an OS installed shared lib, but a python manylinux [1] binary shared lib (in e.g.
{project-venv-path}/lib/
, alongside of{project-venv-path}/lib/python3.6
).[1] https://github.com/pypa/manylinux
What appears to be happening is that several related but independent pypi projects will each install their own copies of various possibly-shared libs, e.g. both rasterio and fiona both install their own copies of
gdal_data
into e.g..../lib/python3.6/site-packages/fiona/gdal_data/*
.../lib/python3.6/site-packages/rasterio/gdal_data/*
The same applies to
proj_data
, i.e..../lib/python3.6/site-packages/fiona/proj_data/*
.../lib/python3.6/site-packages/rasterio/proj_data/*
For rasterio alone, the binary packages are substantial, e.g.
When combined with a few more related projects, there is some potential duplication of the libs and closer inspection of the lib-versions suggests there could be some inconsistency in the versions installed (without any explicit pip options to control those binary lib versions packaged), e.g.
The request here on this project is that it is very close to the C-source used by some common python wrappers and it might provide a "source of truth" about how to package some common libs using pip/manylinux wheels. (The same request could apply to the C-source for proj perhaps.). Obviously the manylinux builds would need to provide version specific builds. (I don't have a clear idea on how they would be provided as shared-libs, only that the duplications and inconsistencies observed above might benefit from some kind of shared-libs solutions -- that are not OS package solutions, despite how much work goes into those.)
The text was updated successfully, but these errors were encountered: