Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is latest 3.6 compiled with parquet / arrow enabled? #57

Open
ncgl-syngenta opened this issue May 4, 2023 · 4 comments
Open

Is latest 3.6 compiled with parquet / arrow enabled? #57

ncgl-syngenta opened this issue May 4, 2023 · 4 comments

Comments

@ncgl-syngenta
Copy link

Not proficient in C++, but running into this problem when trying to run ogr2ogr with a parquet output:

import subprocess

return_code = subprocess.Popen(["ogr2ogr", "-f", "Parquet", "somedestination", "somelocation", "--debug", "ON"], stdout=subprocess.PIPE).poll()

this outputs this:



b'GDAL 3.6.4, released 2023/04/17\n'
--
ERROR 1: Unable to find driver `Parquet'.
[ERROR] FileNotFoundError: somedestination
Traceback (most recent call last):  File "/var/task/epsagon/wrappers/aws_lambda.py", line 137, in _lambda_wrapper
result = func(*args, **kwargs)  File "/var/task/application/v1/controller/console/test_gdal.py", line 26, in test
 df = gpd.read_parquet("somedestination")  File "/mnt/efs/lib/geopandas/io/arrow.py", line 560, in _read_parquet    table = parquet.read_table(path, columns=columns, filesystem=filesystem, **kwargs)  File "/mnt/efs/lib/pyarrow/parquet/core.py", line 2926, in read_table    dataset = _ParquetDatasetV2(  File "/mnt/efs/lib/pyarrow/parquet/core.py", line 2477, in __init__    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,  File "/mnt/efs/lib/pyarrow/dataset.py", line 762, in dataset    return _filesystem_dataset(source, **kwargs)  File "/mnt/efs/lib/pyarrow/dataset.py", line 445, in _filesystem_dataset    fs, paths_or_selector = _ensure_single_source(source, filesystem)  File "/mnt/efs/lib/pyarrow/dataset.py", line 421, in _ensure_single_source    raise FileNotFoundError(path)

file paths have been replaced.

So - diving into the Cmake flags I see this: https://github.com/OSGeo/gdal/blob/634f60a4181c9db067a64dbfdd9f2872e4992927/ogr/ogrsf_frmts/generic/ogrregisterall.cpp#L251

but don't see anything specifically disabling it in the build, so anyone who can read C++ can you tell me if outputting to parquet is possible in the version built for this image?

@vincentsarago
Copy link
Contributor

vincentsarago commented May 4, 2023

we're not adding librarrow so I guess this is why the Parquet driver in not available. But it will be a nice addition.

I'm not sure I'll have time right now sadly but I'll be happy to review any PR 🙏

ref: https://gdal.org/development/building_from_source.html#arrow https://gdal.org/development/building_from_source.html#parquet

@ncgl-syngenta
Copy link
Author

ncgl-syngenta commented May 9, 2023

Ok - I've been able to build my own version with arrow. I removed a lot of other installs since our specific use case only needs parquet support, but the zip size still shoots up, mainly because of an arrow dependency file (40MB on its own)

Im going to paste the full dockerfile that we're using, in case anyone comes across this:

# modified from https://github.com/lambgeo/docker-lambda/blob/master/dockerfiles/Dockerfile.gdal3.6

FROM public.ecr.aws/lambda/provided:al2 as builder

RUN yum makecache fast
RUN yum install -y autoconf libtool flex bison cmake make tar gzip gcc gcc-c++ automake16 nasm readline-devel openssl-devel curl-devel cmake3

ENV PREFIX=/opt
WORKDIR /opt

ENV LD_LIBRARY_PATH $PREFIX/lib:$LD_LIBRARY_PATH

# pkg-config
ENV PKGCONFIG_VERSION=0.29.2
RUN mkdir /tmp/pkg-config \
  && curl -sfL https://pkg-config.freedesktop.org/releases/pkg-config-${PKGCONFIG_VERSION}.tar.gz | tar zxf - -C /tmp/pkg-config --strip-components=1 \
  && cd /tmp/pkg-config \
  && CFLAGS="-O2 -Wl,-S" ./configure --prefix=$PREFIX --with-internal-glib \
  && make -j $(nproc) --silent && make install && make clean \
  && rm -rf /tmp/pkg-config

ENV PKG_CONFIG_PATH=$PREFIX/lib/pkgconfig/

# sqlite
RUN mkdir /tmp/sqlite \
  && curl -sfL https://www.sqlite.org/2020/sqlite-autoconf-3330000.tar.gz | tar zxf - -C /tmp/sqlite --strip-components=1 \
  && cd /tmp/sqlite \
  && CFLAGS="-O2 -Wl,-S" CXXFLAGS="-O2 -Wl,-S" ./configure --prefix=$PREFIX --disable-static \
  && make -j $(nproc) --silent && make install && make clean \
  && rm -rf /tmp/sqlite

ENV \
  SQLITE3_LIBS="-L${PREFIX}/lib -lsqlite3" \
  SQLITE3_INCLUDE_DIR="${PREFIX}/include" \
  SQLITE3_CFLAGS="$CFLAGS -I${PREFIX}/include" \
  PATH=${PREFIX}/bin/:$PATH

# nghttp2
ENV NGHTTP2_VERSION=1.42.0
RUN mkdir /tmp/nghttp2 \
  && curl -sfL https://github.com/nghttp2/nghttp2/releases/download/v${NGHTTP2_VERSION}/nghttp2-${NGHTTP2_VERSION}.tar.gz | tar zxf - -C /tmp/nghttp2 --strip-components=1 \
  && cd /tmp/nghttp2 \
  && ./configure --enable-lib-only --prefix=$PREFIX \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/nghttp2

# libcurl
ENV CURL_VERSION=7.73.0
RUN mkdir /tmp/libcurl \
  && curl -sfL https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz | tar zxf - -C /tmp/libcurl --strip-components=1 \
  && cd /tmp/libcurl \
  && ./configure --disable-manual --disable-cookies --with-nghttp2=$PREFIX --prefix=$PREFIX \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/libcurl

# libtiff
ENV LIBTIFF_VERSION=4.5.0
RUN mkdir /tmp/libtiff \
  && curl -sfL https://download.osgeo.org/libtiff/tiff-${LIBTIFF_VERSION}.tar.gz | tar zxf - -C /tmp/libtiff --strip-components=1 \
  && cd /tmp/libtiff \
  && LDFLAGS="-Wl,-rpath,'\$\$ORIGIN'" CFLAGS="-O2 -Wl,-S" CXXFLAGS="-O2 -Wl,-S" ./configure \
    --prefix=$PREFIX \
    --disable-static \
    --enable-rpath \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/libtiff

# geos
ENV GEOS_VERSION=3.11.2
RUN mkdir /tmp/geos \
  && curl -sfL https://github.com/libgeos/geos/archive/refs/tags/${GEOS_VERSION}.tar.gz | tar zxf - -C /tmp/geos --strip-components=1 \
  && cd /tmp/geos \
  && mkdir build && cd build \
  && cmake3 .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_TESTING=NO \
    -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
    -DCMAKE_INSTALL_LIBDIR:PATH=lib \
    -DCMAKE_C_FLAGS="-O2 -Wl,-S" \
    -DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/geos

ENV PROJ_VERSION=9.2.0
RUN mkdir /tmp/proj && mkdir /tmp/proj/data \
  && curl -sfL https://github.com/OSGeo/proj/archive/${PROJ_VERSION}.tar.gz | tar zxf - -C /tmp/proj --strip-components=1 \
  && cd /tmp/proj \
  && mkdir build && cd build \
  && cmake3 .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
    -DCMAKE_INSTALL_LIBDIR:PATH=lib \
    -DCMAKE_INSTALL_INCLUDEDIR:PATH=include \
    -DBUILD_TESTING=OFF \
    -DCMAKE_C_FLAGS="-O2 -Wl,-S" \
    -DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/proj

ENV ARROW_VERSION=12.0.0
RUN mkdir /tmp/arrow \
    && curl -sfL "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/arrow-${ARROW_VERSION}/apache-arrow-${ARROW_VERSION}.tar.gz" | tar zxf - -C /tmp/arrow --strip-components=1 \
    && cd /tmp/arrow/cpp \
    && mkdir build && cd build \
    && cmake3 .. \
    -DCMAKE_INSTALL_PREFIX=$PREFIX \
    -DCMAKE_PREFIX_PATH=$PREFIX \
    -DCMAKE_INSTALL_LIBDIR=lib \
    -Dxsimd_SOURCE=BUNDLED \
    -DCMAKE_C_FLAGS="-O2 -Wl,-S" \
    -DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
    -DARROW_BUILD_TESTS=OFF \
    -DARROW_PARQUET=ON \
    && make -j $(nproc) --silent && make install \
    && rm -rf /tmp/arrow

# We use commit sha to make sure we are not using `cache` when building the docker image
# "7ca88116f5a46d429251361634eb24629f315076" is the latest commit on release/3.6 branch

# gdal
RUN mkdir /tmp/gdal \
  && curl -sfL https://github.com/OSGeo/gdal/archive/7ca88116f5a46d429251361634eb24629f315076.tar.gz | tar zxf - -C /tmp/gdal --strip-components=1 \
  && cd /tmp/gdal \
  && mkdir build && cd build \
  && cmake3 .. \
    -DGDAL_USE_EXTERNAL_LIBS=ON \
    -DCMAKE_BUILD_TYPE=MinSizeRel \
    -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
    -DCMAKE_INSTALL_LIBDIR:PATH=lib \
    -DCMAKE_PREFIX_PATH=lib \
    -DGDAL_SET_INSTALL_RELATIVE_RPATH=ON \
    -DBUILD_PYTHON_BINDINGS=OFF \
    -DBUILD_TESTING=OFF \
    -DCMAKE_C_FLAGS="-O2 -Wl,-S" \
    -DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
    -DGDAL_BUILD_OPTIONAL_DRIVERS=OFF \
    -DOGR_BUILD_OPTIONAL_DRIVERS=OFF \
    -DGDAL_USE_PARQUET=ON \
    -DGDAL_USE_ARROW=ON \
    -DOGR_ENABLE_DRIVER_ARROW=ON \
    -DOGR_ENABLE_DRIVER_ARROW_PLUGIN=ON \
    -DOGR_ENABLE_DRIVER_PARQUET=ON \
    -DOGR_ENABLE_DRIVER_PARQUET_PLUGIN=ON \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/gdal

# from https://github.com/pypa/manylinux/blob/d8ef5d47433ba771fa4403fd48f352c586e06e43/docker/build_scripts/build.sh#L133-L138
# Install patchelf (latest with unreleased bug fixes)
ENV PATCHELF_VERSION=0.10
RUN mkdir /tmp/patchelf \
  && curl -sfL https://github.com/NixOS/patchelf/archive/${PATCHELF_VERSION}.tar.gz | tar zxf - -C /tmp/patchelf --strip-components=1 \
  && cd /tmp/patchelf \
  && ./bootstrap.sh \
  && ./configure \
  && make -j $(nproc) --silent && make install \
  && cd / && rm -rf /tmp/patchelf

# FIX
RUN for i in $PREFIX/bin/*; do patchelf --force-rpath --set-rpath '$ORIGIN/../lib' $i; done

# Build final image
FROM public.ecr.aws/lambda/provided:al2 as runner

ENV PREFIX=/opt

COPY --from=builder $PREFIX/lib/ $PREFIX/lib/
COPY --from=builder $PREFIX/include/ $PREFIX/include/
COPY --from=builder $PREFIX/share/ $PREFIX/share/
COPY --from=builder $PREFIX/bin/ $PREFIX/bin/

RUN export GDAL_VERSION=$(gdal-config --version)

RUN yum install -y zip binutils

# remove any unneeded files
RUN rm -rdf $PREFIX/share/doc \
    && rm -rdf $PREFIX/share/man \
    && rm -rdf $PREFIX/share/cryptopp \
    && rm -rdf $PREFIX/share/hdf*

RUN cd $PREFIX \
    && find lib/ -type f -name \*.so\* -exec strip {} \; \
    && zip -r9q --symlinks /tmp/package.zip lib \
    && zip -r9q --symlinks /tmp/package.zip share \
    && zip -r9q --symlinks /tmp/package.zip bin/gdal* bin/ogr* bin/geos* bin/arrow* bin/proj* \
    && mv /tmp/package.zip /package.zip

FROM scratch AS exporter
COPY --from=runner /package.zip .

@vincentsarago
Copy link
Contributor

it seems there are a lot of arrow compilation options that are set to ON by default that could be changed https://arrow.apache.org/docs/developers/cpp/building.html#optional-components

@vincentsarago
Copy link
Contributor

some related issues

apache/arrow#33126
aws/aws-sdk-pandas#1977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants