Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install most packages from conda-forge, instead of pypi #2934

Open
6 of 8 tasks
yuvipanda opened this issue Oct 27, 2021 · 16 comments
Open
6 of 8 tasks

Install most packages from conda-forge, instead of pypi #2934

yuvipanda opened this issue Oct 27, 2021 · 16 comments
Assignees
Labels
enhancement Issues around improving existing functionality

Comments

@yuvipanda
Copy link
Contributor

yuvipanda commented Oct 27, 2021

Currently, we get most of our python packages from pypi.org, installed via pip. A lot of scientific python packages have C extensions, and installing them from pypi has been simple enough thanks to manylinux wheels. However, there are some packages - particularly in the geo sciences - that are a pain in the ass to install this way still.

#2824 is one such case. The cartopy project does not ship manylinux wheels, so we need to install its C dependencies - proj, geos, gdal, etc - from apt. This also has knock-on effect for other packages that depend on proj, like shapely. It does have binary wheels, but because cartopy and shapely must link to the same proj library, it must be built from source too - or you run into problems with #1796.

This becomes even more complicated when we add R to the mix. The sf R package also needs proj, and since we're installing it from packagemanager.rstudio.com, it's linked against the version of proj that is available in apt.

So to recap, the following package managers are involved:

  1. apt to get C libraries (proj)
  2. pip to get Pyton packages, that link against the C libraries (shapely, cartopy)
  3. R packages coming in via packagemanager.rstudio.com, that also link against C libraries specifically coming from apt (sf)

This was a bit of a tenuous situation, but the need to upgrade cartopy for #2824 totally made this unworkable. Cartopy 0.20 needed a newer version of proj than what was available in apt. With #2826, we tried to install a newer version of proj from conda (adding yet another package manager to the mix!), but this required we remove proj installed via apt - as otherwise pip was still trying to link to that, and that doesn't work. And once we removed proj from apt, this broke the R sf package, as it required proj from apt!

I think the core of the problem is that both pip and R are dependent on apt for some C libraries, and this can conflict. I propose instead that we:

  1. Use conda to get most scientific python packages, especially any that have C dependencies. This completely removes the need for C packages from apt for the most part
  2. Use apt to get C packages needed by R packages.

The scientific python ecosystem has a lot of good support for conda, so I think this will also simplify our lives a bit. We'll still be getting some python python packages from pip, but as long we're getting most packages that link against C libraries from conda, I think we're ok.

Let's move these one hub image at a time, starting with the easiest.

  • Move infra-requirements.txt to install from conda, by specifying it in environment.yml
  • Biology
  • data8
  • data8x
  • dlab
  • eecs
  • julia
  • datahub

If we get similar versions from conda that we get from pip right now, I think this would work out ok. Should also be faster to do builds

@agoose77
Copy link

@yuvipanda Hi, just dropping in here! Is there a reason that you can't get the R packages from conda-forge? e.g. r-sf?

@yuvipanda
Copy link
Contributor Author

Basically, for python packages, we should install them with conda via environment.yml if it exists in conda-forge, and use pip otherwise.

@yuvipanda
Copy link
Contributor Author

@agoose77 Most of the R community I know of would like to use install.package or devtools to install packages and manage them from CRAN, and I don't want to redirect them to a different method instead. From an optics perspective, conda is often (fairly or unfairly!) seen as python centric, and given we're already fighting the perspective that JupyterHub is python centric (even though we offer RStudio in our hubs), I want to do everything I can to not have R users learn a different package management solution.

@agoose77
Copy link

agoose77 commented Oct 27, 2021

I see what you're saying. I'm not familiar with the R toolchain - is it possible to use different conda environments for RStudio vs the Python kernels?

@yuvipanda
Copy link
Contributor Author

yuvipanda commented Oct 27, 2021

@agoose77 most of our R users use R via RStudio, so conda and Jupyter kernels are completely uninvolved there.

@agoose77
Copy link

agoose77 commented Oct 27, 2021

@yuvipanda sure, let me clarify!

My understanding of your situation is:

  • Both Python and R are installed in the same environment
  • RStudio is accessed via JupyterHub
  • without manylinux wheels, both R and Python are compiling against the same libraries e.g. proj from the system environment.

I am wondering whether it makes sense to drop the need for apt packages entirely by installing RStudio itself in a separate environment to Python, and then have your entry-point such that this is invisible to the user. This is just so that there is a clearer isolation / separation between the system and the application environments (RStudio, Python). Functionally this wouldn't be much different from using apt for the R dependencies, except that it keeps everything on Conda.

@yuvipanda
Copy link
Contributor Author

Both Python and R are installed in the same environment

Ah, so they're installed in the same Docker image, but R doesn't know anything about conda at all, so they aren't in the same 'conda environment'. The proposal in this issue uses conda for all Python, and R's native package installation (from CRAN) + apt for R. The scripts that R users distribute often have install.packages() commands in them, and I don't want them to have to do something special instead. Hence avoiding getting anything R from conda for now.

@agoose77
Copy link

Right! My suggestion was only that putting R inside a separate Conda env would allow you to avoid using APT for R, because install.packages should still work within a Conda environment. The only difference is that you provide the necessary dependencies e.g. proj via Conda inside that environment instead of from APT. It's mainly an idea to simplify the ergonomics so that your "environments" are distinct from the host :)

@yuvipanda
Copy link
Contributor Author

@agoose77 ah, ok - I'll consider that :) I'm somewhat quite reluctant to use conda for R, as I feel the general R community is much more focused on CRAN and apt than on conda. packagemanager.rstudio.com offers prebuilt binary packages for all of CRAN, while there are only a subset of packages available on conda-forge. In my ideal world, I'd not use conda for python packages either (so I don't have to mix them!) - and at least for now it looks like I can do that (avoid mixing!) with R.

@agoose77
Copy link

In my ideal world, I'd not use conda for python packages either (so I don't have to mix them!) - and at least for now it looks like I can do that (avoid mixing!) with R.

Yeah, I don't like mixing my pip with conda-forge packages (and therefore tend to rely solely on PyPI). It would be nice if there were an abstraction layer for tools like poetry such that PyPI + conda-forge could be used by the tool.

@felder felder self-assigned this Oct 28, 2021
@felder felder added the enhancement Issues around improving existing functionality label Oct 28, 2021
@felder
Copy link
Contributor

felder commented Nov 5, 2021

Submitted PR for julia and asked @yuvipanda to review just to make sure we're on the same page.

felder added a commit that referenced this issue Nov 15, 2021
Moving requirements.txt to environment.yml for julia issue #2934
felder added a commit that referenced this issue Nov 16, 2021
felder added a commit that referenced this issue Nov 17, 2021
Moving requirements.txt to environment.yml for biology issue #2934
felder added a commit that referenced this issue Nov 18, 2021
Moving requirements.txt to environment.yml for data8 issue #2934
@felder
Copy link
Contributor

felder commented Nov 19, 2021

dlab uses the datahub user image.

@felder
Copy link
Contributor

felder commented Nov 20, 2021

@yuvipanda looking at eecs hub, would https://anaconda.org/conda-forge/py-opencv be the same as opencv-python?

@balajialg
Copy link
Contributor

balajialg commented Aug 4, 2022

@felder Is this issue in scope for Fall 22 or should be moved to Spring 23 or is irrelevant at this juncture?

@felder
Copy link
Contributor

felder commented Aug 10, 2022

@balajialg could be in scope for Fall. My understanding is we're also going to update the base image and do some package management. Could be this gets done as part of that work.

@balajialg
Copy link
Contributor

@felder Sounds good. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues around improving existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants