Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

Beyond CRAN: modern dependency management; including older/archived versions & alternative respositories #7

Open
cboettig opened this issue Feb 10, 2015 · 28 comments

Comments

@cboettig
Copy link
Member

CRAN is a favorite topic at any R gathering; so perhaps we can channel it into something productive here.

Perhaps I'm grouping different issues under the same tent here, so feel free to propose to break these into separate issues or pursue only certain aspects of this. I'm really not sure how best to describe these issues either, so if this sparks any interest feel free hijack this issue and re-frame the discussion however you see fit.

  • There's a handful of very interesting packages for giving a stricter approach to dependency management: including RStudio's packrat, @gmbecker 's gRAN & switchr, and some of the related tools to which these connect (mran, @gaborcsardi 's crandb, etc). It might be nice to take stock of what issues these address and what challenges remain in managing dependencies.
  • The above approaches attempt to deal with the reality of installing packages whose dependencies come from locations other than CRAN (or CRAN/bioconductor/omegahat; though the latter two are clearly rather different than CRAN & rather different ways). Does this suggest that the future will be more distributed, or the need for a different approach to coordinating package dependencies?
  • EDIT @eddelbuettel's new drat takes the approach of making it easier for simple creation and use of alternative repositories, building around base R install functions. How does this approach (and the general ability to add arbitrary repositories into dependencies) impact with the approaches in the first bullet? (There's a question about the relative roles of install_github vs install.packages()+drat style approach, also seen in @yihui 's xran, etc here too)
@jennybc
Copy link
Member

jennybc commented Feb 10, 2015

@eddelbuettel Isn't drat also relevant here?

@cboettig
Copy link
Member Author

drat! How could I forget that! Thanks, editing original post now...

@gaborcsardi
Copy link

Not very surprisingly :), I would very much like to participate in this. There is also https://github.com/rpkg which I am planning to work into a usable state before the unconf. The plan is to have some sort of database/repository and minimal package manager for packages (initially) on Github, to facilitate 1) package discovery and 2) dependencies. The DB will be here: https://github.com/rpkg/rpkgdb.app, and the package manager is not yet in the works.

Alternatively/additionally, we can convince @hadley to put support for Github dependencies into devtools::install_github. :)

@eddelbuettel
Copy link

For R source packages, I prefer relying on R tools. Hence drat which relies on R's (existing and working) tools to build a package index, resolve R dependencies and simply permit working repositories now. I see no point in replicating this / reinventing wheels here. It works for me.

For R binary packages, though, I was wondering if I could hash things out with @gaborcsardi about revisiting what we did with cran2deb (and which lives on in Don's debian-r. That would eg feed into Rocker, and could be used for other binary builds on other OSs etc.

@eddelbuettel
Copy link

@cboettig: Thanks for the edit above. I think this ticket may need to get split into two or three related ones. Read your headline v e r y s l o w l y: that is really several problems at once, no?

So looking forward to these two days...

@gmbecker
Copy link

@cboetig A subject close to my heart :). Excited to talk to other
interested parties about it!

@gaborcsardi This is one of the things that switchr provides
(github-to-github dependencies), given a manifest of all necessary github
packages and the ability to build the packages from source. We do this by
building a just-in-time repository by recursively traversing the stated
dependencies.

Given a a complete-enough manifest of packages within github repositories,
this allows us to treat github as an untested "CRAN-devel", as well as a
supplementary archive for older pkg versions.

@eddelbuettel Re multiple issues. They can be seperate, and that approach
is not without benefits, but they don't need to be, I think...

This is going to be fun.
~G

On Tue, Feb 10, 2015 at 3:49 PM, Dirk Eddelbuettel <[email protected]

wrote:

@cboettig https://github.com/cboettig: Thanks for the edit above. I
think this ticket may need to get split into two or three related ones.
Read your headline v e r y s l o w l y: that is really several problems at
once, no?

So looking forward to these two days...


Reply to this email directly or view it on GitHub
#7 (comment).

Gabriel Becker, PhD
Computational Biologist
Bioinformatics and Computational Biology
Genentech, Inc.

@gaborcsardi
Copy link

@eddelbuettel Well, existing and working is a good point. However, I would really like to have a repository that provides an API to query, submit, etc., plus a modern package manager that speaks this API (and handles CRAN as well, maybe through crandb).

That does not mean that we cannot start with the existing tools, and just provide an API and maybe a package manager on top of them. In fact, using the existing and working crandb with the existing and working drat is probably something we can put together in two days.

However, I also think that sometimes it is better to start from scratch. :) We would need the machinery to submit, publish, query, etc. packages, anyway.

@sckott
Copy link

sckott commented Feb 11, 2015

This may seem crazy, but the conda package manager from Continuum does or soon will support R http://continuum.io/blog/preliminary-support-R-conda - and has virtual environments, and is cross platform - A possible place to start on the pkg manager front?

@hafen
Copy link

hafen commented Feb 11, 2015

I was going to mention conda-R as well. Interesting idea with Binstar and all.

@gaborcsardi
Copy link

@gmbecker Nice, I didn't know switchr did that. That's a great first step! It would also be nice to have 1) proper versions, and 2) some DB of packages, for discovery. I would leave out virtual environments from this, I think they are great, but they can be done independently.

Re conda, I am not sure what it does exactly with R packages. Isn't conda itself written in Python? That's somewhat suboptimal, to have to install Python to manage R packages..... but conda and binstar are definitely projects to learn from, just like other package managers, e.g. the ones listed at https://github.com/showcases/package-managers

@eddelbuettel As for binary builds, I guess you realize that for that you need a farm, or SaaS? Or maybe I don't get what you mean here.

Btw. I also think that we should break up this issue.

@gmbecker
Copy link

@gaborcsardi Depending on what you mean by true versions, switchr does
support that as well. You can tell it to install an exact (even
non-current) version of a package from a github repositories. This becomes
murkier when there are depednencies, as it is difficult for it to know what
versions of the dependencies to grab.

Re: a database, I agree. The way I am currently envisioning it is a large
manifest hosted on github with the ability for the community to make pull
requests to add-to or update it. This could be wrapped in a convenience R
function.

For the versions-of-dependencies issue I agree that some sort of database
is probably the only way to manage that. switchr can talk to your crandb
service to solve this problem for packages that lived on CRAN. As a
side-note I'd like to talk to you about a) making that particular query
easier in the crandb API, and b) how hard it would be to have a similar
service for Bioconductor packages.

~G

On Tue, Feb 10, 2015 at 7:41 PM, Gábor Csárdi [email protected]
wrote:

@gmbecker https://github.com/gmbecker Nice, I didn't know switchr did
that. That's a great first step! It would also be nice to have 1) proper
versions, and 2) some DB of packages, for discovery. I would leave out
virtual environments from this, I think they are great, but they can be
done independently.

Re conda, I am not sure what it does exactly with R packages. Isn't conda
itself written in Python? That's somewhat suboptimal, to have to install
Python to manage R packages..... but conda and binstar are definitely
projects to learn from, just like other package managers, e.g. the ones
listed at https://github.com/showcases/package-managers

@eddelbuettel https://github.com/eddelbuettel As for binary builds, I
guess you realize that for that you need a farm, or SaaS? Or maybe I don't
get what you mean here.

Btw. I also think that we should break up this issue.


Reply to this email directly or view it on GitHub
#7 (comment).

Gabriel Becker, PhD
Computational Biologist
Bioinformatics and Computational Biology
Genentech, Inc.

@gaborcsardi
Copy link

@gmbecker

Re versions, that's great, again, I did not know you did that. For the DB, I want people to be able to make releases, i.e. to decide which versions they want to include in the DB. And then the package manager would handle this, e.g. update packages, That's all I mean. Versioned dependencies would be great, but too difficult right now imho. I investigated this a bit here: https://github.com/metacran/camo but I don't think it is worth doing it, before we actually have a package manager.

I was thinking about the Github solution, too. However, I really don't want to handle pull requests by hand, and it is also really hard to automate them, so that people can't mess up the DB by chance. That's why I am thinking about a noSQL DB with an API, and authentication (which can be provided by Github, actually). Something very similar to crandb. All automated, with human intervention only in exceptional cases.

As for your crandb query, and BioC, why don't you open an issue in the crandb repo, and then we can discuss it.

@eddelbuettel
Copy link

FWIW I started to cobble together a package (for Debian-based systems including of course Ubuntu and what is used at Travis) to interface the package manager backend: RcppAPT. This may (or may not) help with gathering information about what has been built (in the "take source from CRAN and build a binary" sense) and which build-dependencies are or are not available.

It may be useful for all the cloud-based things implemented with a Debian/Ubuntu backing such as Docker (where at least @cboettig and I use a Debian backing), Travis, ... but it won't buy you lunch on Windoze, OS X, Fedora, and lots of other lovely places. Which I rarely visit :)

@hafen
Copy link

hafen commented Mar 20, 2015

A late +1 for this - I'm particularly interested in the managed github database of packages / github dependency management, and would like to participate.

@eddelbuettel
Copy link

Have you looked at drat ?

One view is that we don't actually have to reinvent anything but rather create more repositories.

@hafen
Copy link

hafen commented Mar 21, 2015

Yes - it looks great! Sorry I didn't mean to imply that these aren't solved - I have a lot to catch up on in this area. Just interested in helping push it forward.

@gaborcsardi
Copy link

I think the DB has different goals than drat. The main goals are

  1. Discoverability. There are thousands of R packages on Github, and it is not easy to find them, even if Github is searchable.
  2. Easy installation and dependency management. R's built-in package management (used by drat, too, afaik) does not really scale to hundreds of repos, imo. Specifically, every time to want to install something or even just want to list packages, it downloads PACKAGES files from all repos. This is not feasible for hundreds of repos, and as this is hardwired into R (utils), it is unlikely to change. (OK, the packages are cached within a session, that's actually good.)

@eddelbuettel
Copy link

I am not aware that aware that anybody ever has tried "hundreds of repos" but do agree that scaling to that size would be an issue.

I foresee drat as being useful for the range of, say, two to five repos as I don't really see people keeping track of that many more.

And I also see the MetaCRAN DB as both extremely useful and also complementary do what drat tries to do here and now.

This is not either or, and I hope I didn't portray it as such. One can use drat today to solve actual problems -- and I and some other early adopters do.

@gaborcsardi
Copy link

I agree completely. I think that drat is great for your own local repo management, and maybe a couple of other repos. (But we don't know how many until we try it, actually. Maybe it scales up to 50, maybe not.) It probably does not scale up to hundreds. This is not drat's fault, R's package management functions were simply not written with having many repos in mind.

Btw. I started writing the package manager: https://github.com/metacran/rpkg (Pre-alpha, so beware! :) )
It uses ./r_pkgs by default for installing packages, and you need to supply global = TRUE to most functions to use the standard R library directories in .libPaths(). Thinks like

pkg_list()
pkg_list(global = TRUE)
pkg_outdated()
pkg_outdated(global = TRUE)
pkg_tree("devtools")
pkg_install("httr")
pkg_upgrade()
pkg_info("httr")
pkg_bug("httr")
pkg_browse("Rcpp")

work, at least on OSX. There are a lot of corner cases to handle, and a lot to do in general. The good thing is, with crandb, every operation is pretty simple.

It uses CRAN packages now, the idea is to add Github packages later.

@gmbecker
Copy link

@gaborcsardi @eddelbuettel - switchr approaches this via non-centralized manifests. When installing from a manifest, it creates a local repository containing only the necessary packages, and installs from that by calling down to R's built-in installation mechanisms. AFAICS, this should scale indefinitely, as the complexity goes up with the number of packages that will be installed, not the number of packages available.

At that point, all we need is a manifest of where R packages live on github, and we are good to go. For example, it is easy to generate a manifest of all ROpenSci packages, so that installing any single or combination of those packages just works.

@gaborcsardi
Copy link

@gmbecker Yes, the DB and the accompanying web-service have essentially the same role as your Github manifest.

@viking
Copy link

viking commented Feb 20, 2018

@gaborcsardi Sorry to resurrect a dead thread, but did you end up getting anywhere in your efforts to create a package database?

@eddelbuettel
Copy link

eddelbuettel commented Feb 20, 2018

@viking : Yes, it underlies R Hub and is used. But I tend to loose myself among the different repos that @gaborcsardi has and cannot immediately point you to one.

@jeroen
Copy link
Member

jeroen commented Feb 20, 2018

@viking the r-hub dependency db is https://sysreqs.r-hub.io which is backed by this data: https://github.com/r-hub/sysreqsdb

@viking
Copy link

viking commented Feb 20, 2018

Hmm, so what is the primary function of R Hub? Will it be a CRAN replacement eventually or is it meant to supplement CRAN in some way?

@gaborcsardi
Copy link

It is a multi-platform package check service, in its current form.

@eddelbuettel
Copy link

Will it be a CRAN replacement eventually

That was never planned. It is independent of CRAN.

@viking
Copy link

viking commented Feb 20, 2018

Aha, I see. I was just curious about the state of some of the ideas discussed in this thread. Thanks for the info.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants