From 99f0a0de9062c44cc8cdb9cf3a7efd270dbdfd57 Mon Sep 17 00:00:00 2001 From: Philippe Ombredanne Date: Fri, 6 Jan 2017 12:44:23 +0100 Subject: [PATCH] Add write up on packagedcode module in README. * following discussion with @sschuberth in https://github.com/nexB/scancode-toolkit/pull/421#issuecomment-270344304 Signed-off-by: Philippe Ombredanne --- src/packagedcode/README.rst | 116 ++++++++++++++++++++++++++++++++++++ 1 file changed, 116 insertions(+) create mode 100644 src/packagedcode/README.rst diff --git a/src/packagedcode/README.rst b/src/packagedcode/README.rst new file mode 100644 index 00000000000..22f1a039936 --- /dev/null +++ b/src/packagedcode/README.rst @@ -0,0 +1,116 @@ +The purpose of `packagedcode` is to: + +- detect a package, +- determine its dependencies, +- detect its asserted license (at the metadata level) vs. its actual licensing (as scanned). + + +1. **detect the presence of a package** in a codebase based on its manifest, its file +or archive type. Typically it is a third party package but it may be your own too. +Taking Python as a main example a package can exist in multiple forms: + + 1.1. as a **source checkout** (or some source archive such as a source + distribution or an `sdist`) where the presence of a `setup.py` or some + `requirements.txt` file is the key marker for Python. For Maven it would be a + `pom.xml` or a `build.gradle` file, for Ruby a `Gemfile` or `Gemfile.lock`, the + presence of autotools files, and so on, with the goal to eventually covering all + the packages formats/types that are out there and commonly used. + + 1.2. as an **installable archive or binary** such as a Pypi wheel `.whl` or + `.egg`, a Maven `.jar`, a Ruby `.gem`, a `.nupkg` for a Nuget, a `.rpm` or `.deb` + Linux package, etc... Here the type, shape and name structure of an archive as + well as some its files content are the key markers for detection. The metadata + may also be included in that archive as a file or as some headers (e.g. RPMs) + + 1.3. as an **installed packaged** such as when you `pip install` a Python package + or `bundle install` Ruby gems or `npm install` node modules. Here the key markers + may be some combo of a typical or conventional directory layout and presence of + specific files such as the metadata installed with a Python `wheel`, a `vendor` + directory for Ruby, some `node_modules` directory tree for npms, or a certain + file type with metadata such as Windows DLLs. Additional markers may also include + "namespaces" such as Java or Python imports, C/C++ namespace declarations. + +2. **parse and collect the package manifest(s)** metadata. For Python, this means +extracting name, version, authorship, asserted licensing and declared dependencies as +found in the any of the package descriptor files (e.g. a `setup.py` file, +`requirements` file(s) or any of the `*-dist-info` or `*-egg-info` dir files such as +a `metadata.json`). Other package formats have their own metatada that may be more or +less comprehensive in the breadth and depth of information they offer (e.g. +`.nuspec`, `package.json`, `bower.json`, Godeps, etc...). These metadata include the +declared dependencies (and in some cases the fully resolved dependencies too such as +with Gemfile.lock). Finally, all the different packages formats and data are +normalized and stored in a common data structure abstracting the small differences of +naming and semantics that may exists between all the different package formats. + +Once collected, these data are then injected in a `packages` section of the scan. + +What code in `packagedcode` is not meant to do: + +A. **download packages** from a thirdparty repository: there is code upcomming code in +another tool that will be specifically dealing with this and also handles collecting +the metadata as served by a package repository (which are in most cases --but not +always-- the same as what is declared in the manifests). + +B. **resolve dependencies**: the focus here is on a purely static analysis that does not +rely on any network access at runtime by design. To scan for actually used +dependencies the process is to instead scan for an as-built or as-installed or as- +deployed codebase where the dependencies have already been provisioned and installed +and there ScanCode would detect these. +There are also some upcomming prototype for a dynamic multi-package dependencies +resolver that actually runs live the proper tool to resolve and collect dependencies +(e.g. effectively running Maven, bundler, pip, npm, gradle, bower, go get/dep, etc). +This will be a tool separate from ScanCode as this requires having several/all +package managers installed (and possibly multiple versions of each) and may run code +from the codebase (e.g. a setup.py) and access the network for fetching or resolving +dependencies. It could be also exposed as a web service that can take in a manifest +and package and run safely the dep resolution in an isolated environment (e.g. a +chroot jail or docker container) and return the collected deps. + +C. **match packages** (and files) to actual repositories or registries, e.g. given a +scan detecting packages matching would be looking them up in a remote package +repository or a local index and possibly using A. and/or B. additionally if needed. +Here again there is some upcomming code and tool that will deal specifically with +this aspect and would handle also building an index of actual registries/repositories +and matching using hashes and fingerprints. + +An now some answer to questions originally by @sschuberth: + +> More concretely, this does not download the source code of a Python package to run +ScanCode over it. + +Correct. The assumption with ScanCode proper (aside of the other in progress tools +that I mentioned above) is that the deps have been fetched in the code you scan if +you want to scan for deps. Packages will be detected with their declared deps but the +deps will neither be resolved nor fetched. Though, as a second step we could also +verify that all the declared deps are also present in the scanned code as detected +packages. + +> This should be made very clear as this means cases where the license from the +metadata is wrong compared to the LICENSE file in the source code will not get +detected. + +Both the metadata and the file level licenses (such as a header comment or a +`LICENSE` file of sorts) are detected by ScanCode here: the license scan detect the +licenses while the package scan collect the asserted licensing in the metadata. The +interesting thing thanks to this combo is that eventual conflicts (or incomplete +data) can then be analyzed and a deduction should be doable automatically: given a +scan for packages and licenses and copyrights, do the package metadata +asserted/declared license match the actual detected licenses? If not this could be +reported as some "error" condition... Furthermore, this could be refined based on +classification of the files: a package may assert a top level `MIT` license and use a +GPL-licensed build script. By knowing that the build script is indeed a build script, +we could then report that the GPL detected in such script is not conflicting with the +overall asserted MIT license of the package. The same could be done with test +scripts/code, or documentation code (such as doxygen-generated docs) + +> Moreover, licenses from transitive dependencies are not taking into account. + +If the transitive dependencies have been resolved and their code present in the +codebase, then they would be caught by a static ScanCode scan and eventually scanned +both for package metadata and/or license detection. There are some caveats that would +need to be dealt with of course as some tools (e.g. Maven) may not store locally +(e.g. side-by-side with a given checkout) the corresponding artifacts/Jars and use +instead a `~/user` "global" dot directory to store a cache. + +Beyond this, actual dependency resolution of a single package or a complete manifest +would the topic of another tool as mentioned above.