Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#253 python packages #421

Merged
merged 8 commits into from
Jan 5, 2017
Merged

#253 python packages #421

merged 8 commits into from
Jan 5, 2017

Conversation

pombredanne
Copy link
Member

Add initial support to detect Python packages for #253

  • handle some level of detection using setup.py, wheels and eggs metadata

rakesh balusa and others added 6 commits December 6, 2016 12:47
@sschuberth
Copy link
Collaborator

From reading the code, my understanding is that this relies on the meta-data specified as part of setup.py an reports whatever license is declared there. More concretely, this does not download the source code of a Python package to run ScanCode over it. This should be made very clear as this means cases where the license from the meta-data is wrong compared to the LICENSE file in the source code will not get detected. Moreover, licenses from transitive dependencies are not taking into account.

@pombredanne Are my statements correct?

@pombredanne
Copy link
Member Author

pombredanne commented Jan 4, 2017

@sschuberth the goal of this PR (and of all the code in the packagedcode package which is invoked with the -p or --package scan) is to:

  1. detect the presence of a package in a codebase based on its manifest, it file or archive type. Typically it would be a third party package but it may be yours. Here this is for Python and they can exists in multiple forms:
    1.a) a source checkout (or some source archive such as what is called a source distribution or an sdist) where the presence of a setup.py or some requirements file is the key marker. For Maven it would be a pom.xml or a .gradle file, for Ruby a Gemfile or Gemfile.lock, and so on, eventually covering ALL the packages formats/types that are out there.
    1.b) an installable archive such as a Pypi wheel or egg, a Maven jar, a Ruby gem, a .nupkg for a Nuget, etc... Here the type, shape and name structure of an archive as well as possibly some its file content is the key marker for detection.
    1.c) the package as-installed such as when you pip install or bundle install or npm install one or packages. Here the key markers may be some combo of a directory layout and specific files (such as the metadata installed with a Python wheel, a vendor directory for Ruby, some node_modules tree of sorts for npm, a .rpm or .deb Linux package, etc.

  2. parse and collect the package manifest(s) metadata. For Python, this means extracting name, version, authorship, asserted licensing and declared dependencies as found in the any of the package files (setup.py and/or requirements file(s) and/or any of the *-dist-info dir files such as a metadata.json). Other package formats have their own metatada more or less comprehensive (e.g. .nuspec, package.json, bower.json, Godeps, etc...)

These two will be then injected in a packages scan section.

What code in packagedcode is not meant to do:

A. download packages from a thirdparty repository: there is code I have in an unpushed repo that is specifically dealing with this and also handles collecting the metadata as served by a package repository (which are in most cases --but not always-- the same as what is declared in the manifest). I will publish this sometimes in Q1 hopefully.

B. resolve dependencies: the focus of scancode is on a purely static analysis that does not rely on any network access for running by design. To scan for actually used dependencies the process is to instead scan for an as-built or as-installed or as-deployed application where the dependencies have already been provisioned and installed and there scancode would detect these. I also have an unpushed prototype for a dynamic multi-package dependencies resolver that actually runs live the proper tool to resolve and collect dependencies (e.g. effectively running Maven, bundler, pip, npm, gradle, bower, go get/dep, etc). This will be a tool separate from scancode as this requires having several/all package managers installed (and possibly multiple versions) and does run things and accesses the network. It may be best exposed as a web service that can take in a manifest and package and run safely the dep resolution in an isolated environment (e.g. a chroot jail or docker container) and return the collected deps.

C. match packages (and files) to actual repositories or registries, e.g. given a scan detecting packages actually looking them up in a remote package repository and then using A. and/or B. additionally if needed. I have unpushed code for this too that will eventually land on Github and handles also building an index of actual registries/repositories and matching using hashes and fingerprints.

Now the goal of all is simple: detect a package, determine its deps, detect its asserted license (at the metadata level) and its actual licenses (at the scan level, and these may differ, conflict at times)

So to finally answer your questions:

More concretely, this does not download the source code of a Python package to run ScanCode over it.

Correct. The assumption with ScanCode proper (aside of the other not-yet-published tools that I mentioned
above) is that the deps have been fetched in the code you scan if you want to scan for deps. Packages will be detected with their declared deps but the deps will neither be resolved nor fetched (but as a second step I plan also to have a check to verify that all the declared deps are also present in the scanned code as detected packages too). Actual fetching will be handled by the upcoming tool I mentioned above.

This should be made very clear as this means cases where the license from the meta-data is wrong compared to the LICENSE file in the source code will not get detected.

Both the metadata and the file level licenses (such as a header comment or a LICENSE file of sorts) are detected by ScanCode here: the interesting thing thanks to this is that eventual conflicts can then be analyzed and a deduction should be doable automatically: given a scan for packages and licenses and copyrights, do the package metadata asserted/declared license match the actual detected licenses? if not this could be reported as some "error" condition... Furthermore, this could be refined based on classification of the files: a package may assert a top level MIT license and use a GPL-licensed build script. By knowing that the build script is indeed a build script, we could then report that the GPL detected in such script is not conflicting with the overall asserted MIT license of the package. The sam could be done with test scripts/code, or documentation code (such as doxygen-generated docs)

Moreover, licenses from transitive dependencies are not taking into account.

If the transitive dependencies have been resolved and their code present in the codebase, then they would be caught by a static ScanCode scan and eventually scanned both for package metadata or actual license detection. There are some caveats that would need to be dealt with of course as some tools (e.g. Maven) may not store locally (e.g. side-by-side with a given checkout) the corresponding artifacts/Jars and use instead a ~/user "global" dot directory to store a cache.

Beyond this, the goal is to use in the future the other tools that I mentioned above for actual dependency resolution of a single package or a complete manifest

@sschuberth
Copy link
Collaborator

Thanks @pombredanne for the sophisticated write up. I believe it's too valuable to just sit here, so would you mind adding it to a README.md in the packagedcode directory?

@pombredanne
Copy link
Member Author

@sschuberth good point! Will do today.

@pombredanne
Copy link
Member Author

@linexb I wonder if you could help there. The code needs some love and has many tests failing.

@pombredanne
Copy link
Member Author

@linexb Thanks.... Looking good!

@pombredanne pombredanne merged commit 088e86a into develop Jan 5, 2017
@pombredanne pombredanne deleted the 253-python-packages branch January 5, 2017 21:58
pombredanne added a commit that referenced this pull request Jan 6, 2017
 * following discussion with @sschuberth in
 #421 (comment)

Signed-off-by: Philippe Ombredanne <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants