-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify pipelines to get purls from package_data #904
Conversation
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See various comments.
Also:
Disables the package assembly for the scan_codebase_packages pipeline. So we only have the codebase scanned for package_data and these are added to the specific resources without creating DiscoveredPackage and DiscoveredDependency objects out of these package data, by assembling.
I'm not sure about changing the behavior of package detection in scan_codebase_packages
. It's unclear to me looking at the code the consequence of this.
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
If the main objective of this pipeline is to get purls from the codebase and use this as a previous step for the populate_purldb pipeline, then we need to create Discoveredpackage/DiscoveredDependency instances, but having resource level package detections in each resource is enough (We can goa step further by passing an argument to scancode to only detect purl related information only, and ignore everything else, which will be the next step). The package assembly step uses these resource level detections and creates instances, assigns files, but this is not required in this context. @tdruez thanks for the review and feedback, I've addressed these, ready for your review again 😄 |
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AyanSinhaMahapatra It would be nice to add some details in the docstring of ScanCodebasePackages to describe its purpose and its limitations.
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@@ -70,7 +70,7 @@ def feed_purldb(self, packages, package_type): | |||
if not purldb.is_available(): | |||
raise Exception("PurlDB is not available.") | |||
|
|||
package_urls = [package.purl for package in packages] | |||
package_urls = list(set([package.purl for package in packages])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that when we get purls from package_data and not package/dependency instances, we might have duplicate purls (like from manifests of a same package instance and from different manifests having same dependencies), so it makes sense to send only the unique purls for indexing in purldb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to update expectations at 98fc128 for the same reason, both packages had the same purl at https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/data/asgiref-3.3.0_toolkit_scan.json#L148 and https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/data/asgiref-3.3.0_toolkit_scan.json#L258 which are same. (SCTK should also have not reported 2 packages in this case)
Updates test expectations which after modifying populate purldb pipeline to only send unique purls. Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 more minor suggestions and we'll be ready ;)
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@tdruez latest comments addressed, thanks for the suggestions. 😄 |
* Modify pipelines to get purls from package_data Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Only get purls from package data if no packages Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Add docstrings and tests Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Only submit unique purls Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Add docstrings and comments from feedback Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Update test expectations Updates test expectations which after modifying populate purldb pipeline to only send unique purls. Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Address review comments Signed-off-by: Ayan Sinha Mahapatra <[email protected]> --------- Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
The
scan_codebase_packages
pipeline currently takes a lot of time for large projects with ~700k resources, and it is stuck in thescan_for_application_packages
step, and so presumably it's the package assembly in there that takes the longest.The main objective of this pipeline was as a precursor to the
populate_purldb
pipeline to get purls from the codebase and submit them to purldb for indexing. So this PR does the following:scan_codebase_packages
pipeline. So we only have the codebase scanned for package_data and these are added to the specific resources without creating DiscoveredPackage and DiscoveredDependency objects out of these package data, by assembling.