Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify pipelines to get purls from package_data #904

Merged
merged 8 commits into from
Sep 11, 2023

Conversation

AyanSinhaMahapatra
Copy link
Member

The scan_codebase_packages pipeline currently takes a lot of time for large projects with ~700k resources, and it is stuck in the scan_for_application_packages step, and so presumably it's the package assembly in there that takes the longest.
The main objective of this pipeline was as a precursor to the populate_purldb pipeline to get purls from the codebase and submit them to purldb for indexing. So this PR does the following:

  1. Disables the package assembly for the scan_codebase_packages pipeline. So we only have the codebase scanned for package_data and these are added to the specific resources without creating DiscoveredPackage and DiscoveredDependency objects out of these package data, by assembling.
  2. If there are no packages/dependencies added to the project (otherwise we will add purls from these packages/deps twice), then further look for package data detected in the resources and add purls from those.

Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See various comments.

Also:

Disables the package assembly for the scan_codebase_packages pipeline. So we only have the codebase scanned for package_data and these are added to the specific resources without creating DiscoveredPackage and DiscoveredDependency objects out of these package data, by assembling.

I'm not sure about changing the behavior of package detection in scan_codebase_packages. It's unclear to me looking at the code the consequence of this.

scanpipe/pipelines/scan_codebase_packages.py Show resolved Hide resolved
scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved
scanpipe/pipes/scancode.py Show resolved Hide resolved
scanpipe/pipelines/populate_purldb.py Outdated Show resolved Hide resolved
scanpipe/pipelines/populate_purldb.py Show resolved Hide resolved
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@AyanSinhaMahapatra
Copy link
Member Author

I'm not sure about changing the behavior of package detection in scan_codebase_packages. It's unclear to me looking at the code the consequence of this.

If the main objective of this pipeline is to get purls from the codebase and use this as a previous step for the populate_purldb pipeline, then we need to create Discoveredpackage/DiscoveredDependency instances, but having resource level package detections in each resource is enough (We can goa step further by passing an argument to scancode to only detect purl related information only, and ignore everything else, which will be the next step). The package assembly step uses these resource level detections and creates instances, assigns files, but this is not required in this context.

@tdruez thanks for the review and feedback, I've addressed these, ready for your review again 😄

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AyanSinhaMahapatra It would be nice to add some details in the docstring of ScanCodebasePackages to describe its purpose and its limitations.

scanpipe/pipelines/populate_purldb.py Show resolved Hide resolved
@@ -70,7 +70,7 @@ def feed_purldb(self, packages, package_type):
if not purldb.is_available():
raise Exception("PurlDB is not available.")

package_urls = [package.purl for package in packages]
package_urls = list(set([package.purl for package in packages]))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that when we get purls from package_data and not package/dependency instances, we might have duplicate purls (like from manifests of a same package instance and from different manifests having same dependencies), so it makes sense to send only the unique purls for indexing in purldb.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to update expectations at 98fc128 for the same reason, both packages had the same purl at https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/data/asgiref-3.3.0_toolkit_scan.json#L148 and https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/data/asgiref-3.3.0_toolkit_scan.json#L258 which are same. (SCTK should also have not reported 2 packages in this case)

Updates test expectations which after modifying populate
purldb pipeline to only send unique purls.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 more minor suggestions and we'll be ready ;)

scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved
scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@AyanSinhaMahapatra
Copy link
Member Author

@tdruez latest comments addressed, thanks for the suggestions. 😄

@tdruez tdruez merged commit ade2953 into main Sep 11, 2023
@tdruez tdruez deleted the update-scan-codebase-packages branch September 11, 2023 10:31
Hritik14 pushed a commit to Hritik14/scancode.io that referenced this pull request Oct 16, 2023
* Modify pipelines to get purls from package_data

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Only get purls from package data if no packages

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Add docstrings and tests

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Only submit unique purls

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Add docstrings and comments from feedback

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Update test expectations

Updates test expectations which after modifying populate
purldb pipeline to only send unique purls.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Address review comments

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

---------

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants