Modify pipelines to get purls from package_data #904

AyanSinhaMahapatra · 2023-09-01T05:30:46Z

The scan_codebase_packages pipeline currently takes a lot of time for large projects with ~700k resources, and it is stuck in the scan_for_application_packages step, and so presumably it's the package assembly in there that takes the longest.
The main objective of this pipeline was as a precursor to the populate_purldb pipeline to get purls from the codebase and submit them to purldb for indexing. So this PR does the following:

Disables the package assembly for the scan_codebase_packages pipeline. So we only have the codebase scanned for package_data and these are added to the specific resources without creating DiscoveredPackage and DiscoveredDependency objects out of these package data, by assembling.
If there are no packages/dependencies added to the project (otherwise we will add purls from these packages/deps twice), then further look for package data detected in the resources and add purls from those.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

tdruez

See various comments.

Also:

Disables the package assembly for the scan_codebase_packages pipeline. So we only have the codebase scanned for package_data and these are added to the specific resources without creating DiscoveredPackage and DiscoveredDependency objects out of these package data, by assembling.

I'm not sure about changing the behavior of package detection in scan_codebase_packages. It's unclear to me looking at the code the consequence of this.

scanpipe/pipelines/scan_codebase_packages.py

scanpipe/pipes/scancode.py

scanpipe/pipelines/populate_purldb.py

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

AyanSinhaMahapatra · 2023-09-01T18:01:34Z

I'm not sure about changing the behavior of package detection in scan_codebase_packages. It's unclear to me looking at the code the consequence of this.

If the main objective of this pipeline is to get purls from the codebase and use this as a previous step for the populate_purldb pipeline, then we need to create Discoveredpackage/DiscoveredDependency instances, but having resource level package detections in each resource is enough (We can goa step further by passing an argument to scancode to only detect purl related information only, and ignore everything else, which will be the next step). The package assembly step uses these resource level detections and creates instances, assigns files, but this is not required in this context.

@tdruez thanks for the review and feedback, I've addressed these, ready for your review again 😄

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

tdruez

@AyanSinhaMahapatra It would be nice to add some details in the docstring of ScanCodebasePackages to describe its purpose and its limitations.

scanpipe/pipelines/populate_purldb.py

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

AyanSinhaMahapatra · 2023-09-11T07:49:52Z

scanpipe/pipelines/populate_purldb.py

@@ -70,7 +70,7 @@ def feed_purldb(self, packages, package_type):
        if not purldb.is_available():
            raise Exception("PurlDB is not available.")

-        package_urls = [package.purl for package in packages]
+        package_urls = list(set([package.purl for package in packages]))


Note that when we get purls from package_data and not package/dependency instances, we might have duplicate purls (like from manifests of a same package instance and from different manifests having same dependencies), so it makes sense to send only the unique purls for indexing in purldb.

Had to update expectations at 98fc128 for the same reason, both packages had the same purl at https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/data/asgiref-3.3.0_toolkit_scan.json#L148 and https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/data/asgiref-3.3.0_toolkit_scan.json#L258 which are same. (SCTK should also have not reported 2 packages in this case)

Updates test expectations which after modifying populate purldb pipeline to only send unique purls. Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

tdruez

2 more minor suggestions and we'll be ready ;)

scanpipe/pipes/scancode.py

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

AyanSinhaMahapatra · 2023-09-11T10:16:01Z

@tdruez latest comments addressed, thanks for the suggestions. 😄

* Modify pipelines to get purls from package_data Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Only get purls from package data if no packages Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Add docstrings and tests Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Only submit unique purls Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Add docstrings and comments from feedback Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Update test expectations Updates test expectations which after modifying populate purldb pipeline to only send unique purls. Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Address review comments Signed-off-by: Ayan Sinha Mahapatra <[email protected]> --------- Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

Modify pipelines to get purls from package_data

23fa79f

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

AyanSinhaMahapatra requested a review from tdruez September 1, 2023 05:30

Only get purls from package data if no packages

4c9f36f

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

tdruez requested changes Sep 1, 2023

View reviewed changes

Add docstrings and tests

e47e85e

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

Only submit unique purls

e628d6a

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

tdruez requested changes Sep 5, 2023

View reviewed changes

scanpipe/pipelines/populate_purldb.py Show resolved Hide resolved

AyanSinhaMahapatra added 2 commits September 11, 2023 13:15

Add docstrings and comments from feedback

4e87c70

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

Merge branch 'main' into update-scan-codebase-packages

645e25c

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

AyanSinhaMahapatra commented Sep 11, 2023

View reviewed changes

Update test expectations

98fc128

Updates test expectations which after modifying populate purldb pipeline to only send unique purls. Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

tdruez requested changes Sep 11, 2023

View reviewed changes

scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved

scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved

Address review comments

7b4925b

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

tdruez merged commit ade2953 into main Sep 11, 2023

tdruez deleted the update-scan-codebase-packages branch September 11, 2023 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify pipelines to get purls from package_data #904

Modify pipelines to get purls from package_data #904

AyanSinhaMahapatra commented Sep 1, 2023

tdruez left a comment

AyanSinhaMahapatra commented Sep 1, 2023

tdruez left a comment

AyanSinhaMahapatra Sep 11, 2023

AyanSinhaMahapatra Sep 11, 2023

tdruez left a comment

AyanSinhaMahapatra commented Sep 11, 2023

Modify pipelines to get purls from package_data #904

Modify pipelines to get purls from package_data #904

Conversation

AyanSinhaMahapatra commented Sep 1, 2023

tdruez left a comment

Choose a reason for hiding this comment

AyanSinhaMahapatra commented Sep 1, 2023

tdruez left a comment

Choose a reason for hiding this comment

AyanSinhaMahapatra Sep 11, 2023

Choose a reason for hiding this comment

AyanSinhaMahapatra Sep 11, 2023

Choose a reason for hiding this comment

tdruez left a comment

Choose a reason for hiding this comment

AyanSinhaMahapatra commented Sep 11, 2023