Add osv-scalibr as a dependency extraction method for PR remediation #4688

evankanderson · 2024-10-08T20:57:02Z

Summary

Switches from a home-rolled dependency parser to the (in-development) OSV-Scalibr library from Google, which already has parsers for our current set of dependencies.†

† Except that OSV-Scalibr doesn't seem to parse dependencies from package.json, and only parses package-lock.json. We'd prefer to get both dependency locations, so we can patch both of them.

I have not documented the new-dep PR ingestion type, as I'd like to replace the existing dep type (via experiment) over time, but wasn't quite ready to wire that up in the same PR.

This also moves us to development branches of both osv-scalibr and go-billy to take advantage of the in-flight code. osv-scalibr's development branch has a new API that the Google team recommended, and go-billy has an adapter from the billy.Filesystem type to io/fs.FS native Golang (read-only) filesystem.

Change Type

Mark the type of change your PR introduces:

Bug fix (resolves an issue without affecting existing features)
Feature (adds new functionality without breaking changes)
Breaking change (may impact existing functionalities or require documentation updates)
Documentation (updates or additions to documentation)
Refactoring or test improvements (no bug fixes or new functionality)

Testing

I did a bunch of manual testing with cmd/dev on an actual in-flight PR, and then added unit tests to cover the languages we currently support.

Review Checklist:

Reviewed my own code for quality and clarity.
Added comments to complex or tricky code sections.
Updated any affected documentation.
Included tests that validate the fix or feature.
Checked that related changes are merged.

stacklok-cloud-staging · 2024-10-08T20:57:06Z

Minder Vulnerability Report ✅

Minder analyzed this PR and found it does not add any new vulnerable dependencies.

Vulnerability scan of fcd79ea5:

🐞 vulnerable packages: 0

🛠 fixes available for: 0

evankanderson · 2024-10-08T21:02:25Z

@oliverchang -- I'd appreciate your review of the usage of github.com/google/osv-scalibr in this PR.

Thanks!

oliverchang · 2024-10-09T05:13:28Z

@erikvarga, @another-rex are the actual OSV-Scalibr experts -- would you be able to help here?

erikvarga · 2024-10-09T12:21:15Z

internal/engine/ingester/diff/diff.go

+}
+
+func inventorySorter(a *extractor.Inventory, b *extractor.Inventory) int {
+	return cmp.Or(cmp.Compare(a.Name, b.Name), cmp.Compare(a.Version, b.Version))


Can you get packages whose name and version are the same? (e.g. two identical lockfiles scanned from different files)
We have an existing comparison function for Inventory in SCALIBR that looks at all the relevant fields of the inventory: https://github.com/google/osv-scalibr/blob/main/scalibr.go#L293
If you think it'd beneficial to use for you we can make it public.

Oh! Good point. Yes, that would be a useful function to expose, though you might benefit from using cmp.Or and cmp.Compare over the current implementation. 😁

I'm also wondering if you want to guarantee that extractor.Inventory.Locations is sorted; as it is, it's possible to construct two Inventories which are equivalent but sort differently.

erikvarga · 2024-10-09T12:23:46Z

internal/engine/ingester/diff/diff.go

+	}
+
+	desiredCaps := scalibr_plugin.Capabilities{
+		OS:            scalibr_plugin.OSAny,


OSAny is supposed to be used for specifying plugin requirements only. When specifying the scan environment's capabilities you should pick a specific OS.
You can use the SCALIBR-prodivded platform.OS() function to determine the OS at runtime: https://github.com/google/osv-scalibr/blob/main/binary/cli/cli.go#L418

So, this is kinda funny... we actually want to use the platform that the developers would be using for their builds (which is not easy to extract from a repo), but we're probably running on Linux, so I guess we can use that for now.

erikvarga · 2024-10-09T12:25:24Z

internal/engine/ingester/diff/diff.go

+		// All includes Ruby, Dotnet which we're not ready to test yet, so use the more limited Default set.
+		FilesystemExtractors: list.FilterByCapabilities(list.Default, &desiredCaps),
+		Capabilities: &desiredCaps,
+		Stats: scalibr_stats.NoopCollector{},


This can also be left empty

Thanks, done

erikvarga · 2024-10-09T12:27:25Z

internal/engine/ingester/diff/diff.go

+		zerolog.Ctx(context.Background()).Warn().Msg("nil ecosystem scanning diffs")
+		return pbinternal.DepEcosystem_DEP_ECOSYSTEM_UNSPECIFIED
+	}
+	// This should be inventory.GetEcosystem()... but there isn't a convenience wrapper yet


What about inventory.Ecosystem()?

https://github.com/google/osv-scalibr/blob/main/extractor/extractor.go#L86

Oop, I somehow missed these when I was reading the code (since godoc isn't published yet). Thanks!

erikvarga · 2024-10-09T12:30:10Z

internal/engine/ingester/diff/diff.go

+	}
+
+	// Sometimes Scalibr uses the string "PyPI" instead of "pypi" when reporting the ecosystem.
+	switch strings.ToLower(ecosystem) {


If you're interested in the PURL-defined package type you should use converter.ToPURL(inventory).Type or inventory.Extractor.ToPURL(inventory).Type (https://github.com/google/osv-scalibr/blob/main/converter/converter.go#L45)

Ecosystem is specific to the OSV schema and sometimes use different strings than PURL package types.

Hm, we should probably provide an Inventory.PURL() member function similar to Inventory.Ecosystem() so people don't have to import converter.go.

I couldn't find constants for the OSV Ecosystem types -- it would be great if there were some type hints in the interface that would point me towards the correct constants (and then enforced the use of those constants in the rest of the codebase). I'll switch to using ToPURL for now -- is there a "preferred" set of values? (It seems like a shame to assemble an entire PURL just to extract the Type parameter.)

I'm also wondering if it makes sense to remove the error parameter from the Ecosystem, ToPURL and ToCPEs interface methods -- the only place where it is used in is in the extractor_dummy.go files, where the Extract method that returns []*extractor.Inventory already returns an error and a nil list. Removing the error parameter would simplify a lot of consumer code and coverage checks.

Right, at the moment we don't really have any cases where we'd need to return an error. I'll discuss internally if we ever would want to do that or if we can just remove it.

@oliverchang are you aware of any documentation about the list of OSV Ecosystem values apart from https://ossf.github.io/osv-schema/#affectedpackage-field?

I'll discuss internally if we ever would want to do that or if we can just remove it.

Update: We ended up removing it: google/osv-scalibr@9f86538

So if you pin SCALIBR to after this commit you should be able to get rid of the error handling in your code too.

evankanderson · 2024-10-08T21:05:47Z

internal/engine/ingester/diff/diff.go

 	case "", pb.DiffTypeDep:
-		allDiffs := make([]*pbinternal.PrDependencies_ContextualDependency, 0)
-		for {
-			if pr.Number > math.MaxInt {
-				return nil, fmt.Errorf("pr number is too large")
-			}
-			prFiles, resp, err := di.cli.ListFiles(ctx, pr.RepoOwner, pr.RepoName, int(pr.Number), prFilesPerPage, page)
-			if err != nil {
-				return nil, fmt.Errorf("error getting pull request files: %w", err)
-			}
+		return di.getDepTypeDiff(ctx, prNumber, pr)

-			for _, file := range prFiles {
-				fileDiffs, err := di.ingestFileForDepDiff(file.GetFilename(), file.GetPatch(), file.GetRawURL(), logger)
-				if err != nil {
-					return nil, fmt.Errorf("error ingesting file %s: %w", file.GetFilename(), err)
-				}
-				allDiffs = append(allDiffs, fileDiffs...)
-			}
+	case pb.DiffTypeNewDeps:
+		// TODO: once we've tested some, convert DiffTypeDep to use this algorithm.
+		return di.getScalibrTypeDiff(ctx, prNumber, pr)

-			if resp.NextPage == 0 {
-				break
-			}
+	case pb.DiffTypeFull:
+		return di.getFullTypeDiff(ctx, prNumber, pr)

-			page = resp.NextPage
-		}
+	default:
+		return nil, fmt.Errorf("unknown diff type")
+	}
+}

-		return &engif.Result{
-			Object: &pbinternal.PrDependencies{
-				Pr:   pr,
-				Deps: allDiffs,
-			},
-			// NOTE: At this point we're only retrieving the timestamp as the checkpoint.
-			Checkpoint: checkpoints.NewCheckpointV1(time.Now()),
-		}, nil
+func (di *Diff) getDepTypeDiff(ctx context.Context, prNumber int, pr *pb.PullRequest) (*engif.Result, error) {
+	deps := pbinternal.PrDependencies{Pr: pr}
+	page := 0

-	case pb.DiffTypeFull:


This method was getting "big" from a cyclomatic complexity point of view, so I moved the code for each switch case into a separate function.

(I can pull this out into a separate PR if it would help for reviewing)

evankanderson · 2024-10-08T21:06:48Z

internal/engine/ingester/diff/diff.go

+func inventorySorter(a *extractor.Inventory, b *extractor.Inventory) int {
+	return cmp.Or(cmp.Compare(a.Name, b.Name), cmp.Compare(a.Version, b.Version))
+}


It would be nice to have this in the extractor module, but it's not too hard to have it here.

evankanderson · 2024-10-09T15:56:55Z

internal/engine/ingester/diff/diff.go

+	}
+
+	// Sometimes Scalibr uses the string "PyPI" instead of "pypi" when reporting the ecosystem.
+	switch strings.ToLower(ecosystem) {


I couldn't find constants for the OSV Ecosystem types -- it would be great if there were some type hints in the interface that would point me towards the correct constants (and then enforced the use of those constants in the rest of the codebase). I'll switch to using ToPURL for now -- is there a "preferred" set of values? (It seems like a shame to assemble an entire PURL just to extract the Type parameter.)

I'm also wondering if it makes sense to remove the error parameter from the Ecosystem, ToPURL and ToCPEs interface methods -- the only place where it is used in is in the extractor_dummy.go files, where the Extract method that returns []*extractor.Inventory already returns an error and a nil list. Removing the error parameter would simplify a lot of consumer code and coverage checks.

evankanderson · 2024-10-09T15:57:25Z

internal/engine/ingester/diff/diff.go

+		zerolog.Ctx(context.Background()).Warn().Msg("nil ecosystem scanning diffs")
+		return pbinternal.DepEcosystem_DEP_ECOSYSTEM_UNSPECIFIED
+	}
+	// This should be inventory.GetEcosystem()... but there isn't a convenience wrapper yet


Oop, I somehow missed these when I was reading the code (since godoc isn't published yet). Thanks!

evankanderson · 2024-10-09T23:47:47Z

internal/engine/ingester/diff/diff.go

+}
+
+func inventorySorter(a *extractor.Inventory, b *extractor.Inventory) int {
+	return cmp.Or(cmp.Compare(a.Name, b.Name), cmp.Compare(a.Version, b.Version))


Oh! Good point. Yes, that would be a useful function to expose, though you might benefit from using cmp.Or and cmp.Compare over the current implementation. 😁

I'm also wondering if you want to guarantee that extractor.Inventory.Locations is sorted; as it is, it's possible to construct two Inventories which are equivalent but sort differently.

evankanderson · 2024-10-09T23:52:57Z

internal/engine/ingester/diff/diff.go

+	}
+
+	desiredCaps := scalibr_plugin.Capabilities{
+		OS:            scalibr_plugin.OSAny,


So, this is kinda funny... we actually want to use the platform that the developers would be using for their builds (which is not easy to extract from a repo), but we're probably running on Linux, so I guess we can use that for now.

evankanderson · 2024-10-09T23:54:16Z

internal/engine/ingester/diff/diff.go

+		// All includes Ruby, Dotnet which we're not ready to test yet, so use the more limited Default set.
+		FilesystemExtractors: list.FilterByCapabilities(list.Default, &desiredCaps),
+		Capabilities: &desiredCaps,
+		Stats: scalibr_stats.NoopCollector{},


Thanks, done

alowayed · 2024-10-10T00:56:49Z

osv-scalibr does have a 'package.json' extractor at: extractor/filesystem/language/javascript/packagejson/extractor.go

evankanderson · 2024-10-10T13:00:34Z

osv-scalibr does have a 'package.json' extractor at: extractor/filesystem/language/javascript/packagejson/extractor.go

Yes, but that extractor does not unpack the dependencies or devDependencies values (at the moment).

evankanderson · 2024-10-10T13:57:51Z

osv-scalibr does have a 'package.json' extractor at: extractor/filesystem/language/javascript/packagejson/extractor.go

Yes, but that extractor does not unpack the dependencies or devDependencies values (at the moment).

Similarly, there is an extractor for go.mod, but it does not also cover go.sum. If the goal is to find dependencies, one file location is sufficient, but if the goal is to locate the files which need fixing, it would be nice to have all the dependency-implicated files (using multiple Location values in one Inventory for the related files).

erikvarga · 2024-10-10T14:25:24Z

osv-scalibr does have a 'package.json' extractor at: extractor/filesystem/language/javascript/packagejson/extractor.go

Yes, but that extractor does not unpack the dependencies or devDependencies values (at the moment).

Adding or modifying extractors in SCALIBR is relatively simple (https://github.com/google/osv-scalibr/blob/main/docs/new_extractor.md) so we're happy to merge in PRs for this functionality.

… with v1.7.22

coveralls · 2024-10-10T19:30:25Z

coverage: 53.082% (+0.2%) from 52.865%
when pulling d92e0ef on evankanderson:pr-scalibr
into 2fa96c9 on stacklok:main.

evankanderson · 2024-10-10T19:41:47Z

BTW, not Scalibr's fault, but it turns out that github.com/containerd/containerd depends on github.com/containerd/errgroup at v0.1.0, and the containerd folks did a bad versioning thing, and released a v0.3.0 that breaks the interfaces used by github.com/containerd/containerd at v1.7.22.

jhrozek · 2024-10-10T20:07:40Z

internal/engine/ingester/diff/diff.go

+				},
+				File: &pbinternal.PrDependencies_ContextualDependency_FilePatch{
+					Name:     filename,
+					PatchUrl: "", // TODO: do we need this?


Not to diff the deps, but we'd need this in case we wanted to provide the inline reviews. (Which in their current form where we just propose the version bump w/o proposing also any possible transitive deps is...demo quality)

Yeah, I want to think through what the abstract interface is for proposing patch comments upstream, but I need to take a look at the GitHub PR and GitLab MR comment APIs before I propose something concrete. (In any case, I think this is a place where we want to be able to leverage the providers so this can be re-used across multiple repo providers.)

Add osv-scalibr as a dependency extraction method for PR remediation

66b7bd6

Merge remote-tracking branch 'upstream/main' into pr-scalibr

6d8807a

Remove unused constant

ba4923e

erikvarga reviewed Oct 9, 2024

View reviewed changes

evankanderson added 2 commits October 9, 2024 16:46

Merge remote-tracking branch 'upstream/main' into pr-scalibr

51cf1d3

Address feedback from @erikvarga

fcd79ea

evankanderson commented Oct 10, 2024

View reviewed changes

erikvarga approved these changes Oct 10, 2024

View reviewed changes

evankanderson added 2 commits October 10, 2024 11:47

Merge remote-tracking branch 'upstream/main' into pr-scalibr

fb12756

Revert to older github.com/containerd/errgroup, as v0.3.0 is breaking…

171fcb7

… with v1.7.22

Fix lint errors

d92e0ef

jhrozek reviewed Oct 10, 2024

View reviewed changes

jhrozek approved these changes Oct 10, 2024

View reviewed changes

evankanderson merged commit f7814a9 into mindersec:main Oct 10, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add osv-scalibr as a dependency extraction method for PR remediation #4688

Add osv-scalibr as a dependency extraction method for PR remediation #4688

evankanderson commented Oct 8, 2024

stacklok-cloud-staging bot commented Oct 8, 2024 •

edited

Loading

Vulnerability scan of `fcd79ea5:`

evankanderson commented Oct 8, 2024

oliverchang commented Oct 9, 2024

erikvarga Oct 9, 2024

evankanderson Oct 9, 2024

erikvarga Oct 9, 2024

evankanderson Oct 9, 2024

erikvarga Oct 9, 2024

evankanderson Oct 9, 2024

erikvarga Oct 9, 2024

evankanderson Oct 9, 2024

erikvarga Oct 9, 2024

erikvarga Oct 9, 2024

evankanderson Oct 9, 2024

erikvarga Oct 10, 2024

erikvarga Oct 10, 2024

erikvarga Oct 14, 2024

evankanderson Oct 8, 2024

evankanderson Oct 8, 2024

evankanderson Oct 8, 2024

evankanderson Oct 9, 2024

evankanderson Oct 9, 2024

evankanderson Oct 9, 2024

evankanderson Oct 9, 2024

evankanderson Oct 9, 2024

alowayed commented Oct 10, 2024

evankanderson commented Oct 10, 2024

evankanderson commented Oct 10, 2024

erikvarga commented Oct 10, 2024

coveralls commented Oct 10, 2024 •

edited

Loading

evankanderson commented Oct 10, 2024

jhrozek Oct 10, 2024

evankanderson Oct 10, 2024

Add osv-scalibr as a dependency extraction method for PR remediation #4688

Add osv-scalibr as a dependency extraction method for PR remediation #4688

Conversation

evankanderson commented Oct 8, 2024

Summary

Change Type

Testing

Review Checklist:

stacklok-cloud-staging bot commented Oct 8, 2024 • edited Loading

Minder Vulnerability Report ✅

Vulnerability scan of fcd79ea5:

evankanderson commented Oct 8, 2024

oliverchang commented Oct 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alowayed commented Oct 10, 2024

evankanderson commented Oct 10, 2024

evankanderson commented Oct 10, 2024

erikvarga commented Oct 10, 2024

coveralls commented Oct 10, 2024 • edited Loading

evankanderson commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stacklok-cloud-staging bot commented Oct 8, 2024 •

edited

Loading

Vulnerability scan of `fcd79ea5:`

coveralls commented Oct 10, 2024 •

edited

Loading