Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add osv-scalibr as a dependency extraction method for PR remediation #4688

Merged
merged 8 commits into from
Oct 10, 2024

Conversation

evankanderson
Copy link
Member

Summary

Switches from a home-rolled dependency parser to the (in-development) OSV-Scalibr library from Google, which already has parsers for our current set of dependencies.†

† Except that OSV-Scalibr doesn't seem to parse dependencies from package.json, and only parses package-lock.json. We'd prefer to get both dependency locations, so we can patch both of them.

I have not documented the new-dep PR ingestion type, as I'd like to replace the existing dep type (via experiment) over time, but wasn't quite ready to wire that up in the same PR.

This also moves us to development branches of both osv-scalibr and go-billy to take advantage of the in-flight code. osv-scalibr's development branch has a new API that the Google team recommended, and go-billy has an adapter from the billy.Filesystem type to io/fs.FS native Golang (read-only) filesystem.

Change Type

Mark the type of change your PR introduces:

  • Bug fix (resolves an issue without affecting existing features)
  • Feature (adds new functionality without breaking changes)
  • Breaking change (may impact existing functionalities or require documentation updates)
  • Documentation (updates or additions to documentation)
  • Refactoring or test improvements (no bug fixes or new functionality)

Testing

I did a bunch of manual testing with cmd/dev on an actual in-flight PR, and then added unit tests to cover the languages we currently support.

Review Checklist:

  • Reviewed my own code for quality and clarity.
  • Added comments to complex or tricky code sections.
  • Updated any affected documentation.
  • Included tests that validate the fix or feature.
  • Checked that related changes are merged.

Copy link

stacklok-cloud-staging bot commented Oct 8, 2024

Minder Vulnerability Report ✅

Minder analyzed this PR and found it does not add any new vulnerable dependencies.

Vulnerability scan of fcd79ea5:

  • 🐞 vulnerable packages: 0
  • 🛠 fixes available for: 0

@evankanderson
Copy link
Member Author

@oliverchang -- I'd appreciate your review of the usage of github.com/google/osv-scalibr in this PR.

Thanks!

@oliverchang
Copy link

@erikvarga, @another-rex are the actual OSV-Scalibr experts -- would you be able to help here?

}

func inventorySorter(a *extractor.Inventory, b *extractor.Inventory) int {
return cmp.Or(cmp.Compare(a.Name, b.Name), cmp.Compare(a.Version, b.Version))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get packages whose name and version are the same? (e.g. two identical lockfiles scanned from different files)
We have an existing comparison function for Inventory in SCALIBR that looks at all the relevant fields of the inventory: https://github.com/google/osv-scalibr/blob/main/scalibr.go#L293
If you think it'd beneficial to use for you we can make it public.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Good point. Yes, that would be a useful function to expose, though you might benefit from using cmp.Or and cmp.Compare over the current implementation. 😁

I'm also wondering if you want to guarantee that extractor.Inventory.Locations is sorted; as it is, it's possible to construct two Inventories which are equivalent but sort differently.

}

desiredCaps := scalibr_plugin.Capabilities{
OS: scalibr_plugin.OSAny,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSAny is supposed to be used for specifying plugin requirements only. When specifying the scan environment's capabilities you should pick a specific OS.
You can use the SCALIBR-prodivded platform.OS() function to determine the OS at runtime: https://github.com/google/osv-scalibr/blob/main/binary/cli/cli.go#L418

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is kinda funny... we actually want to use the platform that the developers would be using for their builds (which is not easy to extract from a repo), but we're probably running on Linux, so I guess we can use that for now.

// All includes Ruby, Dotnet which we're not ready to test yet, so use the more limited Default set.
FilesystemExtractors: list.FilterByCapabilities(list.Default, &desiredCaps),
Capabilities: &desiredCaps,
Stats: scalibr_stats.NoopCollector{},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be left empty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

zerolog.Ctx(context.Background()).Warn().Msg("nil ecosystem scanning diffs")
return pbinternal.DepEcosystem_DEP_ECOSYSTEM_UNSPECIFIED
}
// This should be inventory.GetEcosystem()... but there isn't a convenience wrapper yet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop, I somehow missed these when I was reading the code (since godoc isn't published yet). Thanks!

}

// Sometimes Scalibr uses the string "PyPI" instead of "pypi" when reporting the ecosystem.
switch strings.ToLower(ecosystem) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're interested in the PURL-defined package type you should use converter.ToPURL(inventory).Type or inventory.Extractor.ToPURL(inventory).Type (https://github.com/google/osv-scalibr/blob/main/converter/converter.go#L45)

Ecosystem is specific to the OSV schema and sometimes use different strings than PURL package types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we should probably provide an Inventory.PURL() member function similar to Inventory.Ecosystem() so people don't have to import converter.go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find constants for the OSV Ecosystem types -- it would be great if there were some type hints in the interface that would point me towards the correct constants (and then enforced the use of those constants in the rest of the codebase). I'll switch to using ToPURL for now -- is there a "preferred" set of values? (It seems like a shame to assemble an entire PURL just to extract the Type parameter.)

I'm also wondering if it makes sense to remove the error parameter from the Ecosystem, ToPURL and ToCPEs interface methods -- the only place where it is used in is in the extractor_dummy.go files, where the Extract method that returns []*extractor.Inventory already returns an error and a nil list. Removing the error parameter would simplify a lot of consumer code and coverage checks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, at the moment we don't really have any cases where we'd need to return an error. I'll discuss internally if we ever would want to do that or if we can just remove it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oliverchang are you aware of any documentation about the list of OSV Ecosystem values apart from https://ossf.github.io/osv-schema/#affectedpackage-field?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll discuss internally if we ever would want to do that or if we can just remove it.

Update: We ended up removing it: google/osv-scalibr@9f86538

So if you pin SCALIBR to after this commit you should be able to get rid of the error handling in your code too.

Comment on lines 112 to -137
case "", pb.DiffTypeDep:
allDiffs := make([]*pbinternal.PrDependencies_ContextualDependency, 0)
for {
if pr.Number > math.MaxInt {
return nil, fmt.Errorf("pr number is too large")
}
prFiles, resp, err := di.cli.ListFiles(ctx, pr.RepoOwner, pr.RepoName, int(pr.Number), prFilesPerPage, page)
if err != nil {
return nil, fmt.Errorf("error getting pull request files: %w", err)
}
return di.getDepTypeDiff(ctx, prNumber, pr)

for _, file := range prFiles {
fileDiffs, err := di.ingestFileForDepDiff(file.GetFilename(), file.GetPatch(), file.GetRawURL(), logger)
if err != nil {
return nil, fmt.Errorf("error ingesting file %s: %w", file.GetFilename(), err)
}
allDiffs = append(allDiffs, fileDiffs...)
}
case pb.DiffTypeNewDeps:
// TODO: once we've tested some, convert DiffTypeDep to use this algorithm.
return di.getScalibrTypeDiff(ctx, prNumber, pr)

if resp.NextPage == 0 {
break
}
case pb.DiffTypeFull:
return di.getFullTypeDiff(ctx, prNumber, pr)

page = resp.NextPage
}
default:
return nil, fmt.Errorf("unknown diff type")
}
}

return &engif.Result{
Object: &pbinternal.PrDependencies{
Pr: pr,
Deps: allDiffs,
},
// NOTE: At this point we're only retrieving the timestamp as the checkpoint.
Checkpoint: checkpoints.NewCheckpointV1(time.Now()),
}, nil
func (di *Diff) getDepTypeDiff(ctx context.Context, prNumber int, pr *pb.PullRequest) (*engif.Result, error) {
deps := pbinternal.PrDependencies{Pr: pr}
page := 0

case pb.DiffTypeFull:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was getting "big" from a cyclomatic complexity point of view, so I moved the code for each switch case into a separate function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I can pull this out into a separate PR if it would help for reviewing)

Comment on lines 249 to 251
func inventorySorter(a *extractor.Inventory, b *extractor.Inventory) int {
return cmp.Or(cmp.Compare(a.Name, b.Name), cmp.Compare(a.Version, b.Version))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this in the extractor module, but it's not too hard to have it here.

}

// Sometimes Scalibr uses the string "PyPI" instead of "pypi" when reporting the ecosystem.
switch strings.ToLower(ecosystem) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find constants for the OSV Ecosystem types -- it would be great if there were some type hints in the interface that would point me towards the correct constants (and then enforced the use of those constants in the rest of the codebase). I'll switch to using ToPURL for now -- is there a "preferred" set of values? (It seems like a shame to assemble an entire PURL just to extract the Type parameter.)

I'm also wondering if it makes sense to remove the error parameter from the Ecosystem, ToPURL and ToCPEs interface methods -- the only place where it is used in is in the extractor_dummy.go files, where the Extract method that returns []*extractor.Inventory already returns an error and a nil list. Removing the error parameter would simplify a lot of consumer code and coverage checks.

zerolog.Ctx(context.Background()).Warn().Msg("nil ecosystem scanning diffs")
return pbinternal.DepEcosystem_DEP_ECOSYSTEM_UNSPECIFIED
}
// This should be inventory.GetEcosystem()... but there isn't a convenience wrapper yet
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop, I somehow missed these when I was reading the code (since godoc isn't published yet). Thanks!

}

func inventorySorter(a *extractor.Inventory, b *extractor.Inventory) int {
return cmp.Or(cmp.Compare(a.Name, b.Name), cmp.Compare(a.Version, b.Version))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Good point. Yes, that would be a useful function to expose, though you might benefit from using cmp.Or and cmp.Compare over the current implementation. 😁

I'm also wondering if you want to guarantee that extractor.Inventory.Locations is sorted; as it is, it's possible to construct two Inventories which are equivalent but sort differently.

}

desiredCaps := scalibr_plugin.Capabilities{
OS: scalibr_plugin.OSAny,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is kinda funny... we actually want to use the platform that the developers would be using for their builds (which is not easy to extract from a repo), but we're probably running on Linux, so I guess we can use that for now.

// All includes Ruby, Dotnet which we're not ready to test yet, so use the more limited Default set.
FilesystemExtractors: list.FilterByCapabilities(list.Default, &desiredCaps),
Capabilities: &desiredCaps,
Stats: scalibr_stats.NoopCollector{},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

@alowayed
Copy link

osv-scalibr does have a 'package.json' extractor at: extractor/filesystem/language/javascript/packagejson/extractor.go

@evankanderson
Copy link
Member Author

osv-scalibr does have a 'package.json' extractor at: extractor/filesystem/language/javascript/packagejson/extractor.go

Yes, but that extractor does not unpack the dependencies or devDependencies values (at the moment).

@evankanderson
Copy link
Member Author

osv-scalibr does have a 'package.json' extractor at: extractor/filesystem/language/javascript/packagejson/extractor.go

Yes, but that extractor does not unpack the dependencies or devDependencies values (at the moment).

Similarly, there is an extractor for go.mod, but it does not also cover go.sum. If the goal is to find dependencies, one file location is sufficient, but if the goal is to locate the files which need fixing, it would be nice to have all the dependency-implicated files (using multiple Location values in one Inventory for the related files).

@erikvarga
Copy link

osv-scalibr does have a 'package.json' extractor at: extractor/filesystem/language/javascript/packagejson/extractor.go

Yes, but that extractor does not unpack the dependencies or devDependencies values (at the moment).

Adding or modifying extractors in SCALIBR is relatively simple (https://github.com/google/osv-scalibr/blob/main/docs/new_extractor.md) so we're happy to merge in PRs for this functionality.

@coveralls
Copy link

coveralls commented Oct 10, 2024

Coverage Status

coverage: 53.082% (+0.2%) from 52.865%
when pulling d92e0ef on evankanderson:pr-scalibr
into 2fa96c9 on stacklok:main.

@evankanderson
Copy link
Member Author

BTW, not Scalibr's fault, but it turns out that github.com/containerd/containerd depends on github.com/containerd/errgroup at v0.1.0, and the containerd folks did a bad versioning thing, and released a v0.3.0 that breaks the interfaces used by github.com/containerd/containerd at v1.7.22.

},
File: &pbinternal.PrDependencies_ContextualDependency_FilePatch{
Name: filename,
PatchUrl: "", // TODO: do we need this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to diff the deps, but we'd need this in case we wanted to provide the inline reviews. (Which in their current form where we just propose the version bump w/o proposing also any possible transitive deps is...demo quality)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I want to think through what the abstract interface is for proposing patch comments upstream, but I need to take a look at the GitHub PR and GitLab MR comment APIs before I propose something concrete. (In any case, I think this is a place where we want to be able to leverage the providers so this can be re-used across multiple repo providers.)

@evankanderson evankanderson merged commit f7814a9 into mindersec:main Oct 10, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants