Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow license rules to require the presence of certain defining keywords #2637

Closed
2 of 7 tasks
petergardfjall opened this issue Aug 13, 2021 · 10 comments · Fixed by #2773
Closed
2 of 7 tasks

allow license rules to require the presence of certain defining keywords #2637

petergardfjall opened this issue Aug 13, 2021 · 10 comments · Fixed by #2773

Comments

@petergardfjall
Copy link
Contributor

petergardfjall commented Aug 13, 2021

Short Description

In many license expression rules, certain words or phrases carry a much higher significance than others. For example, This software is distributed under a seems a lot less important than MIT License for a rule that describes an MIT license notice. I think it would be valuable if these "defining words" could optionally be marked and, if declared, would prevent matches from happening unless those defining phrases are present in the matched text.

Possible Labels

  • new feature
  • license scan

Select Category

  • Enhancement
  • Add License/Copyright
  • Scan Feature
  • Packaging
  • Documentation
  • Expand Support
  • Other

Describe the Update

This is a feature proposal which I haven't been able to find any discussions of when searching through the issues (it's quite possible that I have just missed it!). Anyhow, I think it would be really valueable if it was possible in a license expression rule to mark certain "defining words" (or phrases) as required to be present in the scanned text in order for a match to be reported.

As an example, I've seen quite a lot of false positives that I believe could be eliminated if these crucial phrases were enforced. For example, the following text

## License

This SDK is distributed under the Apache License, Version 2.0, see LICENSE for more information.

was matched by mit_923.RULE:

License
Distributed under the MIT License. See LICENSE for more information.

even though it's pretty clear to a human reader that there are certain crucial aspects of this rule that doesn't match. That is to say, some words in the rule are more significant in others, and are essentially defining for the license expression. In this case, of course it is MIT that would be that defining phrase.

As can be clearly seen from a scan match:

      "matched_rule": {
        "identifier": "mit_923.RULE",
        ..
      },
      "matched_text": "License\n\n[This] [SDK] [is] distributed under the [Apache] License, [Version] [2].[0], see LICENSE for more information."

the defining aspect of the license expression (MIT) is missing from the match.

Now, I think it would be really cool and, I imagine, would reduce false positive by quite a significant number, if it was possible to mark these defining words/phrases (mandatory words, keywords, or whatever one migth want to call them) in a rule.

How This Feature will help you/your organization

I believe it would reduce the number of false positives quite substantially although that remains to be seen.

Possible Solution/Implementation Details

Although I'm unsure of the feasibility of implementing such a solution, it would be nice if it was possible to highlight these words directly in the license RULE file, for example surrounded by some markup:

License
Distributed under the {{MIT}} License. See LICENSE for more information.

The semantics of this would be that the rule would never match (irrespective of score) if MIT wasn't present in the matched text.

I suppose an alternative realization, although not as appealing, would be to include these mandatory keywords/phrases as an attribute in the .yml file:

license_expression: mit
is_license_notice: yes
relevance: 100
referenced_filenames:
    - LICENSE
key_phrases:
    - MIT

or something to that effect.

Example/Links if Any

Can you help with this Feature

I have spent zero time in the codebase, but if given a proper introduction I suppose I might be able to help out.

@pombredanne
Copy link
Member

pombredanne commented Aug 13, 2021

@petergardfjall This makes sense but within reason! Why? early versions of scancode were using a {{ markup }} to tag parts that could possibly not be matched... which is related but different from key phrases, and that markup was abandonned as being too complex and too inneficient with the engine used at that time (see https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/textcode/analysis.py for the code that tokenized the rule text and an example of rules with {{ template }} markup at https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_2.RULE#L13 and https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_3.RULE#L5 or https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_14.RULE#L11 )

The syntax was more or less {{<number> some text that was ignored}} where the number meant that this number of words could be skipped before deciding to give up on strings alignment.
https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_3.RULE#L5
This was dropped as it proved hard to maintain and was making the code super complex and slow.
But that was using an entirely different license matching engine back then.

That said, requiring the presence of some key, essential words or phrases would make a lot of sense and would be reasonably easy to get working. It might make some license not detected at all in some odd cases such as the one you fixed with #2636 but these will be caught alright by @akugarg new feature for unknown unknown license detection, so there is no harm I can foresee.

I would go about it this way:

Now we could use either markup or explicit data attribute in the YAML data file. Either can work. Using an explicit data attribute may be much simpler to process than markup yet may be not as appealing, as you wrote, in particular as this could not take into account the relative positions of these phrases... But markup would still be much more involved IMHO.

BTW, closely related feature could be to reinstate the ability to consider some parts of a rule as "variable" as in https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_3.RULE#L5 and not score-impacting if not present in a match (e.g. the opposite of a key phrase in a way).

@pombredanne
Copy link
Member

pombredanne commented Aug 14, 2021

@petergardfjall After thinking a bit more about this, I think that markup may be best and then what should be kept from parsing the markup would be the rule word positions for the key sentences (typically a list of integers that we handle usually as a Span backed by an integer bit set using intbitset. Checking the presence of theses positions in the actually matched positions of a LicenseMatch (which is a Span too, see https://github.com/nexB/scancode-toolkit/blob/6bf2fd98653f376779ddfa71b567f92366bb4277/src/licensedcode/match.py#L137) would be as simple as a Span containment check available here https://github.com/nexB/scancode-toolkit/blob/6bf2fd98653f376779ddfa71b567f92366bb4277/src/licensedcode/spans.py#L177

This would be a bit slower at indexing time (since every rule would need to be parsed in a more sophisticated way) but super simple code for the match filtering at runtime, literally something like

for match in matches:
    if (
        matched_rule.key_phrase_spans
        and not all(key_phrase_span in match.qspan for key_phrase_span in match.matched_rule.key_phrase_spans)
    ):
        ... discard this match

@petergardfjall
Copy link
Contributor Author

Thanks for your thoughtful response @pombredanne!
Glad to hear that you sound positive to the proposal. Your guidance towards an implementation is most appreciated. I'll need to see if I can get some time to work on this (I take it that you would prefer to see a PR rather than implementing this yourself).

@pombredanne
Copy link
Member

Glad to hear that you sound positive to the proposal. Your guidance towards an implementation is most appreciated. I'll need to see if I can get some time to work on this (I take it that you would prefer to see a PR rather than implementing this yourself).

If you can chip in to help that would be much appreciated. In any case this is a good proposal.

@pombredanne
Copy link
Member

After a review, there is a surprising large number of detection issues that would be solved with this proposal! Also I reckon this is closely related to #1838 from @furuholm and we may merge these in one.

@pombredanne
Copy link
Member

Another one #2605 by @soimkim

@petergardfjall
Copy link
Contributor Author

After a review, there is a surprising large number of detection issues that would be solved with this proposal!

Nicely researched! Good to hear that the proposal has potential to resolve a lot of false positive scenarios. I'm currently burdened with a lot of other work but I will try to come back to this when time permits.

pombredanne added a commit that referenced this issue Sep 14, 2021
The difference between GPL 2 and GPL 3 may be only one digit and this
is what this rule and minimum-coverage update attempts to deal with
in a specific case. Eventually teh solution is to implement #2637

Reported-by: John Horan <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This was referenced Nov 17, 2021
@pombredanne
Copy link
Member

For reference this is being worked on at https://github.com/softsense/scancode-toolkit/pull/1

@pombredanne
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants