allow license rules to require the presence of certain defining keywords #2637

petergardfjall · 2021-08-13T11:11:40Z

Short Description

In many license expression rules, certain words or phrases carry a much higher significance than others. For example, This software is distributed under a seems a lot less important than MIT License for a rule that describes an MIT license notice. I think it would be valuable if these "defining words" could optionally be marked and, if declared, would prevent matches from happening unless those defining phrases are present in the matched text.

Possible Labels

new feature
license scan

Select Category

Describe the Update

This is a feature proposal which I haven't been able to find any discussions of when searching through the issues (it's quite possible that I have just missed it!). Anyhow, I think it would be really valueable if it was possible in a license expression rule to mark certain "defining words" (or phrases) as required to be present in the scanned text in order for a match to be reported.

As an example, I've seen quite a lot of false positives that I believe could be eliminated if these crucial phrases were enforced. For example, the following text

## License

This SDK is distributed under the Apache License, Version 2.0, see LICENSE for more information.

was matched by mit_923.RULE:

License
Distributed under the MIT License. See LICENSE for more information.

even though it's pretty clear to a human reader that there are certain crucial aspects of this rule that doesn't match. That is to say, some words in the rule are more significant in others, and are essentially defining for the license expression. In this case, of course it is MIT that would be that defining phrase.

As can be clearly seen from a scan match:

      "matched_rule": {
        "identifier": "mit_923.RULE",
        ..
      },
      "matched_text": "License\n\n[This] [SDK] [is] distributed under the [Apache] License, [Version] [2].[0], see LICENSE for more information."

the defining aspect of the license expression (MIT) is missing from the match.

Now, I think it would be really cool and, I imagine, would reduce false positive by quite a significant number, if it was possible to mark these defining words/phrases (mandatory words, keywords, or whatever one migth want to call them) in a rule.

How This Feature will help you/your organization

I believe it would reduce the number of false positives quite substantially although that remains to be seen.

Possible Solution/Implementation Details

Although I'm unsure of the feasibility of implementing such a solution, it would be nice if it was possible to highlight these words directly in the license RULE file, for example surrounded by some markup:

License
Distributed under the {{MIT}} License. See LICENSE for more information.

The semantics of this would be that the rule would never match (irrespective of score) if MIT wasn't present in the matched text.

I suppose an alternative realization, although not as appealing, would be to include these mandatory keywords/phrases as an attribute in the .yml file:

license_expression: mit
is_license_notice: yes
relevance: 100
referenced_filenames:
    - LICENSE
key_phrases:
    - MIT

or something to that effect.

Example/Links if Any

Can you help with this Feature

I have spent zero time in the codebase, but if given a proper introduction I suppose I might be able to help out.

The text was updated successfully, but these errors were encountered:

pombredanne · 2021-08-13T21:55:28Z

@petergardfjall This makes sense but within reason! Why? early versions of scancode were using a {{ markup }} to tag parts that could possibly not be matched... which is related but different from key phrases, and that markup was abandonned as being too complex and too inneficient with the engine used at that time (see https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/textcode/analysis.py for the code that tokenized the rule text and an example of rules with {{ template }} markup at https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_2.RULE#L13 and https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_3.RULE#L5 or https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_14.RULE#L11 )

The syntax was more or less {{<number> some text that was ignored}} where the number meant that this number of words could be skipped before deciding to give up on strings alignment.
https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_3.RULE#L5
This was dropped as it proved hard to maintain and was making the code super complex and slow.
But that was using an entirely different license matching engine back then.

That said, requiring the presence of some key, essential words or phrases would make a lot of sense and would be reasonably easy to get working. It might make some license not detected at all in some odd cases such as the one you fixed with #2636 but these will be caught alright by @akugarg new feature for unknown unknown license detection, so there is no harm I can foresee.

I would go about it this way:

during indexing in licensedcode/index.py I would keep a new data structure in each Rule that would have the [**sequences** of token ids that must be present in the match] that would be populated somewhere around here https://github.com/nexB/scancode-toolkit/blob/6bf2fd98653f376779ddfa71b567f92366bb4277/src/licensedcode/index.py#L346 during indexing
This is an optimization to avoid doing this at match filtering time
in licensedcode/match.py I would add a new filtering function that would receive matches and remove these missing their rile key phrases in their matched tokens... something similar to https://github.com/nexB/scancode-toolkit/blob/6bf2fd98653f376779ddfa71b567f92366bb4277/src/licensedcode/match.py#L1283 ... this is going to be likely the code that will require the most thinking. Then invoke this here https://github.com/nexB/scancode-toolkit/blob/6bf2fd98653f376779ddfa71b567f92366bb4277/src/licensedcode/match.py#L1430 early in the filters chain.
Add some unit and integration tests. Done :)

Now we could use either markup or explicit data attribute in the YAML data file. Either can work. Using an explicit data attribute may be much simpler to process than markup yet may be not as appealing, as you wrote, in particular as this could not take into account the relative positions of these phrases... But markup would still be much more involved IMHO.

BTW, closely related feature could be to reinstate the ability to consider some parts of a rule as "variable" as in https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_3.RULE#L5 and not score-impacting if not present in a match (e.g. the opposite of a key phrase in a way).

pombredanne · 2021-08-14T09:18:32Z

@petergardfjall After thinking a bit more about this, I think that markup may be best and then what should be kept from parsing the markup would be the rule word positions for the key sentences (typically a list of integers that we handle usually as a Span backed by an integer bit set using intbitset. Checking the presence of theses positions in the actually matched positions of a LicenseMatch (which is a Span too, see https://github.com/nexB/scancode-toolkit/blob/6bf2fd98653f376779ddfa71b567f92366bb4277/src/licensedcode/match.py#L137) would be as simple as a Span containment check available here https://github.com/nexB/scancode-toolkit/blob/6bf2fd98653f376779ddfa71b567f92366bb4277/src/licensedcode/spans.py#L177

This would be a bit slower at indexing time (since every rule would need to be parsed in a more sophisticated way) but super simple code for the match filtering at runtime, literally something like

for match in matches:
    if (
        matched_rule.key_phrase_spans
        and not all(key_phrase_span in match.qspan for key_phrase_span in match.matched_rule.key_phrase_spans)
    ):
        ... discard this match

petergardfjall · 2021-08-16T06:04:38Z

Thanks for your thoughtful response @pombredanne!
Glad to hear that you sound positive to the proposal. Your guidance towards an implementation is most appreciated. I'll need to see if I can get some time to work on this (I take it that you would prefer to see a PR rather than implementing this yourself).

pombredanne · 2021-08-18T12:30:54Z

Glad to hear that you sound positive to the proposal. Your guidance towards an implementation is most appreciated. I'll need to see if I can get some time to work on this (I take it that you would prefer to see a PR rather than implementing this yourself).

If you can chip in to help that would be much appreciated. In any case this is a good proposal.

pombredanne · 2021-08-19T16:04:26Z

Some issues that would benefit from this approach:

pombredanne · 2021-08-20T08:25:34Z

After a review, there is a surprising large number of detection issues that would be solved with this proposal! Also I reckon this is closely related to #1838 from @furuholm and we may merge these in one.

pombredanne · 2021-08-20T08:37:48Z

Another one #2605 by @soimkim

petergardfjall · 2021-08-20T16:19:08Z

After a review, there is a surprising large number of detection issues that would be solved with this proposal!

Nicely researched! Good to hear that the proposal has potential to resolve a lot of false positive scenarios. I'm currently burdened with a lot of other work but I will try to come back to this when time permits.

The difference between GPL 2 and GPL 3 may be only one digit and this is what this rule and minimum-coverage update attempts to deal with in a specific case. Eventually teh solution is to implement #2637 Reported-by: John Horan <[email protected]> Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne · 2021-11-26T09:14:41Z

For reference this is being worked on at https://github.com/softsense/scancode-toolkit/pull/1

pombredanne · 2021-11-26T09:15:01Z

And https://github.com/softsense/scancode-toolkit/pull/2 by @dd-jy

petergardfjall added the new feature label Aug 13, 2021

pombredanne mentioned this issue Aug 16, 2021

Apparent MIT license gets discovered as Apache-2.0 #2635

Open

pombredanne added improve-license-detection license scan labels Aug 20, 2021

This was referenced Aug 20, 2021

License is detected by "bootloader-exception" for "BSD-3-Clause" file. #2605

Closed

test licensedcode/data/rules/openldap-2.7.RULE contains spurious word "date" in header #2633

Open

This was referenced Aug 23, 2021

Duplicated record [potential filter for low score] for license_expression #2139

Open

False positive: MIT and Boost, should be MIT #2675

Closed

This was referenced Nov 17, 2021

GPL-2.0 false alarm #2757

Open

LGPL 2.1 detected as LGPL 3 #2760

Open

mrombout mentioned this issue Nov 30, 2021

Allow license rules to require the presence of certain defining keywords #2773

Merged

4 tasks

pombredanne closed this as completed in #2773 Dec 25, 2021

pombredanne mentioned this issue Jan 4, 2023

Proposal for avoiding false positives #1838

Closed

AyanSinhaMahapatra mentioned this issue Feb 15, 2023

Add required phrase rules automatically #3254

Closed

4 tasks

AyanSinhaMahapatra mentioned this issue Aug 27, 2023

MISDETECTION: AGPL detected when it isn't there #3498

Closed

AyanSinhaMahapatra mentioned this issue Sep 15, 2023

Wrong license detection in oauthlib #3512

Closed

AyanSinhaMahapatra mentioned this issue Sep 17, 2024

Update rules with required phrases automatically #3924

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow license rules to require the presence of certain defining keywords #2637

allow license rules to require the presence of certain defining keywords #2637

petergardfjall commented Aug 13, 2021 •

edited

Loading

pombredanne commented Aug 13, 2021 •

edited

Loading

pombredanne commented Aug 14, 2021 •

edited

Loading

petergardfjall commented Aug 16, 2021

pombredanne commented Aug 18, 2021

pombredanne commented Aug 19, 2021 •

edited

Loading

pombredanne commented Aug 20, 2021

pombredanne commented Aug 20, 2021

petergardfjall commented Aug 20, 2021

pombredanne commented Nov 26, 2021

pombredanne commented Nov 26, 2021

allow license rules to require the presence of certain defining keywords #2637

allow license rules to require the presence of certain defining keywords #2637

Comments

petergardfjall commented Aug 13, 2021 • edited Loading

Short Description

Possible Labels

Select Category

Describe the Update

How This Feature will help you/your organization

Possible Solution/Implementation Details

Example/Links if Any

Can you help with this Feature

pombredanne commented Aug 13, 2021 • edited Loading

pombredanne commented Aug 14, 2021 • edited Loading

petergardfjall commented Aug 16, 2021

pombredanne commented Aug 18, 2021

pombredanne commented Aug 19, 2021 • edited Loading

pombredanne commented Aug 20, 2021

pombredanne commented Aug 20, 2021

petergardfjall commented Aug 20, 2021

pombredanne commented Nov 26, 2021

pombredanne commented Nov 26, 2021

petergardfjall commented Aug 13, 2021 •

edited

Loading

pombredanne commented Aug 13, 2021 •

edited

Loading

pombredanne commented Aug 14, 2021 •

edited

Loading

pombredanne commented Aug 19, 2021 •

edited

Loading