-
-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow license rules to require the presence of certain defining keywords #2637
Comments
@petergardfjall This makes sense but within reason! Why? early versions of scancode were using a {{ markup }} to tag parts that could possibly not be matched... which is related but different from key phrases, and that markup was abandonned as being too complex and too inneficient with the engine used at that time (see https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/textcode/analysis.py for the code that tokenized the rule text and an example of rules with {{ template }} markup at https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_2.RULE#L13 and https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_3.RULE#L5 or https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_14.RULE#L11 ) The syntax was more or less That said, requiring the presence of some key, essential words or phrases would make a lot of sense and would be reasonably easy to get working. It might make some license not detected at all in some odd cases such as the one you fixed with #2636 but these will be caught alright by @akugarg new feature for unknown unknown license detection, so there is no harm I can foresee. I would go about it this way:
Now we could use either markup or explicit data attribute in the YAML data file. Either can work. Using an explicit data attribute may be much simpler to process than markup yet may be BTW, closely related feature could be to reinstate the ability to consider some parts of a rule as "variable" as in https://github.com/nexB/scancode-toolkit/blob/v1.0.0/src/licensedcode/data/rules/bsd-new_3.RULE#L5 and not score-impacting if not present in a match (e.g. the opposite of a key phrase in a way). |
@petergardfjall After thinking a bit more about this, I think that markup may be best and then what should be kept from parsing the markup would be the rule word positions for the key sentences (typically a list of integers that we handle usually as a This would be a bit slower at indexing time (since every rule would need to be parsed in a more sophisticated way) but super simple code for the match filtering at runtime, literally something like for match in matches:
if (
matched_rule.key_phrase_spans
and not all(key_phrase_span in match.qspan for key_phrase_span in match.matched_rule.key_phrase_spans)
):
... discard this match |
Thanks for your thoughtful response @pombredanne! |
If you can chip in to help that would be much appreciated. In any case this is a good proposal. |
Nicely researched! Good to hear that the proposal has potential to resolve a lot of false positive scenarios. I'm currently burdened with a lot of other work but I will try to come back to this when time permits. |
The difference between GPL 2 and GPL 3 may be only one digit and this is what this rule and minimum-coverage update attempts to deal with in a specific case. Eventually teh solution is to implement #2637 Reported-by: John Horan <[email protected]> Signed-off-by: Philippe Ombredanne <[email protected]>
For reference this is being worked on at https://github.com/softsense/scancode-toolkit/pull/1 |
Short Description
In many license expression rules, certain words or phrases carry a much higher significance than others. For example,
This software is distributed under a
seems a lot less important thanMIT License
for a rule that describes an MIT license notice. I think it would be valuable if these "defining words" could optionally be marked and, if declared, would prevent matches from happening unless those defining phrases are present in the matched text.Possible Labels
Select Category
Describe the Update
This is a feature proposal which I haven't been able to find any discussions of when searching through the issues (it's quite possible that I have just missed it!). Anyhow, I think it would be really valueable if it was possible in a license expression rule to mark certain "defining words" (or phrases) as required to be present in the scanned text in order for a match to be reported.
As an example, I've seen quite a lot of false positives that I believe could be eliminated if these crucial phrases were enforced. For example, the following text
was matched by
mit_923.RULE
:even though it's pretty clear to a human reader that there are certain crucial aspects of this rule that doesn't match. That is to say, some words in the rule are more significant in others, and are essentially defining for the license expression. In this case, of course it is
MIT
that would be that defining phrase.As can be clearly seen from a scan match:
the defining aspect of the license expression (
MIT
) is missing from the match.Now, I think it would be really cool and, I imagine, would reduce false positive by quite a significant number, if it was possible to mark these defining words/phrases (mandatory words, keywords, or whatever one migth want to call them) in a rule.
How This Feature will help you/your organization
I believe it would reduce the number of false positives quite substantially although that remains to be seen.
Possible Solution/Implementation Details
Although I'm unsure of the feasibility of implementing such a solution, it would be nice if it was possible to highlight these words directly in the license
RULE
file, for example surrounded by some markup:The semantics of this would be that the rule would never match (irrespective of score) if
MIT
wasn't present in the matched text.I suppose an alternative realization, although not as appealing, would be to include these mandatory keywords/phrases as an attribute in the
.yml
file:or something to that effect.
Example/Links if Any
Can you help with this Feature
I have spent zero time in the codebase, but if given a proper introduction I suppose I might be able to help out.
The text was updated successfully, but these errors were encountered: