Issue #89 Base64 and hexadecimal entropy score override #223

namithasind · 2021-10-04T22:19:43Z

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

Tests (if applicable)
Documentation (if applicable)
Changelog entry
A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

Yes (backward compatible)
No (breaking changes)

Issue Linking

#89

What's new?

Added 2 new options -b64/--b64-entropy-score and -hex/--hex-entropy-score which can be used to override the hard coded base64 and hexadecimal entropy scores
Modified the unit tests
Updated Readme with the new options and descriptions

…ore override Updated readme

rbailey-godaddy

LGTM

rbailey-godaddy · 2021-10-05T13:16:45Z

I'm curious -- what's the use case for this enhancement?

I've always treated the original (now default) values as more or less magic, although thinking about it I guess I can see that a hex digit 0-f has 16 unique values and represents 4 bits of data. And base64 represents 24 bits of input in 4 encoded output characters, so each character we see represents 6 bits of data. The "magic" part to me is "why do we use thresholds that are 75% of the raw bit rate?" Is this "if the data looks at least 75% random it's a problem?"

That's the lead-up to wondering why somebody would want to override these numbers and how they would make intelligent choices and what the implications of those choices would be. It also makes me think (although not strongly enough to withdraw my approval) that these are the wrong knobs for this problem space.

The potential entropy of a hex digit string or a base64 character string is just math (see above) -- and changing those properties makes about as much sense as trying to legislate the value of pi. ;-) Perhaps this would be more manageable -- if not more understandable -- to have a "sensitivity" argument with a range of 0-100 and a default of 75 (so people don't have to mess with a decimal point) and then compute hex_entropy_score = 4.0 * float(x) / 100.0 and b64_entropy_score = 6.0 * float(x) / 100.0.

This way at least there's only one thing to twiddle, the effects are consistent (at least to a computer scientist or mathematician lol), and we can frame the discussion in terms of entropy generally without getting bogged down in the actual representation, which isn't supposed to matter anyway.

rbailey-godaddy · 2021-10-05T13:34:34Z

I'm curious -- what's the use case for this enhancement?

I've always treated the original (now default) values as more or less magic, although thinking about it I guess I can see that a hex digit 0-f has 16 unique values and represents 4 bits of data. And base64 represents 24 bits of input in 4 encoded output characters, so each character we see represents 6 bits of data. The "magic" part to me is "why do we use thresholds that are 75% of the raw bit rate?" Is this "if the data looks at least 75% random it's a problem?"

That's the lead-up to wondering why somebody would want to override these numbers and how they would make intelligent choices and what the implications of those choices would be. It also makes me think (although not strongly enough to withdraw my approval) that these are the wrong knobs for this problem space.

The potential entropy of a hex digit string or a base64 character string is just math (see above) -- and changing those properties makes about as much sense as trying to legislate the value of pi. ;-) Perhaps this would be more manageable -- if not more understandable -- to have a "sensitivity" argument with a range of 0-100 and a default of 75 (so people don't have to mess with a decimal point) and then compute hex_entropy_score = 4.0 * float(x) / 100.0 and b64_entropy_score = 6.0 * float(x) / 100.0.

This way at least there's only one thing to twiddle, the effects are consistent (at least to a computer scientist or mathematician lol), and we can frame the discussion in terms of entropy generally without getting bogged down in the actual representation, which isn't supposed to matter anyway.

FYI. I edited this after sending, and I think subscribers only saw the original version. So I'm spamming you all again. Apologies.

rbailey-godaddy · 2021-10-05T14:06:52Z

Perhaps this would be more manageable -- if not more understandable -- to have a "sensitivity" argument with a range of 0-100 and a default of 75 (so people don't have to mess with a decimal point) and then compute hex_entropy_score = 4.0 * float(x) / 100.0 and b64_entropy_score = 6.0 * float(x) / 100.0.
This way at least there's only one thing to twiddle, the effects are consistent (at least to a computer scientist or mathematician lol), and we can frame the discussion in terms of entropy generally without getting bogged down in the actual representation, which isn't supposed to matter anyway.

It occurs to me that "sensitivity" is a bad name, since I think higher numbers are less sensitive than lower numbers, which is counterintuitive. My interpretation is that this number (0-100) is the probability that the string in question is totally random (aka encrypted, at a fuzzy level), and therefore higher numbers (above 75) make tartufo more reluctant to report a given string, while lower numbers make tartufo more likely to report that string.

BTW, one reason this is nicer than tweaking scores -- it's easier to reason about. The range is 0-100 because it's a probability, and you can't talk about probabilities greater than 1.0 (aka 100%) or less than 0.0 (aka 0%). I still have a curious urge to understand where the original 75% came from, but both trufflehog and the original blog it references seem silent on the subject.

namithasind · 2021-10-05T16:21:36Z

I'm curious -- what's the use case for this enhancement?

I've always treated the original (now default) values as more or less magic, although thinking about it I guess I can see that a hex digit 0-f has 16 unique values and represents 4 bits of data. And base64 represents 24 bits of input in 4 encoded output characters, so each character we see represents 6 bits of data. The "magic" part to me is "why do we use thresholds that are 75% of the raw bit rate?" Is this "if the data looks at least 75% random it's a problem?"

That's the lead-up to wondering why somebody would want to override these numbers and how they would make intelligent choices and what the implications of those choices would be. It also makes me think (although not strongly enough to withdraw my approval) that these are the wrong knobs for this problem space.

The potential entropy of a hex digit string or a base64 character string is just math (see above) -- and changing those properties makes about as much sense as trying to legislate the value of pi. ;-) Perhaps this would be more manageable -- if not more understandable -- to have a "sensitivity" argument with a range of 0-100 and a default of 75 (so people don't have to mess with a decimal point) and then compute hex_entropy_score = 4.0 * float(x) / 100.0 and b64_entropy_score = 6.0 * float(x) / 100.0.

This way at least there's only one thing to twiddle, the effects are consistent (at least to a computer scientist or mathematician lol), and we can frame the discussion in terms of entropy generally without getting bogged down in the actual representation, which isn't supposed to matter anyway.

We have repositories that have secrets that needs to be moved out of GitHub before we migrated it to GHEC. Tartufo could not catch all of the secrets. For example we had a password like password: <randomAlphanumericStringOfLength16> which tartufo could not catch. I added Regex to catch these but I came across a few more Regex couldn't catch. I was also pointed to the code that sets the string length to 20. I was playing around with this value to see if that can be made configurable. I was not able to catch secret strings of lower length just by altering it. Then I realized that might be because shorter strings have lower entropy value. That's when I started looking into making entropy score configurable and came across the issue already reported in the repo. This feature is useful when you want to catch secrets like the one I mentioned above and Regex isn't able to catch it.
I agree that it is not possible to guess the entropy values that will work, right away. Need to play around the default to see if the results are more effective. A downside for altering entropy value to a lower value would be that you have more number of false positives that are reported by tartufo. But I think this feature will enable user to do a more thorough scan.

tarkatronic

I do really like the idea that @rbailey-godaddy presented of working with a percentage, rather than a mysterious number, and I would love to see that enhancement eventually. That said, this should be a very useful change for experimentation and learning more about effective values! Great work on this @namithasind!

Issue godaddy#89 Initial commit for base64 and hexadecimal entropy sc…

001c2c3

…ore override Updated readme

namithasind requested a review from a team as a code owner October 4, 2021 22:19

namithasind requested review from beardedeagle, jmink-godaddy and rbailey-godaddy October 4, 2021 22:19

rbailey-godaddy approved these changes Oct 5, 2021

View reviewed changes

tarkatronic approved these changes Oct 18, 2021

View reviewed changes

Merge branch 'main' into Issue#89-b64-hex-score-override

8cde83d

tarkatronic merged commit 53f544f into godaddy:main Oct 18, 2021

tarkatronic linked an issue Oct 19, 2021 that may be closed by this pull request

Allow Entropy b64/hex Score Override #89

Closed

tarkatronic mentioned this pull request Oct 19, 2021

Allow Entropy b64/hex Score Override #89

Closed

tarkatronic mentioned this pull request Nov 4, 2021

Change --b64-entropy-score and --hex-entropy-score to --entropy-sensitivity #265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #89 Base64 and hexadecimal entropy score override #223

Issue #89 Base64 and hexadecimal entropy score override #223

namithasind commented Oct 4, 2021

rbailey-godaddy left a comment

rbailey-godaddy commented Oct 5, 2021 •

edited

Loading

rbailey-godaddy commented Oct 5, 2021

rbailey-godaddy commented Oct 5, 2021

namithasind commented Oct 5, 2021

tarkatronic left a comment

Issue #89 Base64 and hexadecimal entropy score override #223

Issue #89 Base64 and hexadecimal entropy score override #223

Conversation

namithasind commented Oct 4, 2021

PR Type

Backward Compatibility

Issue Linking

What's new?

rbailey-godaddy left a comment

Choose a reason for hiding this comment

rbailey-godaddy commented Oct 5, 2021 • edited Loading

rbailey-godaddy commented Oct 5, 2021

rbailey-godaddy commented Oct 5, 2021

namithasind commented Oct 5, 2021

tarkatronic left a comment

Choose a reason for hiding this comment

rbailey-godaddy commented Oct 5, 2021 •

edited

Loading