Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DLP: Added sample for inspect string with custom regex #3107

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions dlp/inspectWithCustomRegex.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
// Copyright 2023 Google LLC
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

'use strict';

// sample-metadata:
// title: Inspects strings
// description: Inspects a string using custom regex pattern.
// usage: node inspectWithCustomRegex.js my-project string minLikelihood maxFindings infoTypes customInfoTypes includeQuote

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we trim this down?

  • string: required
  • minLikelihood: we should be able to remove this because the likelihood for custom infotype (which is how regex is used) is controlled by the request anyway.
  • maxFindings: does not make sense in the context of this sample. We're inspecting a small-ish string, not megabytes of file content, so there will only be so many findings. If you want to limit it anyway, just hardcode a limit of ~1000 in the sample. Users can change it if they need to.
  • infoTypes: Can omit, since we're demonstrating regex matching only. I guess it's possible we may want to show side-by-side detection of custom and built-in infotypes, but if that's the case move this to the end and make it optional. (Also if that is the case, lets make the example string actually demonstrate that)
  • customInfoTypes: Since the whole point of this sample is to demonstrate regex, we should ask for regex directly and construct the custom infotype in code.
  • includeQuote: as with maxFindings, lets just set this to true for demo purposes. If users want to change it they can edit the code.

At a high level we should make the sample as easy as possible to run. Adding a lot of parameters and using obscure syntax (such as the ',' and regex/dict hybrid for customInfoTypes) will lead to confusion and frustration.

As a user of this sample, I should be able to say node inspectWithCustomRegex.js 'this is my serial number aab-bcdd-eef' '[a-f\-]10' and see results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumya92 I had the same thought when I started the implementation but I noticed a particular structure is being followed for all inspect samples. Couldn't figure out the exact reason but mostly it was to keep the sample code consistent. Anyway, I feel your findings look reasonable and so I have updated this sample. Also, will it be okay if I make these same changes in my other PRs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah go for it! The easier we make our samples to use, the better


function main(
projectId,
string,
minLikelihood,
maxFindings,
infoTypes,
customInfoTypes,
includeQuote
) {
[infoTypes, customInfoTypes] = transformCLI(infoTypes, customInfoTypes);
// [START dlp_inspect_custom_regex]
// Imports the Google Cloud Data Loss Prevention library
const DLP = require('@google-cloud/dlp');

// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// The project ID to run the API call under
// const projectId = 'my-project';

// The string to inspect
// const string = 'Patients MRN 444-5-22222';

// The minimum likelihood required before returning a match
// const minLikelihood = DLP.protos.google.privacy.dlp.v2.Likelihood.POSSIBLE;

// The maximum number of findings to report per request (0 = server maximum)
// const maxFindings = 0;

// The infoTypes of information to match
// See https://cloud.google.com/dlp/docs/concepts-infotypes for more information
// about supported infoTypes.
// const infoTypes = [{ name: 'EMAIL_ADDRESS' }];

// The customInfoTypes of information to match
// const customInfoTypes = [{ infoType: { name: 'DICT_TYPE' }, dictionary: { wordList: { words: ['foo', 'bar', 'baz']}}},
// { infoType: { name: 'REGEX_TYPE' }, regex: {pattern: '\\(\\d{3}\\) \\d{3}-\\d{4}'}}];

// Whether to include the matching string
// const includeQuote = true;

async function inspectWithCustomRegex() {
// Construct item to inspect
const item = {
byteItem: {
type: DLP.protos.google.privacy.dlp.v2.ByteContentItem.BytesType
.TEXT_UTF8,
data: Buffer.from(string, 'utf-8'),
},
};

// Assigns likelihood to each match
customInfoTypes = customInfoTypes.map(customInfoType => {
soumya92 marked this conversation as resolved.
Show resolved Hide resolved
customInfoType.likelihood =
DLP.protos.google.privacy.dlp.v2.Likelihood.POSSIBLE;
return customInfoType;
});

// Construct request
const request = {
parent: `projects/${projectId}/locations/global`,
inspectConfig: {
infoTypes: infoTypes,
customInfoTypes: customInfoTypes,
minLikelihood: minLikelihood,
includeQuote: includeQuote,
limits: {
maxFindingsPerRequest: maxFindings,
},
},
item: item,
};

// Run request
const [response] = await dlp.inspectContent(request);
const findings = response.result.findings;
if (findings.length > 0) {
console.log('Findings:');
findings.forEach(finding => {
soumya92 marked this conversation as resolved.
Show resolved Hide resolved
if (includeQuote) {
console.log(`\tQuote: ${finding.quote}`);
}
console.log(`\tInfo type: ${finding.infoType.name}`);
soumya92 marked this conversation as resolved.
Show resolved Hide resolved
console.log(`\tLikelihood: ${finding.likelihood}`);
});
} else {
console.log('No findings.');
}
}
inspectWithCustomRegex();
// [END dlp_inspect_custom_regex]
}

main(...process.argv.slice(2));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be after process.on to avoid missing synchronous promise rejections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any use case where this can happen? I have updated the code as you mentioned but during testing, I found the same results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it happens right now, but I could imagine in the future there might be synchronous validation of requests (e.g. bad enum values might throw before even making a network request).

process.on('unhandledRejection', err => {
console.error(err.message);
process.exitCode = 1;
});

function transformCLI(infoTypes, customInfoTypes) {
infoTypes = infoTypes
? infoTypes.split(',').map(type => {
return {name: type};
})
: undefined;

if (customInfoTypes) {
customInfoTypes = customInfoTypes.includes(',')
soumya92 marked this conversation as resolved.
Show resolved Hide resolved
? customInfoTypes.split(',').map((dict, idx) => {
return {
infoType: {name: 'CUSTOM_DICT_'.concat(idx.toString())},
dictionary: {wordList: {words: dict.split(',')}},
};
})
: customInfoTypes.split(',').map((rgx, idx) => {
return {
infoType: {name: 'CUSTOM_REGEX_'.concat(idx.toString())},
regex: {pattern: rgx},
};
});
}

return [infoTypes, customInfoTypes];
}
34 changes: 34 additions & 0 deletions dlp/system-test/inspect.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -309,4 +309,38 @@ describe('inspect', () => {
assert.notMatch(outputB, /EMAIL_ADDRESS/);
assert.match(outputB, /PHONE_NUMBER/);
});

// dlp_inspect_custom_regex
it('should inspect a string using custom regex', () => {
const string = 'Patients MRN 444-5-22222';
const custRegex = '[1-9]{3}-[1-9]{1}-[1-9]{5}';
const output = execSync(
`node inspectWithCustomRegex.js ${projectId} "${string}" UNLIKELY 0 "" ${custRegex} true`
soumya92 marked this conversation as resolved.
Show resolved Hide resolved
);
assert.match(output, /Info type: CUSTOM_REGEX_0/);
assert.match(output, /Likelihood: POSSIBLE/);
});

it('should handle string with no match', () => {
const string = 'Patients MRN 444-5-22222';
const custRegex = '[1-9]{3}-[1-9]{2}-[1-9]{5}';
const output = execSync(
`node inspectWithCustomRegex.js ${projectId} "${string}" UNLIKELY 0 "" ${custRegex} true`
);
assert.include(output, 'No findings');
});

it('should report any errors while inspecting a string', () => {
let output;
const string = 'Patients MRN 444-5-22222';
const custRegex = '[1-9]{3}-[1-9]{2}-[1-9]{5}';
try {
output = execSync(
`node inspectWithCustomRegex.js ${projectId} "${string}" UNLIKELY 0 BAD_TYPE ${custRegex} true`
);
} catch (err) {
output = err.message;
}
assert.include(output, 'INVALID_ARGUMENT');
});
});