Privacy 2021 custom metrics #211

max-ostapenko · 2021-06-20T17:51:27Z

Progress on HTTPArchive/almanac.httparchive.org#2149

custom_metrics/privacy.js

Co-authored-by: Max Ostapenko <[email protected]>

max-ostapenko · 2021-06-22T11:12:29Z

@VictorLeP could you add to-do items to the description of this PR?
I've added some that you've completed, but how much is left?

VictorLeP · 2021-06-22T12:19:08Z

@VictorLeP could you add to-do items to the description of this PR?
I've added some that you've completed, but how much is left?

I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')?

VictorLeP · 2021-06-22T12:21:36Z

@VictorLeP could you add to-do items to the description of this PR?
I've added some that you've completed, but how much is left?

I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')?

I'm guessing this is due to the same reason as me not being able to mark the pull request as ready for review: "Only those with write access to this repository can mark a draft pull request as ready for review."

tunetheweb · 2021-06-22T13:07:39Z

You should have full access now @VictorLeP . Can you try again.

VictorLeP · 2021-06-22T13:10:26Z

Thanks, I don't see an edit option for #211 (comment) yet though. (Just to check I'm looking in the right place: I would expect an 'Edit' item in the menu under the three dots top right of that comment.)

tunetheweb · 2021-06-22T13:15:08Z

Ah looks like this repo is set up differently to the main Web Almanac one so you can’t edit each other’s comments unfortunately.

VictorLeP · 2021-06-22T13:20:23Z

OK, thanks for checking!

VictorLeP · 2021-06-22T13:38:43Z

@max-ostapenko These are the changes to the to-do list I think should be made:

IAB TCF v1: retrieve consent string -- done in 2770294
Ads Transparency Spotlight: traverse all frames -- I don't think this is possible? (we can't access 3rd party frames from the custom metrics JS)
add item for 'privacy policy keywords' -- done (reused from last year)
add item for 'sensitive resources' with subitems 'permissions policy', 'media devices' and 'geolocation' -- I believe you will still restructure these, then they will be complete?

VictorLeP · 2021-06-22T13:53:15Z

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

max-ostapenko · 2021-06-22T17:41:56Z

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

I thought it would be enough to verify the presence of ClientJS object?

VictorLeP · 2021-06-22T17:45:15Z

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

I thought it would be enough to verify the presence of ClientJS object?

Sure, but a custom metric is still needed for that if it's not added to Wappalyzer :)

tunetheweb · 2021-06-22T17:52:47Z

You can open a PR with Wappalyzer if you know how to detect it. And that way everyone gets to benefit from it and not just the Web Almanac.

Here's the PR for FingerprintJs for example: https://github.com/AliasIO/wappalyzer/pull/1790 and a follow up one to improve on that: https://github.com/AliasIO/wappalyzer/pull/3503

rviscomi · 2021-06-22T17:57:42Z

Be aware that if you're relying on the detection being implemented in Wappalyzer, your PR would need to be merged with them before July 1. If the PR is delayed for any reason, the metric may need to be implemented here as a custom metric anyway.

max-ostapenko · 2021-06-22T17:58:57Z

Next release of wappalyser is in about 2 weeks. @tunetheweb do you know how much time it takes to be used on our crawling servers?

tunetheweb · 2021-06-22T18:01:10Z

We normally sync it up just before the web almanac run to get latest version.

pmeenan · 2021-06-22T18:20:04Z

FWIW, we sync the code to master and don't need to wait for an official release.

VictorLeP · 2021-06-23T14:51:22Z

FYI: PRs https://github.com/AliasIO/wappalyzer/pull/4048 (geolocation) and https://github.com/AliasIO/wappalyzer/pull/4050 (fingerprinting) are pending for adding libraries to Wappalyzer.

custom_metrics/privacy.js

max-ostapenko · 2021-06-29T00:54:50Z

Please also add a few test cases to show it working in WPT.

@rviscomi We've added test websites in the comments, is that what you mean?

@pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table.

VictorLeP · 2021-06-29T06:19:52Z

Please also add a few test cases to show it working in WPT.

@rviscomi We've added test websites in the comments, is that what you mean?

I think he means provide links to actual tests on https://www.webpagetest.org/, I'll try to get to those today.

@pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table.

It's indeed a bit annoying that we cannot do the whole extraction upfront to have clean data already in the requests table, but must merge the meta tag and header values in the SQL query. (Given that the literal definition of <meta http-equiv="..."> is to simulate HTTP headers, I could imagine that there are more cases where the same issue arises.)
@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

max-ostapenko · 2021-06-29T09:53:27Z

@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric.
Then we'll be able to verify if there is any value to go the hard way (compared to HAR data) and will adjust it to UDF.
What do you think?

VictorLeP · 2021-06-29T09:55:03Z

@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric.
Then we'll be able to verify if there is any value to go the hard way (compared to HAR data) and will adjust it to UDF.
What do you think?

Sounds excellent, it'll be interesting to compare the two sources and see if there is any meaningful difference.

VictorLeP · 2021-06-29T10:07:01Z

WebPageTest runs:

custom_metrics/origin-trials.js

max-ostapenko · 2021-06-29T21:57:21Z

@VictorLeP @rviscomi seems we are ready for merge

custom_metrics/origin-trials.js

custom_metrics/privacy.js

rviscomi · 2021-06-29T22:37:48Z

custom_metrics/privacy.js

+   *
+   * @todo Check function/variable accesses through string searches (wrappers cannot be used, as the metrics are only collected at the end of the test)
+   */
+  document_interestCohort: testPropertyStringInResponseBodies('document.+interestCohort'),


Not sure about this pattern. .+ could include anything so document and interestCohort could be very far apart in the script. Also, do you mean to use the literal . character here, since it's a regex?

I would expect that you'd only want to match instances of document.interestCohort and not things like documentFOOinterestCohort.

The point is that we can't know for sure if there will be a call using a global name or some local variable (example)

So this multiple keyword match should be better than just testPropertyStringInResponseBodies('interestCohort').
And no, there is no literal ..

Co-authored-by: Rick Viscomi <[email protected]>

rviscomi

Last two comments

custom_metrics/origin-trials.js

rviscomi · 2021-06-30T03:06:56Z

custom_metrics/privacy.js

-    ).length;
+    let privacy_links = Array.from(document.querySelectorAll('a')).filter(a => a.innerText.match(pattern));
+
+    return privacy_links.map(link => link.innerText);


Returning the whole text may make aggregation difficult later, for example accounting for additional text / whitespace / case-sensitivity. WDYT about mapping links to the matching word pattern specifically?

Or both? Having the whole text might be more useful for debugging, the specific pattern more for additional analysis.

Stored both in 00f7160. Feel free to revert, but in that case we could also consider using test instead of match.
WPT run (rtl.de): https://www.webpagetest.org/custom_metrics.php?test=210630_AiDcD7_c350b1535df7dec7bddbafff01698b79&run=1&cached=0

Co-authored-by: Rick Viscomi <[email protected]>

* Rename and extend ecommerce custom metric for well-known URLs as per #2211 * Add parser for '/.well-known/security.txt' in well-known.js custom metric * Add robots.txt data parsing to well-known.js * Resolve reviewer suggestions * Add privacy custom metric (#211) * Replace left-over double quotes * Update 'robots.txt' in well-known.js to connect user-agents with disallow rules * Fix issue for 'robots.txt' in well-known.js * Filter 'robots.txt' disallows in well-known.js

* Rename and extend ecommerce custom metric for well-known URLs as per #2211 * Add parser for '/.well-known/security.txt' in well-known.js custom metric * Add robots.txt data parsing to well-known.js * Resolve reviewer suggestions * Add privacy custom metric (#211) * Replace left-over double quotes * Update 'robots.txt' in well-known.js to connect user-agents with disallow rules * Fix issue for 'robots.txt' in well-known.js * Filter 'robots.txt' disallows in well-known.js * Add per-metric error-handling and add fix for /robots.txt * Add fetch error catch (e.g. caused by site CORS policy)

VictorLeP added 4 commits June 14, 2021 18:00

Retrieve IAB CMP metadata, meta tags, method calls

3dd3fec

Search calls for sensitive resources

e1235f3

Add check for IAB US Privacy framework

64e2dc7

Fix outputs for meta tag searches

b64765f

max-ostapenko commented Jun 20, 2021

View reviewed changes