Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Privacy 2021 custom metrics #211

Merged
merged 26 commits into from
Jun 30, 2021
Merged

Privacy 2021 custom metrics #211

merged 26 commits into from
Jun 30, 2021

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Jun 20, 2021

Progress on HTTPArchive/almanac.httparchive.org#2149

@max-ostapenko
Copy link
Contributor Author

@VictorLeP could you add to-do items to the description of this PR?
I've added some that you've completed, but how much is left?

@VictorLeP
Copy link
Contributor

@VictorLeP could you add to-do items to the description of this PR?
I've added some that you've completed, but how much is left?

I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')?

@VictorLeP
Copy link
Contributor

@VictorLeP could you add to-do items to the description of this PR?
I've added some that you've completed, but how much is left?

I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')?

I'm guessing this is due to the same reason as me not being able to mark the pull request as ready for review: "Only those with write access to this repository can mark a draft pull request as ready for review."

@tunetheweb
Copy link
Member

You should have full access now @VictorLeP . Can you try again.

@VictorLeP
Copy link
Contributor

Thanks, I don't see an edit option for #211 (comment) yet though. (Just to check I'm looking in the right place: I would expect an 'Edit' item in the menu under the three dots top right of that comment.)

@tunetheweb
Copy link
Member

Ah looks like this repo is set up differently to the main Web Almanac one so you can’t edit each other’s comments unfortunately.

@VictorLeP
Copy link
Contributor

OK, thanks for checking!

@VictorLeP
Copy link
Contributor

@max-ostapenko These are the changes to the to-do list I think should be made:

  • IAB TCF v1: retrieve consent string -- done in 2770294
  • Ads Transparency Spotlight: traverse all frames -- I don't think this is possible? (we can't access 3rd party frames from the custom metrics JS)
  • add item for 'privacy policy keywords' -- done (reused from last year)
  • add item for 'sensitive resources' with subitems 'permissions policy', 'media devices' and 'geolocation' -- I believe you will still restructure these, then they will be complete?

@VictorLeP
Copy link
Contributor

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

@max-ostapenko
Copy link
Contributor Author

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

I thought it would be enough to verify the presence of ClientJS object?

@VictorLeP
Copy link
Contributor

Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., ClientJS).

I thought it would be enough to verify the presence of ClientJS object?

Sure, but a custom metric is still needed for that if it's not added to Wappalyzer :)

@tunetheweb
Copy link
Member

You can open a PR with Wappalyzer if you know how to detect it. And that way everyone gets to benefit from it and not just the Web Almanac.

Here's the PR for FingerprintJs for example: https://github.com/AliasIO/wappalyzer/pull/1790 and a follow up one to improve on that: https://github.com/AliasIO/wappalyzer/pull/3503

@rviscomi
Copy link
Member

Be aware that if you're relying on the detection being implemented in Wappalyzer, your PR would need to be merged with them before July 1. If the PR is delayed for any reason, the metric may need to be implemented here as a custom metric anyway.

@max-ostapenko
Copy link
Contributor Author

Next release of wappalyser is in about 2 weeks. @tunetheweb do you know how much time it takes to be used on our crawling servers?

@tunetheweb
Copy link
Member

We normally sync it up just before the web almanac run to get latest version.

@pmeenan
Copy link
Member

pmeenan commented Jun 22, 2021

FWIW, we sync the code to master and don't need to wait for an official release.

@VictorLeP
Copy link
Contributor

FYI: PRs https://github.com/AliasIO/wappalyzer/pull/4048 (geolocation) and https://github.com/AliasIO/wappalyzer/pull/4050 (fingerprinting) are pending for adding libraries to Wappalyzer.

@max-ostapenko max-ostapenko marked this pull request as ready for review June 29, 2021 00:13
@max-ostapenko max-ostapenko mentioned this pull request Jun 29, 2021
2 tasks
@max-ostapenko
Copy link
Contributor Author

max-ostapenko commented Jun 29, 2021

Please also add a few test cases to show it working in WPT.

@rviscomi We've added test websites in the comments, is that what you mean?

@pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table.

@VictorLeP
Copy link
Contributor

Please also add a few test cases to show it working in WPT.

@rviscomi We've added test websites in the comments, is that what you mean?

I think he means provide links to actual tests on https://www.webpagetest.org/, I'll try to get to those today.

@pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table.

It's indeed a bit annoying that we cannot do the whole extraction upfront to have clean data already in the requests table, but must merge the meta tag and header values in the SQL query. (Given that the literal definition of <meta http-equiv="..."> is to simulate HTTP headers, I could imagine that there are more cases where the same issue arises.)
@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

GJFR added a commit to GJFR/legacy.httparchive.org that referenced this pull request Jun 29, 2021
@max-ostapenko
Copy link
Contributor Author

@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric.
Then we'll be able to verify if there is any value to go the hard way (compared to HAR data) and will adjust it to UDF.
What do you think?

@VictorLeP
Copy link
Contributor

@max-ostapenko I'm thinking we should then revert to just storing the raw base64 value for the origin value meta tags in the custom metric, and then doing the extraction of all the attributes in the SQL query? We can use a JS UDF there, so it would just be a matter of moving over the function statement.

@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric.
Then we'll be able to verify if there is any value to go the hard way (compared to HAR data) and will adjust it to UDF.
What do you think?

Sounds excellent, it'll be interesting to compare the two sources and see if there is any meaningful difference.

@max-ostapenko
Copy link
Contributor Author

@VictorLeP @rviscomi seems we are ready for merge

*
* @todo Check function/variable accesses through string searches (wrappers cannot be used, as the metrics are only collected at the end of the test)
*/
document_interestCohort: testPropertyStringInResponseBodies('document.+interestCohort'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this pattern. .+ could include anything so document and interestCohort could be very far apart in the script. Also, do you mean to use the literal . character here, since it's a regex?

I would expect that you'd only want to match instances of document.interestCohort and not things like documentFOOinterestCohort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that we can't know for sure if there will be a call using a global name or some local variable (example)

So this multiple keyword match should be better than just testPropertyStringInResponseBodies('interestCohort').
And no, there is no literal ..

Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last two comments

).length;
let privacy_links = Array.from(document.querySelectorAll('a')).filter(a => a.innerText.match(pattern));

return privacy_links.map(link => link.innerText);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning the whole text may make aggregation difficult later, for example accounting for additional text / whitespace / case-sensitivity. WDYT about mapping links to the matching word pattern specifically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or both? Having the whole text might be more useful for debugging, the specific pattern more for additional analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stored both in 00f7160. Feel free to revert, but in that case we could also consider using test instead of match.
WPT run (rtl.de): https://www.webpagetest.org/custom_metrics.php?test=210630_AiDcD7_c350b1535df7dec7bddbafff01698b79&run=1&cached=0

@rviscomi rviscomi merged commit 0b0f16b into HTTPArchive:master Jun 30, 2021
rviscomi pushed a commit that referenced this pull request Jun 30, 2021
* Rename and extend ecommerce custom metric for well-known URLs as per #2211

* Add parser for '/.well-known/security.txt' in well-known.js custom metric

* Add robots.txt data parsing to well-known.js

* Resolve reviewer suggestions

* Add privacy custom metric (#211)

* Replace left-over double quotes

* Update 'robots.txt' in well-known.js to connect user-agents with disallow rules

* Fix issue for 'robots.txt' in well-known.js

* Filter 'robots.txt' disallows in well-known.js
rviscomi pushed a commit that referenced this pull request Aug 30, 2021
* Rename and extend ecommerce custom metric for well-known URLs as per #2211

* Add parser for '/.well-known/security.txt' in well-known.js custom metric

* Add robots.txt data parsing to well-known.js

* Resolve reviewer suggestions

* Add privacy custom metric (#211)

* Replace left-over double quotes

* Update 'robots.txt' in well-known.js to connect user-agents with disallow rules

* Fix issue for 'robots.txt' in well-known.js

* Filter 'robots.txt' disallows in well-known.js

* Add per-metric error-handling and add fix for /robots.txt

* Add fetch error catch (e.g. caused by site CORS policy)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants