-
-
Notifications
You must be signed in to change notification settings - Fork 85
Conversation
Co-authored-by: Max Ostapenko <[email protected]>
@VictorLeP could you add to-do items to the description of this PR? |
I don't seem to be allowed to edit your initial comment (which I presume you mean with 'description')? |
I'm guessing this is due to the same reason as me not being able to mark the pull request as ready for review: "Only those with write access to this repository can mark a draft pull request as ready for review." |
You should have full access now @VictorLeP . Can you try again. |
Thanks, I don't see an edit option for #211 (comment) yet though. (Just to check I'm looking in the right place: I would expect an 'Edit' item in the menu under the three dots top right of that comment.) |
Ah looks like this repo is set up differently to the main Web Almanac one so you can’t edit each other’s comments unfortunately. |
OK, thanks for checking! |
@max-ostapenko These are the changes to the to-do list I think should be made:
|
Depending on whether we can use Wappalyzer or not, we might also need custom metrics to detect certain fingerprinting libraries more reliably (e.g., |
I thought it would be enough to verify the presence of |
Sure, but a custom metric is still needed for that if it's not added to Wappalyzer :) |
You can open a PR with Wappalyzer if you know how to detect it. And that way everyone gets to benefit from it and not just the Web Almanac. Here's the PR for FingerprintJs for example: https://github.com/AliasIO/wappalyzer/pull/1790 and a follow up one to improve on that: https://github.com/AliasIO/wappalyzer/pull/3503 |
Be aware that if you're relying on the detection being implemented in Wappalyzer, your PR would need to be merged with them before July 1. If the PR is delayed for any reason, the metric may need to be implemented here as a custom metric anyway. |
Next release of wappalyser is in about 2 weeks. @tunetheweb do you know how much time it takes to be used on our crawling servers? |
We normally sync it up just before the web almanac run to get latest version. |
FWIW, we sync the code to master and don't need to wait for an official release. |
FYI: PRs https://github.com/AliasIO/wappalyzer/pull/4048 (geolocation) and https://github.com/AliasIO/wappalyzer/pull/4050 (fingerprinting) are pending for adding libraries to Wappalyzer. |
@rviscomi We've added test websites in the comments, is that what you mean? @pmeenan ok, got it. My take was to write one custom metric that will check all optional feature signs: in meta tags or in headers. I know learned that for the most accurate analysis we have to combine it with SQL over headers in the requests table. |
I think he means provide links to actual tests on https://www.webpagetest.org/, I'll try to get to those today.
It's indeed a bit annoying that we cannot do the whole extraction upfront to have clean data already in the requests table, but must merge the meta tag and header values in the SQL query. (Given that the literal definition of |
@VictorLeP let's keep it the way it is now for crawling. But I'll add a base64 string to the custom metric. |
Sounds excellent, it'll be interesting to compare the two sources and see if there is any meaningful difference. |
@VictorLeP @rviscomi seems we are ready for merge |
* | ||
* @todo Check function/variable accesses through string searches (wrappers cannot be used, as the metrics are only collected at the end of the test) | ||
*/ | ||
document_interestCohort: testPropertyStringInResponseBodies('document.+interestCohort'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this pattern. .+
could include anything so document
and interestCohort
could be very far apart in the script. Also, do you mean to use the literal .
character here, since it's a regex?
I would expect that you'd only want to match instances of document.interestCohort
and not things like documentFOOinterestCohort
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that we can't know for sure if there will be a call using a global name or some local variable (example)
So this multiple keyword match should be better than just testPropertyStringInResponseBodies('interestCohort')
.
And no, there is no literal .
.
Co-authored-by: Rick Viscomi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last two comments
custom_metrics/privacy.js
Outdated
).length; | ||
let privacy_links = Array.from(document.querySelectorAll('a')).filter(a => a.innerText.match(pattern)); | ||
|
||
return privacy_links.map(link => link.innerText); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning the whole text may make aggregation difficult later, for example accounting for additional text / whitespace / case-sensitivity. WDYT about mapping links to the matching word pattern specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or both? Having the whole text might be more useful for debugging, the specific pattern more for additional analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stored both in 00f7160. Feel free to revert, but in that case we could also consider using test
instead of match
.
WPT run (rtl.de): https://www.webpagetest.org/custom_metrics.php?test=210630_AiDcD7_c350b1535df7dec7bddbafff01698b79&run=1&cached=0
Co-authored-by: Rick Viscomi <[email protected]>
* Rename and extend ecommerce custom metric for well-known URLs as per #2211 * Add parser for '/.well-known/security.txt' in well-known.js custom metric * Add robots.txt data parsing to well-known.js * Resolve reviewer suggestions * Add privacy custom metric (#211) * Replace left-over double quotes * Update 'robots.txt' in well-known.js to connect user-agents with disallow rules * Fix issue for 'robots.txt' in well-known.js * Filter 'robots.txt' disallows in well-known.js
* Rename and extend ecommerce custom metric for well-known URLs as per #2211 * Add parser for '/.well-known/security.txt' in well-known.js custom metric * Add robots.txt data parsing to well-known.js * Resolve reviewer suggestions * Add privacy custom metric (#211) * Replace left-over double quotes * Update 'robots.txt' in well-known.js to connect user-agents with disallow rules * Fix issue for 'robots.txt' in well-known.js * Filter 'robots.txt' disallows in well-known.js * Add per-metric error-handling and add fix for /robots.txt * Add fetch error catch (e.g. caused by site CORS policy)
Progress on HTTPArchive/almanac.httparchive.org#2149