Skip to content
This repository has been archived by the owner on Jan 4, 2023. It is now read-only.

Security 2021 custom metrics #219

Merged
merged 9 commits into from
Jun 30, 2021
Merged

Security 2021 custom metrics #219

merged 9 commits into from
Jun 30, 2021

Conversation

GJFR
Copy link
Member

@GJFR GJFR commented Jun 28, 2021

Progress on HTTPArchive/almanac.httparchive.org#2150.

  • robots.txt
  • /.well-known/security.txt

Also includes renaming and extending ecommerce custom metric for well-known URLs as per HTTPArchive/almanac.httparchive.org#2211.

@rviscomi
Copy link
Member

Is this ready for review?

@max-ostapenko
Copy link
Contributor

@GJFR Could you please add, as ,emtioned in #211

// privacy
parseResponse('/.well-known/gpc.json', r => {
  return r.text().then(text => {
    let data = {
      'gpc': null
    };
    let gpc_data = JSON.parse(text);
    if (typeof gpc_data.gpc == 'boolean') {
      data.gpc = gpc_data.gpc;
    }
    return data;
  });
}),

@GJFR
Copy link
Member Author

GJFR commented Jun 29, 2021

@max-ostapenko Your code has been added 👍

@rviscomi Thank you for your comments! I've updated the code. I'm gonna do a quick check and tweak the robots.txt data collecting. Will mark as ready for review ASAP.

let data = {
'signed': false
};
if (text.startsWith('-----BEGIN PGP SIGNED MESSAGE-----')) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@GJFR
Copy link
Member Author

GJFR commented Jun 29, 2021

@GJFR GJFR marked this pull request as ready for review June 29, 2021 10:35
@GJFR GJFR requested a review from rviscomi June 29, 2021 10:36
@SaptakS
Copy link

SaptakS commented Jun 29, 2021

WPT tests look good to me.

@GJFR
Copy link
Member Author

GJFR commented Jun 30, 2021

I filtered on all keywords discussed in this thread.

A few thoughts:

  • If we are not that interested in User-agent entries in itself, we could remove reported user-agents that do not have any matched Disallow paths to reduce clutter.
  • As shown in the test results below: author seems a popular false positive for auth. I'm sure we will encounter similar false positives in the crawl data. In my opinion, we should make the filter more strict in the querying stage and not on the custom metric level for two reasons:
    • We could overtighten the filter and miss out on endpoints. It's safer to try out more strict filters when we have the crawl data.
    • It's more practical to put as less filtering logic at custom metric level as possible. For example, if later on this metric will be used in a longitudinal analysis spanning multiple crawls, it will be more difficult to get it right if the custom metric filter has been adjusted between crawls. Not that familiar with the database/querying costs, though, so can't say whether it will be worth the (limited?) additional cost.

WPT test runs:

@GJFR GJFR requested a review from rviscomi June 30, 2021 16:43
@rviscomi
Copy link
Member

LGTM thanks everyone!

@rviscomi rviscomi merged commit 133472e into HTTPArchive:master Jun 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants