Fetch well-known URLs #2211

nrllh · 2021-05-26T12:55:21Z

I think we should discuss including well-known URLs (e.g., robots.txt, ads.txt, security.txt, etc.) because I see two problems there;

Even if one of these URLs is being fetched by our crawler, not all people know about that (I was noticed that last year security.txt was used by the SEO chapter, and we actually could also use it for our security chapter, we did not because we (or I 😊) didn't know it.)
Fetching these URLs may cost more time, but I think it's in general essential to analyze these URLs and enrich our analysis. We can fetch these files once in a year for the Almanac.

My suggestion is to collect interesting URLs in this issue so all contributors know which URLs are being fetched additionally.

I found a list but it doesn't include all URLs: https://en.wikipedia.org/wiki/List_of_/.well-known/_services_offered_by_webservers - for example, manifest.json, hackers.txt are missing in that list.

So what do you think?

The text was updated successfully, but these errors were encountered:

nrllh · 2021-05-26T13:01:16Z

For the security chapter security.txt, hackers.txt and robots.txt would be interesting.

name	path
robots.xt	/robots.txt
security.txt	/.well-known/security.txt
hackers.txt	/hackers.txt

cc @SaptakS @tomvangoethem

SaptakS · 2021-05-26T16:09:51Z

I think these URLs will be really helpful in getting more interesting analysis. Totally support this.

rockeynebhwani · 2021-05-26T19:53:09Z

@nrllh - Last year, I tried to do this for eCommerce chapter by looking at following well-known URLs

.well-known/assetlinks.json
.well-known/apple-app-site-association

My commit from last year is in custom_metrics/ecommerce.js (https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/ecommerce.js)

I was trying to find out how many eCommerce sites have Android/iOS app and using these standards to declare the app association. I didn't end up including any insights in eCommerce chapter as I ran out of time and for some platforms, I was getting empty assetlinks.json file so in order to get to something meaningful for the chapter, I needed to further parse the content of the file and just detecting the presence of file was not enough.

Tagging eCommerce 2021 team if the team want to pick up this - @bobbyshaw @rrajiv

rviscomi · 2021-05-26T21:48:56Z

One issue with WPT is that fetching additional URLs in the custom metric does not necessarily make their requests/responses available in the network log that makes up the requests and response_bodies tables. So we would either have to dump the response bodies in the output of the custom metric or only output some summary statistics about the file. The latter is preferable because we don't know how long some of these files will be and we don't want to bloat the HAR file (made available in the pages tables). @nrllh could you clarify which approach you're proposing?

cc @pmeenan in case there's a way to make custom metric requests visible in the requests/bodies.

tunetheweb · 2021-05-26T21:54:46Z

I actually quite like the fact it doesn’t appear in requests/bodies. Means you’re not polluting the real page load data. Otherwise number of requests goes up by one for each requested additional load, and number of 404s could sky rocket as most sites won’t have a lot of these URLs.

Do agree we should ideally do the processing in the custom metric though and only save summary results back, rather than full file.

pmeenan · 2021-05-27T13:22:17Z

Yeah, agree with @tunetheweb - they aren't part of the page load, they shouldn't be in the main requests data.

A few things to watch out for:

Make sure to have aggressive timeouts on the fetches so we don't stall the crawl if a lot of sites timeout the requests.
If possible, fetch everything async, do the processing and then await all so they can run in parallel (preferably with all of the fetches in a single custom metric or a small number of them because the custom metrics are serialized).
Watch out for storing/processing responses for non-200 responses in the case of a friendly 404 page being returned.
If the responses may be big, storing the full response will bloat the page data table/queries.

On the last point, we could add more processing to the HARs if we want to store response bodies but prune them out of the page data into the bodies tables. We could have a well-known metric name that includes the file name. Something like "response-body-security.txt" and then the HAR processing could prune out anything that starts with response-body.

nrllh · 2021-05-31T09:43:43Z

There is a lot to explore in these files. In robots.txt, it'll be interesting to analyze (for security chapter) potential exploitation (e.g., secret login links). In security.txt we can check which popular reporting methods are being used.

That's why I think providing the response body of these files would be better than providing some statistics. Otherwise, it could also be a limitation for future analysis. Of course, if it doesn't cause an overhead.

GJFR · 2021-06-28T17:44:54Z

To be completely sure: currently the consensus is for analysts to use custom metrics to collect information on the content of .well-known URLs, right? Or is this information going to be included in the crawl dataset, such that it will be available by query?

Just want to avoid redundant data/work :)

rviscomi · 2021-06-28T20:26:07Z

@GJFR yes custom metrics are the preferred approach for this but the window to get it in before the July crawl is closing quickly. Per @pmeenan's suggestion, we should combine any custom metrics that rely on external fetches so that they can be parallelized and share the same timeout logic. So whoever implements this should extend ecommerce.js and rename it to something more generic like well_known.js or external_resources.js. I would still discourage returning the entire contents of the resource and opt for more specific/aggregatable summary stats instead.

GJFR · 2021-06-28T23:30:55Z

I've extended and renamed ecommerce.js to well_known.js in HTTPArchive/legacy.httparchive.org@2a441a0.

It should be easily extendible for other well-known URLs and external sources by just adding parseResponse calls while passing the desired URL, and -- if required -- a parser function.

tunetheweb added the analysis Querying the dataset label Jun 1, 2021

tunetheweb added this to the 2021 Analysis milestone Jun 1, 2021

VictorLeP mentioned this issue Jun 25, 2021

Privacy 2021 queries #2227

Merged

57 tasks

VictorLeP mentioned this issue Jun 28, 2021

Privacy 2021 custom metrics HTTPArchive/legacy.httparchive.org#211

Merged

15 tasks

GJFR mentioned this issue Jun 28, 2021

Security 2021 custom metrics HTTPArchive/legacy.httparchive.org#219

Merged

2 tasks

rviscomi closed this as completed Oct 1, 2021

rockeynebhwani mentioned this issue Oct 1, 2021

Ecommerce 2021 queries #2300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch well-known URLs #2211

Fetch well-known URLs #2211

nrllh commented May 26, 2021

nrllh commented May 26, 2021 •

edited

Loading

SaptakS commented May 26, 2021

rockeynebhwani commented May 26, 2021

rviscomi commented May 26, 2021

tunetheweb commented May 26, 2021 •

edited

Loading

pmeenan commented May 27, 2021 •

edited by tunetheweb

Loading

nrllh commented May 31, 2021

GJFR commented Jun 28, 2021

rviscomi commented Jun 28, 2021 •

edited

Loading

GJFR commented Jun 28, 2021

Fetch well-known URLs #2211

Fetch well-known URLs #2211

Comments

nrllh commented May 26, 2021

nrllh commented May 26, 2021 • edited Loading

SaptakS commented May 26, 2021

rockeynebhwani commented May 26, 2021

rviscomi commented May 26, 2021

tunetheweb commented May 26, 2021 • edited Loading

pmeenan commented May 27, 2021 • edited by tunetheweb Loading

nrllh commented May 31, 2021

GJFR commented Jun 28, 2021

rviscomi commented Jun 28, 2021 • edited Loading

GJFR commented Jun 28, 2021

nrllh commented May 26, 2021 •

edited

Loading

tunetheweb commented May 26, 2021 •

edited

Loading

pmeenan commented May 27, 2021 •

edited by tunetheweb

Loading

rviscomi commented Jun 28, 2021 •

edited

Loading