Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch well-known URLs #2211

Closed
nrllh opened this issue May 26, 2021 · 10 comments
Closed

Fetch well-known URLs #2211

nrllh opened this issue May 26, 2021 · 10 comments
Labels
analysis Querying the dataset
Milestone

Comments

@nrllh
Copy link
Collaborator

nrllh commented May 26, 2021

I think we should discuss including well-known URLs (e.g., robots.txt, ads.txt, security.txt, etc.) because I see two problems there;

  1. Even if one of these URLs is being fetched by our crawler, not all people know about that (I was noticed that last year security.txt was used by the SEO chapter, and we actually could also use it for our security chapter, we did not because we (or I 😊) didn't know it.)

  2. Fetching these URLs may cost more time, but I think it's in general essential to analyze these URLs and enrich our analysis. We can fetch these files once in a year for the Almanac.

My suggestion is to collect interesting URLs in this issue so all contributors know which URLs are being fetched additionally.

I found a list but it doesn't include all URLs: https://en.wikipedia.org/wiki/List_of_/.well-known/_services_offered_by_webservers - for example, manifest.json, hackers.txt are missing in that list.

So what do you think?

@nrllh
Copy link
Collaborator Author

nrllh commented May 26, 2021

For the security chapter security.txt, hackers.txt and robots.txt would be interesting.

name path
robots.xt /robots.txt
security.txt /.well-known/security.txt
hackers.txt /hackers.txt

cc @SaptakS @tomvangoethem

@SaptakS
Copy link
Collaborator

SaptakS commented May 26, 2021

I think these URLs will be really helpful in getting more interesting analysis. Totally support this.

@rockeynebhwani
Copy link
Contributor

@nrllh - Last year, I tried to do this for eCommerce chapter by looking at following well-known URLs

.well-known/assetlinks.json
.well-known/apple-app-site-association

My commit from last year is in custom_metrics/ecommerce.js (https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/ecommerce.js)

I was trying to find out how many eCommerce sites have Android/iOS app and using these standards to declare the app association. I didn't end up including any insights in eCommerce chapter as I ran out of time and for some platforms, I was getting empty assetlinks.json file so in order to get to something meaningful for the chapter, I needed to further parse the content of the file and just detecting the presence of file was not enough.

Tagging eCommerce 2021 team if the team want to pick up this - @bobbyshaw @rrajiv

@rviscomi
Copy link
Member

One issue with WPT is that fetching additional URLs in the custom metric does not necessarily make their requests/responses available in the network log that makes up the requests and response_bodies tables. So we would either have to dump the response bodies in the output of the custom metric or only output some summary statistics about the file. The latter is preferable because we don't know how long some of these files will be and we don't want to bloat the HAR file (made available in the pages tables). @nrllh could you clarify which approach you're proposing?

cc @pmeenan in case there's a way to make custom metric requests visible in the requests/bodies.

@tunetheweb
Copy link
Member

tunetheweb commented May 26, 2021

I actually quite like the fact it doesn’t appear in requests/bodies. Means you’re not polluting the real page load data. Otherwise number of requests goes up by one for each requested additional load, and number of 404s could sky rocket as most sites won’t have a lot of these URLs.

Do agree we should ideally do the processing in the custom metric though and only save summary results back, rather than full file.

@pmeenan
Copy link
Member

pmeenan commented May 27, 2021

Yeah, agree with @tunetheweb - they aren't part of the page load, they shouldn't be in the main requests data.

A few things to watch out for:

  • Make sure to have aggressive timeouts on the fetches so we don't stall the crawl if a lot of sites timeout the requests.
  • If possible, fetch everything async, do the processing and then await all so they can run in parallel (preferably with all of the fetches in a single custom metric or a small number of them because the custom metrics are serialized).
  • Watch out for storing/processing responses for non-200 responses in the case of a friendly 404 page being returned.
  • If the responses may be big, storing the full response will bloat the page data table/queries.

On the last point, we could add more processing to the HARs if we want to store response bodies but prune them out of the page data into the bodies tables. We could have a well-known metric name that includes the file name. Something like "response-body-security.txt" and then the HAR processing could prune out anything that starts with response-body.

@nrllh
Copy link
Collaborator Author

nrllh commented May 31, 2021

There is a lot to explore in these files. In robots.txt, it'll be interesting to analyze (for security chapter) potential exploitation (e.g., secret login links). In security.txt we can check which popular reporting methods are being used.

That's why I think providing the response body of these files would be better than providing some statistics. Otherwise, it could also be a limitation for future analysis. Of course, if it doesn't cause an overhead.

@tunetheweb tunetheweb added the analysis Querying the dataset label Jun 1, 2021
@tunetheweb tunetheweb added this to the 2021 Analysis milestone Jun 1, 2021
@VictorLeP VictorLeP mentioned this issue Jun 25, 2021
57 tasks
@GJFR
Copy link
Member

GJFR commented Jun 28, 2021

To be completely sure: currently the consensus is for analysts to use custom metrics to collect information on the content of .well-known URLs, right? Or is this information going to be included in the crawl dataset, such that it will be available by query?

Just want to avoid redundant data/work :)

@rviscomi
Copy link
Member

rviscomi commented Jun 28, 2021

@GJFR yes custom metrics are the preferred approach for this but the window to get it in before the July crawl is closing quickly. Per @pmeenan's suggestion, we should combine any custom metrics that rely on external fetches so that they can be parallelized and share the same timeout logic. So whoever implements this should extend ecommerce.js and rename it to something more generic like well_known.js or external_resources.js. I would still discourage returning the entire contents of the resource and opt for more specific/aggregatable summary stats instead.

@GJFR
Copy link
Member

GJFR commented Jun 28, 2021

I've extended and renamed ecommerce.js to well_known.js in HTTPArchive/legacy.httparchive.org@2a441a0.

It should be easily extendible for other well-known URLs and external sources by just adding parseResponse calls while passing the desired URL, and -- if required -- a parser function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

No branches or pull requests

7 participants