-
-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate 2020 almanac tables on BigQuery #1258
Comments
I'm replacing the I plan to regenerate the |
Was looking over the 2019 PWA queries and think it's highly likely this data would be asked for for 2020, as they form the basis of most of the 2019 queries. I gather they are just subsets of the |
The PWA authors confirmed they do want same stats as last night so let us know if it's possible to date stamp those two tables and add this years stats. Also see this comment:
Or is there a way to partition the |
The |
As for generating the Not so sure how to best generate the |
I'm trying this: SELECT
client, page, type, body
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
date = '2020-08-01' AND
(type = 'script' OR firstHTML = true) And it's telling me it This query produces a much more reasonable 24.5 GB but need the bodies for some of the queries. SELECT
client, page, type, count(1)
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
date = '2020-08-01' AND
(type = 'script' OR firstHTML = false)
GROUP BY
client, page, type |
~25 TB is the size of the entire 2020-08-01 partition. BigQuery gives an upper estimate of the query size without knowing how efficient the clustering will be until runtime. So the query should only incur the cost to process HTML and JS responses, which are still maybe ~75% of all responses considering that binary files are excluded. So it's efficient in the sense that it doesn't query all response bodies, but if you're still querying most rows it will be expensive, which is why we discourage this approach in favor of custom metrics. |
Ah good to know about binary files not being included. Makes sense but that was my hope to reduce this. And does look like JS is the vast majority of requests by volume - and probably even more so by size :-( SELECT
client, type, firstHtml, count(1) AS count
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
date = '2020-08-01'
GROUP BY
client, type, firstHtml
ORDER BY
count(1) DESC |
I found this post from @tomayac and from that I created below SQL: #standardSQL
CREATE TEMPORARY FUNCTION
pathResolve(path1 STRING,
path2 STRING)
RETURNS STRING
LANGUAGE js AS """
function normalizeStringPosix(e,t){for(var n="",r=-1,i=0,l=void 0,o=!1,h=0;h<=e.length;++h){if(h<e.length)l=e.charCodeAt(h);else{if(l===SLASH)break;l=SLASH}if(l===SLASH){if(r===h-1||1===i);else if(r!==h-1&&2===i){if(n.length<2||!o||n.charCodeAt(n.length-1)!==DOT||n.charCodeAt(n.length-2)!==DOT)if(n.length>2){for(var g=n.length-1,a=g;a>=0&&n.charCodeAt(a)!==SLASH;--a);if(a!==g){n=-1===a?"":n.slice(0,a),r=h,i=0,o=!1;continue}}else if(2===n.length||1===n.length){n="",r=h,i=0,o=!1;continue}t&&(n.length>0?n+="/..":n="..",o=!0)}else{var f=e.slice(r+1,h);n.length>0?n+="/"+f:n=f,o=!1}r=h,i=0}else l===DOT&&-1!==i?++i:i=-1}return n}function resolvePath(){for(var e=[],t=0;t<arguments.length;t++)e[t]=arguments[t];for(var n="",r=!1,i=void 0,l=e.length-1;l>=-1&&!r;l--){var o=void 0;l>=0?o=e[l]:(void 0===i&&(i=getCWD()),o=i),0!==o.length&&(n=o+"/"+n,r=o.charCodeAt(0)===SLASH)}return n=normalizeStringPosix(n,!r),r?"/"+n:n.length>0?n:"."}var SLASH=47,DOT=46,getCWD=function(){return""};if(/^https?:/.test(path2)){return path2;}if(/^\\//.test(path2)){return path1+path2.substr(1);}return resolvePath(path1, path2).replace(/^(https?:\\/)/, '$1/');
""";
SELECT
DISTINCT
client,
REGEXP_REPLACE(page, "^http:", "https:") AS page,
pathResolve(REGEXP_REPLACE(page, "^http:", "https:"),
REGEXP_EXTRACT(body, "navigator\\.serviceWorker\\.register\\s*\\(\\s*[\"']([^\\),\\s\"']+)")) AS url,
body
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
date = '2020-08-01' AND
(REGEXP_EXTRACT(body, "navigator\\.serviceWorker\\.register\\s*\\(\\s*[\"']([^\\),\\s\"']+)") IS NOT NULL
AND REGEXP_EXTRACT(body, "navigator\\.serviceWorker\\.register\\s*\\(\\s*[\"']([^\\),\\s\"']+)") != "/") Was that what you did last year @tomayac ? |
And btw there's similar code to generate the manifest data. Looks like in that post he got both in one go, but I'd imagine that would only work for service workers defined on the main index page in inline |
Good find! That's from 2018 but I assume it was reused for the 2019 Almanac. I'll try running it and report back here if it works. |
Awesome, you figured it out based on the traces I left in the article, all while I was sleeping :-) So, yeah, what @bazzadp found in #1258 (comment) is exactly the way I did it and would do it again in 2020 if I needed to… |
Cool thanks for confirming. Btw the SQL on your origin post on your blog has lost some of the escapes and so isn’t recognised as valid SQL by BigQuery in case you want to fix it. The SQL in the Medium article seems to work though. |
Yeah, Medium is the authoritative source in this case. The export to my origin was lossy unfortunately, tracked as tomayac/blogccasion#3 that I will deal with some time in the future when I will have time™. |
BTW the data for this comes from here: https://github.com/andydavies/http2-prioritization-issues#cdns--cloud-hosting-services. Looks like only a very small number of updates since last year, if someone can add a date column and convert the above table to SQL inserts. |
Thanks @bazzadp! I created a new dated table Here's the script snippet I used on the GitHub page, with copy(Array.from($0.querySelectorAll('tr')).map(tr => {
var passFail = tr.querySelector('td:nth-child(2)').textContent.split(' ')[0].toLowerCase().split('');
passFail[0] = passFail[0].toUpperCase();
passFail = passFail.join('');
return `STRUCT<cdn STRING, prioritization_status STRING>(${[tr.querySelector('td:first-child').textContent, passFail].map(JSON.stringify).join(', ')})`;
}).join(`,
`)) Then in BigQuery I appended the output of this query to the table: SELECT CAST('2020-08-01' AS DATE) AS date, *
FROM UNNEST([
STRUCT<cdn STRING, prioritization_status STRING>("Akamai", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Amazon CloudFront", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("BitGravity", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Cachefly", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("CDN77", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("CDNetworks", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("CDNsun", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("ChinaCache", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Cloudflare", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("DreamHost", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Edgecast", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Facebook", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Fastly", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Google Cloud CDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Google Firebase", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Google Storage", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Highwinds", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Incapsula", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Instart Logic", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("KeyCDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("LeaseWeb CDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Level 3", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Limelight", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Medianova", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Microsoft Azure", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Netlify", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Reflected Networks", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Rocket CDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("section.io", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Sucuri Firewall", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("StackPath/NetDNA/MaxCDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("WordPress.com", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("WordPress.com Jetpack CDN (Photon)", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Yottaa", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Zeit", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Zenedge", "Fail")
]) EDIT: Then check and correct any CDN names as noted below Here's the comparison table:
|
I've partitioned the |
Running the following: SELECT
date,
count(distinct pwa_url) as pwas,
count(distinct sw_url) as service_workers,
count(distinct manifest_url) as manifests,
count(0) as total
FROM
`httparchive.almanac.pwa_candidates`
GROUP BY
date Gives this:
Which seems roughly realistic - a 3 fold increase in use in last year, and tracks roughly with this graph (though it's possibly closer to two fold increase). However it does seem there is an awful lot of duplication in the data. Should we just include distinct pages? This SQL returns 29,714 rows: SELECT date, client, pwa_url, sw_url, manifest_url, count(0)
FROM `httparchive.almanac.pwa_candidates`
GROUP BY date, client, pwa_url, sw_url, manifest_url
HAVING count(0) > 1 However, that aside, I think there are bigger issues. When I run this: SELECT
date,
count(distinct page) as pages,
count(distinct url) as urls,
count(0) as total
FROM
`httparchive.almanac.manifests`
GROUP BY
date I get the following:
Which is way off. Digging into it more I think I understand why. https://zeals.co.jp/ for example contains a manifest but not a service worker and was included in 2019 but not 2020. This is because Similarly when I run this: SELECT
date,
count(distinct page) as pages,
count(distinct url) as urls,
count(0) as total
FROM
`httparchive.almanac.service_workers`
GROUP BY
date I get this:
That drop definitely looks suspect but think we should look at this again when we run manifests and service_workers independently of each other. |
Thanks @bazzadp, that's really helpful. I've regenerated The growth from 2019 to 2020 now looks more natural. Please confirm if it looks good from your end. |
That looks better to me thanks! There are still some dupes in there. For example: SELECT *
FROM `httparchive.almanac.service_workers`
WHERE date = '2020-08-01' AND page = 'https://www.lix.com/' AND client = 'desktop' Maybe need a distinct on those last two queries? Anyway nothing major and can ignore those. One more request. There are only 63k service worker pages, so I don't think the 1k or 10k pages are going to be very representative of those. Would it be possible to get a reduced |
Was just working on this with @bazzadp and we ran the following query to extract all response bodies for pages that have a service-worker.
This query processes 14.5TB. The results for it are stored in |
Only mobile? |
Yeah. Just for hunting around for now and figure mobile and desktop will be similar enough so thought I’d half my costs and runtimes. Can then decide whether it’s even useful and whether to create a real serviceworkersbodies table with both in almanac schema, or write queries against the full summaryresponsebodies table - once I know exactly what I wanna query. |
SGTM. There was some duplication in the PWA tables, but I think there are easy workaround using DISTINCT or GROUP BY in their queries. Any objections to closing this issue? |
@rviscomi can we run the following to match the WPT CDN name? UPDATE `httparchive.almanac.h2_prioritization_cdns` SET cdn = 'Google' WHERE cdn = 'Google Cloud CDN';
UPDATE `httparchive.almanac.h2_prioritization_cdns` SET cdn = 'Automattic' WHERE cdn = 'WordPress.com'; Those seems to be the main two with volume that need set based on this query: SELECT _cdn_provider, count(0)
FROM `httparchive.almanac.requests` r
LEFT OUTER JOIN `httparchive.almanac.h2_prioritization_cdns` c ON c.date = r.date AND c.cdn = r._cdn_provider
WHERE
r.date = "2020-08-01" AND
--firstHTML AND -- Check with both this commented out and not.
_cdn_provider IS NOT NULL AND
cdn IS NULL
GROUP BY _cdn_provider
ORDER BY count(0) DESC; |
Should we update the 2019 "WordPress" CDN to "Automattic" as well? |
Yes and no. Yes because it's wrong. No because those aren't the stats we used for 2019 chapter. WDYT? |
Maybe not so a JOIN works across years. Ran the updates. Had to change |
Thanks. Closing again. |
Some chapters depend on data from the September crawl. Reopening this to track adding '2020-09-01' dated rows to the |
The 2020_08_01 HTTP Archive crawl completed yesterday and the tables are available on BigQuery. However, to facilitate Web Almanac analysis, we reorganize the data into the
almanac
dataset to make the results more efficient to query.@paulcalvano and I will be prepping this dataset with the 2020 results. The existing tables already contain 2019 data and they do not necessarily make that clear. We should continue to retain the 2019 data and alter the table schemas to add a new date field to distinguish the annual editions.
parsed_css
date
columnrequests
summary_requests
metadata withrequests
payloadssummary_response_bodies
summary_requests
metadata withresponse_bodies
blobsThere are also a couple of externally-sourced tables:
third_parties
h2_prioritization_cdns
h2_prioritization_cdns_201909
and is in use by the 2019 HTTP/2 metric 20_07.sqlAnd there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:
manifests
service_workers
I'd like to explore whether it's feasible to combine the request/response tables into a single table that contains the summary metadata, request payload, and response bodies. That way there would be no SQL joins to contend with for the analysts. The tables would be enormous but AFAIK BigQuery only bills for the columns used, so queries that don't require the bodies would be much cheaper. Not sure if performance is worse.
The text was updated successfully, but these errors were encountered: