This is the description of the final project for Stat 431, Spring 2020, at the University of Pennsylvania.
The ongoing Covid-19 pandemic cut the academic year off and affected each of our lives. Given its urgency, a lot of work is being done to mitigate its effects. Some of those works, and some decisions being made, are informed by statistics. Given this, in this project you will aim to understand some aspect of how statistics is used to learn about Covid-19. As the instructor, I think it is important for you to appreciate that statistical thinking is crucial, even in this time of crisis.
You will read a paper that addresses any aspect of the pandemic, and write a short summary/report/essay on how statistics is being used. There are many aspects of Covid-19 that use statistics, and you are free to choose what interests you most:
- estimating key parameters, such as the reproductive number of the virus, interval between infection and symptoms, epidemic doubling time, serial interval (delay between illness onset in successive cases in chains of transmission), onset-to-first-medical-visit and onset-to-admission time
- a specific interesting problem is estimating R_0 based on the serial interval distribution and the doubling time
- estimating the mortality rate for various groups (e.g., requires first to understand the demographics of the cases, then to understand possible sampling biases)
- estimating the effectiveness of interventions like masks, travel bans, and social distancing in reducing the spread
- estimating and modelling the growth rate
- estimating the need for various medical resources (beds, masks)
- estimating the effectiveness of vaccinces and other treatments (e.g., what is the effect of wearing various types of masks?)
- estimating the economic impact and losses due to the virus (e.g., increase in unemployment rate, decrease in rate of growth of GDP)
- evaluating the impact on the stock market (e.g., assessing the magnitude of the crash)
Other aspects: feel free to choose other aspects if you are interested. In that case, you should choose a paper to summarize and ask the instructor to approve it. Moreover, you are free to supplement your work by looking at other additional readings. Make sure to give proper credit for all ideas, facts etc, by citing the appropriate sources.
- It may be challenging to do summaries at the highest quality. However, remember that the goal of this project is an exercise of your critical understanding skills. It is about practicing your knowledge of statistics to wade through primary sources, to sort out fact from fiction about a critically important problem.
- Instead of reading papers, some may wish to do their own data analysis & visualization to help understand Covid-19. There are several statistical analyses and visualizations that give valuable insights: JHU cases map, Nextstrain phylogenetic tree, covid-19 dashboards. However, doing this in a quality fashion takes a lot of time & energy. So, while valuable, it is outside of the scope of this class.
For each (and many more aspects) there is an extensive body of scientific work being done. You only need to read one paper, and to your best ability, summarize the statistical aspects. Here are the questions you should answer. You can structure your paper as a response to these queries:
- What is the goal of the study?
- What type of data do they use? Is it observational, experimental? (refer to the sections where we studied this; if experimental, what kind of randomized study more specifically) What type of data does it measure (case medical characteristics, questionnaires)? Are there any possible biases in the data?
- What type of statistical tools do they use?
- 3a. how do they visualize the data? do they use histograms, boxplots, etc? is it compelling?
- 3b. what kind of statistical methods do they use? e.g., data summarization (mean/median, standard deviation), hypothesis testing, confidence intervals, p-values, regression analysis,
- 3c. if possible, go in some detail about one or two examples. e.g., what was the quantity estimated? how did they set up the model? how did they estimate?
- What do they conclude? how does the data support the conclusions?
- What are your overall impressions? what were the strong points, what were the limitations, what could be improved?
This work should take several days (say 3-4 days) to complete (including reading the main source, additional background reading, planning the summary, and writing).
I anticipate that 3c may be challenging for many papers, because they may use advanced statistical methods that we have not covered in the class. I have a few comments and suggestions: first, the goal is not to understand in 100% detail the subtleties of the technique, but rather to appreciate how the methods are used. The goal is to understand the methods as well as you can, based on the training in the class, and based on your independent reading preparing for the project. You are expected to look up relevant methods that appear in the work, and spend some time understanding them. This will also be a valuable experience - no matter how many statistics classes you take, in your work in "the real world" you will always encounter new statistical methods. You need to be prepared to learn about them on the spot. Even more, as informed citizens of the future, it will serve you well to read beyond what is presented in a processed way in popular press.
I also anticipate that some aspects may require you to do a bit of indepenent reading. For instance, if you talk about estimating the basic reproduction number
, then you first need to understand what that is, and what models are used to estimate it.
For some of the papers, you will need to look at the supplementary material to understand what is being done. The main body of papers often just briefly describes the methods, and the details (which can be crucial) are often reported in supplementary materials.
Your report should be about 5-6 pages double-spaced, standard font size and standard formatting (standard margin). You can work in groups of at most two (i.e., one or two people per project). You should turn in your project on Canvas (there will be an Assignment created for this), by 5pm EST on May 5th 2020 (this was the date of the original final exam).
The University of Pennsylvania's Code of Academic Integrity applies: https://catalog.upenn.edu/pennbook/code-of-academic-integrity/. Please add the phrase "This paper represents my own work in accordance with the Code of Academic Integrity" to your paper and sign below.
Here are a few relevant references:
The materials below can serve as a helpful orientation into the extensive literature. However, they themselves are not appropriate to be a main source that you summarize, because they do not contain original analyses.
-
A.S. Fauci et al., Covid-19 — Navigating the Uncharted, N Engl J Med 2020; 382:1268-126. 3/26/2020. An overview of some basic epidemiological facts and work. Has pointers to work that is suitable to summarize in the project.
-
Congressional Reasearch Service (CRS), 3/26/2020 Global Economic Effects of COVID-19. Well-researched summary of various thoughts on global economic effects. See also this presentation by KPMG, which has a lot of charts.
-
L. Wynants et al Systematic review and critical appraisal of prediction models for diagnosis and prognosis of COVID-19 infection . "COVID-19 related prediction models are quickly entering the academic literature, to support medical decision making at a time where this is urgently needed. Our review indicates proposed models are poorly reported and at high risk of bias."
The articles below can be summarized as part of the project. They are arranged into groups based on their length, complexity, and difficulty. You can choose papers from any group, but if you choose a "basic" paper, then you are expected to cover it in more detail. Then the correctness standard will also be higher. For the more complex papers ("medium" and "high" levels), you are not expected to summarize them in full technical detail. We will reward effort and take into account the difficulty.
-
C. Huang et al, Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China, Lancet 2020; 395: 497–506. An epidemiological study structured in a direct & easy to read way. This was one of the first international reports on the covid-19 patients, and provided a great deal of valuable insights to the medical community. It has been extremely influential (>1000 citations in a short period of time). It uses several statistical methods including data visualization (boxplots etc), and hypothesis testing (chi-squared test etc). Good choice for a summary. This is perhaps the simplest one to summarize.
-
Q. Li et al., Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus–Infected Pneumonia , N Engl J Med 2020; 382:1199-1207. Similar and slightly larger later study.
-
NHL Yeung et al., Respiratory virus shedding in exhaled breath and efficacy of face masks, Nature Medicine 4/3/2020. Uses a direct two-sample testing approach to compare the mask-wearing and non-wearing groups. See Table 1b: "P values for comparing the frequency of respiratory virus detection between the mask intervention were obtained by two-sided Fisher’s exact test". You can focus on this particular statistical aspect (i.e., two-sample test with Fisher's exact test). You can read more about Fisher's exact test here on Wikipedia.
The difficulty level of the following papers varies a lot. Some use methods that we have not discussed in the class (eg "SIR models"). However, their conclusions are important, and so I include them for those that are most interested to delve into the depths.
-
Predictive model used to forecast death toll: Institute for Health Metrics and Evaluation, UW, Paper, Code, Methods pdf. This model has been referred to prominently in the White House address that extended the stay-at-home and social distancing guidelines to April (see below). It is based on a mixed-effects regression model, fit using maximum likelihood. This is a bit beyond what we covered, but with a few hours of reading you should be able to understand it.
-
J. Grein, et al., Compassionate Use of Remdesivir for Patients with Severe Covid-19, NEJM April 10, 2020, reports promising results using remdesivir
-
C. Wang et al., Evolving Epidemiology and Impact of Non-pharmaceutical Interventions on the Outbreak of Coronavirus Disease 2019 in Wuhan, China. Video by prof. Xihong Lin from HSPH. Another, bigger, epidemiological study. Fits SEIR model to estimate R_0, see here for background: Simple SIR model, More sophisticated SEIRS model. More readable, but perhaps less scientifically rigorous exposition: link. Quote from the last one, a particularly insightful and readable paragraph:
How can you estimate the true ones? It turns out, there’s a couple of ways. [...] First, through deaths. If you have deaths in your region, you can use that to guess the number of true current cases. We know approximately how long it takes for that person to go from catching the virus to dying on average (17.3 days). That means the person who died on 2/29 in Washington State probably got infected around 2/12. Then, you know the mortality rate. For this scenario, I’m using 1% (we’ll discuss later the details). That means that, around 2/12, there were already around ~100 cases in the area (of which only one ended up in death 17.3 days later). Now, use the average doubling time for the coronavirus (time it takes to double cases, on average). It’s 6.2. That means that, in the 17 days it took this person to die, the cases had to multiply by ~8 (=2^(17/6)). That means that, if you are not diagnosing all cases, one death today means 800 true cases today.
-
Q. Zhao et al, A novel analysis of the epidemic outbreak of coronavirus disease 2019 , preprint. This work, done in part by Wharton stats faculty and alumni, argues that the epidemic doubling time (before lockdown in Wuhan) was cca 2.8 days, much lower than the cca 7 days estimated in other works (eg the Li et al paper cited above). They achieve this by careful analysis of cases that left Wuhan before the lockdown. Data, analysis
-
N v Dorelman et al, Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1, N Engl J Med 2020 . An influential article evaluating the stability of SARS-CoV-2 and SARS-CoV-1 in aerosols and on various surfaces, and estimating their decay rates using a Bayesian regression model. Good example of using sophisticated statistical methods to tackle an important problem. See the supplement, pages 3-5 for their methods.
-
C.J. Wang et al., Response to COVID-19 in Taiwan Big Data Analytics, New Technology, and Proactive Testing. An apparently effective set of measures in Taiwan.
-
L. Yan et al., Prediction of criticality in patients with severe Covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in Wuhan. A statistical/machine learning model that identified three clinical features predictive of disease severity.
-
R. Li et al., Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2) , Science 16 Mar 2020. Argues that a lot of infection is undocumented: "We estimate 86% of all infections were undocumented (95% CI: [82%-90%]) prior to 23 January 2020 travel restrictions." This is a very important study. However, the model is a bit beyond the level of this class, and so it may take you some additional readinng to understand it.
-
S. Flaxman et al., Estimating the number of infections and the impact of nonpharmaceutical interventions on COVID-19 in 11 European countries. Using a hierarchical Bayesian statistical model, estimates that a large number of deaths have been averted in Europe because of infection-control measures such as national lockdowns.
- Economic Policy Institute, 3.5 million workers likely lost their employer-provided health insurance in the past two weeks. This is not an academic paper, but a post by a think tank. However, it is structured similarly to a paper. It studies the important problem of health insurance loss. It reports the full steps of the data analysis. It uses statistical methods like basic summary statisics (sample averages, percentages), and some visualization techniques. However, it does not address statistical inference questions like assessing uncertainty (which they could improve).
-
H Yilmazkuday, Coronavirus Effects on the U.S. Unemployment: Evidence from Google Trends. 3/23/2020. Uses a structural vector autoregression (SVAR) model, with Bayesian inference, to estimate relation between Google search queries of "coronavirus", "unemployment", etc.
-
Parolin & Wimer, Columbia U Center on Poverty and Social Policy, FORECASTING ESTIMATES of POVERTY during the COVID-19 CRISIS. "Poverty in the United States could reach highest levels in over 50 years"
-
W McKibbin, R Fernando, The Global Macroeconomic Impacts of COVID-19: Seven Scenarios, Bookings Institution 3.2.2020. Detailed description of their Dynamic stochastic general equilibrium (DSGE) model can be found in the paper W McKibbin & A Triggs
- Referred to in UN report Shared responsibility, global solidarity: Responding to the socio-economic impact of covid-19. Fig 3 is a good example of a statistical method for illustrating the impact on unemployment. "According to the UN International Labour Organization (ILO), five to 25 million jobs will be eradicated, and the world will lose $860 billion to $3.4 trillion in labor income." ILO article
-
OECD Economic Outlook, Mar 2020. Uses NIGEM model developed by the Nat'l Inst of Econ and Soc Research, see here. See also discussion of use of models at OECD
-
Gormsen and Koijen, Coronavirus: Impact on Stock Prices and Growth Expectations. Uses dividend futures, contracts that pay the dividends of the aggregate stock market in a given year, to forecast GDP returns.
- WHO paper database
- An analysis of the papers with the highest impact and attention at Altmetric, and their updated database dimensions.ai
- COVID-19 Open Research Dataset from Allen Institute for AI & partners, tens of thousands of articles in machine readable format, and tools to parse the
- reviews of papers by an expert team at Mt Sinai Immunology: Twitter
- summary of work being done at UCL, UK: link
- up-to date data: Worldometers, 1point3acres,
- Also data from NY State Dept of Health
- predictive models:
- Institute for Health Metrics and Evaluation, UW, Paper, Code, Methods pdf. Used in the White House address (See below)
- Imperial College, Mar 17 2020
- Note on low quality models on the Statistical Modeling blog
- Other discussion on modelling in Science News
- Dashboards at UCLA ML
- Coverage in NY Times, Mar 31 2020.
Models predicting expected spread of the virus in the U.S. paint a grim picture. The coronavirus studies that appear to have convinced President Trump to prolong disruptive social distancing in the United States paint a grim picture of a pandemic that is likely to ravage the country over the next several months, killing close to 100,000 Americans and infecting millions more.
- Statistical models take a prominent role in a White House address:
- Statistical models also form the core of the discussion in a NY Governor's address, 4.5.2020. Quote: "by the data, we could be either very near the apex, or the apex could be a plateau & and we could be on it right now, we won't know until a few days [...] that's what the statisticians will tell you today".
- NY Times discussion of models
- Chicago Booth Initiaive on Global Markets (IGM)
- Collection of articles in Science, many other sources temporarily collected and accessible through Google Scholar
- International Capital Market Association (ICMA) COVID-19 Market Updates, Market data and commentary. Documents market data, focus on Europe.
- Institute for Supply Management, March 2020 Manufacturing report. Documents decrease in manufacturing and production.
- Are crime rates declining? Chicago tribune article, Philly police crime maps, Weekly reports. Is there enough data to conclude that crime rates are declining? What kind of statistical methods may one use to argue about this?
- Simulation and vizualization of exponential spread Washington Post
- NYTimes Opinion piece by Penn faculty EJ Emanuel, S Ellenberg, M Levy, The Coronavirus Is Here to Stay, So What Happens Next?. "A likely scenario is that there will be subsequent waves of the disease."
- US Census report on sales, showing the decline. Contains discussion of methodology, including stratified random sampling and confidence intervals. See also the raw data in an excel datafile on which this based. Note that the monthly variation can be much larger, eg clohting sales dropped by 50% from Mar 2019 to Mar 2020.
- NY Times article on distorted data. Connects to the notions of sampling bias we had discussed.