Skip to content

dobriban/stat-431-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stat 431 project

This is the description of the final project for Stat 431, Spring 2020, at the University of Pennsylvania.

The ongoing Covid-19 pandemic cut the academic year off and affected each of our lives. Given its urgency, a lot of work is being done to mitigate its effects. Some of those works, and some decisions being made, are informed by statistics. Given this, in this project you will aim to understand some aspect of how statistics is used to learn about Covid-19. As the instructor, I think it is important for you to appreciate that statistical thinking is crucial, even in this time of crisis.

You will read a paper that addresses any aspect of the pandemic, and write a short summary/report/essay on how statistics is being used. There are many aspects of Covid-19 that use statistics, and you are free to choose what interests you most:

Epidemiological aspects:

  • estimating key parameters, such as the reproductive number of the virus, interval between infection and symptoms, epidemic doubling time, serial interval (delay between illness onset in successive cases in chains of transmission), onset-to-first-medical-visit and onset-to-admission time
    • a specific interesting problem is estimating R_0 based on the serial interval distribution and the doubling time
  • estimating the mortality rate for various groups (e.g., requires first to understand the demographics of the cases, then to understand possible sampling biases)
  • estimating the effectiveness of interventions like masks, travel bans, and social distancing in reducing the spread
  • estimating and modelling the growth rate

Medical aspects:

  • estimating the need for various medical resources (beds, masks)
  • estimating the effectiveness of vaccinces and other treatments (e.g., what is the effect of wearing various types of masks?)

Economic aspects:

  • estimating the economic impact and losses due to the virus (e.g., increase in unemployment rate, decrease in rate of growth of GDP)
  • evaluating the impact on the stock market (e.g., assessing the magnitude of the crash)

Other aspects: feel free to choose other aspects if you are interested. In that case, you should choose a paper to summarize and ask the instructor to approve it. Moreover, you are free to supplement your work by looking at other additional readings. Make sure to give proper credit for all ideas, facts etc, by citing the appropriate sources.

Special notes

  • It may be challenging to do summaries at the highest quality. However, remember that the goal of this project is an exercise of your critical understanding skills. It is about practicing your knowledge of statistics to wade through primary sources, to sort out fact from fiction about a critically important problem.
  • Instead of reading papers, some may wish to do their own data analysis & visualization to help understand Covid-19. There are several statistical analyses and visualizations that give valuable insights: JHU cases map, Nextstrain phylogenetic tree, covid-19 dashboards. However, doing this in a quality fashion takes a lot of time & energy. So, while valuable, it is outside of the scope of this class.

Instructions:

For each (and many more aspects) there is an extensive body of scientific work being done. You only need to read one paper, and to your best ability, summarize the statistical aspects. Here are the questions you should answer. You can structure your paper as a response to these queries:

  1. What is the goal of the study?
  2. What type of data do they use? Is it observational, experimental? (refer to the sections where we studied this; if experimental, what kind of randomized study more specifically) What type of data does it measure (case medical characteristics, questionnaires)? Are there any possible biases in the data?
  3. What type of statistical tools do they use?
  • 3a. how do they visualize the data? do they use histograms, boxplots, etc? is it compelling?
  • 3b. what kind of statistical methods do they use? e.g., data summarization (mean/median, standard deviation), hypothesis testing, confidence intervals, p-values, regression analysis,
  • 3c. if possible, go in some detail about one or two examples. e.g., what was the quantity estimated? how did they set up the model? how did they estimate?
  1. What do they conclude? how does the data support the conclusions?
  2. What are your overall impressions? what were the strong points, what were the limitations, what could be improved?

This work should take several days (say 3-4 days) to complete (including reading the main source, additional background reading, planning the summary, and writing).

I anticipate that 3c may be challenging for many papers, because they may use advanced statistical methods that we have not covered in the class. I have a few comments and suggestions: first, the goal is not to understand in 100% detail the subtleties of the technique, but rather to appreciate how the methods are used. The goal is to understand the methods as well as you can, based on the training in the class, and based on your independent reading preparing for the project. You are expected to look up relevant methods that appear in the work, and spend some time understanding them. This will also be a valuable experience - no matter how many statistics classes you take, in your work in "the real world" you will always encounter new statistical methods. You need to be prepared to learn about them on the spot. Even more, as informed citizens of the future, it will serve you well to read beyond what is presented in a processed way in popular press.

I also anticipate that some aspects may require you to do a bit of indepenent reading. For instance, if you talk about estimating the basic reproduction number , then you first need to understand what that is, and what models are used to estimate it.

For some of the papers, you will need to look at the supplementary material to understand what is being done. The main body of papers often just briefly describes the methods, and the details (which can be crucial) are often reported in supplementary materials.

Your report should be about 5-6 pages double-spaced, standard font size and standard formatting (standard margin). You can work in groups of at most two (i.e., one or two people per project). You should turn in your project on Canvas (there will be an Assignment created for this), by 5pm EST on May 5th 2020 (this was the date of the original final exam).

The University of Pennsylvania's Code of Academic Integrity applies: https://catalog.upenn.edu/pennbook/code-of-academic-integrity/. Please add the phrase "This paper represents my own work in accordance with the Code of Academic Integrity" to your paper and sign below.

References:

Here are a few relevant references:

Warmup and orientation

The materials below can serve as a helpful orientation into the extensive literature. However, they themselves are not appropriate to be a main source that you summarize, because they do not contain original analyses.

Scientific articles

The articles below can be summarized as part of the project. They are arranged into groups based on their length, complexity, and difficulty. You can choose papers from any group, but if you choose a "basic" paper, then you are expected to cover it in more detail. Then the correctness standard will also be higher. For the more complex papers ("medium" and "high" levels), you are not expected to summarize them in full technical detail. We will reward effort and take into account the difficulty.

Epidemiology and medicine

Basic level

Medium to high level

The difficulty level of the following papers varies a lot. Some use methods that we have not discussed in the class (eg "SIR models"). However, their conclusions are important, and so I include them for those that are most interested to delve into the depths.

How can you estimate the true ones? It turns out, there’s a couple of ways. [...] First, through deaths. If you have deaths in your region, you can use that to guess the number of true current cases. We know approximately how long it takes for that person to go from catching the virus to dying on average (17.3 days). That means the person who died on 2/29 in Washington State probably got infected around 2/12. Then, you know the mortality rate. For this scenario, I’m using 1% (we’ll discuss later the details). That means that, around 2/12, there were already around ~100 cases in the area (of which only one ended up in death 17.3 days later). Now, use the average doubling time for the coronavirus (time it takes to double cases, on average). It’s 6.2. That means that, in the 17 days it took this person to die, the cases had to multiply by ~8 (=2^(17/6)). That means that, if you are not diagnosing all cases, one death today means 800 true cases today.

Economics, finance, social, etc

Basic level

  • Economic Policy Institute, 3.5 million workers likely lost their employer-provided health insurance in the past two weeks. This is not an academic paper, but a post by a think tank. However, it is structured similarly to a paper. It studies the important problem of health insurance loss. It reports the full steps of the data analysis. It uses statistical methods like basic summary statisics (sample averages, percentages), and some visualization techniques. However, it does not address statistical inference questions like assessing uncertainty (which they could improve).

Medium level

High level

Other resources

Releases

No releases published

Packages

No packages published