01_covid19_preprints_published.Rmd

---
title: "COVID-19 Preprints - published"
output: github_document
---

# Background

This file contains code used to harvest information on COVID-19 related preprints (as colected in [Covid19 Preprints](https://github.com/nicholasmfraser/covid19_preprints) which are linked to journal publications.

Currently, data on preprint-publication links is harvested from one source only:

1. Crossref (using the [rcrossref](https://github.com/ropensci/rcrossref) package)

A description of the methods for harvesting data is provided in each relevant section below.

# Load required packages 

```{r message = FALSE, warning = FALSE}

library(lubridate)
library(rcrossref)
library(tidyverse)
library(jsonlite)
library(colorspace)

```


# Set sample date
# Retrieve the latest sample date for preprints

```{r message = FALSE, warning = FALSE}
sample_date <- Sys.Date()

sample_date_preprints <- fromJSON(
  "https://raw.githubusercontent.com/nicholasmfraser/covid19_preprints/master/data/metadata.json"
  ) 

sample_date_preprints <- sample_date_preprints %>%
  .$posted_date %>%
  as.Date()


```


# Import dataset on preprints

This script uses the dataset on preprints collated here: [Covid19 Preprints](https://github.com/nicholasmfraser/covid19_preprints)

```{r message = FALSE, warning = FALSE}

covid_preprints <- read_csv("https://raw.githubusercontent.com/nicholasmfraser/covid19_preprints/master/data/covid19_preprints.csv")

# Filter to preprints with DOI
# Select columns of interest
covid_preprints <- covid_preprints %>%
  filter(identifier_type == "DOI") %>%
  select(source,identifier, posted_date)

# Quering Crossref with all DOIs gets NULL responses for DataCite DOIs -> these can be filtered out

```

# Crossref

Harvesting of Crossref metadata is carried out using the [rcrossref](https://github.com/ropensci/rcrossref) package for R. The `cr_works_` function is used to retrieve all metadata related to Metadata were retrieved for all dois in the dataset of COVID19 preprints. Note that here, the 'low level' `cr_types_` function is used to return all metadata in list format, as this also includes the field 'relation' that is not returned by the 'high level' `cr_works` function.

To identify to the Crossref API and get access to a dedicated API cluster for improved performance, an email address is stored as environment variable in .Renviron 

```{r message = FALSE, warning = FALSE, cache = TRUE}

# Set email as variable "crossref_email" in .Renviron
#file.edit("~/.Renviron")
# Restart R session after saving .Renviron 

# Query dois
dois <- covid_preprints %>%
  pull(identifier)

cr_dois <- cr_works_(dois,
                     parse = TRUE,
                     .progress = "time")

```

Relevant preprint metadata fields are parsed from the list format returned in the previous step, to a more manageable data frame. 

```{r message = FALSE, warning = FALSE, cache = TRUE}

# Function to parse Crossref preprint data to data frame
parseCrossrefDOIs <- function(item) {
  tibble(
    DOI = item$DOI,
    is_preprint_of = if(length(item$relation$`is-preprint-of`)) "is_preprint_of" else NA_character_,
    preprint_of_doi = if(length(item$relation$`is-preprint-of`)) item$relation$`is-preprint-of`[[1]]$id  else NA_character_)
}

# Select element 'message', remove NULL elements
# This removes NULL results from DataCite DOIs
cr_dois_message <- map(cr_dois, "message") %>%
  compact()

# Iterate over posted-content list and build data frame
cr_dois_df <- map_dfr(cr_dois_message, parseCrossrefDOIs)

rm(cr_dois, cr_dois_message)

```


The dataset containing all Covid19 related preprints and the dataset containing information on published articles (from Crossref) are merged and to get a final dataset containing all Crossref preprints with information on linked published articles

```{r message = FALSE, warning = FALSE}

covid_preprints_published <- covid_preprints %>%
  right_join(cr_dois_df, by = c("identifier" = "DOI"))

covid_preprints_published %>%
  write_csv("data/covid19_preprints_published.csv")

covid_preprints_published <- 
  read_csv("data/covid19_preprints_published.csv")

  
```

# Create metadata file (json file with sample date and release date)

```{r message = FALSE, warning = FALSE}

# Set system date as release date
release_date <- Sys.Date()

# Create metadata as list
metadata <- list()
metadata$release_date <- release_date
metadata$sample_date_crossref <- sample_date
metadata$sample_date_preprints <- sample_date_preprints
metadata$url <- "https://github.com/bmkramer/covid19_preprints_published/blob/master/data/covid19_preprints_published.csv?raw=true"

# Save as json file
metadata_json <- toJSON(metadata, pretty = TRUE, auto_unbox = TRUE)
write(metadata_json, "data/metadata.json")

```

# Visualizations

```{r message = FALSE, warning = FALSE}
# Default theme options
theme_set(theme_minimal() +
          theme(text = element_text(size = 12),
          axis.text.x = element_text(angle = 90, vjust = 0.5),
          axis.title.x = element_text(margin = margin(20, 0, 0, 0)),
          axis.title.y = element_text(margin = margin(0, 20, 0, 0)),
          legend.key.size = unit(0.5, "cm"),
          legend.text = element_text(size = 8),
          plot.caption = element_text(size = 10, hjust = 0, color = "grey25", 
                                      margin = margin(20, 0, 0, 0))))
# Create a nice color palette
pal_1 <- colorspace::lighten(pals::tol(n = 10), amount = 0.2)
pal_2 <- colorspace::lighten(pals::tol(n = 10), amount = 0.4)
palette <- c(pal_1, pal_2)
```

```{r message = FALSE, warning = FALSE, include = FALSE}
# For graphs per preprint server

# Group all OSF preprints together
OSF_names <- covid_preprints_published %>%
  count(source) %>%
  filter(str_detect(source, "OSF")) %>%
  arrange(desc(n)) %>%
  pull(source)

covid_preprints_published_viz <- covid_preprints_published %>%
  mutate(source = case_when(
    source %in% OSF_names ~ "OSF",
    TRUE ~ source))

#create vector of names of servers w/ linked preprints
#in descending order of number of preprints
servers_selected <- covid_preprints_published_viz %>%
  mutate(source = case_when(
    source %in% OSF_names ~ "OSF",
    TRUE ~ source)) %>%
  group_by(source) %>%
  summarise_all(~ sum(!is.na(.))) %>%
  ungroup() %>%
  filter(is_preprint_of > 0) %>%
  arrange(desc(identifier)) %>%
  slice(1:7) %>%
  pull(source)


```


```{r message = FALSE, warning = FALSE, include = FALSE}
# Weekly preprint counts

p1 <- covid_preprints_published %>%
  mutate(
    status = case_when(
      is.na(is_preprint_of) ~ "not linked to published journal article",
      !is.na(is_preprint_of) ~ "linked to published journal article"
    ),
    status = factor(status),
    posted_week = ymd(cut(posted_date,
                          breaks = "week",
                          start.on.monday = TRUE))) %>%
  count(status, posted_week) %>%
  ggplot(aes(x = posted_week, y = n, 
             fill = forcats::fct_rev(status))) +
  geom_col() +
  labs(x = "Posted Date (year-month)", y = "Preprints", fill = "status (from Crossref)",
       title = "COVID-19 preprints per week in Crossref", 
       subtitle = paste0("(preprints up until ", sample_date_preprints, ", sample date ", sample_date,")")
       ) +
  scale_x_date(date_breaks = "1 month",
               date_labels = "%Y-%m",
               expand = c(0, 0),
               limits = c(ymd("2020-01-13"), ymd(sample_date_preprints))) +
  scale_fill_manual(values = palette) +
  ggsave("outputs/figures/preprints_published/covid19_preprints_published_week.png", width = 12, height = 6)
```

```{r message = FALSE, warning = FALSE, include = FALSE}
# Weekly preprint counts - percentage

p2 <- covid_preprints_published %>%
  mutate(posted_week = ymd(cut(posted_date,
                          breaks = "week",
                          start.on.monday = TRUE))) %>%
 count(posted_week, is_preprint_of) %>%
  group_by(posted_week) %>%
  mutate(prop = n*100 / sum(n)) %>%
  ungroup() %>%
  filter(!is.na(is_preprint_of)) %>%
  ggplot(aes(x = posted_week, y = prop)) +
  geom_col(fill = "#95D8FA") +
  labs(x = "Posted Date (year-month)", 
       y = "% linked to published journal article", 
       fill = "status (from Crossref)",
       title = "Percentage of COVID-19 preprints linked to published journal articles (in Crossref)", 
       subtitle = paste0("(preprints up until ", sample_date_preprints, ", sample date ", sample_date,")")
       ) +
  scale_x_date(date_breaks = "1 month",
               date_labels = "%Y-%m",
               expand = c(0, 0),
               limits = c(ymd("2020-01-13"), ymd(sample_date_preprints))) + 
  scale_y_continuous(limits=c(0, 100)) +
  scale_fill_manual(values = palette) +
  ggsave("outputs/figures/preprints_published/percentages/covid19_preprints_published_week_perc.png", width = 12, height = 6)
```


```{r message=FALSE, warning=FALSE, include = FALSE}

# Weekly preprint counts - per preprint server

# Select source to display in graph
# To do: wrap into function and map over servers_selected
var <- servers_selected[1]

# Create graph
p3 <- covid_preprints_published_viz %>%
  filter(source == var) %>%
  mutate(
    status = case_when(
      is.na(is_preprint_of) ~ "not linked to published journal article",
      !is.na(is_preprint_of) ~ "linked to published journal article"
    ),
    status = factor(status),
    posted_week = ymd(cut(posted_date,
                          breaks = "week",
                          start.on.monday = TRUE))) %>%
  count(status, posted_week) %>%
  ggplot(aes(x = posted_week, y = n, 
             fill = forcats::fct_rev(status))) +
  geom_col() +
  labs(x = "Posted Date (year-month)", y = "Preprints", fill = "status (from Crossref)",
       title = paste0("COVID-19 preprints per week on ",var), 
       subtitle = paste0("(preprints up until ", sample_date_preprints, ", sample date ", sample_date,")")
       ) +
  scale_x_date(date_breaks = "1 month",
               date_labels = "%Y-%m",
               expand = c(0, 0),
               limits = c(ymd("2020-01-13"), ymd(sample_date_preprints))) +
  scale_fill_manual(values = palette) +
  ggsave(paste0("outputs/figures/preprints_published/covid19_preprints_published_",
                var,
                "_week.png"), 
         width = 12, height = 6)
```

```{r message = FALSE, warning = FALSE, include = FALSE}

# Weekly preprint counts per preprint server - percentage

# Select source to display in graph
# To do: wrap into function and map over servers_selected
var <- servers_selected[7]

# Create graph
p4 <- covid_preprints_published_viz %>%
  filter(source == var) %>%
  mutate(posted_week = ymd(cut(posted_date,
                          breaks = "week",
                          start.on.monday = TRUE))) %>%
 count(posted_week, is_preprint_of) %>%
  group_by(posted_week) %>%
  mutate(prop = n*100 / sum(n)) %>%
  ungroup() %>%
  filter(!is.na(is_preprint_of)) %>%
  ggplot(aes(x = posted_week, y = prop)) +
  geom_col(fill = "#95D8FA") +
  labs(x = "Posted Date (year-month)", 
       y = "% linked to published journal article", 
       fill = "status (from Crossref)",
       title = "Percentage of COVID-19 preprints linked to published journal articles (in Crossref)", 
       subtitle = paste0("(preprints up until ", sample_date_preprints, ", sample date ", sample_date,")")
       ) +
  scale_x_date(date_breaks = "1 month",
               date_labels = "%Y-%m",
               expand = c(0, 0),
               limits = c(ymd("2020-01-13"), ymd(sample_date_preprints))) +
  scale_y_continuous(limits=c(0, 100)) +
  scale_fill_manual(values = palette) +
  ggsave(paste0("outputs/figures/preprints_published/percentages/covid19_preprints_published_",
                var,
                "_week_perc.png"), 
                width = 12, height = 6)

```

```{r message=FALSE, warning=FALSE, include = FALSE}
#Percentage of preprints published

p5 <- covid_preprints_published_viz %>% 
  filter(source %in% servers_selected) %>%
  count(source, is_preprint_of) %>%
  group_by(source) %>%
  mutate(prop = n*100 / sum(n),
         total = sum(n)) %>%
  ungroup() %>%
  mutate(
    #source_label = str_c(source, "\n(", total, ")"),
    source_label = str_c(source),
    data_label = str_c(round(prop,1), "%")) %>%
  filter(!is.na(is_preprint_of)) %>%
  arrange(desc(total)) %>%
  mutate(source_label = factor(source_label),
         source_label = fct_inorder(source_label)) %>%
  ggplot(aes(x = source_label, y = prop, width=.75)) +
  geom_col(color = "grey50", fill = "#95D8FA", size = 0.25, position = "dodge") +
  geom_text(aes(label = data_label), size = 5.5, vjust = -1) +
  ylim(0, 55) +
  labs(x = "", y = "% linked to published journal article", fill ="") +
  guides(fill = FALSE) + 
  theme(axis.text.x = element_text(size = 16, angle = 0, vjust = 1), 
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        #axis.title.y = element_text(size = 12)
        axis.title.y = element_blank())
  ggsave("outputs/figures/preprints_published/covid19_preprints_published_percentage.png", width = 12, height = 6)

```


```{r message = FALSE, warning = FALSE, include = FALSE}

# Create empty figures for table layout in Readme

ggplot() + 
  theme_void() + 
  ggsave(paste0("outputs/figures/preprints_published/empty.png"), 
         width = 12, 
         height = 6)


ggplot() + 
  theme_void() + 
  ggsave(paste0("outputs/figures/preprints_published/empty2.png"), 
         width = 6, 
         height = 6)

```

# Calculate counts for Readme file
```{r}

count <- covid_preprints_published_viz %>%
  group_by(source) %>%
  summarise_all(~ sum(!is.na(.))) %>%
  mutate(perc = (is_preprint_of/identifier)*100,
         perc = round(perc, 1)) %>%
  arrange(desc(identifier))

```