From f01cea377854115fa9c869730909bd56fdcd8315 Mon Sep 17 00:00:00 2001 From: Anthony Date: Mon, 15 Jan 2018 15:32:08 +0000 Subject: [PATCH] Drug discovery and chemical representation (#774) This build is based on https://github.com/greenelab/deep-review/commit/e10f48f6900ae3f088c4d23b52671f2f25c728c0. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/329103733 https://travis-ci.org/greenelab/deep-review/jobs/329103734 [ci skip] The full commit message that triggered this build is copied below: Drug discovery and chemical representation (#774) * GAN already defined * Address bioRxiv comment * New chemical representation subsection * New ROC reference, MoleculeNet includes PR now * Add WIP references to check build * Consistently apply existing tags * Tag typo * Reorganize chemical features and graphs * Add graph convolution and Mol2vec * Rephrasing * Rephrase per cgreene suggestion --- README.md | 2 +- index.html | 498 ++++++++++++++++++++++++--------------------- index.html.ots | Bin 387 -> 387 bytes manuscript.pdf | Bin 867984 -> 876157 bytes manuscript.pdf.ots | Bin 387 -> 387 bytes 5 files changed, 271 insertions(+), 229 deletions(-) diff --git a/README.md b/README.md index 14275692..78a9d9da 100644 --- a/README.md +++ b/README.md @@ -15,4 +15,4 @@ This directory contains the following files, which are mostly ignored on the `ma ## Source The manuscripts in this directory were built from -[`72352b8e472e39d807e21830ff72e2c0adaba1b3`](https://github.com/greenelab/deep-review/commit/72352b8e472e39d807e21830ff72e2c0adaba1b3). +[`e10f48f6900ae3f088c4d23b52671f2f25c728c0`](https://github.com/greenelab/deep-review/commit/e10f48f6900ae3f088c4d23b52671f2f25c728c0). diff --git a/index.html b/index.html index 9410c267..d1461775 100644 --- a/index.html +++ b/index.html @@ -61,7 +61,7 @@

Opportunities and obstacles for deep learning in biology and medicine

A DOI-citable preprint of this manuscript is available at https://doi.org/10.1101/142760.

-

This manuscript was automatically generated from greenelab/deep-review@72352b8 on January 15, 2018.

+

This manuscript was automatically generated from greenelab/deep-review@e10f48f on January 15, 2018.

Authors

ORCID icon Travers Ching1,☯, ORCID icon Daniel S. Himmelstein2, ORCID icon Brett K. Beaulieu-Jones3, ORCID icon Alexandr A. Kalinin4, ORCID icon Brian T. Do5, ORCID icon Gregory P. Way2, ORCID icon Enrico Ferrero6, ORCID icon Paul-Michael Agapow7, ORCID icon Wei Xie8, ORCID icon Gail L. Rosen9, ORCID icon Benjamin J. Lengerich10, ORCID icon Johnny Israeli11, ORCID icon Jack Lanchantin12, ORCID icon Stephen Woloszynek9, ORCID icon Anne E. Carpenter13, ORCID icon Avanti Shrikumar14, ORCID icon Jinbo Xu15, ORCID icon Evan M. Cofer16, ORCID icon David J. Harris17, ORCID icon Dave DeCaprio18, ORCID icon Yanjun Qi12, ORCID icon Anshul Kundaje14,19, ORCID icon Yifan Peng20, ORCID icon Laura K. Wiley21, ORCID icon Marwin H.S. Segler22, ORCID icon Anthony Gitter23,24,†, ORCID icon Casey S. Greene2,†

— Author order was determined with a randomized algorithm
— To whom correspondence should be addressed: gitter@biostat.wisc.edu (AG) and greenescientist@gmail.com (CSG)

@@ -361,115 +361,117 @@

Drug repositioning

Drug development

Ligand-based prediction of bioactivity

High-throughput chemical screening in biomedical research aims to improve therapeutic options over a long term horizon [21]. The objective is to discover which small molecules (also referred to as chemical compounds or ligands) specifically affect the activity of a target, such as a kinase, protein-protein interaction, or broader cellular phenotype. This screening is often one of the first steps in a long drug discovery pipeline, where novel molecules are pursued for their ability to inhibit or enhance disease-relevant biological mechanisms [368]. Initial hits are confirmed to eliminate false positives and proceed to the lead generation stage [369], where they are evaluated for absorption, distribution, metabolism, excretion, and toxicity (ADMET) and other properties. It is desirable to advance multiple lead series, clusters of structurally-similar active chemicals, for further optimization by medicinal chemists to protect against unexpected failures in the later stages of drug discovery [368].

-

Computational work in this domain aims to identify sufficient candidate active compounds without exhaustively screening libraries of hundreds of thousands or millions of chemicals. Predicting chemical activity computationally is known as virtual screening. This task has been treated variously as a classification, regression, or ranking problem. In reality, it does not fit neatly into any of those categories. An ideal algorithm will rank a sufficient number of active compounds before the inactives, but the rankings of actives relative to other actives and inactives are less important [370]. Computational modeling also has the potential to predict ADMET traits for lead generation [371] and how drugs are metabolized [372].

-

Ligand-based approaches train on chemicals’ features without modeling target features (e.g. protein structure). Chemical features may be represented as a list of molecular descriptors such as molecular weight, atom counts, functional groups, charge representations, summaries of atom-atom relationships in the molecular graph, and more sophisticated derived properties [373]. Alternatively, chemicals can be characterized with the fingerprint bit vectors, textual strings, or novel learned representations described below. Neural networks have a long history in this domain [19,22], and the 2012 Merck Molecular Activity Challenge on Kaggle generated substantial excitement about the potential for high-parameter deep learning approaches. The winning submission was an ensemble that included a multi-task multi-layer perceptron network [374]. The sponsors noted drastic improvements over a random forest baseline, remarking “we have seldom seen any method in the past 10 years that could consistently outperform [random forest] by such a margin” [375]. Subsequent work (reviewed in more detail by Goh et al. [20]) explored the effects of jointly modeling far more targets than the Merck challenge [376,377], with Ramsundar et al. [377] showing that the benefits of multi-task networks had not yet saturated even with 259 targets. Although DeepTox [378], a deep learning approach, won another competition, the Toxicology in the 21st Century (Tox21) Data Challenge, it did not dominate alternative methods as thoroughly as in other domains. DeepTox was the top performer on 9 of 15 targets and highly competitive with the top performer on the others. However, for many targets there was little separation between the top two or three methods.

-

The nuanced Tox21 performance may be more reflective of the practical challenges encountered in ligand-based chemical screening than the extreme enthusiasm generated by the Merck competition. A study of 22 ADMET tasks demonstrated that there are limitations to multi-task transfer learning that are in part a consequence of the degree to which tasks are related [371]. Some of the ADMET datasets showed superior performance in multi-task models with only 22 ADMET tasks compared to multi-task models with over 500 less-similar tasks. In addition, the training datasets encountered in practical applications may be tiny relative to what is available in public datasets and organized competitions. A study of BACE-1 inhibitors included only 1547 compounds [379]. Machine learning models were able to train on this limited dataset, but overfitting was a challenge and the differences between random forests and a deep neural network were negligible, especially in the classification setting. Overfitting is still a problem in larger chemical screening datasets with tens or hundreds of thousands of compounds because the number of active compounds can be very small, on the order of 0.1% of all tested chemicals for a typical target [380]. This is consistent with the strong performance of low-parameter neural networks that emphasize compound-compound similarity, such as influence-relevance voter [370,381], instead of predicting compound activity directly from chemical features.

-

Much of the recent excitement in this domain has come from what could be considered a creative experimentation phase, in which deep learning has offered novel possibilities for feature representation and modeling of chemical compounds. A molecular graph, where atoms are labeled nodes and bonds are labeled edges, is a natural way to represent a chemical structure. Traditional machine learning approaches relied on preprocessing the graph into a feature vector, such as a fixed-width bit vector fingerprint [382]. The same fingerprints have been used by some drug-target interaction methods discussed above [38]. An overly simplistic but approximately correct view of chemical fingerprints is that each bit represents the presence or absence of a particular chemical substructure in the molecular graph. Modern neural networks, such as those discussed previously for PPI networks, can operate directly on the molecular graph as input. Duvenaud et al. [383] generalized standard circular fingerprints by substituting discrete operations in the fingerprinting algorithm with operations in a neural network, producing a real-valued feature vector instead of a bit vector. Other approaches offer trainable networks that can learn chemical feature representations that are optimized for a particular prediction task. Lusci et al. [384] applied recursive neural networks for directed acyclic graphs to undirected molecular graphs by creating an ensemble of directed graphs in which one atom is selected as the root node. Graph convolutions on undirected molecular graphs have eliminated the need to enumerate artificially directed graphs, learning feature vectors for atoms that are a function of the properties of neighboring atoms and local regions on the molecular graph [385,386].

-

Advances in chemical representation learning have also enabled new strategies for learning chemical-chemical similarity functions. Altae-Tran et al. developed a one-shot learning network [386] to address the reality that most practical chemical screening studies are unable to provide the thousands or millions of training compounds that are needed to train larger multi-task networks. Using graph convolutions to featurize chemicals, the network learns an embedding from compounds into a continuous feature space such that compounds with similar activities in a set of training tasks have similar embeddings. The approach is evaluated in an extremely challenging setting. The embedding is learned from a subset of prediction tasks (e.g. activity assays for individual proteins), and only one to ten labeled examples are provided as training data on a new task. On Tox21 targets, even when trained with one task-specific active compound and one inactive compound, the model is able to generalize reasonably well because it has learned an informative embedding function from the related tasks. Random forests, which cannot take advantage of the related training tasks, trained in the same setting are only slightly better than a random classifier. Despite the success on Tox21, performance on MUV datasets, which contains assays designed to be challenging for chemical informatics algorithms, is considerably worse. The authors also demonstrate the limitations of transfer learning as embeddings learned from the Tox21 assays have little utility for a drug adverse reaction dataset.

-

These novel, learned chemical feature representations may prove to be essential for accurately predicting why some compounds with similar structures yield similar target effects and others produce drastically different results. Currently, these methods are enticing but do not necessarily outperform classic approaches by a large margin. The neural fingerprints [383] were narrowly beaten by regression using traditional circular fingerprints on a drug efficacy prediction task but were superior for predicting solubility or photovoltaic efficiency. In the original study, graph convolutions [385] performed comparably to a multi-task network using standard fingerprints and slightly better than the neural fingerprints [383] on the drug efficacy task but were slightly worse than the influence-relevance voter method on an HIV dataset [370]. Broader recent benchmarking has shown that relative merits of these methods depends on the dataset and cross validation strategy [387], though evaluation often uses auROC (area under the receiver operating characteristic curve), which has limited utility due to the large class imbalance (see Discussion).

-

We remain optimistic for the potential of deep learning and specifically representation learning in drug discovery. Rigorous benchmarking on broad and diverse prediction tasks will be as important as novel neural network architectures to advance the state of the art and convincingly demonstrate superiority over traditional cheminformatics techniques. Fortunately, there has recently been much progress in this direction. The DeepChem software [386,388] and MoleculeNet benchmarking suite [387] built upon it contain chemical bioactivity and toxicity prediction datasets, multiple compound featurization approaches including graph convolutions, and various machine learning algorithms ranging from standard baselines like logistic regression and random forests to recent neural network architectures. Independent research groups have already contributed additional datasets and prediction algorithms to DeepChem. Adoption of common benchmarking evaluation metrics, datasets, and baseline algorithms has the potential to establish the practical utility of deep learning in chemical bioactivity prediction and lower the barrier to entry for machine learning researchers without biochemistry expertise.

-

One open question in ligand-based screening pertains to the benefits and limitations of transfer learning. Multi-task neural networks have shown the advantages of jointly modeling many targets [376,377]. Other studies have shown the limitations of transfer learning when the prediction tasks are insufficiently related [371,386]. This has important implications for representation learning. The typical approach to improve deep learning models by expanding the dataset size may not be applicable if only “related” tasks are beneficial, especially because task-task relatedness is ill-defined. The massive chemical state space will also influence the development of unsupervised representation learning methods [389]. Future work will establish whether it is better to train on massive collections of diverse compounds, drug-like small molecules, or specialized subsets.

+

Computational work in this domain aims to identify sufficient candidate active compounds without exhaustively screening libraries of hundreds of thousands or millions of chemicals. Predicting chemical activity computationally is known as virtual screening. An ideal algorithm will rank a sufficient number of active compounds before the inactives, but the rankings of actives relative to other actives and inactives are less important [370]. Computational modeling also has the potential to predict ADMET traits for lead generation [371] and how drugs are metabolized [372].

+

Ligand-based approaches train on chemicals’ features without modeling target features (e.g. protein structure). Neural networks have a long history in this domain [19,22], and the 2012 Merck Molecular Activity Challenge on Kaggle generated substantial excitement about the potential for high-parameter deep learning approaches. The winning submission was an ensemble that included a multi-task multi-layer perceptron network [373]. The sponsors noted drastic improvements over a random forest baseline, remarking “we have seldom seen any method in the past 10 years that could consistently outperform [random forest] by such a margin” [374], but not all outside experts were convinced [375]. Subsequent work (reviewed in more detail by Goh et al. [20]) explored the effects of jointly modeling far more targets than the Merck challenge [376,377], with Ramsundar et al. [377] showing that the benefits of multi-task networks had not yet saturated even with 259 targets. Although DeepTox [378], a deep learning approach, won another competition, the Toxicology in the 21st Century (Tox21) Data Challenge, it did not dominate alternative methods as thoroughly as in other domains. DeepTox was the top performer on 9 of 15 targets and highly competitive with the top performer on the others. However, for many targets there was little separation between the top two or three methods.

+

The nuanced Tox21 performance may be more reflective of the practical challenges encountered in ligand-based chemical screening than the extreme enthusiasm generated by the Merck competition. A study of 22 ADMET tasks demonstrated that there are limitations to multi-task transfer learning that are in part a consequence of the degree to which tasks are related [371]. Some of the ADMET datasets showed superior performance in multi-task models with only 22 ADMET tasks compared to multi-task models with over 500 less-similar tasks. In addition, the training datasets encountered in practical applications may be tiny relative to what is available in public datasets and organized competitions. A study of BACE-1 inhibitors included only 1547 compounds [379]. Machine learning models were able to train on this limited dataset, but overfitting was a challenge and the differences between random forests and a deep neural network were negligible, especially in the classification setting. Overfitting is still a problem in larger chemical screening datasets with tens or hundreds of thousands of compounds because the number of active compounds can be very small, on the order of 0.1% of all tested chemicals for a typical target [380]. This has motivated low-parameter neural networks that emphasize compound-compound similarity, such as influence-relevance voter [370,381], instead of predicting compound activity directly from chemical features.

+

Chemical featurization and representation learning

+

Much of the recent excitement in this domain has come from what could be considered a creative experimentation phase, in which deep learning has offered novel possibilities for feature representation and modeling of chemical compounds. A molecular graph, where atoms are labeled nodes and bonds are labeled edges, is a natural way to represent a chemical structure. Chemical features can be represented as a list of molecular descriptors such as molecular weight, atom counts, functional groups, charge representations, summaries of atom-atom relationships in the molecular graph, and more sophisticated derived properties [382]. Traditional machine learning approaches relied on preprocessing the graph into a feature vector of molecular descriptors or a fixed-width bit vector known as a fingerprint [383]. The same fingerprints have been used by some drug-target interaction methods discussed above [38]. An overly simplistic but approximately correct view of chemical fingerprints is that each bit represents the presence or absence of a particular chemical substructure in the molecular graph. Instead of using molecular descriptors or fingerprints as input, modern neural networks can represent chemicals as textual strings [384] or images [385] or operate directly on the molecular graph, which has enabled strategies for learning novel chemical representations.

+

Virtual screening and chemical property prediction have emerged as one of the major applications areas for graph-based neural networks. Duvenaud et al. [386] generalized standard circular fingerprints by substituting discrete operations in the fingerprinting algorithm with operations in a neural network, producing a real-valued feature vector instead of a bit vector. Other approaches offer trainable networks that can learn chemical feature representations that are optimized for a particular prediction task. Lusci et al. [387] applied recursive neural networks for directed acyclic graphs to undirected molecular graphs by creating an ensemble of directed graphs in which one atom is selected as the root node. Graph convolutions on undirected molecular graphs have eliminated the need to enumerate artificially directed graphs, learning feature vectors for atoms that are a function of the properties of neighboring atoms and local regions on the molecular graph [388390]. More sophisticated graph algorithms [391,392] addressed limitations of standard graph convolutions that primarily operate on each node’s local neighborhood. We anticipate that these graph-based neural networks could also be applicable in other types of biological networks, such as the PPI networks we discussed previously.

+

Advances in chemical representation learning have also enabled new strategies for learning chemical-chemical similarity functions. Altae-Tran et al. developed a one-shot learning network [389] to address the reality that most practical chemical screening studies are unable to provide the thousands or millions of training compounds that are needed to train larger multi-task networks. Using graph convolutions to featurize chemicals, the network learns an embedding from compounds into a continuous feature space such that compounds with similar activities in a set of training tasks have similar embeddings. The approach is evaluated in an extremely challenging setting. The embedding is learned from a subset of prediction tasks (e.g. activity assays for individual proteins), and only one to ten labeled examples are provided as training data on a new task. On Tox21 targets, even when trained with one task-specific active compound and one inactive compound, the model is able to generalize reasonably well because it has learned an informative embedding function from the related tasks. Random forests, which cannot take advantage of the related training tasks, trained in the same setting are only slightly better than a random classifier. Despite the success on Tox21, performance on MUV datasets, which contains assays designed to be challenging for chemical informatics algorithms, is considerably worse. The authors also demonstrate the limitations of transfer learning as embeddings learned from the Tox21 assays have little utility for a drug adverse reaction dataset.

+

These novel, learned chemical feature representations may prove to be essential for accurately predicting why some compounds with similar structures yield similar target effects and others produce drastically different results. Currently, these methods are enticing but do not necessarily outperform classic approaches by a large margin. The neural fingerprints [386] were narrowly beaten by regression using traditional circular fingerprints on a drug efficacy prediction task but were superior for predicting solubility or photovoltaic efficiency. In the original study, graph convolutions [388] performed comparably to a multi-task network using standard fingerprints and slightly better than the neural fingerprints [386] on the drug efficacy task but were slightly worse than the influence-relevance voter method on an HIV dataset [370]. Broader recent benchmarking has shown that relative merits of these methods depends on the dataset and cross validation strategy [393], though evaluation in this domain often uses auROC (area under the receiver operating characteristic curve) [394], which has limited utility due to the large class imbalance (see Discussion).

+

We remain optimistic for the potential of deep learning and specifically representation learning in drug discovery. Rigorous benchmarking on broad and diverse prediction tasks will be as important as novel neural network architectures to advance the state of the art and convincingly demonstrate superiority over traditional cheminformatics techniques. Fortunately, there has recently been much progress in this direction. The DeepChem software [389,395] and MoleculeNet benchmarking suite [393] built upon it contain chemical bioactivity and toxicity prediction datasets, multiple compound featurization approaches including graph convolutions, and various machine learning algorithms ranging from standard baselines like logistic regression and random forests to recent neural network architectures. Independent research groups have already contributed additional datasets and prediction algorithms to DeepChem. Adoption of common benchmarking evaluation metrics, datasets, and baseline algorithms has the potential to establish the practical utility of deep learning in chemical bioactivity prediction and lower the barrier to entry for machine learning researchers without biochemistry expertise.

+

One open question in ligand-based screening pertains to the benefits and limitations of transfer learning. Multi-task neural networks have shown the advantages of jointly modeling many targets [376,377]. Other studies have shown the limitations of transfer learning when the prediction tasks are insufficiently related [371,389]. This has important implications for representation learning. The typical approach to improve deep learning models by expanding the dataset size may not be applicable if only “related” tasks are beneficial, especially because task-task relatedness is ill-defined. The massive chemical state space will also influence the development of unsupervised representation learning methods [384,396]. Future work will establish whether it is better to train on massive collections of diverse compounds, drug-like small molecules, or specialized subsets.

Structure-based prediction of bioactivity

-

When protein structure is available, virtual screening has traditionally relied on docking programs to predict how a compound best fits in the target’s binding site and score the predicted ligand-target complex [390]. Recently, deep learning approaches have been developed to model protein structure, which is expected to improve upon the simpler drug-target interaction algorithms described above that represent proteins with feature vectors derived from amino acid sequences [38,366].

-

Structure-based deep learning methods differ in whether they use experimentally-derived or predicted ligand-target complexes and how they represent the 3D structure. The Atomic CNN [391] takes 3D crystal structures from PDBBind [392] as input, ensuring it uses a reliable ligand-target complex. AtomNet [35] samples multiple ligand poses within the target binding site, and DeepVS [393] and Ragoza et al. [394] use a docking program to generate protein-compound complexes. If they are sufficiently accurate, these latter approaches would have wider applicability to a much larger set of compounds and proteins. However, incorrect ligand poses will be misleading during training, and the predictive performance is sensitive to the docking quality [393].

-

There are two established options for representing a protein-compound complex. One option, a 3D grid, can featurize the input complex [35,394]. Each entry in the grid tracks the types of protein and ligand atoms in that region of the 3D space or descriptors derived from those atoms. Alternatively, DeepVS [393] and atomic convolutions [391] offer greater flexibility in their convolutions by eschewing the 3D grid. Instead, they each implement techniques for executing convolutions over atoms’ neighboring atoms in the 3D space. Gomes et al. demonstrate that currently random forest on a 1D feature vector that describes the 3D ligand-target structure generally outperforms neural networks on the same feature vector as well as atomic convolutions and ligand-based neural networks when predicting the continuous-valued inhibition constant on the PDBBind refined dataset [391]. However, in the long term, atomic convolutions may ultimately overtake grid-based methods, as they provide greater freedom to model atom-atom interactions and the forces that govern binding affinity.

+

When protein structure is available, virtual screening has traditionally relied on docking programs to predict how a compound best fits in the target’s binding site and score the predicted ligand-target complex [397]. Recently, deep learning approaches have been developed to model protein structure, which is expected to improve upon the simpler drug-target interaction algorithms described above that represent proteins with feature vectors derived from amino acid sequences [38,366].

+

Structure-based deep learning methods differ in whether they use experimentally-derived or predicted ligand-target complexes and how they represent the 3D structure. The Atomic CNN [398] and TopologyNet [399] models take 3D structures from PDBBind [400] as input, ensuring the ligand-target complexes are reliable. AtomNet [35] samples multiple ligand poses within the target binding site, and DeepVS [401] and Ragoza et al. [402] use a docking program to generate protein-compound complexes. If they are sufficiently accurate, these latter approaches would have wider applicability to a much larger set of compounds and proteins. However, incorrect ligand poses will be misleading during training, and the predictive performance is sensitive to the docking quality [401].

+

There are two established options for representing a protein-compound complex. One option, a 3D grid, can featurize the input complex [35,402]. Each entry in the grid tracks the types of protein and ligand atoms in that region of the 3D space or descriptors derived from those atoms. Alternatively, DeepVS [401] and atomic convolutions [398] offer greater flexibility in their convolutions by eschewing the 3D grid. Instead, they each implement techniques for executing convolutions over atoms’ neighboring atoms in the 3D space. Gomes et al. demonstrate that currently random forest on a 1D feature vector that describes the 3D ligand-target structure generally outperforms neural networks on the same feature vector as well as atomic convolutions and ligand-based neural networks when predicting the continuous-valued inhibition constant on the PDBBind refined dataset [398]. However, in the long term, atomic convolutions may ultimately overtake grid-based methods, as they provide greater freedom to model atom-atom interactions and the forces that govern binding affinity.

De novo drug design

-

De novo drug design attempts to model the typical design-synthesize-test cycle of drug discovery [395,396]. It explores an estimated 1060 synthesizable organic molecules with drug-like properties without explicit enumeration [380]. To test or score structures, algorithms like those discussed earlier are used. To “design” and “synthesize”, traditional de novo design software relied on classical optimizers such as genetic algorithms. Unfortunately, this often leads to overfit, “weird” molecules, which are difficult to synthesize in the lab. Current programs have settled on rule-based virtual chemical reactions to generate molecular structures [396]. Deep learning models that generate realistic, synthesizable molecules have been proposed as an alternative. In contrast to the classical, symbolic approaches, generative models learned from data would not depend on laboriously encoded expert knowledge. The challenge of generating molecules has parallels to the generation of syntactically and semantically correct text [397].

-

As deep learning models that directly output (molecular) graphs remain under-explored, generative neural networks for drug design typically represent chemicals with the simplified molecular-input line-entry system (SMILES), a standard string-based representation with characters that represent atoms, bonds, and rings [398]. This allows treating molecules as sequences and leveraging recent progress in recurrent neural networks. Gómez-Bombarelli et al. designed a SMILES-to-SMILES autoencoder to learn a continuous latent feature space for chemicals [389]. In this learned continuous space it was possible to interpolate between continuous representations of chemicals in a manner that is not possible with discrete (e.g. bit vector or string) features or in symbolic, molecular graph space. Even more interesting is the prospect of performing gradient-based or Bayesian optimization of molecules within this latent space. The strategy of constructing simple, continuous features before applying supervised learning techniques is reminiscent of autoencoders trained on high-dimensional EHR data [112]. A drawback of the SMILES-to-SMILES autoencoder is that not all SMILES strings produced by the autoencoder’s decoder correspond to valid chemical structures. Recently, the Grammar Variational Autoencoder, which takes the SMILES grammar into account and is guaranteed to produce syntactically valid SMILES, has been proposed to alleviate this issue [399].

-

Another approach to de novo design is to train character-based RNNs on large collections of molecules, for example, ChEMBL [400], to first obtain a generic generative model for drug-like compounds [398]. These generative models successfully learn the grammar of compound representations, with 94% [401] or nearly 98% [398] of generated SMILES corresponding to valid molecular structures. The initial RNN is then fine-tuned to generate molecules that are likely to be active against a specific target by either continuing training on a small set of positive examples [398] or adopting reinforcement learning strategies [401,402]. Both the fine-tuning and reinforcement learning approaches can rediscover known, held-out active molecules. The great flexibility of neural networks, and progress in generative models offers many opportunities for deep architectures in de novo design (e.g. the adaptation of Generative Adversarial Networks (GANs) for molecules).

+

De novo drug design attempts to model the typical design-synthesize-test cycle of drug discovery [403,404]. It explores an estimated 1060 synthesizable organic molecules with drug-like properties without explicit enumeration [380]. To test or score structures, algorithms like those discussed earlier are used. To “design” and “synthesize”, traditional de novo design software relied on classical optimizers such as genetic algorithms. Unfortunately, this often leads to overfit, “weird” molecules, which are difficult to synthesize in the lab. Current programs have settled on rule-based virtual chemical reactions to generate molecular structures [404]. Deep learning models that generate realistic, synthesizable molecules have been proposed as an alternative. In contrast to the classical, symbolic approaches, generative models learned from data would not depend on laboriously encoded expert knowledge. The challenge of generating molecules has parallels to the generation of syntactically and semantically correct text [405].

+

As deep learning models that directly output (molecular) graphs remain under-explored, generative neural networks for drug design typically represent chemicals with the simplified molecular-input line-entry system (SMILES), a standard string-based representation with characters that represent atoms, bonds, and rings [406]. This allows treating molecules as sequences and leveraging recent progress in recurrent neural networks. Gómez-Bombarelli et al. designed a SMILES-to-SMILES autoencoder to learn a continuous latent feature space for chemicals [384]. In this learned continuous space it was possible to interpolate between continuous representations of chemicals in a manner that is not possible with discrete (e.g. bit vector or string) features or in symbolic, molecular graph space. Even more interesting is the prospect of performing gradient-based or Bayesian optimization of molecules within this latent space. The strategy of constructing simple, continuous features before applying supervised learning techniques is reminiscent of autoencoders trained on high-dimensional EHR data [112]. A drawback of the SMILES-to-SMILES autoencoder is that not all SMILES strings produced by the autoencoder’s decoder correspond to valid chemical structures. Recently, the Grammar Variational Autoencoder, which takes the SMILES grammar into account and is guaranteed to produce syntactically valid SMILES, has been proposed to alleviate this issue [407].

+

Another approach to de novo design is to train character-based RNNs on large collections of molecules, for example, ChEMBL [408], to first obtain a generic generative model for drug-like compounds [406]. These generative models successfully learn the grammar of compound representations, with 94% [409] or nearly 98% [406] of generated SMILES corresponding to valid molecular structures. The initial RNN is then fine-tuned to generate molecules that are likely to be active against a specific target by either continuing training on a small set of positive examples [406] or adopting reinforcement learning strategies [409,410]. Both the fine-tuning and reinforcement learning approaches can rediscover known, held-out active molecules. The great flexibility of neural networks, and progress in generative models offers many opportunities for deep architectures in de novo design (e.g. the adaptation of GANs for molecules).

Discussion

Despite the disparate types of data and scientific goals in the learning tasks covered above, several challenges are broadly important for deep learning in the biomedical domain. Here we examine these factors that may impede further progress, ask what steps have already been taken to overcome them, and suggest future research directions.

Customizing deep learning models reflects a tradeoff between bias and variance

Some of the challenges in applying deep learning are shared with other machine learning methods. In particular, many problem-specific optimizations described in this review reflect a recurring universal tradeoff – controlling the flexibility of a model in order to maximize predictivity. Methods for adjusting the flexibility of deep learning models include dropout, reduced data projections, and transfer learning (described below). One way of understanding such model optimizations is that they incorporate external information to limit model flexibility and thereby improve predictions. This balance is formally described as a tradeoff between “bias and variance” [10].

-

Although the bias-variance tradeoff is common to all machine learning applications, recent empirical and theoretical observations suggest that deep learning models may have uniquely advantageous generalization properties [403,404]. Nevertheless, additional advances will be needed to establish a coherent theoretical foundation that enables practitioners to better reason about their models from first principles.

+

Although the bias-variance tradeoff is common to all machine learning applications, recent empirical and theoretical observations suggest that deep learning models may have uniquely advantageous generalization properties [411,412]. Nevertheless, additional advances will be needed to establish a coherent theoretical foundation that enables practitioners to better reason about their models from first principles.

Evaluation metrics for imbalanced classification

Making predictions in the presence of high class imbalance and differences between training and generalization data is a common feature of many large biomedical datasets, including deep learning models of genomic features, patient classification, disease detection, and virtual screening. Prediction of transcription factor binding sites exemplifies the difficulties with learning from highly imbalanced data. The human genome has 3 billion base pairs, and only a small fraction of them are implicated in specific biochemical activities. Less than 1% of the genome can be confidently labeled as bound for most transcription factors.

Estimating the false discovery rate (FDR) is a standard method of evaluation in genomics that can also be applied to deep learning model predictions of genomic features. Using deep learning predictions for targeted validation experiments of specific biochemical activities necessitates a more stringent FDR (typically 5-25%). However, when predicted biochemical activities are used as features in other models, such as gene expression models, a low FDR may not be necessary.

-

What is the correspondence between FDR metrics and commonly used classification metrics such as auPRC (area under the precision-recall curve) and auROC (area under the receiver-operating-characteristic curve)? auPRC evaluates the average precision, or equivalently, the average FDR across all recall thresholds. This metric provides an overall estimate of performance across all possible use cases, which can be misleading for targeted validation experiments. For example, classification of TF binding sites can exhibit a recall of 0% at 10% FDR and auPRC greater than 0.6. In this case, the auPRC may be competitive, but the predictions are ill-suited for targeted validation that can only examine a few of the highest-confidence predictions. Likewise, auROC evaluates the average recall across all false positive rate (FPR) thresholds, which is often a highly misleading metric in class-imbalanced domains [70,405]. Consider a classification model with recall of 0% at FDR less than 25% and 100% recall at FDR greater than 25%. In the context of TF binding predictions where only 1% of genomic regions are bound by the TF, this is equivalent to a recall of 100% for FPR greater than 0.33%. In other words, the auROC would be 0.9967, but the classifier would be useless for targeted validation. It is not unusual to obtain a chromosome-wide auROC greater than 0.99 for TF binding predictions but a recall of 0% at 10% FDR. Consequently, practitioners must select the metric most tailored to their subsequent use case to use these methods most effectively.

+

What is the correspondence between FDR metrics and commonly used classification metrics such as auPRC (area under the precision-recall curve) and auROC (area under the receiver-operating-characteristic curve)? auPRC evaluates the average precision, or equivalently, the average FDR across all recall thresholds. This metric provides an overall estimate of performance across all possible use cases, which can be misleading for targeted validation experiments. For example, classification of TF binding sites can exhibit a recall of 0% at 10% FDR and auPRC greater than 0.6. In this case, the auPRC may be competitive, but the predictions are ill-suited for targeted validation that can only examine a few of the highest-confidence predictions. Likewise, auROC evaluates the average recall across all false positive rate (FPR) thresholds, which is often a highly misleading metric in class-imbalanced domains [70,413]. Consider a classification model with recall of 0% at FDR less than 25% and 100% recall at FDR greater than 25%. In the context of TF binding predictions where only 1% of genomic regions are bound by the TF, this is equivalent to a recall of 100% for FPR greater than 0.33%. In other words, the auROC would be 0.9967, but the classifier would be useless for targeted validation. It is not unusual to obtain a chromosome-wide auROC greater than 0.99 for TF binding predictions but a recall of 0% at 10% FDR. Consequently, practitioners must select the metric most tailored to their subsequent use case to use these methods most effectively.

Formulation of classification labels

Genome-wide continuous signals are commonly formulated into classification labels through signal peak detection. ChIP-seq peaks are used to identify locations of TF binding and histone modifications. Such procedures rely on thresholding criteria to define what constitutes a peak in the signal. This inevitably results in a set of signal peaks that are close to the threshold, not sufficient to constitute a positive label but too similar to positively labeled examples to constitute a negative label. To avoid an arbitrary label for these examples they may be labeled as “ambiguous”. Ambiguously labeled examples can then be ignored during model training and evaluation of recall and FDR. The correlation between model predictions on these examples and their signal values can be used to evaluate if the model correctly ranks these examples between positive and negative examples.

Formulation of a performance upper bound

-

In assessing the upper bound on the predictive performance of a deep learning model, it is necessary to incorporate inherent between-study variation inherent to biomedical research [406]. Study-level variability limits classification performance and can lead to underestimating prediction error if the generalization error is estimated by splitting a single dataset. Analyses can incorporate data from multiple labs and experiments to capture between-study variation within the prediction model mitigating some of these issues.

+

In assessing the upper bound on the predictive performance of a deep learning model, it is necessary to incorporate inherent between-study variation inherent to biomedical research [414]. Study-level variability limits classification performance and can lead to underestimating prediction error if the generalization error is estimated by splitting a single dataset. Analyses can incorporate data from multiple labs and experiments to capture between-study variation within the prediction model mitigating some of these issues.

Uncertainty quantification

-

Deep learning based solutions for biomedical applications could substantially benefit from guarantees on the reliability of predictions and a quantification of uncertainty. Due to biological variability and precision limits of equipment, biomedical data do not consist of precise measurements but of estimates with noise. Hence, it is crucial to obtain uncertainty measures that capture how noise in input values propagate through deep neural networks. Such measures can be used for reliability assessment of automated decisions in clinical and public health applications, and for guarding against model vulnerabilities in the face of rare or adversarial cases [407]. Moreover, in fundamental biological research, measures of uncertainty help researchers distinguish between true regularities in the data and patterns that are false or merely anecdotal. There are two main uncertainties that one can calculate: epistemic and aleatoric [408]. Epistemic uncertainty describes uncertainty about the model, its structure, or its parameters. This uncertainty is caused by insufficient training data or by a difference in the training set and testing set distributions, so it vanishes in the limit of infinite data. On the other hand, aleatoric uncertainty describes uncertainty inherent in the observations. This uncertainty is due to noisy or missing data, so it vanishes with the ability to observe all independent variables with infinite precision. A good way to represent aleatoric uncertainty is to design an appropriate loss function with an uncertainty variable. In the case of data-dependent aleatoric uncertainty, one can train the model to increase its uncertainty when it is incorrect due to noisy or missing data, and in the case of task-depedent aleatoric uncertainty, one can optimize for the best uncertainty parameter for each task [409]. Meanwhile, there are various methods for modeling epistemic uncertainty, outlined below.

-

In classification tasks, confidence calibration is the problem of using classifier scores to predict class membership probabilities that match the true membership likelihoods. These membership probabilities can be used to assess the uncertainty associated with assigning the example to each of the classes. Guo et al. [410] observed that contemporary neural networks are poorly calibrated and provided a simple recommendation for calibration: temperature scaling, a single parameter special case of Platt scaling [411]. In addition to confidence calibration, there is early work from Chryssolouris et al. [412] that described a method for obtaining confidence intervals with the assumption of normally distributed error for the neural network. More recently, Hendrycks and Gimpel discovered that incorrect or out-of-distribution examples usually have lower maximum softmax probabilities than correctly classified examples, allowing for effective detection of misclassified examples [413]. Liang et al. used temperature scaling and small perturbations to further separate the softmax scores of correctly classified examples and the scores of out-of-distribution examples, allowing for more effective detection [414]. This approach outperformed the baseline approaches by a large margin, establishing a new state-of-the-art performance.

-

An alternative approach for obtaining principled uncertainty estimates from deep learning models is to use Bayesian neural networks. Deep learning models are usually trained to obtain the most likely parameters given the data. However, choosing the single most likely set of parameters ignores the uncertainty about which set of parameters (among the possible models that explain the given dataset) should be used. This sometimes leads to uncertainty in predictions when the chosen likely parameters produce high-confidence but incorrect results. On the other hand, the parameters of Bayesian neural networks are modeled as full probability distributions. This Bayesian approach comes with a whole host of benefits, including better calibrated confidence estimates [415] and more robustness to adversarial and out-of-distribution examples [416]. Unfortunately, modeling the full posterior distribution for the model’s parameters given the data is usually computationally intractable. One popular method for circumventing this high computational cost is called test-time dropout [417], where an approximate posterior distribution is obtained using variational inference. Gal and Ghahramani showed that a stack of fully connected layers with dropout between the layers is equivalent to approximate inference in a Gaussian process model [417]. The authors interpret dropout as a variational inference method and apply their method to convolutional neural networks. This is simple to implement and preserves the possibility of obtaining cheap samples from the approximate posterior distribution. Operationally, obtaining model uncertainty for a given case becomes as straightforward as leaving dropout turned on and predicting multiple times. The spread of the different predictions is a reasonable proxy for model uncertainty. This technique has been successfully applied in an automated system for detecting diabetic retinopathy [418], where uncertainty-informed referrals improved diagnostic performance and allowed the model to meet the National Health Service recommended levels of sensitivity and specificity. The authors also found that entropy performs comparably to the spread obtained via test-time dropout for identifying uncertain cases, and therefore it can be used instead for automated referrals.

-

Several other techniques have been proposed for effectively estimating predictive uncertainty as uncertainty quantification for neural networks continues to be an active research area. Recently, McClure and Kriegeskorte observed that test-time sampling improved calibration of the probabilistic predictions, sampling weights led to more robust uncertainty estimates than sampling units, and spike-and-slab sampling is superior to Gaussian dropconnect and Bernoulli dropout [419]. Krueger et al. introduced Bayesian hypernetworks [420] as another framework for approximate Bayesian inference in deep learning, where an invertible generative hypernetwork maps isotropic Gaussian noise to parameters of the primary network allowing for computationally cheap sampling and efficient estimation of the posterior. Meanwhile, Lakshminarayanan et al. proposed using deep ensembles, which are traditionally used for boosting predictive performance, on standard (non-Bayesian) neural networks to obtain well-calibrated uncertainty estimates that are comparable to those obtained by Bayesian neural networks [421]. In cases where model uncertainty is known to be caused by a difference in training and testing distributions, domain adaptation based techniques can help mitigate the problem [422].

-

Despite the success and popularity of deep learning, some deep learning models can be surprisingly brittle. Researchers are actively working on modifications to deep learning frameworks to enable them to handle probability and embrace uncertainty. Most notably, Bayesian modeling and deep learning are being integrated with renewed enthusiasm. As a result, several opportunities for innovation arise: understanding the causes of model uncertainty can lead to novel optimization and regularization techniques, assessing the utility of uncertainty estimation techniques on various model architectures and structures can be very useful to practitioners, and extending Bayesian deep learning to unsupervised settings can be a significant breakthrough [423]. Unfortunately, uncertainty quantification techniques are underutilized in the computational biology communities and largely ignored in the current deep learning for biomedicine literature. Thus, the practical value of uncertainty quantification in biomedical domains is yet to be appreciated.

+

Deep learning based solutions for biomedical applications could substantially benefit from guarantees on the reliability of predictions and a quantification of uncertainty. Due to biological variability and precision limits of equipment, biomedical data do not consist of precise measurements but of estimates with noise. Hence, it is crucial to obtain uncertainty measures that capture how noise in input values propagate through deep neural networks. Such measures can be used for reliability assessment of automated decisions in clinical and public health applications, and for guarding against model vulnerabilities in the face of rare or adversarial cases [415]. Moreover, in fundamental biological research, measures of uncertainty help researchers distinguish between true regularities in the data and patterns that are false or merely anecdotal. There are two main uncertainties that one can calculate: epistemic and aleatoric [416]. Epistemic uncertainty describes uncertainty about the model, its structure, or its parameters. This uncertainty is caused by insufficient training data or by a difference in the training set and testing set distributions, so it vanishes in the limit of infinite data. On the other hand, aleatoric uncertainty describes uncertainty inherent in the observations. This uncertainty is due to noisy or missing data, so it vanishes with the ability to observe all independent variables with infinite precision. A good way to represent aleatoric uncertainty is to design an appropriate loss function with an uncertainty variable. In the case of data-dependent aleatoric uncertainty, one can train the model to increase its uncertainty when it is incorrect due to noisy or missing data, and in the case of task-depedent aleatoric uncertainty, one can optimize for the best uncertainty parameter for each task [417]. Meanwhile, there are various methods for modeling epistemic uncertainty, outlined below.

+

In classification tasks, confidence calibration is the problem of using classifier scores to predict class membership probabilities that match the true membership likelihoods. These membership probabilities can be used to assess the uncertainty associated with assigning the example to each of the classes. Guo et al. [418] observed that contemporary neural networks are poorly calibrated and provided a simple recommendation for calibration: temperature scaling, a single parameter special case of Platt scaling [419]. In addition to confidence calibration, there is early work from Chryssolouris et al. [420] that described a method for obtaining confidence intervals with the assumption of normally distributed error for the neural network. More recently, Hendrycks and Gimpel discovered that incorrect or out-of-distribution examples usually have lower maximum softmax probabilities than correctly classified examples, allowing for effective detection of misclassified examples [421]. Liang et al. used temperature scaling and small perturbations to further separate the softmax scores of correctly classified examples and the scores of out-of-distribution examples, allowing for more effective detection [422]. This approach outperformed the baseline approaches by a large margin, establishing a new state-of-the-art performance.

+

An alternative approach for obtaining principled uncertainty estimates from deep learning models is to use Bayesian neural networks. Deep learning models are usually trained to obtain the most likely parameters given the data. However, choosing the single most likely set of parameters ignores the uncertainty about which set of parameters (among the possible models that explain the given dataset) should be used. This sometimes leads to uncertainty in predictions when the chosen likely parameters produce high-confidence but incorrect results. On the other hand, the parameters of Bayesian neural networks are modeled as full probability distributions. This Bayesian approach comes with a whole host of benefits, including better calibrated confidence estimates [423] and more robustness to adversarial and out-of-distribution examples [424]. Unfortunately, modeling the full posterior distribution for the model’s parameters given the data is usually computationally intractable. One popular method for circumventing this high computational cost is called test-time dropout [425], where an approximate posterior distribution is obtained using variational inference. Gal and Ghahramani showed that a stack of fully connected layers with dropout between the layers is equivalent to approximate inference in a Gaussian process model [425]. The authors interpret dropout as a variational inference method and apply their method to convolutional neural networks. This is simple to implement and preserves the possibility of obtaining cheap samples from the approximate posterior distribution. Operationally, obtaining model uncertainty for a given case becomes as straightforward as leaving dropout turned on and predicting multiple times. The spread of the different predictions is a reasonable proxy for model uncertainty. This technique has been successfully applied in an automated system for detecting diabetic retinopathy [426], where uncertainty-informed referrals improved diagnostic performance and allowed the model to meet the National Health Service recommended levels of sensitivity and specificity. The authors also found that entropy performs comparably to the spread obtained via test-time dropout for identifying uncertain cases, and therefore it can be used instead for automated referrals.

+

Several other techniques have been proposed for effectively estimating predictive uncertainty as uncertainty quantification for neural networks continues to be an active research area. Recently, McClure and Kriegeskorte observed that test-time sampling improved calibration of the probabilistic predictions, sampling weights led to more robust uncertainty estimates than sampling units, and spike-and-slab sampling is superior to Gaussian dropconnect and Bernoulli dropout [427]. Krueger et al. introduced Bayesian hypernetworks [428] as another framework for approximate Bayesian inference in deep learning, where an invertible generative hypernetwork maps isotropic Gaussian noise to parameters of the primary network allowing for computationally cheap sampling and efficient estimation of the posterior. Meanwhile, Lakshminarayanan et al. proposed using deep ensembles, which are traditionally used for boosting predictive performance, on standard (non-Bayesian) neural networks to obtain well-calibrated uncertainty estimates that are comparable to those obtained by Bayesian neural networks [429]. In cases where model uncertainty is known to be caused by a difference in training and testing distributions, domain adaptation based techniques can help mitigate the problem [430].

+

Despite the success and popularity of deep learning, some deep learning models can be surprisingly brittle. Researchers are actively working on modifications to deep learning frameworks to enable them to handle probability and embrace uncertainty. Most notably, Bayesian modeling and deep learning are being integrated with renewed enthusiasm. As a result, several opportunities for innovation arise: understanding the causes of model uncertainty can lead to novel optimization and regularization techniques, assessing the utility of uncertainty estimation techniques on various model architectures and structures can be very useful to practitioners, and extending Bayesian deep learning to unsupervised settings can be a significant breakthrough [431]. Unfortunately, uncertainty quantification techniques are underutilized in the computational biology communities and largely ignored in the current deep learning for biomedicine literature. Thus, the practical value of uncertainty quantification in biomedical domains is yet to be appreciated.

Interpretation

-

As deep learning models achieve state-of-the-art performance in a variety of domains, there is a growing need to make the models more interpretable. Interpretability matters for two main reasons. First, a model that achieves breakthrough performance may have identified patterns in the data that practitioners in the field would like to understand. However, this would not be possible if the model is a black box. Second, interpretability is important for trust. If a model is making medical diagnoses, it is important to ensure the model is making decisions for reliable reasons and is not focusing on an artifact of the data. A motivating example of this can be found in Ba and Caruana [424], where a model trained to predict the likelihood of death from pneumonia assigned lower risk to patients with asthma, but only because such patients were treated as higher priority by the hospital. In the context of deep learning, understanding the basis of a model’s output is particularly important as deep learning models are unusually susceptible to adversarial examples [425] and can output confidence scores over 99.99% for samples that resemble pure noise.

+

As deep learning models achieve state-of-the-art performance in a variety of domains, there is a growing need to make the models more interpretable. Interpretability matters for two main reasons. First, a model that achieves breakthrough performance may have identified patterns in the data that practitioners in the field would like to understand. However, this would not be possible if the model is a black box. Second, interpretability is important for trust. If a model is making medical diagnoses, it is important to ensure the model is making decisions for reliable reasons and is not focusing on an artifact of the data. A motivating example of this can be found in Ba and Caruana [432], where a model trained to predict the likelihood of death from pneumonia assigned lower risk to patients with asthma, but only because such patients were treated as higher priority by the hospital. In the context of deep learning, understanding the basis of a model’s output is particularly important as deep learning models are unusually susceptible to adversarial examples [433] and can output confidence scores over 99.99% for samples that resemble pure noise.

As the concept of interpretability is quite broad, many methods described as improving the interpretability of deep learning models take disparate and often complementary approaches.

Assigning example-specific importance scores

Several approaches ascribe importance on an example-specific basis to the parts of the input that are responsible for a particular output. These can be broadly divided into perturbation-based approaches and backpropagation-based approaches.

-

Perturbation-based approaches change parts of the input and observe the impact on the output of the network. Alipanahi et al. [200] and Zhou & Troyanskaya [204] scored genomic sequences by introducing virtual mutations at individual positions in the sequence and quantifying the change in the output. Umarov et al. [209] used a similar strategy, but with sliding windows where the sequence within each sliding window was substituted with a random sequence. Kelley et al. [214] inserted known protein-binding motifs into the centers of sequences and assessed the change in predicted accessibility. Ribeiro et al. [426] introduced LIME, which constructs a linear model to locally approximate the output of the network on perturbed versions of the input and assigns importance scores accordingly. For analyzing images, Zeiler and Fergus [427] applied constant-value masks to different input patches. More recently, marginalizing over the plausible values of an input has been suggested as a way to more accurately estimate contributions [428].

-

A common drawback to perturbation-based approaches is computational efficiency: each perturbed version of an input requires a separate forward propagation through the network to compute the output. As noted by Shrikumar et al. [206], such methods may also underestimate the impact of features that have saturated their contribution to the output, as can happen when multiple redundant features are present. To reduce the computational overhead of perturbation-based approaches, Fong and Vedaldi [429] solve an optimization problem using gradient descent to discover a minimal subset of inputs to perturb in order to decrease the predicted probability of a selected class. Their method converges in many fewer iterations but requires the perturbation to have a differentiable form.

-

Backpropagation-based methods, in which the signal from a target output neuron is propagated backwards to the input layer, are another way to interpret deep networks that sidestep inefficiencies of the perturbastion-basd methods. A classic example of this is calculating the gradients of the output with respect to the input [430] to compute a “saliency map”. Bach et al. [431] proposed a strategy called Layerwise Relevance Propagation, which was shown to be equivalent to the element-wise product of the gradient and input [206,432]. Networks with Rectified Linear Units (ReLUs) create nonlinearities that must be addressed. Several variants exist for handling this [427,433]. Backpropagation-based methods are a highly active area of research. Researchers are still actively identifying weaknesses [434], and new methods are being developed to address them [206,435,436]. Lundberg and Lee [437] noted that several importance scoring methods including integrated gradients and LIME could all be considered approximations to Shapely values [438], which have a long history in game theory for assigning contributions to players in cooperative games.

+

Perturbation-based approaches change parts of the input and observe the impact on the output of the network. Alipanahi et al. [200] and Zhou & Troyanskaya [204] scored genomic sequences by introducing virtual mutations at individual positions in the sequence and quantifying the change in the output. Umarov et al. [209] used a similar strategy, but with sliding windows where the sequence within each sliding window was substituted with a random sequence. Kelley et al. [214] inserted known protein-binding motifs into the centers of sequences and assessed the change in predicted accessibility. Ribeiro et al. [434] introduced LIME, which constructs a linear model to locally approximate the output of the network on perturbed versions of the input and assigns importance scores accordingly. For analyzing images, Zeiler and Fergus [435] applied constant-value masks to different input patches. More recently, marginalizing over the plausible values of an input has been suggested as a way to more accurately estimate contributions [436].

+

A common drawback to perturbation-based approaches is computational efficiency: each perturbed version of an input requires a separate forward propagation through the network to compute the output. As noted by Shrikumar et al. [206], such methods may also underestimate the impact of features that have saturated their contribution to the output, as can happen when multiple redundant features are present. To reduce the computational overhead of perturbation-based approaches, Fong and Vedaldi [437] solve an optimization problem using gradient descent to discover a minimal subset of inputs to perturb in order to decrease the predicted probability of a selected class. Their method converges in many fewer iterations but requires the perturbation to have a differentiable form.

+

Backpropagation-based methods, in which the signal from a target output neuron is propagated backwards to the input layer, are another way to interpret deep networks that sidestep inefficiencies of the perturbastion-basd methods. A classic example of this is calculating the gradients of the output with respect to the input [438] to compute a “saliency map”. Bach et al. [439] proposed a strategy called Layerwise Relevance Propagation, which was shown to be equivalent to the element-wise product of the gradient and input [206,440]. Networks with Rectified Linear Units (ReLUs) create nonlinearities that must be addressed. Several variants exist for handling this [435,441]. Backpropagation-based methods are a highly active area of research. Researchers are still actively identifying weaknesses [442], and new methods are being developed to address them [206,443,444]. Lundberg and Lee [445] noted that several importance scoring methods including integrated gradients and LIME could all be considered approximations to Shapely values [446], which have a long history in game theory for assigning contributions to players in cooperative games.

Matching or exaggerating the hidden representation

-

Another approach to understanding the network’s predictions is to find artificial inputs that produce similar hidden representations to a chosen example. This can elucidate the features that the network uses for prediction and drop the features that the network is insensitive to. In the context of natural images, Mahendran and Vedaldi [439] introduced the “inversion” visualization, which uses gradient descent and backpropagation to reconstruct the input from its hidden representation. The method required placing a prior on the input to favor results that resemble natural images. For genomic sequence, Finnegan and Song [440] used a Markov chain Monte Carlo algorithm to find the maximum-entropy distribution of inputs that produced a similar hidden representation to the chosen input.

-

A related idea is “caricaturization”, where an initial image is altered to exaggerate patterns that the network searches for [441]. This is done by maximizing the response of neurons that are active in the network, subject to some regularizing constraints. Mordvintsev et al. [442] leveraged caricaturization to generate aesthetically pleasing images using neural networks.

+

Another approach to understanding the network’s predictions is to find artificial inputs that produce similar hidden representations to a chosen example. This can elucidate the features that the network uses for prediction and drop the features that the network is insensitive to. In the context of natural images, Mahendran and Vedaldi [447] introduced the “inversion” visualization, which uses gradient descent and backpropagation to reconstruct the input from its hidden representation. The method required placing a prior on the input to favor results that resemble natural images. For genomic sequence, Finnegan and Song [448] used a Markov chain Monte Carlo algorithm to find the maximum-entropy distribution of inputs that produced a similar hidden representation to the chosen input.

+

A related idea is “caricaturization”, where an initial image is altered to exaggerate patterns that the network searches for [449]. This is done by maximizing the response of neurons that are active in the network, subject to some regularizing constraints. Mordvintsev et al. [450] leveraged caricaturization to generate aesthetically pleasing images using neural networks.

Activation maximization

-

Activation maximization can reveal patterns detected by an individual neuron in the network by generating images which maximally activate that neuron, subject to some regularizing constraints. This technique was first introduced in Ehran et al. [443] and applied in subsequent work [430,441,442,444]. Lanchantin et al. [201] applied class-based activation maximization to genomic sequence data. One drawback of this approach is that neural networks often learn highly distributed representations where several neurons cooperatively describe a pattern of interest. Thus, visualizing patterns learned by individual neurons may not always be informative.

+

Activation maximization can reveal patterns detected by an individual neuron in the network by generating images which maximally activate that neuron, subject to some regularizing constraints. This technique was first introduced in Ehran et al. [451] and applied in subsequent work [438,449,450,452]. Lanchantin et al. [201] applied class-based activation maximization to genomic sequence data. One drawback of this approach is that neural networks often learn highly distributed representations where several neurons cooperatively describe a pattern of interest. Thus, visualizing patterns learned by individual neurons may not always be informative.

RNN-specific approaches

-

Several interpretation methods are specifically tailored to recurrent neural network architectures. The most common form of interpretability provided by RNNs is through attention mechanisms, which have been used in diverse problems such as image captioning and machine translation to select portions of the input to focus on generating a particular output [445,446]. Deming et al. [447] applied the attention mechanism to models trained on genomic sequence. Attention mechanisms provide insight into the model’s decision-making process by revealing which portions of the input are used by different outputs. Singh et al. used a hierarchy of attention layers to locate important genome positions and signals for predicting gene expression from histone modifications [183]. In the clinical domain, Choi et al. [448] leveraged attention mechanisms to highlight which aspects of a patient’s medical history were most relevant for making diagnoses. Choi et al. [449] later extended this work to take into account the structure of disease ontologies and found that the concepts represented by the model aligned with medical knowledge. Note that interpretation strategies that rely on an attention mechanism do not provide insight into the logic used by the attention layer.

-

Visualizing the activation patterns of the hidden state of a recurrent neural network can also be instructive. Early work by Ghosh and Karamcheti [450] used cluster analysis to study hidden states of comparatively small networks trained to recognize strings from a finite state machine. More recently, Karpathy et al. [451] showed the existence of individual cells in LSTMs that kept track of quotes and brackets in character-level language models. To facilitate such analyses, LSTMVis [452] allows interactive exploration of the hidden state of LSTMs on different inputs.

-

Another strategy, adopted by Lanchatin et al. [201] looks at how the output of a recurrent neural network changes as longer and longer subsequences are supplied as input to the network, where the subsequences begin with just the first position and end with the entire sequence. In a binary classification task, this can identify those positions which are responsible for flipping the output of the network from negative to positive. If the RNN is bidirectional, the same process can be repeated on the reverse sequence. As noted by the authors, this approach was less effective at identifying motifs compared to the gradient-based backpropagation approach of Simonyan et al. [430], illustrating the need for more sophisticated strategies to assign importance scores in recurrent neural networks.

-

Murdoch and Szlam [453] showed that the output of an LSTM can be decomposed into a product of factors, where each factor can be interpreted as the contribution at a particular timestep. The contribution scores were then used to identify key phrases from a model trained for sentiment analysis and obtained superior results compared to scores derived via a gradient-based approach.

+

Several interpretation methods are specifically tailored to recurrent neural network architectures. The most common form of interpretability provided by RNNs is through attention mechanisms, which have been used in diverse problems such as image captioning and machine translation to select portions of the input to focus on generating a particular output [453,454]. Deming et al. [455] applied the attention mechanism to models trained on genomic sequence. Attention mechanisms provide insight into the model’s decision-making process by revealing which portions of the input are used by different outputs. Singh et al. used a hierarchy of attention layers to locate important genome positions and signals for predicting gene expression from histone modifications [183]. In the clinical domain, Choi et al. [456] leveraged attention mechanisms to highlight which aspects of a patient’s medical history were most relevant for making diagnoses. Choi et al. [457] later extended this work to take into account the structure of disease ontologies and found that the concepts represented by the model aligned with medical knowledge. Note that interpretation strategies that rely on an attention mechanism do not provide insight into the logic used by the attention layer.

+

Visualizing the activation patterns of the hidden state of a recurrent neural network can also be instructive. Early work by Ghosh and Karamcheti [458] used cluster analysis to study hidden states of comparatively small networks trained to recognize strings from a finite state machine. More recently, Karpathy et al. [459] showed the existence of individual cells in LSTMs that kept track of quotes and brackets in character-level language models. To facilitate such analyses, LSTMVis [460] allows interactive exploration of the hidden state of LSTMs on different inputs.

+

Another strategy, adopted by Lanchatin et al. [201] looks at how the output of a recurrent neural network changes as longer and longer subsequences are supplied as input to the network, where the subsequences begin with just the first position and end with the entire sequence. In a binary classification task, this can identify those positions which are responsible for flipping the output of the network from negative to positive. If the RNN is bidirectional, the same process can be repeated on the reverse sequence. As noted by the authors, this approach was less effective at identifying motifs compared to the gradient-based backpropagation approach of Simonyan et al. [438], illustrating the need for more sophisticated strategies to assign importance scores in recurrent neural networks.

+

Murdoch and Szlam [461] showed that the output of an LSTM can be decomposed into a product of factors, where each factor can be interpreted as the contribution at a particular timestep. The contribution scores were then used to identify key phrases from a model trained for sentiment analysis and obtained superior results compared to scores derived via a gradient-based approach.

Latent space manipulation

-

Interpretation of embedded or latent space features learned through generative unsupervised models can reveal underlying patterns otherwise masked in the original input. Embedded feature interpretation has been emphasized mostly in image and text based applications [102,454], but applications to genomic and biomedical domains are increasing.

-

For example, Way and Greene trained a variational autoencoder (VAE) on gene expression from The Cancer Genome Atlas (TCGA) and use latent space arithmetic to rapidly isolate and interpret gene expression features descriptive of high grade serous ovarian cancer subtypes [455]. The most differentiating VAE features were representative of biological processes that are known to distinguish the subtypes. Latent space arithmetic with features derived using other compression algorithms were not as informative in this context [456]. Embedding discrete chemical structures with autoencoders and interpreting the learned continuous representations with latent space arithmetic has also facilitated predicting drug-like compounds [389]. Furthermore, embedding biomedical text into lower dimensional latent spaces have improved name entity recognition in a variety of tasks including annotating clinical abbreviations, genes, cell lines, and drug names [7578].

-

Other approaches have used interpolation through latent space embeddings learned by GANs to interpret unobserved intermediate states. For example, Osokin et al. trained GANs on two-channel fluorescent microscopy images to interpret intermediate states of protein localization in yeast cells [457]. Goldsborough et al. trained a GAN on fluorescent microscopy images and used latent space interpolation and arithmetic to reveal underlying responses to small molecule perturbations in cell lines [458].

+

Interpretation of embedded or latent space features learned through generative unsupervised models can reveal underlying patterns otherwise masked in the original input. Embedded feature interpretation has been emphasized mostly in image and text based applications [102,462], but applications to genomic and biomedical domains are increasing.

+

For example, Way and Greene trained a variational autoencoder (VAE) on gene expression from The Cancer Genome Atlas (TCGA) and use latent space arithmetic to rapidly isolate and interpret gene expression features descriptive of high grade serous ovarian cancer subtypes [463]. The most differentiating VAE features were representative of biological processes that are known to distinguish the subtypes. Latent space arithmetic with features derived using other compression algorithms were not as informative in this context [464]. Embedding discrete chemical structures with autoencoders and interpreting the learned continuous representations with latent space arithmetic has also facilitated predicting drug-like compounds [384]. Furthermore, embedding biomedical text into lower dimensional latent spaces have improved name entity recognition in a variety of tasks including annotating clinical abbreviations, genes, cell lines, and drug names [7578].

+

Other approaches have used interpolation through latent space embeddings learned by GANs to interpret unobserved intermediate states. For example, Osokin et al. trained GANs on two-channel fluorescent microscopy images to interpret intermediate states of protein localization in yeast cells [465]. Goldsborough et al. trained a GAN on fluorescent microscopy images and used latent space interpolation and arithmetic to reveal underlying responses to small molecule perturbations in cell lines [466].

Miscellaneous approaches

-

It can often be informative to understand how the training data affects model learning. Toward this end, Koh and Liang [459] used influence functions, a technique from robust statistics, to trace a model’s predictions back through the learning algorithm to identify the datapoints in the training set that had the most impact on a given prediction. A more free-form approach to interpretability is to visualize the activation patterns of the network on individual inputs and on subsets of the data. ActiVis and CNNvis [460,461] are two frameworks that enable interactive visualization and exploration of large-scale deep learning models. An orthogonal strategy is to use a knowledge distillation approach to replace a deep learning model with a more interpretable model that achieves comparable performance. Towards this end, Che et al. [462] used gradient boosted trees to learn interpretable healthcare features from trained deep models.

-

Finally, it is sometimes possible to train the model to provide justifications for its predictions. Lei et al. [463] used a generator to identify “rationales”, which are short and coherent pieces of the input text that produce similar results to the whole input when passed through an encoder. The authors applied their approach to a sentiment analysis task and obtained substantially superior results compared to an attention-based method.

+

It can often be informative to understand how the training data affects model learning. Toward this end, Koh and Liang [467] used influence functions, a technique from robust statistics, to trace a model’s predictions back through the learning algorithm to identify the datapoints in the training set that had the most impact on a given prediction. A more free-form approach to interpretability is to visualize the activation patterns of the network on individual inputs and on subsets of the data. ActiVis and CNNvis [468,469] are two frameworks that enable interactive visualization and exploration of large-scale deep learning models. An orthogonal strategy is to use a knowledge distillation approach to replace a deep learning model with a more interpretable model that achieves comparable performance. Towards this end, Che et al. [470] used gradient boosted trees to learn interpretable healthcare features from trained deep models.

+

Finally, it is sometimes possible to train the model to provide justifications for its predictions. Lei et al. [471] used a generator to identify “rationales”, which are short and coherent pieces of the input text that produce similar results to the whole input when passed through an encoder. The authors applied their approach to a sentiment analysis task and obtained substantially superior results compared to an attention-based method.

Future outlook

While deep learning lags behind most Bayesian models in terms of interpretability, the interpretability of deep learning is comparable to or exceeds that of many other widely-used machine learning methods such as random forests or SVMs. While it is possible to obtain importance scores for different inputs in a random forest, the same is true for deep learning. Similarly, SVMs trained with a nonlinear kernel are not easily interpretable because the use of the kernel means that one does not obtain an explicit weight matrix. Finally, it is worth noting that some simple machine learning methods are less interpretable in practice than one might expect. A linear model trained on heavily engineered features might be difficult to interpret as the input features themselves are difficult to interpret. Similarly, a decision tree with many nodes and branches may also be difficult for a human to make sense of.

There are several directions that might benefit the development of interpretability techniques. The first is the introduction of gold standard benchmarks that different interpretability approaches could be compared against, similar in spirit to how datasets like ImageNet and CIFAR spurred the development of deep learning for computer vision. It would also be helpful if the community placed more emphasis on domains outside of computer vision. Computer vision is often used as the example application of interpretability methods, but it is not the domain with the most pressing need. Finally, closer integration of interpretability approaches with popular deep learning frameworks would make it easier for practitioners to apply and experiment with different approaches to understanding their deep learning models.

Data limitations

-

A lack of large-scale, high-quality, correctly labeled training data has impacted deep learning in nearly all applications we have discussed. The challenges of training complex, high-parameter neural networks from few examples are obvious, but uncertainty in the labels of those examples can be just as problematic. In genomics labeled data may be derived from an experimental assay with known and unknown technical artifacts, biases, and error profiles. It is possible to weight training examples or construct Bayesian models to account for uncertainty or non-independence in the data, as described in the TF binding example above. As another example, Park et al. [464] estimated shared non-biological signal between datasets to correct for non-independence related to assay platform or other factors in a Bayesian integration of many datasets. However, such techniques are rarely placed front and center in any description of methods and may be easily overlooked.

-

For some types of data, especially images, it is straightforward to augment training datasets by splitting a single labeled example into multiple examples. For example, an image can easily be rotated, flipped, or translated and retain its label [57]. 3D MRI and 4D fMRI (with time as a dimension) data can be decomposed into sets of 2D images [465]. This can greatly expand the number of training examples but artificially treats such derived images as independent instances and sacrifices the structure inherent in the data. CellCnn trains a model to recognize rare cell populations in single-cell data by creating training instances that consist of subsets of cells that are randomly sampled with replacement from the full dataset [284].

-

Simulated or semi-synthetic training data has been employed in multiple biomedical domains, though many of these ideas are not specific to deep learning. Training and evaluating on simulated data, for instance, generating synthetic TF binding sites with position weight matrices [203] or RNA-seq reads for predicting mRNA transcript boundaries [466], is a standard practice in bioinformatics. This strategy can help benchmark algorithms when the available gold standard dataset is imperfect, but it should be paired with an evaluation on real data, as in the prior examples [203,466]. In rare cases, models trained on simulated data have been successfully applied directly to real data [466].

-

Data can be simulated to create negative examples when only positive training instances are available. DANN [34] adopts this approach to predict the pathogenicity of genetic variants using semi-synthetic training data from Combined Annotation-Dependent Depletion (CADD) [467]. Though our emphasis here is on the training strategy, it should be noted that logistic regression outperformed DANN when distinguishing known pathogenic mutations from likely benign variants in real data. Similarly, a somatic mutation caller has been trained by injecting mutations into real sequencing datasets [328]. This method detected mutations in other semi-synthetic datasets but was not validated on real data.

+

A lack of large-scale, high-quality, correctly labeled training data has impacted deep learning in nearly all applications we have discussed. The challenges of training complex, high-parameter neural networks from few examples are obvious, but uncertainty in the labels of those examples can be just as problematic. In genomics labeled data may be derived from an experimental assay with known and unknown technical artifacts, biases, and error profiles. It is possible to weight training examples or construct Bayesian models to account for uncertainty or non-independence in the data, as described in the TF binding example above. As another example, Park et al. [472] estimated shared non-biological signal between datasets to correct for non-independence related to assay platform or other factors in a Bayesian integration of many datasets. However, such techniques are rarely placed front and center in any description of methods and may be easily overlooked.

+

For some types of data, especially images, it is straightforward to augment training datasets by splitting a single labeled example into multiple examples. For example, an image can easily be rotated, flipped, or translated and retain its label [57]. 3D MRI and 4D fMRI (with time as a dimension) data can be decomposed into sets of 2D images [473]. This can greatly expand the number of training examples but artificially treats such derived images as independent instances and sacrifices the structure inherent in the data. CellCnn trains a model to recognize rare cell populations in single-cell data by creating training instances that consist of subsets of cells that are randomly sampled with replacement from the full dataset [284].

+

Simulated or semi-synthetic training data has been employed in multiple biomedical domains, though many of these ideas are not specific to deep learning. Training and evaluating on simulated data, for instance, generating synthetic TF binding sites with position weight matrices [203] or RNA-seq reads for predicting mRNA transcript boundaries [474], is a standard practice in bioinformatics. This strategy can help benchmark algorithms when the available gold standard dataset is imperfect, but it should be paired with an evaluation on real data, as in the prior examples [203,474]. In rare cases, models trained on simulated data have been successfully applied directly to real data [474].

+

Data can be simulated to create negative examples when only positive training instances are available. DANN [34] adopts this approach to predict the pathogenicity of genetic variants using semi-synthetic training data from Combined Annotation-Dependent Depletion (CADD) [475]. Though our emphasis here is on the training strategy, it should be noted that logistic regression outperformed DANN when distinguishing known pathogenic mutations from likely benign variants in real data. Similarly, a somatic mutation caller has been trained by injecting mutations into real sequencing datasets [328]. This method detected mutations in other semi-synthetic datasets but was not validated on real data.

In settings where the experimental observations are biased toward positive instances, such as MHC protein and peptide ligand binding affinity [257], or the negative instances vastly outnumber the positives, such as high-throughput chemical screening [381], training datasets have been augmented by adding additional instances and assuming they are negative. There is some evidence that this can improve performance [381], but in other cases it was only beneficial when the real training datasets were extremely small [257]. Overall, training with simulated and semi-simulated data is a valuable idea for overcoming limited sample sizes but one that requires more rigorous evaluation on real ground-truth datasets before we can recommend it for widespread use. There is a risk that a model will easily discriminate synthetic examples but not generalize to real data.

-

Multimodal, multi-task, and transfer learning, discussed in detail below, can also combat data limitations to some degree. There are also emerging network architectures, such as Diet Networks for high-dimensional SNP data [468]. These use multiple networks to drastically reduce the number of free parameters by first flipping the problem and training a network to predict parameters (weights) for each input (SNP) to learn a feature embedding. This embedding (e.g. from principal component analysis, per class histograms, or a Word2vec [102] generalization) can be learned directly from input data or take advantage of other datasets or domain knowledge. Additionally, in this task the features are the examples, an important advantage when it is typical to have 500 thousand or more SNPs and only a few thousand patients. Finally, this embedding is of a much lower dimension, allowing for a large reduction in the number of free parameters. In the example given, the number of free parameters was reduced from 30 million to 50 thousand, a factor of 600.

+

Multimodal, multi-task, and transfer learning, discussed in detail below, can also combat data limitations to some degree. There are also emerging network architectures, such as Diet Networks for high-dimensional SNP data [476]. These use multiple networks to drastically reduce the number of free parameters by first flipping the problem and training a network to predict parameters (weights) for each input (SNP) to learn a feature embedding. This embedding (e.g. from principal component analysis, per class histograms, or a Word2vec [102] generalization) can be learned directly from input data or take advantage of other datasets or domain knowledge. Additionally, in this task the features are the examples, an important advantage when it is typical to have 500 thousand or more SNPs and only a few thousand patients. Finally, this embedding is of a much lower dimension, allowing for a large reduction in the number of free parameters. In the example given, the number of free parameters was reduced from 30 million to 50 thousand, a factor of 600.

Hardware limitations and scaling

-

Efficiently scaling deep learning is challenging, and there is a high computational cost (e.g. time, memory, and energy) associated with training neural networks and using them to make predictions. This is one of the reasons why neural networks have only recently found widespread use [469].

-

Many have sought to curb these costs, with methods ranging from the very applied (e.g. reduced numerical precision [470473]) to the exotic and theoretic ( e.g. training small networks to mimic large networks and ensembles [424,474]). The largest gains in efficiency have come from computation with GPUs [469,475479], which excel at the matrix and vector operations so central to deep learning. The massively parallel nature of GPUs allows additional optimizations, such as accelerated mini-batch gradient descent [476,477,480,481]. However, GPUs also have limited memory, making networks of useful size and complexity difficult to implement on a single GPU or machine [66,475]. This restriction has sometimes forced computational biologists to use workarounds or limit the size of an analysis. Chen et al. [181] inferred the expression level of all genes with a single neural network, but due to memory restrictions they randomly partitioned genes into two separately analyzed halves. In other cases, researchers limited the size of their neural network [28] or the total number of training instances [389]. Some have also chosen to use standard central processing unit (CPU) implementations rather than sacrifice network size or performance [482].

-

While steady improvements in GPU hardware may alleviate this issue, it is unclear whether advances will occur quickly enough to keep pace with the growing biological datasets and increasingly complex neural networks. Much has been done to minimize the memory requirements of neural networks [424,470473,483,484], but there is also growing interest in specialized hardware, such as field-programmable gate arrays (FPGAs) [479,485] and application-specific integrated circuits (ASICs) [486]. Less software is available for such highly specialized hardware [485]. But specialized hardware promises improvements in deep learning at reduced time, energy, and memory [479]. Specialized hardware may be a difficult investment for those not solely interested in deep learning, but for those with a deep learning focus these solutions may become popular.

-

Distributed computing is a general solution to intense computational requirements and has enabled many large-scale deep learning efforts. Some types of distributed computation [487,488] are not suitable for deep learning [489], but much progress has been made. There now exist a number of algorithms [472,489,490], tools [491493], and high-level libraries [494,495] for deep learning in a distributed environment, and it is possible to train very complex networks with limited infrastructure [496]. Besides handling very large networks, distributed or parallelized approaches offer other advantages, such as improved ensembling [497] or accelerated hyperparameter optimization [498,499].

-

Cloud computing, which has already seen wide adoption in genomics [500], could facilitate easier sharing of the large datasets common to biology [501,502], and may be key to scaling deep learning. Cloud computing affords researchers flexibility, and enables the use of specialized hardware (e.g. FPGAs, ASICs, GPUs) without major investment. As such, it could be easier to address the different challenges associated with the multitudinous layers and architectures available [503]. Though many are reluctant to store sensitive data (e.g. patient electronic health records) in the cloud, secure, regulation-compliant cloud services do exist [504].

+

Efficiently scaling deep learning is challenging, and there is a high computational cost (e.g. time, memory, and energy) associated with training neural networks and using them to make predictions. This is one of the reasons why neural networks have only recently found widespread use [477].

+

Many have sought to curb these costs, with methods ranging from the very applied (e.g. reduced numerical precision [478481]) to the exotic and theoretic ( e.g. training small networks to mimic large networks and ensembles [432,482]). The largest gains in efficiency have come from computation with GPUs [477,483487], which excel at the matrix and vector operations so central to deep learning. The massively parallel nature of GPUs allows additional optimizations, such as accelerated mini-batch gradient descent [484,485,488,489]. However, GPUs also have limited memory, making networks of useful size and complexity difficult to implement on a single GPU or machine [66,483]. This restriction has sometimes forced computational biologists to use workarounds or limit the size of an analysis. Chen et al. [181] inferred the expression level of all genes with a single neural network, but due to memory restrictions they randomly partitioned genes into two separately analyzed halves. In other cases, researchers limited the size of their neural network [28] or the total number of training instances [384]. Some have also chosen to use standard central processing unit (CPU) implementations rather than sacrifice network size or performance [490].

+

While steady improvements in GPU hardware may alleviate this issue, it is unclear whether advances will occur quickly enough to keep pace with the growing biological datasets and increasingly complex neural networks. Much has been done to minimize the memory requirements of neural networks [432,478481,491,492], but there is also growing interest in specialized hardware, such as field-programmable gate arrays (FPGAs) [487,493] and application-specific integrated circuits (ASICs) [494]. Less software is available for such highly specialized hardware [493]. But specialized hardware promises improvements in deep learning at reduced time, energy, and memory [487]. Specialized hardware may be a difficult investment for those not solely interested in deep learning, but for those with a deep learning focus these solutions may become popular.

+

Distributed computing is a general solution to intense computational requirements and has enabled many large-scale deep learning efforts. Some types of distributed computation [495,496] are not suitable for deep learning [497], but much progress has been made. There now exist a number of algorithms [480,497,498], tools [499501], and high-level libraries [502,503] for deep learning in a distributed environment, and it is possible to train very complex networks with limited infrastructure [504]. Besides handling very large networks, distributed or parallelized approaches offer other advantages, such as improved ensembling [505] or accelerated hyperparameter optimization [506,507].

+

Cloud computing, which has already seen wide adoption in genomics [508], could facilitate easier sharing of the large datasets common to biology [509,510], and may be key to scaling deep learning. Cloud computing affords researchers flexibility, and enables the use of specialized hardware (e.g. FPGAs, ASICs, GPUs) without major investment. As such, it could be easier to address the different challenges associated with the multitudinous layers and architectures available [511]. Though many are reluctant to store sensitive data (e.g. patient electronic health records) in the cloud, secure, regulation-compliant cloud services do exist [512].

Data, code, and model sharing

-

A robust culture of data, code, and model sharing would speed advances in this domain. The cultural barriers to data sharing in particular are perhaps best captured by the use of the term “research parasite” to describe scientists who use data from other researchers [505]. A field that honors only discoveries and not the hard work of generating useful data will have difficulty encouraging scientists to share their hard-won data. It’s precisely those data that would help to power deep learning in the domain. Efforts are underway to recognize those who promote an ecosystem of rigorous sharing and analysis [506].

-

The sharing of high-quality, labeled datasets will be especially valuable. In addition, researchers who invest time to preprocess datasets to be suitable for deep learning can make the preprocessing code (e.g. Basset [214] and variationanalysis [326]) and cleaned data (e.g. MoleculeNet [387]) publicly available to catalyze further research. However, there are complex privacy and legal issues involved in sharing patient data that cannot be ignored. Solving these issues will require increased understanding of privacy risks and standards specifying acceptable levels. In some domains high-quality training data has been generated privately, i.e. high-throughput chemical screening data at pharmaceutical companies. One perspective is that there is little expectation or incentive for this private data to be shared. However, data are not inherently valuable. Instead, the insights that we glean from them are where the value lies. Private companies may establish a competitive advantage by releasing data sufficient for improved methods to be developed. Recently, Ramsundar et al. did this with an open source platform DeepChem, where they released four privately generated datasets [507].

-

Code sharing and open source licensing is essential for continued progress in this domain. We strongly advocate following established best practices for sharing source code, archiving code in repositories that generate digital object identifiers, and open licensing [508] regardless of the minimal requirements, or lack thereof, set by journals, conferences, or preprint servers. In addition, it is important for authors to share not only code for their core models but also scripts and code used for data cleaning (see above) and hyperparameter optimization. These improve reproducibility and serve as documentation of the detailed decisions that impact model performance but may not be exhaustively captured in a manuscript’s methods text.

-

Because many deep learning models are often built using one of several popular software frameworks, it is also possible to directly share trained predictive models. The availability of pre-trained models can accelerate research, with image classifiers as an apt example. A pre-trained neural network can be quickly fine-tuned on new data and used in transfer learning, as discussed below. Taking this idea to the extreme, genomic data has been artificially encoded as images in order to benefit from pre-trained image classifiers [325]. “Model zoos” – collections of pre-trained models – are not yet common in biomedical domains but have started to appear in genomics applications [280,509]. However, it is important to note that sharing models trained on individual data requires great care because deep learning models can be attacked to identify examples used in training. One possible solution to protect individual samples includes training models under differential privacy [152], which has been used in the biomedical domain [155]. We discussed this issue as well as recent techniques to mitigate these concerns in the patient categorization section.

-

DeepChem [386388] and DragoNN [509] exemplify the benefits of sharing pre-trained models and code under an open source license. DeepChem, which targets drug discovery and quantum chemistry, has actively encouraged and received community contributions of learning algorithms and benchmarking datasets. As a consequence, it now supports a large suite of machine learning approaches, both deep learning and competing strategies, that can be run on diverse test cases. This realistic, continual evaluation will play a critical role in assessing which techniques are most promising for chemical screening and drug discovery. Like formal, organized challenges such as the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge [510], DeepChem provides a forum for the fair, critical evaluations that are not always conducted in individual methodological papers, which can be biased toward favoring a new proposed algorithm. Likewise DragoNN (Deep RegulAtory GenOmic Neural Networks) offers not only code and a model zoo but also a detailed tutorial and partner package for simulating training data. These resources, especially the ability to simulate datasets that are sufficiently complex to demonstrate the challenges of training neural networks but small enough to train quickly on a CPU, are important for training students and attracting machine learning researchers to problems in genomics and healthcare.

+

A robust culture of data, code, and model sharing would speed advances in this domain. The cultural barriers to data sharing in particular are perhaps best captured by the use of the term “research parasite” to describe scientists who use data from other researchers [513]. A field that honors only discoveries and not the hard work of generating useful data will have difficulty encouraging scientists to share their hard-won data. It’s precisely those data that would help to power deep learning in the domain. Efforts are underway to recognize those who promote an ecosystem of rigorous sharing and analysis [514].

+

The sharing of high-quality, labeled datasets will be especially valuable. In addition, researchers who invest time to preprocess datasets to be suitable for deep learning can make the preprocessing code (e.g. Basset [214] and variationanalysis [326]) and cleaned data (e.g. MoleculeNet [393]) publicly available to catalyze further research. However, there are complex privacy and legal issues involved in sharing patient data that cannot be ignored. Solving these issues will require increased understanding of privacy risks and standards specifying acceptable levels. In some domains high-quality training data has been generated privately, i.e. high-throughput chemical screening data at pharmaceutical companies. One perspective is that there is little expectation or incentive for this private data to be shared. However, data are not inherently valuable. Instead, the insights that we glean from them are where the value lies. Private companies may establish a competitive advantage by releasing data sufficient for improved methods to be developed. Recently, Ramsundar et al. did this with an open source platform DeepChem, where they released four privately generated datasets [515].

+

Code sharing and open source licensing is essential for continued progress in this domain. We strongly advocate following established best practices for sharing source code, archiving code in repositories that generate digital object identifiers, and open licensing [516] regardless of the minimal requirements, or lack thereof, set by journals, conferences, or preprint servers. In addition, it is important for authors to share not only code for their core models but also scripts and code used for data cleaning (see above) and hyperparameter optimization. These improve reproducibility and serve as documentation of the detailed decisions that impact model performance but may not be exhaustively captured in a manuscript’s methods text.

+

Because many deep learning models are often built using one of several popular software frameworks, it is also possible to directly share trained predictive models. The availability of pre-trained models can accelerate research, with image classifiers as an apt example. A pre-trained neural network can be quickly fine-tuned on new data and used in transfer learning, as discussed below. Taking this idea to the extreme, genomic data has been artificially encoded as images in order to benefit from pre-trained image classifiers [325]. “Model zoos” – collections of pre-trained models – are not yet common in biomedical domains but have started to appear in genomics applications [280,517]. However, it is important to note that sharing models trained on individual data requires great care because deep learning models can be attacked to identify examples used in training. One possible solution to protect individual samples includes training models under differential privacy [152], which has been used in the biomedical domain [155]. We discussed this issue as well as recent techniques to mitigate these concerns in the patient categorization section.

+

DeepChem [389,393,395] and DragoNN [517] exemplify the benefits of sharing pre-trained models and code under an open source license. DeepChem, which targets drug discovery and quantum chemistry, has actively encouraged and received community contributions of learning algorithms and benchmarking datasets. As a consequence, it now supports a large suite of machine learning approaches, both deep learning and competing strategies, that can be run on diverse test cases. This realistic, continual evaluation will play a critical role in assessing which techniques are most promising for chemical screening and drug discovery. Like formal, organized challenges such as the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge [518], DeepChem provides a forum for the fair, critical evaluations that are not always conducted in individual methodological papers, which can be biased toward favoring a new proposed algorithm. Likewise DragoNN (Deep RegulAtory GenOmic Neural Networks) offers not only code and a model zoo but also a detailed tutorial and partner package for simulating training data. These resources, especially the ability to simulate datasets that are sufficiently complex to demonstrate the challenges of training neural networks but small enough to train quickly on a CPU, are important for training students and attracting machine learning researchers to problems in genomics and healthcare.

Multimodal, multi-task, and transfer learning

-

The fact that biomedical datasets often contain a limited number of instances or labels can cause poor performance of deep learning algorithms. These models are particularly prone to overfitting due to their high representational power. However, transfer learning techniques, also known as domain adaptation, enable transfer of extracted patterns between different datasets and even domains. This approach consists of training a model for the base task and subsequently reusing the trained model for the target problem. The first step allows a model to take advantage of a larger amount of data and/or labels to extract better feature representations. Transferring learned features in deep neural networks improves performance compared to randomly initialized features even when pre-training and target sets are dissimilar. However, transferability of features decreases as the distance between the base task and target task increases [511].

-

In image analysis, previous examples of deep transfer learning applications proved large-scale natural image sets [40] to be useful for pre-training models that serve as generic feature extractors for various types of biological images [14,270,512,513]. More recently, deep learning models predicted protein sub-cellular localization for proteins not originally present in a training set [514]. Moreover, learned features performed reasonably well even when applied to images obtained using different fluorescent labels, imaging techniques, and different cell types [515]. However, there are no established theoretical guarantees for feature transferability between distant domains such as natural images and various modalities of biological imaging. Because learned patterns are represented in deep neural networks in a layer-wise hierarchical fashion, this issue is usually addressed by fixing an empirically chosen number of layers that preserve generic characteristics of both training and target datasets. The model is then fine-tuned by re-training top layers on the specific dataset in order to re-learn domain-specific high level concepts (e.g. fine-tuning for radiology image classification [52]). Fine-tuning on specific biological datasets enables more focused predictions.

-

In genomics, the Basset package [214] for predicting chromatin accessibility was shown to rapidly learn and accurately predict on new data by leveraging a model pre-trained on available public data. To simulate this scenario, authors put aside 15 of 164 cell type datasets and trained the Basset model on the remaining 149 datasets. Then, they fine-tuned the model with one training pass of each of the remaining datasets and achieved results close to the model trained on all 164 datasets together. In another example, Min et al. [215] demonstrated how training on the experimentally-validated FANTOM5 permissive enhancer dataset followed by fine-tuning on ENCODE enhancer datasets improved cell type-specific predictions, outperforming state-of-the-art results. In drug design, general RNN models trained to generate molecules from the ChEMBL database have been fine-tuned to produce drug-like compounds for specific targets [398,401].

-

Related to transfer learning, multimodal learning assumes simultaneous learning from various types of inputs, such as images and text. It can capture features that describe common concepts across input modalities. Generative graphical models like RBMs, deep Boltzmann machines, and DBNs, demonstrate successful extraction of more informative features for one modality (images or video) when jointly learned with other modalities (audio or text) [516]. Deep graphical models such as DBNs are well-suited for multimodal learning tasks because they learn a joint probability distribution from inputs. They can be pre-trained in an unsupervised fashion on large unlabeled data and then fine-tuned on a smaller number of labeled examples. When labels are available, convolutional neural networks are ubiquitously used because they can be trained end-to-end with backpropagation and demonstrate state-of-the-art performance in many discriminative tasks [14].

-

Jha et al. [190] showed that integrated training delivered better performance than individual networks. They compared a number of feed-forward architectures trained on RNA-seq data with and without an additional set of CLIP-seq, knockdown, and over-expression based input features. The integrative deep model generalized well for combined data, offering a large performance improvement for alternative splicing event estimation. Chaudhary et al. [517] trained a deep autoencoder model jointly on RNA-seq, miRNA-seq, and methylation data from TCGA to predict survival subgroups of hepatocellular carcinoma patients. This multimodal approach that treated different omic data types as different modalities outperformed both traditional methods (principal component analysis) and single-omic models. Interestingly, multi-omic model performance did not improve when combined with clinical information, suggesting that the model was able to capture redundant contributions of clinical features through their correlated genomic features. Chen et al. [176] used deep belief networks to learn phosphorylation states of a common set of signaling proteins in primary cultured bronchial cells collected from rats and humans treated with distinct stimuli. By interpreting species as different modalities representing similar high-level concepts, they showed that DBNs were able to capture cross-species representation of signaling mechanisms in response to a common stimuli. Another application used DBNs for joint unsupervised feature learning from cancer datasets containing gene expression, DNA methylation, and miRNA expression data [184]. This approach allowed for the capture of intrinsic relationships in different modalities and for better clustering performance over conventional k-means.

-

Multimodal learning with CNNs is usually implemented as a collection of individual networks in which each learns representations from single data type. These individual representations are further concatenated before or within fully-connected layers. FIDDLE [518] is an example of a multimodal CNN that represents an ensemble of individual networks that take NET-seq, MNase-seq, ChIP-seq, RNA-seq, and raw DNA sequence as input to predict transcription start sites. The combined model radically improves performance over separately trained datatype-specific networks, suggesting that it learns the synergistic relationship between datasets.

-

Multi-task learning is an approach related to transfer learning. In a multi-task learning framework, a model learns a number of tasks simultaneously such that features are shared across them. DeepSEA [204] implemented multi-task joint learning of diverse chromatin factors from raw DNA sequence. This allowed a sequence feature that was effective in recognizing binding of a specific TF to be simultaneously used by another predictor for a physically interacting TF. Similarly, TFImpute [191] learned information shared across transcription factors and cell lines to predict cell-specific TF binding for TF-cell line combinations. Yoon et al. [101] demonstrated that predicting the primary cancer site from cancer pathology reports together with its laterality substantially improved the performance for the latter task, indicating that multi-task learning can effectively leverage the commonality between two tasks using a shared representation. Many studies employed multi-task learning to predict chemical bioactivity [374,377] and drug toxicity [378,519]. Kearnes et al. [371] systematically compared single-task and multi-task models for ADMET properties and found that multi-task learning generally improved performance. Smaller datasets tended to benefit more than larger datasets.

-

Multi-task learning is complementary to multimodal and transfer learning. All three techniques can be used together in the same model. For example, Zhang et al. [512] combined deep model-based transfer and multi-task learning for cross-domain image annotation. One could imagine extending that approach to multimodal inputs as well. A common characteristic of these methods is better generalization of extracted features at various hierarchical levels of abstraction, which is attained by leveraging relationships between various inputs and task objectives.

-

Despite demonstrated improvements, transfer learning approaches pose challenges. There are no theoretically sound principles for pre-training and fine-tuning. Best practice recommendations are heuristic and must account for additional hyper-parameters that depend on specific deep architectures, sizes of the pre-training and target datasets, and similarity of domains. However, similarity of datasets and domains in transfer learning and relatedness of tasks in multi-task learning is difficult to access. Most studies address these limitations by empirical evaluation of the model. Unfortunately, negative results are typically not reported. A deep CNN trained on natural images boosts performance in radiographic images [52]. However, due to differences in imaging domains, the target task required either re-training the initial model from scratch with special pre-processing or fine-tuning of the whole network on radiographs with heavy data augmentation to avoid overfitting. Exclusively fine-tuning top layers led to much lower validation accuracy (81.4 versus 99.5). Fine-tuning the aforementioned Basset model with more than one pass resulted in overfitting [214]. DeepChem successfully improved results for low-data drug discovery with one-shot learning for related tasks. However, it clearly demonstrated the limitations of cross-task generalization across unrelated tasks in one-shot models, specifically nuclear receptor assays and patient adverse reactions [386].

+

The fact that biomedical datasets often contain a limited number of instances or labels can cause poor performance of deep learning algorithms. These models are particularly prone to overfitting due to their high representational power. However, transfer learning techniques, also known as domain adaptation, enable transfer of extracted patterns between different datasets and even domains. This approach consists of training a model for the base task and subsequently reusing the trained model for the target problem. The first step allows a model to take advantage of a larger amount of data and/or labels to extract better feature representations. Transferring learned features in deep neural networks improves performance compared to randomly initialized features even when pre-training and target sets are dissimilar. However, transferability of features decreases as the distance between the base task and target task increases [519].

+

In image analysis, previous examples of deep transfer learning applications proved large-scale natural image sets [40] to be useful for pre-training models that serve as generic feature extractors for various types of biological images [14,270,520,521]. More recently, deep learning models predicted protein sub-cellular localization for proteins not originally present in a training set [522]. Moreover, learned features performed reasonably well even when applied to images obtained using different fluorescent labels, imaging techniques, and different cell types [523]. However, there are no established theoretical guarantees for feature transferability between distant domains such as natural images and various modalities of biological imaging. Because learned patterns are represented in deep neural networks in a layer-wise hierarchical fashion, this issue is usually addressed by fixing an empirically chosen number of layers that preserve generic characteristics of both training and target datasets. The model is then fine-tuned by re-training top layers on the specific dataset in order to re-learn domain-specific high level concepts (e.g. fine-tuning for radiology image classification [52]). Fine-tuning on specific biological datasets enables more focused predictions.

+

In genomics, the Basset package [214] for predicting chromatin accessibility was shown to rapidly learn and accurately predict on new data by leveraging a model pre-trained on available public data. To simulate this scenario, authors put aside 15 of 164 cell type datasets and trained the Basset model on the remaining 149 datasets. Then, they fine-tuned the model with one training pass of each of the remaining datasets and achieved results close to the model trained on all 164 datasets together. In another example, Min et al. [215] demonstrated how training on the experimentally-validated FANTOM5 permissive enhancer dataset followed by fine-tuning on ENCODE enhancer datasets improved cell type-specific predictions, outperforming state-of-the-art results. In drug design, general RNN models trained to generate molecules from the ChEMBL database have been fine-tuned to produce drug-like compounds for specific targets [406,409].

+

Related to transfer learning, multimodal learning assumes simultaneous learning from various types of inputs, such as images and text. It can capture features that describe common concepts across input modalities. Generative graphical models like RBMs, deep Boltzmann machines, and DBNs, demonstrate successful extraction of more informative features for one modality (images or video) when jointly learned with other modalities (audio or text) [524]. Deep graphical models such as DBNs are well-suited for multimodal learning tasks because they learn a joint probability distribution from inputs. They can be pre-trained in an unsupervised fashion on large unlabeled data and then fine-tuned on a smaller number of labeled examples. When labels are available, convolutional neural networks are ubiquitously used because they can be trained end-to-end with backpropagation and demonstrate state-of-the-art performance in many discriminative tasks [14].

+

Jha et al. [190] showed that integrated training delivered better performance than individual networks. They compared a number of feed-forward architectures trained on RNA-seq data with and without an additional set of CLIP-seq, knockdown, and over-expression based input features. The integrative deep model generalized well for combined data, offering a large performance improvement for alternative splicing event estimation. Chaudhary et al. [525] trained a deep autoencoder model jointly on RNA-seq, miRNA-seq, and methylation data from TCGA to predict survival subgroups of hepatocellular carcinoma patients. This multimodal approach that treated different omic data types as different modalities outperformed both traditional methods (principal component analysis) and single-omic models. Interestingly, multi-omic model performance did not improve when combined with clinical information, suggesting that the model was able to capture redundant contributions of clinical features through their correlated genomic features. Chen et al. [176] used deep belief networks to learn phosphorylation states of a common set of signaling proteins in primary cultured bronchial cells collected from rats and humans treated with distinct stimuli. By interpreting species as different modalities representing similar high-level concepts, they showed that DBNs were able to capture cross-species representation of signaling mechanisms in response to a common stimuli. Another application used DBNs for joint unsupervised feature learning from cancer datasets containing gene expression, DNA methylation, and miRNA expression data [184]. This approach allowed for the capture of intrinsic relationships in different modalities and for better clustering performance over conventional k-means.

+

Multimodal learning with CNNs is usually implemented as a collection of individual networks in which each learns representations from single data type. These individual representations are further concatenated before or within fully-connected layers. FIDDLE [526] is an example of a multimodal CNN that represents an ensemble of individual networks that take NET-seq, MNase-seq, ChIP-seq, RNA-seq, and raw DNA sequence as input to predict transcription start sites. The combined model radically improves performance over separately trained datatype-specific networks, suggesting that it learns the synergistic relationship between datasets.

+

Multi-task learning is an approach related to transfer learning. In a multi-task learning framework, a model learns a number of tasks simultaneously such that features are shared across them. DeepSEA [204] implemented multi-task joint learning of diverse chromatin factors from raw DNA sequence. This allowed a sequence feature that was effective in recognizing binding of a specific TF to be simultaneously used by another predictor for a physically interacting TF. Similarly, TFImpute [191] learned information shared across transcription factors and cell lines to predict cell-specific TF binding for TF-cell line combinations. Yoon et al. [101] demonstrated that predicting the primary cancer site from cancer pathology reports together with its laterality substantially improved the performance for the latter task, indicating that multi-task learning can effectively leverage the commonality between two tasks using a shared representation. Many studies employed multi-task learning to predict chemical bioactivity [373,377] and drug toxicity [378,527]. Kearnes et al. [371] systematically compared single-task and multi-task models for ADMET properties and found that multi-task learning generally improved performance. Smaller datasets tended to benefit more than larger datasets.

+

Multi-task learning is complementary to multimodal and transfer learning. All three techniques can be used together in the same model. For example, Zhang et al. [520] combined deep model-based transfer and multi-task learning for cross-domain image annotation. One could imagine extending that approach to multimodal inputs as well. A common characteristic of these methods is better generalization of extracted features at various hierarchical levels of abstraction, which is attained by leveraging relationships between various inputs and task objectives.

+

Despite demonstrated improvements, transfer learning approaches pose challenges. There are no theoretically sound principles for pre-training and fine-tuning. Best practice recommendations are heuristic and must account for additional hyper-parameters that depend on specific deep architectures, sizes of the pre-training and target datasets, and similarity of domains. However, similarity of datasets and domains in transfer learning and relatedness of tasks in multi-task learning is difficult to access. Most studies address these limitations by empirical evaluation of the model. Unfortunately, negative results are typically not reported. A deep CNN trained on natural images boosts performance in radiographic images [52]. However, due to differences in imaging domains, the target task required either re-training the initial model from scratch with special pre-processing or fine-tuning of the whole network on radiographs with heavy data augmentation to avoid overfitting. Exclusively fine-tuning top layers led to much lower validation accuracy (81.4 versus 99.5). Fine-tuning the aforementioned Basset model with more than one pass resulted in overfitting [214]. DeepChem successfully improved results for low-data drug discovery with one-shot learning for related tasks. However, it clearly demonstrated the limitations of cross-task generalization across unrelated tasks in one-shot models, specifically nuclear receptor assays and patient adverse reactions [389].

In the medical domain, multimodal, multi-task and transfer learning strategies not only inherit most methodological issues from natural image, text, and audio domains, but also pose domain-specific challenges. There is a compelling need for the development of privacy-preserving transfer learning algorithms, such as Private Aggregation of Teacher Ensembles [158]. We suggest that these types of models deserve deeper investigation to establish sound theoretical guarantees and determine limits for the transferability of features between various closely related and distant learning tasks.

Conclusions

Deep learning-based methods now match or surpass the previous state of the art in a diverse array of tasks in patient and disease categorization, fundamental biological study, genomics, and treatment development. Returning to our central question: given this rapid progress, has deep learning transformed the study of human disease? Though the answer is highly dependent on the specific domain and problem being addressed, we conclude that deep learning has not yet realized its transformative potential or induced a strategic inflection point. Despite its dominance over competing machine learning approaches in many of the areas reviewed here and quantitative improvements in predictive performance, deep learning has not yet definitively “solved” these problems.

-

As an analogy, consider recent progress in conversational speech recognition. Since 2009 there have been drastic performance improvements with error rates dropping from more than 20% to less than 6% [520] and finally approaching or exceeding human performance in the past year [521,522]. The phenomenal improvements on benchmark datasets are undeniable, but greatly reducing the error rate on these benchmarks did not fundamentally transform the domain. Widespread adoption of conversational speech technologies will require solving the problem, i.e. methods that surpass human performance, and persuading users to adopt them [520]. We see parallels in healthcare, where achieving the full potential of deep learning will require outstanding predictive performance as well as acceptance and adoption by biologists and clinicians. These experts will rightfully demand rigorous evidence that deep learning has impacted their respective disciplines – elucidated new biological mechanisms and improved patient outcomes – to be convinced that the promises of deep learning are more substantive than those of previous generations of artificial intelligence.

+

As an analogy, consider recent progress in conversational speech recognition. Since 2009 there have been drastic performance improvements with error rates dropping from more than 20% to less than 6% [528] and finally approaching or exceeding human performance in the past year [529,530]. The phenomenal improvements on benchmark datasets are undeniable, but greatly reducing the error rate on these benchmarks did not fundamentally transform the domain. Widespread adoption of conversational speech technologies will require solving the problem, i.e. methods that surpass human performance, and persuading users to adopt them [528]. We see parallels in healthcare, where achieving the full potential of deep learning will require outstanding predictive performance as well as acceptance and adoption by biologists and clinicians. These experts will rightfully demand rigorous evidence that deep learning has impacted their respective disciplines – elucidated new biological mechanisms and improved patient outcomes – to be convinced that the promises of deep learning are more substantive than those of previous generations of artificial intelligence.

Some of the areas we have discussed are closer to surpassing this lofty bar than others, generally those that are more similar to the non-biomedical tasks that are now monopolized by deep learning. In medical imaging, diabetic retinopathy [44], diabetic macular edema [44], tuberculosis [53], and skin lesion [4] classifiers are highly accurate and comparable to clinician performance.

In other domains, perfect accuracy will not be required because deep learning will primarily prioritize experiments and assist discovery. For example, in chemical screening for drug discovery, a deep learning system that successfully identifies dozens or hundreds of target-specific, active small molecules from a massive search space would have immense practical value even if its overall precision is modest. In medical imaging, deep learning can point an expert to the most challenging cases that require manual review [53], though the risk of false negatives must be addressed. In protein structure prediction, errors in individual residue-residue contacts can be tolerated when using the contacts jointly for 3D structure modeling. Improved contact map predictions [28] have led to notable improvements in fold and 3D structure prediction for some of the most challenging proteins, such as membrane proteins [237].

Conversely, the most challenging tasks may be those in which predictions are used directly for downstream modeling or decision-making, especially in the clinic. As an example, errors in sequence variant calling will be amplified if they are used directly for GWAS. In addition, the stochasticity and complexity of biological systems implies that for some problems, for instance predicting gene regulation in disease, perfect accuracy will be unattainable.

-

We are witnessing deep learning models achieving human-level performance across a number of biomedical domains. However, machine learning algorithms, including deep neural networks, are also prone to mistakes that humans are much less likely to make, such as misclassification of adversarial examples [523,524], a reminder that these algorithms do not understand the semantics of the objects presented. It may be impossible to guarantee that a model is not susceptible to adversarial examples, but work in this area is continuing [525,526]. Cooperation between human experts and deep learning algorithms addresses many of these challenges and can achieve better performance than either individually [64]. For sample and patient classification tasks, we expect deep learning methods to augment clinicians and biomedical researchers.

+

We are witnessing deep learning models achieving human-level performance across a number of biomedical domains. However, machine learning algorithms, including deep neural networks, are also prone to mistakes that humans are much less likely to make, such as misclassification of adversarial examples [531,532], a reminder that these algorithms do not understand the semantics of the objects presented. It may be impossible to guarantee that a model is not susceptible to adversarial examples, but work in this area is continuing [533,534]. Cooperation between human experts and deep learning algorithms addresses many of these challenges and can achieve better performance than either individually [64]. For sample and patient classification tasks, we expect deep learning methods to augment clinicians and biomedical researchers.

We are optimistic about the future of deep learning in biology and medicine. It is by no means inevitable that deep learning will revolutionize these domains, but given how rapidly the field is evolving, we are confident that its full potential in biomedicine has not been explored. We have highlighted numerous challenges beyond improving training and predictive accuracy, such as preserving patient privacy and interpreting models. Ongoing research has begun to address these problems and shown that they are not insurmountable. Deep learning offers the flexibility to model data in its most natural form, for example, longer DNA sequences instead of k-mers for transcription factor binding prediction and molecular graphs instead of pre-computed bit vectors for drug discovery. These flexible input feature representations have spurred creative modeling approaches that would be infeasible with other machine learning techniques. Unsupervised methods are currently less-developed than their supervised counterparts, but they may have the most potential because of how expensive and time-consuming it is to label large amounts of biomedical data. If future deep learning algorithms can summarize very large collections of input data into interpretable models that spur scientists to ask questions that they did not know how to ask, it will be clear that deep learning has transformed biology and medicine.

Methods

Continuous collaborative manuscript drafting

We recognized that deep learning in precision medicine is a rapidly developing area. Hence, diverse expertise was required to provide a forward-looking perspective. Accordingly, we collaboratively wrote this review in the open, enabling anyone with expertise to contribute. We wrote the manuscript in markdown and tracked changes using git. Contributions were handled through GitHub, with individuals submitting “pull requests” to suggest additions to the manuscript.

-

To facilitate citation, we defined a markdown citation syntax. We supported citations to the following identifier types (in order of preference): DOIs, PubMed Central IDs, PubMed IDs, arXiv IDs, and URLs. References were automatically generated from citation metadata by querying APIs to generate Citation Style Language (CSL) JSON items for each reference. Pandoc and pandoc-citeproc converted the markdown to HTML and PDF, while rendering the formatted citations and references. In total, referenced works consisted of 350 DOIs, 6 PubMed Central records, 0 PubMed records, 127 arXiv manuscripts, and 47 URLs (webpages as well as manuscripts lacking standardized identifiers).

+

To facilitate citation, we defined a markdown citation syntax. We supported citations to the following identifier types (in order of preference): DOIs, PubMed Central IDs, PubMed IDs, arXiv IDs, and URLs. References were automatically generated from citation metadata by querying APIs to generate Citation Style Language (CSL) JSON items for each reference. Pandoc and pandoc-citeproc converted the markdown to HTML and PDF, while rendering the formatted citations and references. In total, referenced works consisted of 356 DOIs, 6 PubMed Central records, 0 PubMed records, 128 arXiv manuscripts, and 48 URLs (webpages as well as manuscripts lacking standardized identifiers).

We implemented continuous analysis so the manuscript was automatically regenerated whenever the source changed [147]. We configured Travis CI – a continuous integration service – to fetch new citation metadata and rebuild the manuscript for every commit. Accordingly, formatting or citation errors in pull requests would cause the Travis CI build to fail, automating quality control. In addition, the build process renders templated variables, such as the reference counts mentioned above, to automate the updating of dynamic content. When contributions were merged into the master branch, Travis CI deployed the built manuscript by committing back to the GitHub repository. As a result, the latest manuscript version is always available at https://greenelab.github.io/deep-review. To ensure a consistent software environment, we defined a versioned conda environment of the software dependencies.

-

In addition, we instructed the Travis CI deployment script to perform blockchain timestamping [527,528]. Using OpenTimestamps, we submitted hashes for the manuscript and the source git commit for timestamping in the Bitcoin blockchain [529]. These timestamps attest that a given version of this manuscript (and its history) existed at a given point in time. The ability to irrefutably prove manuscript existence at a past time could be important to establish scientific precedence and enforce an immutable record of authorship.

+

In addition, we instructed the Travis CI deployment script to perform blockchain timestamping [535,536]. Using OpenTimestamps, we submitted hashes for the manuscript and the source git commit for timestamping in the Bitcoin blockchain [537]. These timestamps attest that a given version of this manuscript (and its history) existed at a given point in time. The ability to irrefutably prove manuscript existence at a past time could be important to establish scientific precedence and enforce an immutable record of authorship.

Author contributions

-

We created an open repository on the GitHub version control platform (greenelab/deep-review) [530]. Here, we engaged with numerous authors from papers within and outside of the area. The manuscript was drafted via GitHub commits by 27 individuals who met the ICMJE standards of authorship. These were individuals who contributed to the review of the literature; drafted the manuscript or provided substantial critical revisions; approved the final manuscript draft; and agreed to be accountable in all aspects of the work. Individuals who did not contribute in all of these ways, but who did participate, are acknowledged below. We grouped authors into the following four classes of approximately equal contributions and randomly ordered authors within each contribution class. Drafted multiple sub-sections along with extensive editing, pull request reviews, or discussion: A.A.K., B.K.B., B.T.D., D.S.H., E.F., G.P.W., P.A., T.C. Drafted one or more sub-sections: A.E.C., A.S., B.J.L., E.M.C., G.L.R., J.I., J.L., J.X., S.W., W.X. Revised specific sub-sections or supervised drafting one or more sub-sections: A.K., D.D., D.J.H., L.K.W., M.H.S.S., Y.P., Y.Q. Drafted sub-sections, edited the manuscript, reviewed pull requests, and coordinated co-authors: A.G., C.S.G.

+

We created an open repository on the GitHub version control platform (greenelab/deep-review) [538]. Here, we engaged with numerous authors from papers within and outside of the area. The manuscript was drafted via GitHub commits by 27 individuals who met the ICMJE standards of authorship. These were individuals who contributed to the review of the literature; drafted the manuscript or provided substantial critical revisions; approved the final manuscript draft; and agreed to be accountable in all aspects of the work. Individuals who did not contribute in all of these ways, but who did participate, are acknowledged below. We grouped authors into the following four classes of approximately equal contributions and randomly ordered authors within each contribution class. Drafted multiple sub-sections along with extensive editing, pull request reviews, or discussion: A.A.K., B.K.B., B.T.D., D.S.H., E.F., G.P.W., P.A., T.C. Drafted one or more sub-sections: A.E.C., A.S., B.J.L., E.M.C., G.L.R., J.I., J.L., J.X., S.W., W.X. Revised specific sub-sections or supervised drafting one or more sub-sections: A.K., D.D., D.J.H., L.K.W., M.H.S.S., Y.P., Y.Q. Drafted sub-sections, edited the manuscript, reviewed pull requests, and coordinated co-authors: A.G., C.S.G.

Competing interests

A.K. is on the Advisory Board of Deep Genomics Inc. E.F. is a full-time employee of GlaxoSmithKline. The remaining authors have no competing interests to declare.

Acknowledgements

@@ -2323,19 +2325,21 @@

References

Jed Zaretzki, Matthew Matlock, S. Joshua Swamidass
Journal of Chemical Information and Modeling (2013-12-23) https://doi.org/10.1021/ci400518g

-
-

373. Molecular Descriptors for ChemoinformaticsMethods and Principles in Medicinal Chemistry (2009-07-15) https://doi.org/10.1002/9783527628766

-
-

374. Multi-task Neural Networks for QSAR Predictions
+

373. Multi-task Neural Networks for QSAR Predictions
George E. Dahl, Navdeep Jaitly, Ruslan Salakhutdinov
arXiv (2014-06-04) https://arxiv.org/abs/1406.1231v1

-

375. Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships
+

374. Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships
Junshui Ma, Robert P. Sheridan, Andy Liaw, George E. Dahl, Vladimir Svetnik
Journal of Chemical Information and Modeling (2015-02-23) https://doi.org/10.1021/ci500747n

+
+

375. Did Kaggle Predict Drug Candidate Activities? Or Not?
+Derek Lowe
+In the Pipeline (2012-12-11) http://blogs.sciencemag.org/pipeline/archives/2012/12/11/did_kaggle_predict_drug_candidate_activities_or_not

+

376. Deep learning as an opportunity in virtual screening
Thomas Unterthiner, Andreas Mayr, Günter Klambauer, Marvin Steijaert, Jörg K. Wegner, Hugo Ceulemans, Sepp Hochreiter
@@ -2366,734 +2370,772 @@

References

Alessandro Lusci, David Fooshee, Michael Browning, Joshua Swamidass, Pierre Baldi
Journal of Cheminformatics (2015-12) https://doi.org/10.1186/s13321-015-0110-6

+
+

382. Molecular Descriptors for ChemoinformaticsMethods and Principles in Medicinal Chemistry (2009-07-15) https://doi.org/10.1002/9783527628766

+
-

382. Extended-Connectivity Fingerprints
+

383. Extended-Connectivity Fingerprints
David Rogers, Mathew Hahn
Journal of Chemical Information and Modeling (2010-05-24) https://doi.org/10.1021/ci100050t

+
+

384. Automatic chemical design using a data-driven continuous representation of molecules
+Rafael Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik
+arXiv (2016-10-07) https://arxiv.org/abs/1610.02415v1

+
+
+

385. Chemception: A Deep Neural Network with Minimal Chemistry Knowledge Matches the Performance of Expert-developed QSAR/QSPR Models
+Garrett B. Goh, Charles Siegel, Abhinav Vishnu, Nathan O. Hodas, Nathan Baker
+arXiv (2017-06-20) https://arxiv.org/abs/1706.06689v1

+
-

383. Convolutional Networks on Graphs for Learning Molecular Fingerprints
+

386. Convolutional Networks on Graphs for Learning Molecular Fingerprints
David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, Ryan P. Adams
(2015) http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints

-

384. Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules
+

387. Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules
Alessandro Lusci, Gianluca Pollastri, Pierre Baldi
Journal of Chemical Information and Modeling (2013-07-22) https://doi.org/10.1021/ci400187y

-

385. Molecular graph convolutions: moving beyond fingerprints
+

388. Molecular graph convolutions: moving beyond fingerprints
Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, Patrick Riley
Journal of Computer-Aided Molecular Design (2016-08) https://doi.org/10.1007/s10822-016-9938-8

-

386. Low Data Drug Discovery with One-Shot Learning
+

389. Low Data Drug Discovery with One-Shot Learning
Han Altae-Tran, Bharath Ramsundar, Aneesh S. Pappu, Vijay Pande
ACS Central Science (2017-04-03) https://doi.org/10.1021/acscentsci.6b00367

-
-

387. MoleculeNet: A Benchmark for Molecular Machine Learning
-Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande
-arXiv (2017-03-02) https://arxiv.org/abs/1703.00564v2

+
+

390. Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction
+Connor W. Coley, Regina Barzilay, William H. Green, Tommi S. Jaakkola, Klavs F. Jensen
+Journal of Chemical Information and Modeling (2017-07-25) https://doi.org/10.1021/acs.jcim.6b00601

+
+
+

391. Learning a Local-Variable Model of Aromatic and Conjugated Systems
+Matthew K. Matlock, Na Le Dang, S. Joshua Swamidass
+ACS Central Science (2018-01-03) https://doi.org/10.1021/acscentsci.7b00405

+
+
+

392. Covariant Compositional Networks For Learning Graphs
+Risi Kondor, Hy Truong Son, Horace Pan, Brandon Anderson, Shubhendu Trivedi
+arXiv (2018-01-07) https://arxiv.org/abs/1801.02144v1

+
+
+

393. MoleculeNet: a benchmark for molecular machine learning
+Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande
+Chemical Science (2018) https://doi.org/10.1039/c7sc02664a

+
+
+

394. What do we know and when do we know it?
+Anthony Nicholls
+Journal of Computer-Aided Molecular Design (2008-02-06) https://doi.org/10.1007/s10822-008-9170-2

-

388. deepchem/deepchemGitHub (2017) https://github.com/deepchem/deepchem

+

395. deepchem/deepchemGitHub (2017) https://github.com/deepchem/deepchem

-
-

389. Automatic chemical design using a data-driven continuous representation of molecules
-Rafael Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, Alán Aspuru-Guzik
-arXiv (2016-10-07) https://arxiv.org/abs/1610.02415v2

+
+

396. Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition
+Sabrina Jaeger, Simone Fulle, Samo Turk
+Journal of Chemical Information and Modeling (2018-01-10) https://doi.org/10.1021/acs.jcim.7b00616

-

390. Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review
+

397. Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review
Tiejun Cheng, Qingliang Li, Zhigang Zhou, Yanli Wang, Stephen H. Bryant
The AAPS Journal (2012-01-27) https://doi.org/10.1208/s12248-012-9322-0

-

391. Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity
+

398. Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity
Joseph Gomes, Bharath Ramsundar, Evan N. Feinberg, Vijay S. Pande
arXiv (2017-03-30) https://arxiv.org/abs/1703.10603v1

+
+

399. TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions
+Zixuan Cang, Guo-Wei Wei
+PLOS Computational Biology (2017-07-27) https://doi.org/10.1371/journal.pcbi.1005690

+
-

392. The PDBbind Database:  Methodologies and Updates
+

400. The PDBbind Database:  Methodologies and Updates
Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-Yie Yang, Shaomeng Wang
Journal of Medicinal Chemistry (2005-06) https://doi.org/10.1021/jm048957q

-

393. Boosting Docking-Based Virtual Screening with Deep Learning
+

401. Boosting Docking-Based Virtual Screening with Deep Learning
Janaina Cruz Pereira, Ernesto Raúl Caffarena, Cicero Nogueira dos Santos
Journal of Chemical Information and Modeling (2016-12-27) https://doi.org/10.1021/acs.jcim.6b00355

-

394. Protein-Ligand Scoring with Convolutional Neural Networks
+

402. Protein-Ligand Scoring with Convolutional Neural Networks
Matthew Ragoza, Joshua Hochuli, Elisa Idrobo, Jocelyn Sunseri, David Ryan Koes
arXiv (2016-12-08) https://arxiv.org/abs/1612.02751v1

-

395. Enabling future drug discovery byde novodesign
+

403. Enabling future drug discovery byde novodesign
Markus Hartenfeller, Gisbert Schneider
Wiley Interdisciplinary Reviews: Computational Molecular Science (2011-04-25) https://doi.org/10.1002/wcms.49

-

396. De Novo Design at the Edge of Chaos
+

404. De Novo Design at the Edge of Chaos
Petra Schneider, Gisbert Schneider
Journal of Medicinal Chemistry (2016-05-12) https://doi.org/10.1021/acs.jmedchem.5b01849

-

397. Generating Sequences With Recurrent Neural Networks
+

405. Generating Sequences With Recurrent Neural Networks
Alex Graves
arXiv (2013-08-04) https://arxiv.org/abs/1308.0850v5

-

398. Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks
+

406. Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks
Marwin H. S. Segler, Thierry Kogej, Christian Tyrchan, Mark P. Waller
arXiv (2017-01-05) https://arxiv.org/abs/1701.01329v1

-

399. Grammar Variational Autoencoder
+

407. Grammar Variational Autoencoder
Matt J. Kusner, Brooks Paige, José Miguel Hernández-Lobato
arXiv (2017-03-06) https://arxiv.org/abs/1703.01925v1

-

400. ChEMBL: a large-scale bioactivity database for drug discovery
+

408. ChEMBL: a large-scale bioactivity database for drug discovery
A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani, J. P. Overington
Nucleic Acids Research (2011-09-23) https://doi.org/10.1093/nar/gkr777

-

401. Molecular De Novo Design through Deep Reinforcement Learning
+

409. Molecular De Novo Design through Deep Reinforcement Learning
Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, Hongming Chen
arXiv (2017-04-25) https://arxiv.org/abs/1704.07555v2

-

402. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
+

410. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E. Turner, Douglas Eck
arXiv (2016-11-09) https://arxiv.org/abs/1611.02796v9

-

403. Understanding deep learning requires rethinking generalization
+

411. Understanding deep learning requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
arXiv (2016-11-10) https://arxiv.org/abs/1611.03530v2

-

404. Why does deep and cheap learning work so well?
+

412. Why does deep and cheap learning work so well?
Henry W. Lin, Max Tegmark, David Rolnick
arXiv (2016-08-29) https://arxiv.org/abs/1608.08225v3

-

405. The relationship between Precision-Recall and ROC curves
+

413. The relationship between Precision-Recall and ROC curves
Jesse Davis, Mark Goadrich
Proceedings of the 23rd international conference on Machine learning - ICML ’06 (2006) https://doi.org/10.1145/1143844.1143874

-

406. An open investigation of the reproducibility of cancer biology research
+

414. An open investigation of the reproducibility of cancer biology research
Timothy M Errington, Elizabeth Iorns, William Gunn, Fraser Elisabeth Tan, Joelle Lomax, Brian A Nosek
eLife (2014-12-10) https://doi.org/10.7554/elife.04333

-

407. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks
+

415. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks
John Bradshaw, Alexander G. de G. Matthews, Zoubin Ghahramani
arXiv (2017-07-08) https://arxiv.org/abs/1707.02476v1

-

408. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
+

416. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
Alex Kendall, Yarin Gal
arXiv (2017-03-15) https://arxiv.org/abs/1703.04977v2

-

409. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
+

417. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
Alex Kendall, Yarin Gal, Roberto Cipolla
arXiv (2017-05-19) https://arxiv.org/abs/1705.07115v1

-

410. On Calibration of Modern Neural Networks
+

418. On Calibration of Modern Neural Networks
Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger
arXiv (2017-06-14) https://arxiv.org/abs/1706.04599v2

-

411. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods
+

419. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods
John C. Platt
ADVANCES IN LARGE MARGIN CLASSIFIERS http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639

-

412. Confidence interval prediction for neural network models
+

420. Confidence interval prediction for neural network models
G. Chryssolouris, M. Lee, A. Ramsey
IEEE Transactions on Neural Networks (1996) https://doi.org/10.1109/72.478409

-

413. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
+

421. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
Dan Hendrycks, Kevin Gimpel
arXiv (2016-10-07) https://arxiv.org/abs/1610.02136v2

-

414. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks
+

422. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks
Shiyu Liang, Yixuan Li, R. Srikant
arXiv (2017-06-08) https://arxiv.org/abs/1706.02690v3

-

415. Concrete Problems in AI Safety
+

423. Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané
arXiv (2016-06-21) https://arxiv.org/abs/1606.06565v2

-

416. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods
+

424. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods
Nicholas Carlini, David Wagner
arXiv (2017-05-20) https://arxiv.org/abs/1705.07263v2

-

417. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
+

425. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Yarin Gal, Zoubin Ghahramani
arXiv (2015-06-06) https://arxiv.org/abs/1506.02142v6

-

418. Leveraging uncertainty information from deep neural networks for disease detection
+

426. Leveraging uncertainty information from deep neural networks for disease detection
Christian Leibig, Vaneeda Allken, Murat Seçkin Ayhan, Philipp Berens, Siegfried Wahl
Scientific Reports (2017-12) https://doi.org/10.1038/s41598-017-17876-z

-

419. Robustly representing inferential uncertainty in deep neural networks through sampling
+

427. Robustly representing inferential uncertainty in deep neural networks through sampling
Patrick McClure, Nikolaus Kriegeskorte
arXiv (2016-11-05) https://arxiv.org/abs/1611.01639v6

-

420. Bayesian Hypernetworks
+

428. Bayesian Hypernetworks
David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, Aaron Courville
arXiv (2017-10-13) https://arxiv.org/abs/1710.04759v1

-

421. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
+

429. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan, Alexander Pritzel, Charles Blundell
arXiv (2016-12-05) https://arxiv.org/abs/1612.01474v3

-

422. Domain-Adversarial Training of Neural Networks
+

430. Domain-Adversarial Training of Neural Networks
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Victor Lempitsky
arXiv (2015-05-28) https://arxiv.org/abs/1505.07818v4

-

423. Yarin Gal - Publications | Oxford Machine Learninghttp://www.cs.ox.ac.uk/people/yarin.gal/website/publications.html

+

431. Yarin Gal - Publications | Oxford Machine Learninghttp://www.cs.ox.ac.uk/people/yarin.gal/website/publications.html

-

424. Do Deep Nets Really Need to be Deep?
+

432. Do Deep Nets Really Need to be Deep?
Lei Jimmy Ba, Rich Caruana
arXiv (2013-12-21) https://arxiv.org/abs/1312.6184v7

-

425. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
+

433. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images
Anh Nguyen, Jason Yosinski, Jeff Clune
arXiv (2014-12-05) https://arxiv.org/abs/1412.1897v4

-

426. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier
+

434. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin
arXiv (2016-02-16) https://arxiv.org/abs/1602.04938v3

-

427. Visualizing and Understanding Convolutional Networks
+

435. Visualizing and Understanding Convolutional Networks
Matthew D Zeiler, Rob Fergus
arXiv (2013-11-12) https://arxiv.org/abs/1311.2901v3

-

428. Visualizing Deep Neural Network Decisions: Prediction Difference Analysis
+

436. Visualizing Deep Neural Network Decisions: Prediction Difference Analysis
Luisa M Zintgraf, Taco S Cohen, Tameem Adel, Max Welling
arXiv (2017-02-15) https://arxiv.org/abs/1702.04595v1

-

429. Interpretable Explanations of Black Boxes by Meaningful Perturbation
+

437. Interpretable Explanations of Black Boxes by Meaningful Perturbation
Ruth Fong, Andrea Vedaldi
arXiv (2017-04-11) https://arxiv.org/abs/1704.03296v1

-

430. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
+

438. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
arXiv (2013-12-20) https://arxiv.org/abs/1312.6034v2

-

431. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation
+

439. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation
Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, Wojciech Samek
PLOS ONE (2015-07-10) https://doi.org/10.1371/journal.pone.0130140

-

432. Investigating the influence of noise and distractors on the interpretation of neural networks
+

440. Investigating the influence of noise and distractors on the interpretation of neural networks
Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, Sven Dähne
arXiv (2016-11-22) https://arxiv.org/abs/1611.07270v1

-

433. Striving for Simplicity: The All Convolutional Net
+

441. Striving for Simplicity: The All Convolutional Net
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, Martin Riedmiller
arXiv (2014-12-21) https://arxiv.org/abs/1412.6806v3

-

434. Salient Deconvolutional Networks
+

442. Salient Deconvolutional Networks
Aravindh Mahendran, Andrea Vedaldi
Computer Vision – ECCV 2016 (2016) https://doi.org/10.1007/978-3-319-46466-4_8

-

435. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
+

443. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra
arXiv (2016-10-07) https://arxiv.org/abs/1610.02391v3

-

436. Axiomatic Attribution for Deep Networks
+

444. Axiomatic Attribution for Deep Networks
Mukund Sundararajan, Ankur Taly, Qiqi Yan
arXiv (2017-03-04) https://arxiv.org/abs/1703.01365v2

-

437. An unexpected unity among methods for interpreting model predictions
+

445. An unexpected unity among methods for interpreting model predictions
Scott Lundberg, Su-In Lee
arXiv (2016-11-22) https://arxiv.org/abs/1611.07478v3

-

438. 17. A Value for n-Person Games
+

446. 17. A Value for n-Person Games
L. S. Shapley
Contributions to the Theory of Games (AM-28), Volume II (1953) https://doi.org/10.1515/9781400881970-018

-

439. Understanding Deep Image Representations by Inverting Them
+

447. Understanding Deep Image Representations by Inverting Them
Aravindh Mahendran, Andrea Vedaldi
arXiv (2014-11-26) https://arxiv.org/abs/1412.0035v1

-

440. Maximum Entropy Methods for Extracting the Learned Features of Deep Neural Networks
+

448. Maximum Entropy Methods for Extracting the Learned Features of Deep Neural Networks
Alex I Finnegan, Jun S Song
Cold Spring Harbor Laboratory (2017-02-03) https://doi.org/10.1101/105957

-

441. Visualizing Deep Convolutional Neural Networks Using Natural Pre-images
+

449. Visualizing Deep Convolutional Neural Networks Using Natural Pre-images
Aravindh Mahendran, Andrea Vedaldi
International Journal of Computer Vision (2016-05-18) https://doi.org/10.1007/s11263-016-0911-8

-

442. Inceptionism: Going Deeper into Neural Networks
+

450. Inceptionism: Going Deeper into Neural Networks
Alexander Mordvintsev, Christopher Olah, Mike Tyka
Google Research Blog (2015-06) http://googleresearch.blogspot.co.uk/2015/06/inceptionism-going-deeper-into-neural.html

-

443. Visualizing Higher-Layer Features of a Deep Network
+

451. Visualizing Higher-Layer Features of a Deep Network
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pascal Vincent
University of Montreal (2009-06) http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/247

-

444. Understanding Neural Networks Through Deep Visualization
+

452. Understanding Neural Networks Through Deep Visualization
Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, Hod Lipson
arXiv (2015-06-22) https://arxiv.org/abs/1506.06579v1

-

445. Neural Machine Translation by Jointly Learning to Align and Translate
+

453. Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
arXiv (2014-09-01) https://arxiv.org/abs/1409.0473v7

-

446. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
+

454. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio
arXiv (2015-02-10) https://arxiv.org/abs/1502.03044v3

-

447. Genetic Architect: Discovering Genomic Structure with Learned Neural Architectures
+

455. Genetic Architect: Discovering Genomic Structure with Learned Neural Architectures
Laura Deming, Sasha Targ, Nate Sauder, Diogo Almeida, Chun Jimmie Ye
arXiv (2016-05-23) https://arxiv.org/abs/1605.07156v1

-

448. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism
+

456. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism
Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, Jimeng Sun
arXiv (2016-08-19) https://arxiv.org/abs/1608.05745v4

-

449. GRAM: Graph-based Attention Model for Healthcare Representation Learning
+

457. GRAM: Graph-based Attention Model for Healthcare Representation Learning
Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F. Stewart, Jimeng Sun
arXiv (2016-11-21) https://arxiv.org/abs/1611.07012v3

-

450. Sequence learning with recurrent networks: analysis of internal representations
+

458. Sequence learning with recurrent networks: analysis of internal representations
Joydeep Ghosh, Vijay Karamcheti
Science of Artificial Neural Networks (1992-07-01) https://doi.org/10.1117/12.140112

-

451. Visualizing and Understanding Recurrent Networks
+

459. Visualizing and Understanding Recurrent Networks
Andrej Karpathy, Justin Johnson, Li Fei-Fei
arXiv (2015-06-05) https://arxiv.org/abs/1506.02078v2

-

452. LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks
+

460. LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks
Hendrik Strobelt, Sebastian Gehrmann, Hanspeter Pfister, Alexander M. Rush
arXiv (2016-06-23) https://arxiv.org/abs/1606.07461v2

-

453. Automatic Rule Extraction from Long Short Term Memory Networks
+

461. Automatic Rule Extraction from Long Short Term Memory Networks
W. James Murdoch, Arthur Szlam
arXiv (2017-02-08) https://arxiv.org/abs/1702.02540v2

-

454. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
+

462. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford, Luke Metz, Soumith Chintala
arXiv (2015-11-19) https://arxiv.org/abs/1511.06434v2

-

455. Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders
+

463. Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders
Gregory P. Way, Casey S. Greene
Cold Spring Harbor Laboratory (2017-08-11) https://doi.org/10.1101/174474

-

456. Evaluating deep variational autoencoders trained on pan-cancer gene expression
+

464. Evaluating deep variational autoencoders trained on pan-cancer gene expression
Gregory P. Way, Casey S. Greene
arXiv (2017-11-13) https://arxiv.org/abs/1711.04828v1

-

457. GANs for Biological Image Synthesis
+

465. GANs for Biological Image Synthesis
Anton Osokin, Anatole Chessel, Rafael E. Carazo Salas, Federico Vaggi
arXiv (2017-08-15) https://arxiv.org/abs/1708.04692v2

-

458. CytoGAN: Generative Modeling of Cell Images
+

466. CytoGAN: Generative Modeling of Cell Images
Peter Goldsborough, Nick Pawlowski, Juan C Caicedo, Shantanu Singh, Anne Carpenter
Cold Spring Harbor Laboratory (2017-12-02) https://doi.org/10.1101/227645

-

459. Understanding Black-box Predictions via Influence Functions
+

467. Understanding Black-box Predictions via Influence Functions
Pang Wei Koh, Percy Liang
arXiv (2017-03-14) https://arxiv.org/abs/1703.04730v2

-

460. ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models
+

468. ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models
Minsuk Kahng, Pierre Y. Andrews, Aditya Kalro, Duen Horng Chau
arXiv (2017-04-06) https://arxiv.org/abs/1704.01942v2

-

461. Towards Better Analysis of Deep Convolutional Neural Networks
+

469. Towards Better Analysis of Deep Convolutional Neural Networks
Mengchen Liu, Jiaxin Shi, Zhen Li, Chongxuan Li, Jun Zhu, Shixia Liu
arXiv (2016-04-24) https://arxiv.org/abs/1604.07043v3

-

462. Distilling Knowledge from Deep Networks with Applications to Healthcare Domain
+

470. Distilling Knowledge from Deep Networks with Applications to Healthcare Domain
Zhengping Che, Sanjay Purushotham, Robinder Khemani, Yan Liu
arXiv (2015-12-11) https://arxiv.org/abs/1512.03542v1

-

463. Rationalizing Neural Predictions
+

471. Rationalizing Neural Predictions
Tao Lei, Regina Barzilay, Tommi Jaakkola
arXiv (2016-06-13) https://arxiv.org/abs/1606.04155v2

-

464. Functional Knowledge Transfer for High-accuracy Prediction of Under-studied Biological Processes
+

472. Functional Knowledge Transfer for High-accuracy Prediction of Under-studied Biological Processes
Christopher Y. Park, Aaron K. Wong, Casey S. Greene, Jessica Rowland, Yuanfang Guan, Lars A. Bongo, Rebecca D. Burdine, Olga G. Troyanskaya
PLoS Computational Biology (2013-03-14) https://doi.org/10.1371/journal.pcbi.1002957

-

465. DeepAD: Alzheimer′s Disease Classification via Deep Convolutional Neural Networks using MRI and fMRI
+

473. DeepAD: Alzheimer′s Disease Classification via Deep Convolutional Neural Networks using MRI and fMRI
Saman Sarraf, Danielle D. DeSouza, John Anderson, Ghassem Tofighi,
Cold Spring Harbor Laboratory (2016-08-21) https://doi.org/10.1101/070441

-

466. DeepBound: Accurate Identification of Transcript Boundaries via Deep Convolutional Neural Fields
+

474. DeepBound: Accurate Identification of Transcript Boundaries via Deep Convolutional Neural Fields
Mingfu Shao, Jianzhu Ma, Sheng Wang
Cold Spring Harbor Laboratory (2017-04-07) https://doi.org/10.1101/125229

-

467. A general framework for estimating the relative pathogenicity of human genetic variants
+

475. A general framework for estimating the relative pathogenicity of human genetic variants
Martin Kircher, Daniela M Witten, Preti Jain, Brian J O’Roak, Gregory M Cooper, Jay Shendure
Nature Genetics (2014-02-02) https://doi.org/10.1038/ng.2892

-

468. Diet Networks: Thin Parameters for Fat Genomics
+

476. Diet Networks: Thin Parameters for Fat Genomics
Adriana Romero, Pierre Luc Carrier, Akram Erraqabi, Tristan Sylvain, Alex Auvolat, Etienne Dejoie, Marc-André Legault, Marie-Pierre Dubé, Julie G. Hussin, Yoshua Bengio
International Conference on Learning Representations 2017 (2016-11-04) https://openreview.net/forum?id=Sk-oDY9ge&noteId=Sk-oDY9ge

-

469. Deep learning in neural networks: An overview
+

477. Deep learning in neural networks: An overview
Jürgen Schmidhuber
Neural Networks (2015-01) https://doi.org/10.1016/j.neunet.2014.09.003

-

470. Deep Learning with Limited Numerical Precision
+

478. Deep Learning with Limited Numerical Precision
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
arXiv (2015-02-09) https://arxiv.org/abs/1502.02551v1

-

471. Training deep neural networks with low precision multiplications
+

479. Training deep neural networks with low precision multiplications
Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David
arXiv (2014-12-22) https://arxiv.org/abs/1412.7024v5

-

472. Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
+

480. Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré
arXiv (2015-06-22) https://arxiv.org/abs/1506.06438v2

-

473. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
+

481. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio
arXiv (2016-09-22) https://arxiv.org/abs/1609.07061v1

-

474. Distilling the Knowledge in a Neural Network
+

482. Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, Jeff Dean
arXiv (2015-03-09) https://arxiv.org/abs/1503.02531v1

-

475. Large-scale deep unsupervised learning using graphics processors
+

483. Large-scale deep unsupervised learning using graphics processors
Rajat Raina, Anand Madhavan, Andrew Y. Ng
Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09 (2009) https://doi.org/10.1145/1553374.1553486

-

476. Improving the speed of neural networks on CPUs
+

484. Improving the speed of neural networks on CPUs
Vincent Vanhoucke, Andrew Senior, Mark Z. Mao
(2011) https://research.google.com/pubs/pub37631.html

-

477. On parallelizability of stochastic gradient descent for speech DNNS
+

485. On parallelizability of stochastic gradient descent for speech DNNS
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, Dong Yu
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014-05) https://doi.org/10.1109/icassp.2014.6853593

-

478. Caffe con Troll: Shallow Ideas to Speed Up Deep Learning
+

486. Caffe con Troll: Shallow Ideas to Speed Up Deep Learning
Stefan Hadjis, Firas Abuzaid, Ce Zhang, Christopher Ré
arXiv (2015-04-16) https://arxiv.org/abs/1504.04343v2

-

479. Growing pains for deep learning
+

487. Growing pains for deep learning
Chris Edwards
Communications of the ACM (2015-06-25) https://doi.org/10.1145/2771283

-

480. Experiments on Parallel Training of Deep Neural Network using Model Averaging
+

488. Experiments on Parallel Training of Deep Neural Network using Model Averaging
Hang Su, Haoyu Chen
arXiv (2015-07-05) https://arxiv.org/abs/1507.01239v2

-

481. Efficient mini-batch training for stochastic optimization
+

489. Efficient mini-batch training for stochastic optimization
Mu Li, Tong Zhang, Yuqiang Chen, Alexander J. Smola
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14 (2014) https://doi.org/10.1145/2623330.2623612

-

482. CGBVS-DNN: Prediction of Compound-protein Interactions Based on Deep Learning
+

490. CGBVS-DNN: Prediction of Compound-protein Interactions Based on Deep Learning
Masatoshi Hamanaka, Kei Taneishi, Hiroaki Iwata, Jun Ye, Jianguo Pei, Jinlong Hou, Yasushi Okuno
Molecular Informatics (2016-08-12) https://doi.org/10.1002/minf.201600045

-

483. cuDNN: Efficient Primitives for Deep Learning
+

491. cuDNN: Efficient Primitives for Deep Learning
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer
arXiv (2014-10-03) https://arxiv.org/abs/1410.0759v3

-

484. Compressing Neural Networks with the Hashing Trick
+

492. Compressing Neural Networks with the Hashing Trick
Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, Yixin Chen
arXiv (2015-04-19) https://arxiv.org/abs/1504.04788v1

-

485. Deep Learning on FPGAs: Past, Present, and Future
+

493. Deep Learning on FPGAs: Past, Present, and Future
Griffin Lacey, Graham W. Taylor, Shawki Areibi
arXiv (2016-02-13) https://arxiv.org/abs/1602.04283v1

-

486. In-Datacenter Performance Analysis of a Tensor Processing Unit
+

494. In-Datacenter Performance Analysis of a Tensor Processing Unit
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, … Doe Hyun Yoon
arXiv (2017-04-16) https://arxiv.org/abs/1704.04760v1

-

487. MapReduce
+

495. MapReduce
Jeffrey Dean, Sanjay Ghemawat
Communications of the ACM (2008-01-01) https://doi.org/10.1145/1327452.1327492

-

488. Distributed GraphLab
+

496. Distributed GraphLab
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, Joseph M. Hellerstein
Proceedings of the VLDB Endowment (2012-04-01) https://doi.org/10.14778/2212351.2212354

-

489. Large Scale Distributed Deep Networks
+

497. Large Scale Distributed Deep Networks
Jeffrey Dean, Greg S Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V Le, Mark Z Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, … Andrew Y Ng
Neural Information Processing Systems 2012 (2012-12) http://research.google.com/archive/large_deep_networks_nips2012.html

-

490. Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
+

498. Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré
Advances in neural information processing systems (2015-12) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4907892/

-

491. SparkNet: Training Deep Networks in Spark
+

499. SparkNet: Training Deep Networks in Spark
Philipp Moritz, Robert Nishihara, Ion Stoica, Michael I. Jordan
arXiv (2015-11-19) https://arxiv.org/abs/1511.06051v4

-

492. MLlib: Machine Learning in Apache Spark
+

500. MLlib: Machine Learning in Apache Spark
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, … Ameet Talwalkar
arXiv (2015-05-26) https://arxiv.org/abs/1505.06807v1

-

493. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
+

501. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, … Xiaoqiang Zheng
arXiv (2016-03-14) https://arxiv.org/abs/1603.04467v2

-

494. fchollet/kerasGitHub (2017) https://github.com/fchollet/keras

+

502. fchollet/kerasGitHub (2017) https://github.com/fchollet/keras

-

495. maxpumperla/elephasGitHub (2017) https://github.com/maxpumperla/elephas

+

503. maxpumperla/elephasGitHub (2017) https://github.com/maxpumperla/elephas

-

496. Deep learning with COTS HPC systems
+

504. Deep learning with COTS HPC systems
Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, Ng Andrew
(2013-02-13) http://www.jmlr.org/proceedings/papers/v28/coates13.html

-

497. Ensemble-Compression: A New Method for Parallel Training of Deep Neural Networks
+

505. Ensemble-Compression: A New Method for Parallel Training of Deep Neural Networks
Shizhao Sun, Wei Chen, Jiang Bian, Xiaoguang Liu, Tie-Yan Liu
arXiv (2016-06-02) https://arxiv.org/abs/1606.00575v2

-

498. Algorithms for Hyper-parameter Optimization
+

506. Algorithms for Hyper-parameter Optimization
James Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl
Proceedings of the 24th International Conference on Neural Information Processing Systems (2011) http://dl.acm.org/citation.cfm?id=2986459.2986743

-

499. Random Search for Hyper-Parameter Optimization
+

507. Random Search for Hyper-Parameter Optimization
James Bergstra, Yoshua Bengio
Journal of Machine Learning Research (2012) http://www.jmlr.org/papers/v13/bergstra12a.html

-

500. Cloud computing and the DNA data race
+

508. Cloud computing and the DNA data race
Michael C Schatz, Ben Langmead, Steven L Salzberg
Nature Biotechnology (2010-07) https://doi.org/10.1038/nbt0710-691

-

501. The real cost of sequencing: scaling computation to keep pace with data generation
+

509. The real cost of sequencing: scaling computation to keep pace with data generation
Paul Muir, Shantao Li, Shaoke Lou, Daifeng Wang, Daniel J Spakowicz, Leonidas Salichos, Jing Zhang, George M. Weinstock, Farren Isaacs, Joel Rozowsky, Mark Gerstein
Genome Biology (2016-03-23) https://doi.org/10.1186/s13059-016-0917-0

-

502. The case for cloud computing in genome informatics
+

510. The case for cloud computing in genome informatics
Lincoln D Stein
Genome Biology (2010) https://doi.org/10.1186/gb-2010-11-5-207

-

503. One weird trick for parallelizing convolutional neural networks
+

511. One weird trick for parallelizing convolutional neural networks
Alex Krizhevsky
arXiv (2014-04-23) https://arxiv.org/abs/1404.5997v2

-

504. A view of cloud computing
+

512. A view of cloud computing
Michael Armbrust, Ion Stoica, Matei Zaharia, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin
Communications of the ACM (2010-04-01) https://doi.org/10.1145/1721654.1721672

-

505. Data Sharing
+

513. Data Sharing
Dan L. Longo, Jeffrey M. Drazen
New England Journal of Medicine (2016-01-21) https://doi.org/10.1056/nejme1516564

-

506. Celebrating parasites
+

514. Celebrating parasites
Casey S Greene, Lana X Garmire, Jack A Gilbert, Marylyn D Ritchie, Lawrence E Hunter
Nature Genetics (2017-03-30) https://doi.org/10.1038/ng.3830

-

507. Is Multitask Deep Learning Practical for Pharma?
+

515. Is Multitask Deep Learning Practical for Pharma?
Bharath Ramsundar, Bowen Liu, Zhenqin Wu, Andreas Verras, Matthew Tudor, Robert P. Sheridan, Vijay Pande
Journal of Chemical Information and Modeling (2017-08) https://doi.org/10.1021/acs.jcim.7b00146

-

508. Enhancing reproducibility for computational methods
+

516. Enhancing reproducibility for computational methods
V. Stodden, M. McNutt, D. H. Bailey, E. Deelman, Y. Gil, B. Hanson, M. A. Heroux, J. P. A. Ioannidis, M. Taufer
Science (2016-12-08) https://doi.org/10.1126/science.aah6168

-

509. DragoNN(2016-11-06) http://kundajelab.github.io/dragonn/

+

517. DragoNN(2016-11-06) http://kundajelab.github.io/dragonn/

-

510. ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge(2017) https://www.synapse.org/#!Synapse:syn6131484/wiki/402026

+

518. ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge(2017) https://www.synapse.org/#!Synapse:syn6131484/wiki/402026

-

511. How transferable are features in deep neural networks?
+

519. How transferable are features in deep neural networks?
Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson
(2014) https://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks

-

512. Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis
+

520. Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis
Wenlu Zhang, Rongjian Li, Tao Zeng, Qian Sun, Sudhir Kumar, Jieping Ye, Shuiwang Ji
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’15 (2015) https://doi.org/10.1145/2783258.2783304

-

513. Deep convolutional neural networks for annotating gene expression patterns in the mouse brain
+

521. Deep convolutional neural networks for annotating gene expression patterns in the mouse brain
Tao Zeng, Rongjian Li, Ravi Mukkamala, Jieping Ye, Shuiwang Ji
BMC Bioinformatics (2015-05-07) https://doi.org/10.1186/s12859-015-0553-9

-

514. Accurate Classification of Protein Subcellular Localization from High-Throughput Microscopy Images Using Deep Learning
+

522. Accurate Classification of Protein Subcellular Localization from High-Throughput Microscopy Images Using Deep Learning
Tanel Pärnamaa, Leopold Parts
G3: Genes|Genomes|Genetics (2017-04-08) https://doi.org/10.1534/g3.116.033654

-

515. Automated analysis of high‐content microscopy data with deep learning
+

523. Automated analysis of high‐content microscopy data with deep learning
Oren Z Kraus, Ben T Grys, Jimmy Ba, Yolanda Chong, Brendan J Frey, Charles Boone, Brenda J Andrews
Molecular Systems Biology (2017-04) https://doi.org/10.15252/msb.20177551

-

516. Multimodal Deep Learning
+

524. Multimodal Deep Learning
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y. Ng
Proceedings of the 28th International Conference on Machine Learning (2011) https://ccrma.stanford.edu/~juhan/pubs/NgiamKhoslaKimNamLeeNg2011.pdf

-

517. Deep Learning based multi-omics integration robustly predicts survival in liver cancer
+

525. Deep Learning based multi-omics integration robustly predicts survival in liver cancer
Kumardeep Chaudhary, Olivier B. Poirion, Liangqun Lu, Lana X. Garmire
Cold Spring Harbor Laboratory (2017-03-08) https://doi.org/10.1101/114892

-

518. FIDDLE: An integrative deep learning framework for functional genomic data inference
+

526. FIDDLE: An integrative deep learning framework for functional genomic data inference
Umut Eser, L. Stirling Churchman
Cold Spring Harbor Laboratory (2016-10-17) https://doi.org/10.1101/081380

-

519. Modeling Reactivity to Biological Macromolecules with a Deep Multitask Network
+

527. Modeling Reactivity to Biological Macromolecules with a Deep Multitask Network
Tyler B. Hughes, Na Le Dang, Grover P. Miller, S. Joshua Swamidass
ACS Central Science (2016-08-24) https://doi.org/10.1021/acscentsci.6b00162

-

520. IBM edges closer to human speech recognition
+

528. IBM edges closer to human speech recognition
BI Intelligence
Business Insider (2017-03-14) http://www.businessinsider.com/ibm-edges-closer-to-human-speech-recognition-2017-3

-

521. Achieving Human Parity in Conversational Speech Recognition
+

529. Achieving Human Parity in Conversational Speech Recognition
W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig
arXiv (2016-10-17) https://arxiv.org/abs/1610.05256v2

-

522. English Conversational Telephone Speech Recognition by Humans and Machines
+

530. English Conversational Telephone Speech Recognition by Humans and Machines
George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, … Phil Hall
arXiv (2017-03-06) https://arxiv.org/abs/1703.02136v1

-

523. Intriguing properties of neural networks
+

531. Intriguing properties of neural networks
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus
arXiv (2013-12-21) https://arxiv.org/abs/1312.6199v4

-

524. Explaining and Harnessing Adversarial Examples
+

532. Explaining and Harnessing Adversarial Examples
Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy
arXiv (2014-12-20) https://arxiv.org/abs/1412.6572v3

-

525. Towards the Science of Security and Privacy in Machine Learning
+

533. Towards the Science of Security and Privacy in Machine Learning
Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, Michael Wellman
arXiv (2016-11-11) https://arxiv.org/abs/1611.03814v1

-

526. Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks
+

534. Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks
Weilin Xu, David Evans, Yanjun Qi
arXiv (2017-04-04) https://arxiv.org/abs/1704.01155v1

-

527. The Grey Literature — Proof of prespecified endpoints in medical research with the bitcoin blockchain
+

535. The Grey Literature — Proof of prespecified endpoints in medical research with the bitcoin blockchain
Benjamin Gregory Carlisle
(2014-08-25) https://www.bgcarlisle.com/blog/2014/08/25/proof-of-prespecified-endpoints-in-medical-research-with-the-bitcoin-blockchain/

-

528. The most interesting case of scientific irreproducibility?
+

536. The most interesting case of scientific irreproducibility?
Daniel Himmelstein
Satoshi Village (2017-03-08) http://blog.dhimmel.com/irreproducible-timestamps/

-

529. OpenTimestamps: a timestamping proof standard(2017-05-16) https://opentimestamps.org/

+

537. OpenTimestamps: a timestamping proof standard(2017-05-16) https://opentimestamps.org/

-

530. greenelab/deep-reviewGitHub (2017) https://github.com/greenelab/deep-review

+

538. greenelab/deep-reviewGitHub (2017) https://github.com/greenelab/deep-review