-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathappend.tex
256 lines (249 loc) · 6.55 KB
/
append.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
\chapter{Appendix}
\section{Modelling time series}
\label{sec:full_reduced}
An alternative method (as to \ref{sec:mds}) to determine
the dynamic response of a certain
gene in the network is based on the work of \cite{Mar2009}.
Since statistical analyses of microarray time series data usually
suffers from the lack
of replicates at each time point, classical statistical tests
cannot be directly applied. In order to
circumvent this problem and increase the statistical power,
previous studies~\citep{Bar-Joseph2003,Storey2005} have taken into
account all data points at the
same time by fitting the time series with a spline function and
performing likelihood ratio test to judge the significance of
differential expression.
Our approach works in an analogous way.
More specifically, we fitted cubic regression models to each
individual gene expression profile. A cubic model was chosen
because of the moderately large number of time points
available to fit a model with four to eight parameters.
For a single gene, both a full model and a reduced model are
fitted to its time-course expression profiles under the
stimulation and control condition. The full model specifies
a set of parameters that capture the time-dependent curvature
of a gene's expression profile for each separate condition.
In this way, the full model assumes that the expression
profile is different across the two conditions, or
\begin{eqnarray}
\begin{pmatrix}
Y_{control,1}\\
\cdots\\
Y_{control,N}\\
Y_{stimulation,1}\\
\cdots\\
Y_{stimulation,N}
\end{pmatrix}
%
&= \alpha_0
\begin{pmatrix}
1\\
\cdots\\
1\\
0\\
\cdots\\
0
\end{pmatrix}
%
+ \alpha_1
\begin{pmatrix}
t_1\\
\cdots\\
t_N\\
0\\
\cdots\\
0
\end{pmatrix}
%
+ \alpha_2
\begin{pmatrix}
t_1^2\\
\cdots\\
t_N^2\\
0\\
\cdots\\
0
\end{pmatrix}
%
+ \alpha_3
\begin{pmatrix}
t_1^3\\
\cdots\\
t_N^3\\
0\\
\cdots\\
0
\end{pmatrix}\notag \\
%
&+ \beta_0
\begin{pmatrix}
0\\
\cdots\\
0\\
1\\
\cdots\\
1
\end{pmatrix}
%
+ \beta_1
\begin{pmatrix}
0\\
\cdots\\
0\\
t_1\\
\cdots\\
t_N
\end{pmatrix}
%
+ \beta_2
\begin{pmatrix}
0\\
\cdots\\
0\\
t_1^2\\
\cdots\\
t_N^2
\end{pmatrix}
%
+ \beta_3
\begin{pmatrix}
0\\
\cdots\\
0\\
t_1^3\\
\cdots\\
t_N^3
\end{pmatrix}
%
+
\begin{pmatrix}
\epsilon_1\\
\epsilon_2\\
\cdots\\
\cdots\\
\epsilon_{2N-1}\\
\epsilon_{2N}
\end{pmatrix}
\end{eqnarray}
where $t=(t_1,t_2,\dotsc,t_N)$ corresponds to the measured time points, and
$\epsilon = (\epsilon_1,\epsilon_2,\dotsc,\epsilon_{2N})$ represents a random
normal residual error term. The parameters $\alpha_0,\alpha_1,\alpha_2,
\alpha_3$ and $\beta_0,\beta_1,\beta_2,\beta_3$ are the coefficients of the
time covariate in the model for the control and stimulation time series
respectively.
The reduced model, in contrast, is a simpler model that assumes the expression
profiles for different conditions need only be specified by one set of
parameters, or
\begin{align}
\begin{pmatrix}
Y_{control,1}\\
\cdots\\
Y_{control,N}\\
Y_{stimulation,1}\\
\cdots\\
Y_{stimulation,N}
\end{pmatrix}
%
&= \lambda_0
\begin{pmatrix}
1\\
\cdots\\
1\\
1\\
\cdots\\
1
\end{pmatrix}
%
+ \lambda_1
\begin{pmatrix}
t_1\\
\cdots\\
t_N\\
t_1\\
\cdots\\
t_N
\end{pmatrix}
%
+ \lambda_2
\begin{pmatrix}
t_1^2\\
\cdots\\
t_N^2\\
t_1^2\\
\cdots\\
t_N^2
\end{pmatrix}
%
+ \lambda_3
\begin{pmatrix}
t_1^3\\
\cdots\\
t_N^3\\
t_1^3\\
\cdots\\
t_N^3
\end{pmatrix}\notag \\
%
&+
\begin{pmatrix}
\epsilon_1\\
\epsilon_2\\
\cdots\\
\cdots\\
\epsilon_{2N-1}\\
\epsilon_{2N}
\end{pmatrix}
\end{align}
Fitting these two models to a single gene, is equivalent to
proposing two hypotheses: one, that this gene is not differentially regulated
over time and is therefore defined by the reduced model, and two, this gene is
differentially regulated over time and is defined by the full model. To decide
which of these two hypotheses is more plausible given the available data, we
use the analysis of deviance test, which is an extension of the likelihood
ratio test (\ref{fig:full_reduced}).
\begin{figure}[!ht]
\begin{center}
\includegraphics[width=0.8\textwidth]{hdf-fit_cebpd.pdf}
\end{center}
\caption[Full/reduced model fit of gene expression profile]{
{\bf Full/reduced model fit of the expression profile of CEBPD in fibroblasts.}
Shown are the microarray time series under the control and
keratinocyte-conditioned medium stimulation. Dashed lines indicate the full
model which assumes the alternative hypothesis of differential expression
between conditions; The dot-dashed line represents the reduced model of null
hypothesis.
}
\label{fig:full_reduced}
\end{figure}
\section{Impulse model}
\label{sec:impulse}
The qPCR time courses of \tnfa and \sdfonea measured in HPAEC cells are fitted with an impulse model~\citep{Chechik2009}, which is composed of one rising and one falling sigmoidal curve, as defined by
\[
f(x) = \frac{1}{h_1} \cdot \left(h_0+\frac{h_1-h_0}{1+e^{-\beta_1(x-t_1)}}\right) \cdot
\left(h_2+\frac{h_1-h_2}{1+e^{\beta_2(x-t_2)}}\right),
\]
where the three amplitude (height) parameters determine the initial amplitude
($h_0$), the peak amplitude
($h_1$), and the steady state amplitude ($h_2$). The onset time $t_1$ is the time of first transition (inflection point) and the offset $t_2$ is the time of second transition. Finally, the slope parameters $\beta_1$ and $\beta_2$ determine the slope of the first and second transition. Parameters are estimated using a Levenberg-Marquardt non-linear least squares algorithm as implemented in the R package \texttt{minpack.lm}.
\section{Calculating the significantly basal expressed genes}
In order to robustly estimate differentially regulated genes at the basal level and to avoid the commonly encountered
presence of outliers and skewness~\citep{Marko2011}, we fitted
a skew-$t$ distribution~\citep{Azzalini2003} to the $\log_2$ raw intensities at 0h.
In addition to the regular $t$ distribution $f(x)$ with $v_{df}$ degrees of freedom, a skew-$t$ distribution, $f_{skew}(x)$ has an additional
skewing parameter $\gamma$ such that
\[
f_{skew}\left(\frac{x-\xi}{\omega}\right) =
\begin{cases}
\frac{2\gamma}{\gamma^2+1} \cdot f\left(\gamma\frac{x-\xi}{\omega}\right) & x<0\\
\frac{2\gamma}{\gamma^2+1} \cdot f\left(\frac{1}{\gamma}\frac{x-\xi}{\omega}\right) & x \geqslant 0\\
\end{cases}
\]
where $\xi$ and $\omega$ denote the location and scale of the distribution.
The basal level gene expression distribution was fitted using the maximum likelihood
estimation for the
univariate skew-$t$ distribution, implemented in the R package \texttt{sn}.
%\section{Gene set enrichment analysis}
%\label{sec:h838_gage_table}
%\input{flow/h838_table_gage_diff_up.tex}
%\input{flow/h838_table_gage_diff_down.tex}