@@ -78,7 +78,8 @@ <h2>Description</h2>
78
78
We use the CLAP loss as an example, confirming that end-to-end fine-tuning further boosts the generation quality.
79
79
</ p >
80
80
< p >
81
- < b > Please join us at < a href ="https://interspeech2024.org " target ="_blank "> INTERSPEECH 2024</ a > at Kos Island, Greece!</ b >
81
+ < b > Please check out < a href ="poster.pdf " target ="_blank "> our poster</ a > at
82
+ < a href ="https://interspeech2024.org " target ="_blank "> INTERSPEECH 2024</ a > at Kos Island, Greece!</ b >
82
83
</ p >
83
84
</ section >
84
85
@@ -113,40 +114,36 @@ <h2>Main Experiment Results</h2>
113
114
</ thead >
114
115
< tbody >
115
116
< tr class ="result-row-2 " style ="color: #898989 ">
116
- < td class ="result-data-small "> < span style =" font-weight: 400; " > AudioLDM-L (Baseline)</ span > </ td >
117
- < td class ="result-data-2 "> 400 </ td > < td class =" result-data-2 " > -</ td > < td class ="result-data "> -</ td >
117
+ < td class ="result-data-small "> AudioLDM-L (Baseline)</ td > < td class =" result-data-2 " > 400 </ td >
118
+ < td class ="result-data-2 "> -</ td > < td class ="result-data "> -</ td >
118
119
< td class ="result-data "> -</ td > < td class ="result-data-2 "> -</ td > < td class ="result-data-2 "> -</ td >
119
- < td class ="result-data-2 "> < span style ="font-weight: 400; "> 2.08</ span > </ td > < td class ="result-data-2 "> 27.12</ td >
120
- < td class ="result-data-2 "> 1.86</ td >
120
+ < td class ="result-data-2-400 "> 2.08</ td > < td class ="result-data-2 "> 27.12</ td > < td class ="result-data-2 "> 1.86</ td >
121
121
</ tr >
122
122
< tr class ="result-row-2 " style ="color: #898989 ">
123
- < td class ="result-data-small "> < span style =" font-weight: 400; " > TANGO (Baseline)</ span > </ td >
123
+ < td class ="result-data-small "> TANGO (Baseline)</ td >
124
124
< td class ="result-data-2 "> 400</ td > < td class ="result-data-2 "> 168</ td >
125
125
< td class ="result-data "> < b > 4.136</ b > </ td > < td class ="result-data "> < b > 4.064</ b > </ td >
126
- < td class ="result-data-2 "> < span style ="font-weight: 400; "> 24.10</ span > </ td > < td class ="result-data-2 "> < b > 72.85</ b > </ td >
127
- < td class ="result-data-2 "> < b > 1.631</ b > </ td > < td class ="result-data-2 "> < b > 20.11</ b > </ td >
128
- < td class ="result-data-2 "> 1.362</ td >
126
+ < td class ="result-data-2-400 "> 24.10</ td > < td class ="result-data-2 "> < b > 72.85</ b > </ td >
127
+ < td class ="result-data-2 "> < b > 1.631</ b > </ td > < td class ="result-data-2 "> < b > 20.11</ b > </ td > < td class ="result-data-2 "> 1.362</ td >
129
128
</ tr >
130
129
< tr class ="result-row ">
131
- < td class ="result-data-small "> < span style =" font-weight: 400; " > ConsistencyTTA + CLAP-FT</ span > </ td >
130
+ < td class ="result-data-small "> ConsistencyTTA + CLAP-FT</ td >
132
131
< td class ="result-data-2 "> < b > 1</ b > </ td > < td class ="result-data-2 "> < b > 2.3</ b > </ td >
133
132
< td class ="result-data "> 3.830</ td > < td class ="result-data "> < b > 4.064</ b > </ td >
134
- < td class ="result-data-2 "> < b > 24.69</ b > </ td > < td class ="result-data-2 "> < span style ="font-weight: 400; "> 72.54</ span > </ td >
135
- < td class ="result-data-2 "> 2.406</ td > < td class ="result-data-2 "> < span style ="font-weight: 400; "> 20.97</ span > </ td >
136
- < td class ="result-data-2 "> < span style ="font-weight: 400; "> 1.358</ span > </ td >
133
+ < td class ="result-data-2 "> < b > 24.69</ b > </ td > < td class ="result-data-2-400 "> 72.54</ td >
134
+ < td class ="result-data-2 "> 2.406</ td > < td class ="result-data-2-400 "> 20.97</ td > < td class ="result-data-2-400 "> 1.358</ td >
137
135
</ tr >
138
136
< tr class ="result-row ">
139
- < td class ="result-data-small "> < span style =" font-weight: 400; " > ConsistencyTTA</ span > </ td >
137
+ < td class ="result-data-small "> ConsistencyTTA</ td >
140
138
< td class ="result-data-2 "> < b > 1</ b > </ td > < td class ="result-data-2 "> < b > 2.3</ b > </ td >
141
- < td class ="result-data " > < span style =" font-weight: 400; "> 3.902</ span > </ td > < td class ="result-data "> 4.010</ td >
139
+ < td class ="result-data- 400 "> 3.902</ td > < td class ="result-data "> 4.010</ td >
142
140
< td class ="result-data-2 "> 22.50</ td > < td class ="result-data-2 "> 72.30</ td >
143
141
< td class ="result-data-2 "> 2.575</ td > < td class ="result-data-2 "> 22.08</ td >
144
142
< td class ="result-data-2 "> < b > 1.354</ b > </ td >
145
143
</ tr >
146
144
< tr class ="result-row-2-small " style ="color: #898989 ">
147
- < td class ="result-data "> < span style ="font-weight: 400; "> Ground Truth</ span > </ td >
148
- < td class ="result-data-2 "> -</ td > < td class ="result-data-2 "> -</ td >
149
- < td class ="result-data "> -</ td > < td class ="result-data "> -</ td >
145
+ < td class ="result-data-small "> Ground Truth</ td > < td class ="result-data-2 "> -</ td >
146
+ < td class ="result-data-2 "> -</ td > < td class ="result-data "> -</ td > < td class ="result-data "> -</ td >
150
147
< td class ="result-data-2 "> 26.71</ td > < td class ="result-data-2 "> 100</ td >
151
148
< td class ="result-data-2 "> -</ td > < td class ="result-data-2 "> -</ td > < td class ="result-data-2 "> -</ td >
152
149
</ tr >
@@ -155,7 +152,90 @@ <h2>Main Experiment Results</h2>
155
152
< p >
156
153
< a href ="https://paperswithcode.com/sota/audio-generation-on-audiocaps " target =“blank” > This benchmark</ a >
157
154
demonstrates how our single-step models stack up with previous methods,
158
- most of which mostly require hundreds of generation steps.
155
+ most of which requiring hundreds of generation steps.
156
+ </ p >
157
+ </ section >
158
+
159
+ < section class ="section ">
160
+ < h2 > Ablation Studies on Distillation Settings</ h2 >
161
+ < p >
162
+ < table class ="result-table ">
163
+ < thead >
164
+ < tr class ="result-row ">
165
+ < th class ="result-head "> Guidance Method</ th >
166
+ < th class ="result-head "> CFG Weight</ th >
167
+ < th class ="result-head "> Teacher Solver</ th >
168
+ < th class ="result-head "> Noise Schedule</ th >
169
+ < th class ="result-head-2 "> FAD ↓</ th >
170
+ < th class ="result-head-2 "> FD ↓</ th >
171
+ < th class ="result-head-2 "> KLD ↓</ th >
172
+ </ tr >
173
+ </ thead >
174
+ < tbody >
175
+ < tr class ="result-row-2 ">
176
+ < td class ="result-data-small "> Unguided</ td >
177
+ < td class ="result-data-small "> 1</ td >
178
+ < td class ="result-data-small "> DDIM</ td >
179
+ < td class ="result-data-small "> Uniform</ td >
180
+ < td class ="result-data-2 "> 13.48</ td >
181
+ < td class ="result-data-2 "> 45.75</ td >
182
+ < td class ="result-data-2 "> 2.409</ td >
183
+ </ tr >
184
+ < tr class ="result-row-2 ">
185
+ < td class ="result-data-small " rowspan ="2 "> External CFG</ td >
186
+ < td class ="result-data-small " rowspan ="2 "> 3</ td >
187
+ < td class ="result-data-small "> DDIM</ td >
188
+ < td class ="result-data-small "> Uniform</ td >
189
+ < td class ="result-data-2 "> 8.565</ td >
190
+ < td class ="result-data-2 "> 38.67</ td >
191
+ < td class ="result-data-2 "> 2.015</ td >
192
+ </ tr >
193
+ < tr class ="result-row-2 ">
194
+ < td class ="result-data-small "> Heun</ td >
195
+ < td class ="result-data-small "> Karras</ td >
196
+ < td class ="result-data-2 "> 7.421</ td >
197
+ < td class ="result-data-2 "> 39.36</ td >
198
+ < td class ="result-data-2 "> 1.976</ td >
199
+ </ tr >
200
+ < tr class ="result-row-2 ">
201
+ < td class ="result-data-small " rowspan ="2 "> CFG Distillation< br > with Fixed Weight</ td >
202
+ < td class ="result-data-small " rowspan ="2 "> 3</ td >
203
+ < td class ="result-data-small " rowspan ="2 "> Heun</ td >
204
+ < td class ="result-data-small "> Karras</ td >
205
+ < td class ="result-data-2 "> 5.702</ td >
206
+ < td class ="result-data-2 "> 33.18</ td >
207
+ < td class ="result-data-2 "> 1.494</ td >
208
+ </ tr >
209
+ < tr class ="result-row-2 ">
210
+ < td class ="result-data-small "> Uniform</ td >
211
+ < td class ="result-data-2 "> 3.859</ td >
212
+ < td class ="result-data-2 "> < b > 27.79</ b > </ td >
213
+ < td class ="result-data-2 "> 1.421</ td >
214
+ </ tr >
215
+ < tr class ="result-row-2 ">
216
+ < td class ="result-data-small " rowspan ="3 "> CFG Distillation< br > with Random Weight</ td >
217
+ < td class ="result-data-small "> 4</ td >
218
+ < td class ="result-data-small " rowspan ="2 "> Heun</ td >
219
+ < td class ="result-data-small " rowspan ="2 "> Uniform</ td >
220
+ < td class ="result-data-2-400 "> 3.180</ td >
221
+ < td class ="result-data-2-400 "> 27.92</ td >
222
+ < td class ="result-data-2-400 "> 1.394</ td >
223
+ </ tr >
224
+ < tr class ="result-row-2 ">
225
+ < td class ="result-data-small "> 6</ td >
226
+ < td class ="result-data-2 "> < b > 2.975</ b > </ td >
227
+ < td class ="result-data-2 "> 28.63</ td >
228
+ < td class ="result-data-2 "> < b > 1.378</ b > </ td >
229
+ </ tr >
230
+ </ tbody >
231
+ </ table >
232
+ Based on these results, we can conclude that:
233
+ < ul >
234
+ < li > CFG distillation with random weight is more effective than fixed weight,
235
+ which is more effective than external CFG.</ li >
236
+ < li > Heun is a better teacher solver than DDIM, and
237
+ Uniform noise schedule outperforms Karras noise schedule.</ li >
238
+ </ ul >
159
239
</ p >
160
240
</ section >
161
241
@@ -183,11 +263,11 @@ <h2>Human Evaluation</h2>
183
263
< h2 > Citing Our Work (BibTeX)</ h2 >
184
264
< div id ="bibtex1 " class ="bibtex " onclick ="copyToClipboard('bibtex1') ">
185
265
< i class ="far fa-copy copy-icon "> </ i >
186
- < pre > @article{bai2023accelerating ,
266
+ < pre > @inproceedings{bai2024accelerating ,
187
267
author = {Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh},
188
- title = {Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
189
- journal={arXiv preprint arXiv:2309.10740 },
190
- year = {2023 }
268
+ title = {ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
269
+ booktitle = {INTERSPEECH },
270
+ year = {2024 }
191
271
}</ pre >
192
272
</ div >
193
273
</ section >
0 commit comments