Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Moonshine #34784

Merged
merged 93 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
35434da
config draft
eustlb Nov 18, 2024
7e18038
full encoder forward
eustlb Dec 2, 2024
6517251
full decoder forward
eustlb Dec 3, 2024
b0efed1
fix sdpa and FA2
eustlb Dec 12, 2024
b4d18f9
fix sdpa and FA2
eustlb Dec 12, 2024
b3777e0
moonshine model
eustlb Dec 13, 2024
e313ab5
moonshine model forward
eustlb Dec 13, 2024
7a6935a
fix attention with past_key_values
eustlb Dec 13, 2024
8fda426
add MoonshineForConditionalGeneration
eustlb Dec 13, 2024
d0ed917
fix cache handling and causality for cross attention
eustlb Dec 13, 2024
461f210
no causal attention mask for the encoder
eustlb Dec 15, 2024
22dbaae
model addition (imports etc)
eustlb Dec 15, 2024
72ba8c4
small nit
eustlb Dec 15, 2024
f548504
nits
eustlb Dec 16, 2024
3d52b1e
Update src/transformers/models/moonshine/convert_usefulsensors_to_hf.py
eustlb Dec 16, 2024
8f82a40
add rope_theta
eustlb Dec 16, 2024
fc73b37
nits
eustlb Dec 16, 2024
aedccf5
model doc
eustlb Dec 16, 2024
0954133
Update src/transformers/models/auto/configuration_auto.py
eustlb Dec 17, 2024
b1f0909
imports
eustlb Dec 17, 2024
f647a9f
add MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES
eustlb Dec 17, 2024
2da255d
updates modular
eustlb Dec 17, 2024
407cd3e
make
eustlb Dec 17, 2024
b3da73d
make fix-copies
eustlb Dec 17, 2024
8cf195d
ruff check examples fix
eustlb Dec 17, 2024
f78db3d
fix check_modular_conversion
eustlb Dec 17, 2024
5927846
Merge branch 'main' into add-moonshine
eustlb Dec 17, 2024
b680f0d
nit
eustlb Dec 17, 2024
01a4ea1
nits
eustlb Dec 17, 2024
5124cbd
nits
eustlb Dec 17, 2024
a528bd3
copied from -> imports
eustlb Dec 18, 2024
5648c38
Merge branch 'main' into add-moonshine
eustlb Dec 20, 2024
338c7c0
imports fix
eustlb Dec 20, 2024
2ec366a
integrate attention refacto
eustlb Dec 20, 2024
5261886
modular edge case
eustlb Dec 20, 2024
467b8e4
remove encoder
eustlb Jan 2, 2025
445dfcb
convolutions params in config
eustlb Jan 2, 2025
a656e8c
run modular_model_converter
eustlb Jan 2, 2025
6547634
make
eustlb Jan 2, 2025
9eec03b
Update docs/source/en/model_doc/moonshine.md
eustlb Jan 2, 2025
53b9b9d
MoonshineModelTest
eustlb Jan 2, 2025
fb2258f
correct typo
eustlb Jan 7, 2025
259c6b3
make style
eustlb Jan 7, 2025
59dd57f
integration tests
eustlb Jan 7, 2025
8fe2cdc
Merge branch 'main' into add-moonshine
eustlb Jan 7, 2025
02c4343
make
eustlb Jan 7, 2025
6c7dc79
modular convert
eustlb Jan 7, 2025
89904f9
Merge branch 'main' into add-moonshine
eustlb Jan 8, 2025
829a2b8
name conversion update (up_proj -> fc1 etc)
eustlb Jan 9, 2025
02a1f1f
update config
eustlb Jan 9, 2025
80c39cd
update MLP
eustlb Jan 9, 2025
55ac7e2
update attention
eustlb Jan 9, 2025
4d540af
update encoder layer
eustlb Jan 9, 2025
4f52158
update decoder layer
eustlb Jan 9, 2025
6568576
update convolutions parameters
eustlb Jan 9, 2025
495bd73
update encoder
eustlb Jan 9, 2025
48492c2
remove INPUTS_DOCSTRING
eustlb Jan 9, 2025
21a743f
update decoder
eustlb Jan 9, 2025
3b7af33
update conditional generation
eustlb Jan 9, 2025
563a360
update pretrained model
eustlb Jan 9, 2025
0ebd9a5
imports
eustlb Jan 9, 2025
d0233a7
modular converted
eustlb Jan 9, 2025
f9a6cac
update doc
eustlb Jan 9, 2025
b8a6e3b
fix
eustlb Jan 9, 2025
bfea59a
typo
eustlb Jan 9, 2025
ea150d3
update doc
eustlb Jan 9, 2025
8170cc8
update license
eustlb Jan 9, 2025
9bc7b35
update init
eustlb Jan 9, 2025
2bbb8ee
split config in file
eustlb Jan 9, 2025
b42d7f6
two classes for MLP
eustlb Jan 9, 2025
08272b9
attention from GLM
eustlb Jan 9, 2025
61462db
from GlmRotaryEmbedding
eustlb Jan 9, 2025
c866e58
split MLP
eustlb Jan 9, 2025
b0183be
apply arthur's review suggestions
eustlb Jan 9, 2025
3aa4f8d
apply arthur's review suggestions
eustlb Jan 9, 2025
d64190c
apply arthur's review suggestions
eustlb Jan 9, 2025
0bfb6bc
auto feature extractor
eustlb Jan 9, 2025
1268e13
convert modular
eustlb Jan 9, 2025
141a70b
fix + make
eustlb Jan 9, 2025
70f16dc
Merge branch 'main' into add-moonshine
eustlb Jan 9, 2025
1f3cc63
convert modular
eustlb Jan 9, 2025
77ca19a
make
eustlb Jan 9, 2025
e6b19db
unsplit config
eustlb Jan 9, 2025
7155fbd
use correct checkpoint
eustlb Jan 9, 2025
15b2552
wrap generate
eustlb Jan 9, 2025
89ceca8
Merge branch 'main' into add-moonshine
eustlb Jan 9, 2025
fc4febc
update tests
eustlb Jan 10, 2025
3e2a84b
typos
eustlb Jan 10, 2025
9a8f91d
make
eustlb Jan 10, 2025
834c364
typo
eustlb Jan 10, 2025
b491d95
update doc
eustlb Jan 10, 2025
d357fff
Merge branch 'main' into add-moonshine
eustlb Jan 10, 2025
2a6c59e
Merge branch 'main' into add-moonshine
eustlb Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
config draft
  • Loading branch information
eustlb committed Nov 18, 2024
commit 35434da084655e6e00546e9c8bbfac7f2d10c6dd
50 changes: 50 additions & 0 deletions src/transformers/models/moonshine/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
)


_import_structure = {
"configuration_moonshine": ["MoonshineConfig"],
}

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_gemma2"] = [
]

if TYPE_CHECKING:
from .configuration_moonshine import MoonshineConfig

try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
pass

else:
import sys

sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
246 changes: 246 additions & 0 deletions src/transformers/models/moonshine/modular_moonshine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
from ...configuration_utils import PretrainedConfig

class MoonshineConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`MoonshineModel`]. It is used to instantiate a
Moonshine model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Moonshine
[UsefulSensors/moonshine](https://huggingface.co/UsefulSensors/moonshine) architecture.

Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.


Args:
vocab_size (`int`, *optional*, defaults to 32768):
Vocabulary size of the Moonshine model. Defines the number of different tokens that can be represented by the
`decoder_input_ids` passed when calling [`MoonshineModel`]
encoder_layers (`int`, *optional*, defaults to 4):
Number of encoder layers.
decoder_layers (`int`, *optional*, defaults to 4):
Number of decoder layers.
encoder_attention_heads (`int`, *optional*, defaults to 6):
Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (`int`, *optional*, defaults to 6):
Number of attention heads for each attention layer in the Transformer decoder.
encoder_ffn_dim (`int`, *optional*, defaults to 1536):
Dimensionality of the "intermediate" (often named feed-forward) layer in encoder.
decoder_ffn_dim (`int`, *optional*, defaults to 1536):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_layerdrop (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
decoder_layerdrop (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
decoder_start_token_id (`int`, *optional*, defaults to 50257):
Corresponds to the "<|startoftranscript|>" token, which is automatically used when no `decoder_input_ids`
are provided to the `generate` function. It is used to guide the model`s generation process depending on
the task.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
Whether the model is used as an encoder/decoder or not.
activation_function (`str`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
d_model (`int`, *optional*, defaults to 384):
Dimensionality of the layers.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
init_std (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
scale_embedding (`bool`, *optional*, defaults to False):
Scale embeddings by diving by sqrt(d_model).
max_source_positions (`int`, *optional*, defaults to 1500):
The maximum sequence length of log-mel filter-bank features that this model might ever be used with.
max_target_positions (`int`, *optional*, defaults to 448):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
pad_token_id (`int`, *optional*, defaults to 50256):
Padding token id.
bos_token_id (`int`, *optional*, defaults to 50256):
Begin of stream token id.
eos_token_id (`int`, *optional*, defaults to 50256):
End of stream token id.
suppress_tokens (`List[int]`, *optional*):
A list containing the non-speech tokens that will be used by the logit processor in the `generate`
function. NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI each correspond to the `english-only` and the
`multilingual` model.
begin_suppress_tokens (`List[int]`, *optional*, defaults to `[220,50256]`):
A list containing tokens that will be supressed at the beginning of the sampling process. Initialized as
the token for `" "` (`blank_token_id`) and the `eos_token_id`
use_weighted_layer_sum (`bool`, *optional*, defaults to `False`):
Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an
instance of [`MoonshineForAudioClassification`].
classifier_proj_size (`int`, *optional*, defaults to 256):
Dimensionality of the projection before token mean-pooling for classification. Only relevant when using an
instance of [`MoonshineForAudioClassification`].
apply_spec_augment (`bool`, *optional*, defaults to `False`):
Whether to apply *SpecAugment* data augmentation to the outputs of the feature encoder. For reference see
[SpecAugment: A Simple Data Augmentation Method for Automatic Speech
Recognition](https://arxiv.org/abs/1904.08779).
mask_time_prob (`float`, *optional*, defaults to 0.05):
Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The masking
procecure generates `mask_time_prob*len(time_axis)/mask_time_length` independent masks over the axis. If
reasoning from the propability of each feature vector to be chosen as the start of the vector span to be
masked, *mask_time_prob* should be `prob_vector_start*mask_time_length`. Note that overlap may decrease the
actual percentage of masked vectors. This is only relevant if `apply_spec_augment == True`.
mask_time_length (`int`, *optional*, defaults to 10):
Length of vector span along the time axis.
mask_time_min_masks (`int`, *optional*, defaults to 2),:
The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step,
irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length <
mask_time_min_masks''
mask_feature_prob (`float`, *optional*, defaults to 0.0):
Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. The
masking procecure generates `mask_feature_prob*len(feature_axis)/mask_time_length` independent masks over
the axis. If reasoning from the propability of each feature vector to be chosen as the start of the vector
span to be masked, *mask_feature_prob* should be `prob_vector_start*mask_feature_length`. Note that overlap
may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is
True`.
mask_feature_length (`int`, *optional*, defaults to 10):
Length of vector span along the feature axis.
mask_feature_min_masks (`int`, *optional*, defaults to 0),:
The minimum number of masks of length `mask_feature_length` generated along the feature axis, each time
step, irrespectively of `mask_feature_prob`. Only relevant if
`mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks`.
median_filter_width (`int`, *optional*, defaults to 7):
Width of the median filter used to smoothen to cross-attention outputs when computing token timestamps.
Should be an odd number.

Example:

```python
>>> from transformers import MoonshineConfig, MoonshineModel

>>> # Initializing a Moonshine tiny style configuration
>>> configuration = MoonshineConfig()

>>> # Initializing a model (with random weights) from the tiny style configuration
>>> model = MoonshineModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```"""

model_type = "moonshine"
keys_to_ignore_at_inference = ["past_key_values"]
attribute_map = {
"num_key_value_heads": "encoder_attention_heads",
"num_attention_heads": "encoder_attention_heads",
"hidden_size": "d_model",
}

def __init__(
self,
vocab_size=32768,
encoder_layers=6,
encoder_attention_heads=8,
decoder_layers=6,
decoder_attention_heads=8,
decoder_ffn_dim=1152,
encoder_ffn_dim=1152,
encoder_layerdrop=0.0,
decoder_layerdrop=0.0,
decoder_start_token_id=50257,
use_cache=True,
is_encoder_decoder=True,
activation_function="gelu",
d_model=288,
dropout=0.0,
attention_dropout=0.0,
activation_dropout=0.0,
init_std=0.02,
scale_embedding=False,
max_source_positions=1500,
max_target_positions=448,
pad_token_id=50256,
bos_token_id=50256,
eos_token_id=50256,
suppress_tokens=None,
begin_suppress_tokens=[220, 50256],
use_weighted_layer_sum=False,
classifier_proj_size=256,
apply_spec_augment=False,
mask_time_prob=0.05,
mask_time_length=10,
mask_time_min_masks=2,
mask_feature_prob=0.0,
mask_feature_length=10,
mask_feature_min_masks=0,
median_filter_width=7,
**kwargs,
):
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder,
decoder_start_token_id=decoder_start_token_id,
suppress_tokens=suppress_tokens,
begin_suppress_tokens=begin_suppress_tokens,
**kwargs,
)

self.vocab_size = vocab_size
self.d_model = d_model
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.encoder_ffn_dim = encoder_ffn_dim
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.init_std = init_std
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.use_cache = use_cache
self.num_hidden_layers = encoder_layers
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
self.max_source_positions = max_source_positions
self.max_target_positions = max_target_positions

# Audio Classification-specific parameters. Feel free to ignore for other classes.
self.classifier_proj_size = classifier_proj_size
self.use_weighted_layer_sum = use_weighted_layer_sum

# fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
self.apply_spec_augment = apply_spec_augment
self.mask_time_prob = mask_time_prob
self.mask_time_length = mask_time_length
self.mask_time_min_masks = mask_time_min_masks
self.mask_feature_prob = mask_feature_prob
self.mask_feature_length = mask_feature_length
self.mask_feature_min_masks = mask_feature_min_masks

# draft
self.median_filter_width = median_filter_width
self.head_dim = self.d_model // self.encoder_attention_heads
self.max_position_embeddings = 2048
self.rope_theta = 10000.0
self.query_pre_attn_scalar = self.head_dim
self.attention_bias = True
self.sliding_window = 4096
self.final_logit_softcapping = 30.0
self.attn_logit_softcapping = None
self.final_logit_softcapping_type = None


class MoonshineAttention(Gemma2Attention):
pass


class MoonshineFlashAttention2(Gemma2FlashAttention2):
pass


class MoonshineSdpaAttention(Gemma2SdpaAttention):
pass
Loading