Upgrade Transformers to v4.43.x (adapter-hub#727)

Changes required for sync: - re-copy Llama & Beit attention - add clip sdp & flash attn - fix tie_weights method - upgrade torch version in tests --------- Co-authored-by: Leon Engländer <[email protected]>
ReDASers · Aug 30, 2024 · 7a247a1 · 7a247a1
1 parent 29916f8
commit 7a247a1
Show file tree

Hide file tree

Showing 9 changed files with 215 additions and 32 deletions.
diff --git a/.github/workflows/tests_torch.yml b/.github/workflows/tests_torch.yml
@@ -39,7 +39,7 @@ jobs:
           key: ${{ runner.os }}-pip-${{ hashFiles('setup.py') }}
       - name: Install
         run: |
-          pip install torch==2.1.2
+          pip install torch==2.3
           pip install .[quality]
       - name: Check Quality and Repo Consistency
         run: |
@@ -62,7 +62,7 @@ jobs:
             ${{ runner.os }}-pip-
       - name: Install
         run: |
-          pip install torch==2.1.2
+          pip install torch==2.3
           pip install .[sklearn,testing,sentencepiece]
       - name: Test
         run: |
@@ -85,7 +85,7 @@ jobs:
             ${{ runner.os }}-pip-
       - name: Install
         run: |
-          pip install torch==2.1.2
+          pip install torch==2.3
           pip install .[sklearn,testing,sentencepiece]
       - name: Test
         run: |
@@ -108,7 +108,7 @@ jobs:
             ${{ runner.os }}-pip-
       - name: Install
         run: |
-          pip install torch==2.1.2
+          pip install torch==2.3
           pip install .[sklearn,testing,sentencepiece]
           pip install conllu seqeval
       - name: Test Examples

diff --git a/docs/huggingface_hub.md b/docs/huggingface_hub.md
@@ -17,6 +17,7 @@ Alternatively, all adapters on the Hugging Face Model Hub are also listed on [ht
 
 After you have found an adapter you would like to use, loading it into a Transformer model is easy.
 For example, for loading and activating the adapter [`AdapterHub/roberta-base-pf-sick`](https://huggingface.co/AdapterHub/roberta-base-pf-sick), write:
+
 ```python
 from adapters import AutoAdapterModel
 
@@ -34,20 +35,23 @@ For more options and information, e.g. for managing models via the CLI and Git,
 
 1. **Prepare access credentials**: Before being able to push to the Hugging Face Model Hub for the first time, we have to store our access token in the cache.
     This can be done via the `huggingface-cli` by running:
-    ```
+
+    ```sh
     huggingface-cli login
     ```
 
 2. **Push an adapter**: Next, we can proceed to upload our first adapter.
     Let's say we have a standard pre-trained Transformers model with an existing adapter named `awesome_adapter` (e.g. added via `model.add_adapter("awesome_adapter")` and [trained](training.md) afterwards).
     We can now push this adapter to the Model Hub using `model.push_adapter_to_hub()` like this:
+
     ```python
     model.push_adapter_to_hub(
         "my-awesome-adapter",
         "awesome_adapter",
         datasets_tag="imdb"
     )
     ```
+
     This will create a repository `my-awesome-adapter` under your username, generate a default adapter card as `README.md` and upload the adapter named `awesome_adapter` together with the adapter card to the new repository.
     `datasets_tag` provides additional information for categorization.
 
@@ -56,12 +60,14 @@ For more options and information, e.g. for managing models via the CLI and Git,
         All adapters uploaded to Hugging Face's Model Hub are automatically also listed on AdapterHub.ml. Thus, for better categorization, ``datasets_tag`` is helpful when uploading a new adapter to the Model Hub. ``datasets_tag`` specifies the dataset the adapter was trained on as an identifier from `Hugging Face Datasets <https://huggingface.co/datasets>`_.
     ```
 
-Voilà! Your first adapter is on the Hugging Face Model Hub.
+Voilà! Your first adapter is on
+ the Hugging Face Model Hub.
 Anyone can now run:
-```
+
+```python
 model.load_adapter("<your_username>/my-awesome-adapter")
 ```
 
 To update your adapter, simply run `push_adapter_to_hub()` with the same repository name again. This will push a new commit to the existing repository.
 
-You can find the full documentation of `push_adapter_to_hub()` [here](adapters.hub_mixin.PushAdapterToHubMixin.push_adapter_to_hub).
+You can find the full documentation of `push_adapter_to_hub()` [here](adapters.hub_mixin.PushAdapterToHubMixin.push_adapter_to_hub).
diff --git a/docs/loading.md b/docs/loading.md
@@ -55,11 +55,13 @@ adapter_name = model.load_adapter('sst-2')
 
 In the minimal case, that's everything we need to specify to load a pre-trained task adapter for sentiment analysis, trained on the `sst-2` dataset using BERT base and a suitable adapter configuration.
 The name of the adapter is returned by [`load_adapter()`](adapters.ModelWithHeadsAdaptersMixin.load_adapter), so we can [activate it](adapter_composition.md) in the next step:
+
 ```python
 model.set_active_adapters(adapter_name)
 ```
 
 As the second example, let's have a look at how to load an adapter based on the [`AdapterInfo`](adapters.utils.AdapterInfo) returned by the [`list_adapters()`](adapters.utils.list_adapters) method from [above](#finding-pre-trained-adapters):
+
 ```python
 from adapters import AutoAdapterModel, list_adapters
 
@@ -93,4 +95,4 @@ We will go through the different arguments and their meaning one by one:
 - By default, the `load_adapter()` method will add the loaded adapter using the identifier string given as the first argument.
 To load the adapter using a custom name, we can use the `load_as` parameter.
 
-- Finally, `set_active` will directly activate the loaded adapter for usage in each model forward pass. Otherwise, you have to manually activate the adapter via `set_active_adapters()`.
+- Finally, `set_active` will directly activate the loaded adapter for usage in each model forward pass. Otherwise, you have to manually activate the adapter via `set_active_adapters()`.
diff --git a/setup.py b/setup.py
@@ -57,8 +57,8 @@
     "sphinx-intl==2.1.0",
     "sphinx-multiversion==0.2.4",
     "timeout-decorator",
-    "torch>=1.10,!=1.12.0",
-    "transformers~=4.42.4",
+    "torch",
+    "transformers~=4.43.3",
 ]
 
 

diff --git a/src/adapters/heads/model_mixin.py b/src/adapters/heads/model_mixin.py
@@ -53,8 +53,8 @@ class ModelWithFlexibleHeadsAdaptersMixin(ModelWithHeadsAdaptersMixin):
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
         self._convert_to_flex_head = True
-        if not hasattr(self.config, "custom_heads"):
-            self.config.custom_heads = {}
+        if not hasattr(self, "custom_heads"):
+            self.custom_heads = {}
         self._active_heads = []
 
     def head_type(head_type_str: str):
@@ -88,6 +88,8 @@ def _init_head_modules(self):
         for head_name, config in self.config.prediction_heads.items():
             self.add_prediction_head_from_config(head_name, config)
 
+        self._add_tied_weights_keys()
+
     # The following methods are required for handling LM heads
 
     def get_output_embeddings(self) -> Union[nn.Module, List[nn.Module]]:
@@ -132,6 +134,8 @@ def tie_weights(self):
                 self = getattr(self, self.base_model_prefix)
             self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)
 
+        super().tie_weights()
+
     def _resize_token_embeddings(self, new_num_tokens, pad_to_multiple_of=None):
         old_embeddings = self.get_input_embeddings()
         new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens, pad_to_multiple_of)
@@ -174,7 +178,7 @@ def add_prediction_head_from_config(
             head_class = MODEL_HEAD_MAP[head_type]
             head = head_class(self, head_name, **config)
             self.add_prediction_head(head, overwrite_ok=overwrite_ok, set_active=set_active)
-        elif head_type in self.config.custom_heads:
+        elif head_type in self.custom_heads:
             # we have to re-add the head type for custom heads
             self.add_custom_head(head_type, head_name, overwrite_ok=overwrite_ok, **config)
         else:
@@ -191,7 +195,7 @@ def get_prediction_heads_config(self):
         return heads
 
     def register_custom_head(self, identifier, head):
-        self.config.custom_heads[identifier] = head
+        self.custom_heads[identifier] = head
 
     @property
     def active_head(self) -> Union[str, List[str]]:
@@ -251,8 +255,8 @@ def set_active_adapters(
                 )
 
     def add_custom_head(self, head_type, head_name, overwrite_ok=False, set_active=True, **kwargs):
-        if head_type in self.config.custom_heads:
-            head = self.config.custom_heads[head_type](self, head_name, **kwargs)
+        if head_type in self.custom_heads:
+            head = self.custom_heads[head_type](self, head_name, **kwargs)
             # When a build-in head is added as a custom head it does not have the head_type property
             if not hasattr(head.config, "head_type"):
                 head.config["head_type"] = head_type

diff --git a/src/adapters/models/beit/modeling_beit.py b/src/adapters/models/beit/modeling_beit.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" PyTorch BEiT model."""
+"""PyTorch BEiT model."""
 
 
 import math
@@ -35,6 +35,7 @@ def forward(
         output_attentions: bool = False,
         relative_position_bias: Optional["BeitRelativePositionBias"] = None,
         interpolate_pos_encoding: bool = False,
+        resolution: Optional[Tuple[int]] = None,
     ) -> Union[Tuple[torch.Tensor], Tuple[torch.Tensor, torch.Tensor]]:
         mixed_query_layer = self.query(hidden_states)
 
@@ -51,9 +52,11 @@ def forward(
 
         # Add relative position bias if present.
         if self.relative_position_bias is not None:
+            height, width = resolution
+            window_size = (height // self.config.patch_size, width // self.config.patch_size)
             attention_scores = attention_scores + self.relative_position_bias(
-                interpolate_pos_encoding, attention_scores.shape[2]
-            ).unsqueeze(0)
+                window_size, interpolate_pos_encoding, dim_size=hidden_states.shape[1]
+            )
 
         # Add shared relative position bias if provided.
         if relative_position_bias is not None:
@@ -89,15 +92,17 @@ def forward(
         hidden_states: torch.Tensor,
         head_mask: Optional[torch.Tensor] = None,
         output_attentions: bool = False,
-        relative_position_bias: Optional[BeitRelativePositionBias] = None,
+        relative_position_bias: Optional["BeitRelativePositionBias"] = None,
         interpolate_pos_encoding: bool = False,
+        resolution: Optional[Tuple[int]] = None,
     ) -> Union[Tuple[torch.Tensor], Tuple[torch.Tensor, torch.Tensor]]:
         self_attention_outputs = self.attention(
             self.layernorm_before(hidden_states),  # in BEiT, layernorm is applied before self-attention
             head_mask,
             output_attentions=output_attentions,
             relative_position_bias=relative_position_bias,
             interpolate_pos_encoding=interpolate_pos_encoding,
+            resolution=resolution,
         )
         attention_output = self_attention_outputs[0]
         outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
@@ -125,4 +130,4 @@ def forward(
 
         outputs = (layer_output,) + outputs
 
-        return outputs
+        return outputs