Add ReFT (LoReFT, NoReFT, DiReFT) #705

calpt · 2024-05-31T13:56:50Z

This PR integrates multiple ReFT variants as new adapter methods.

Paper: https://arxiv.org/pdf/2404.03592
Original code: https://github.com/stanfordnlp/pyreft

Changes

Add ReFT module implementation via ReftLayer, integrated into all models supported by Adapters. Integration via init_reft() method & Pytorch hook.
Add new ReftConfig as base config class with three default instances: LoReftConfig, NoReftConfig and DiReftConfig.
Method documentation can be found here: https://github.com/adapter-hub/adapters/blob/6c19ea06c143621a735226e477bf772068e55be3/docs/methods.md#reft

Compatibility

Tested that Pyreft & Adapters produce the same outputs on inference by converting Pyreft checkpoints to Adapters checkpoints (tested settings: LoReft, NoReft, DiReft, weight tying, prefix, suffix, rank, mostly using roberta-base).

Script for testing & checkpoint conversion here: https://github.com/calpt/pyreft/blob/main/compatibility.py.

Evaluation

Roberta-base with LoReFT on GLUE, using hyperparameters similar to the paper:

Task	Score
Cola (Matthews Corr.)	53.95
MNLI (Acc.)	83.23
MRPC (F1)	91.70
QNLI (Acc.)	90.94
QQP (Acc.)	86.82
RTE (Acc.)	76.53
SST-2 (Acc.)	93.81
STS-B (Spearmanr)	88.99

Todos

Modeling implementations
Add test methods
Make all checks passing
Add documentation
Make sure implementation produces same outputs as original code
Sanity check training runs

frankaging · 2024-05-31T17:36:02Z

@calpt Thanks for the PR! I took a quick look, and it looks promising. Here are two minor questions:

It seems like tied and untied weights are handled here:
https://github.com/adapter-hub/adapters/pull/705/files#diff-e791dd9c62ff127d32170821d7571b69e641f7ecc9462a58227fa5a80f3502f1R80

Does this mean if I want united weights among prefix and suffix, I will create two adaptors and set self.prefix_positions or self.suffix_positions to none?

Since in the ReFT paper, we only run experiments where we add interventions to the residual stream (e.g., transformer layer or block output), will, by default, ReftConfig assume this? If I want to add interventions to different streams (e.g., attention output, MLP up-projection), how will the ReftConfig look like?

Thanks!

calpt · 2024-05-31T20:33:00Z

@frankaging Thanks for looking over this! Re your questions:

Re 1: In the current implementation, one or two modules per layer will be created depending on whether the tied_weights attribute in the config is set or not. See here:

adapters/src/adapters/methods/reft.py

Lines 50 to 62 in 25f1d1c

    
           n_units = 1 if config.tied_weights else 2 
        
           self.units = nn.ModuleList( 
        
               [ 
        
                   ReftUnit( 
        
                       in_features, 
        
                       config.r, 
        
                       config.orthogonality, 
        
                       config.subtract_projection, 
        
                       config.non_linearity, 
        
                       config.dropout, 
        
                   ) 
        
                   for _ in range(n_units) 
        
               ]

This makes it very easy for a user to to tie or not tie weights when adding a single Reft adapter, e.g.:

from adapters import AutoAdapterModel, ReftConfig

model = AutoAdapterModel.from_pretrained("...")

config = ReftConfig(
    layers="all", prefix_positions=1, suffix_positions=1, r=1,
    tied_weights=True  # set to True or False to share weights
)
model.add_adapter("my_reft", config=config)
model.set_active_adapters("my_reft")

Re 2: Currently, the Reft implementation always assumes interventions are added to the residual stream as you explained, since this is the method proposed in the paper. This is done via a PyTorch hook here:

adapters/src/adapters/methods/reft.py

Lines 162 to 169 in 25f1d1c

    
           def init_reft(model): 
        
               def hook_fn(module, args, output): 
        
                   return (module.reft_layer(output[0]),) + output[1:] 
        
               for _, layer in model.iter_layers(): 
        
                   if not hasattr(layer, "reft_layer"): 
        
                       layer.reft_layer = ReftLayer(model.config, model.adapters_config) 
        
                       layer.register_forward_hook(hook_fn)

While no other intervention points are added for now, we can easily extend with similar hooks for other intervention points where it makes sense to do so.

Thanks again for looking over this! Please let us know if you have any suggestions or ideas what we should add or change for the first version!
Currently, the PR is still in a draft state. Once remaining issues are fixed and we have some documentation, would be happy to get your feedback again before we merge.

frankaging · 2024-05-31T21:31:39Z

@calpt Thanks for your responses! It makes sense to me.

Will the hook work out of the box for accelerated training (e.g., deepspeed, etc..)? Any existing tests on this? Thanks!

calpt · 2024-06-01T11:42:10Z

@frankaging No extensive tests for training yet at this point. Deepspeed support of this library is unfortunately flaky in general and not really a focus at the moment, but using e.g. torch distributed or HF Accelerate should work in the end.

src/adapters/methods/reft.py

hSterz

This looks great! Just some small comments and questions

src/adapters/methods/adapter_layer_base.py

src/adapters/methods/reft.py

tests/methods/test_reft.py

hSterz · 2024-06-08T10:37:44Z

src/adapters/model_mixin.py

@@ -968,6 +972,17 @@ def forward_context(self, context: ForwardContext, *args, **kwargs):
        if hasattr(self.base_model, "prefix_tuning"):
            context.prefix_states = self.base_model.prefix_tuning(*args, **kwargs)

+        # TODO this does not support padding on the left


Do we want to leave this TODO open?

added, pls check if this is correct

lenglaender

Commented on some minor things; everything else looks good & correctly implemented to me

Once the open comments are resolved & left padding is implemented this is ready to merge.

docs/methods.md

src/adapters/methods/adapter_layer_base.py

tests/methods/test_reft.py

src/adapters/methods/reft.py

Co-authored-by: Leon Engländer <[email protected]>

lenglaender

Looks good to me. This is ready to merge

calpt · 2024-06-25T20:40:24Z

Thanks! I've added some quick training results on GLUE tasks in the description, based on that the implementation looks good.

@frankaging re distributed training: I've tested it to work with torch distributed & HF Accelerate via the Trainer class, e.g. for GLUE:

torchrun --standalone --nnodes=1 --nproc-per-node=2 examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path roberta-large \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length $SEQ \
  --pad_to_max_length False \
  --per_device_train_batch_size 32 \
  --learning_rate $LR \
  --warmup_ratio $WARMUP \
  --num_train_epochs $EPOCH \
  --output_dir output/$TASK_NAME \
  --overwrite_output_dir \
  --train_adapter \
  --adapter_config "loreft[prefix_positions=$POS]"

frankaging · 2024-06-26T01:50:44Z

Thanks! I've added some quick training results on GLUE tasks in the description, based on that the implementation looks good.

@frankaging re distributed training: I've tested it to work with torch distributed & HF Accelerate via the Trainer class, e.g. for GLUE:
torchrun --standalone --nnodes=1 --nproc-per-node=2 examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path roberta-large \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length $SEQ \
  --pad_to_max_length False \
  --per_device_train_batch_size 32 \
  --learning_rate $LR \
  --warmup_ratio $WARMUP \
  --num_train_epochs $EPOCH \
  --output_dir output/$TASK_NAME \
  --overwrite_output_dir \
  --train_adapter \
  --adapter_config "loreft[prefix_positions=$POS]"

@calpt Thanks! It's great to see this approach works for different kinds of parallel training!

I was re-looking into that orthogonal matrix initialization thing (i.e., I was referencing the PEFT repo ticket and asking whether we should remove a redundant init), and I found that for some cases, removing that init step might cause unstable results. Have you look in to this again by doing some tests on your side? Thanks.

calpt · 2024-06-27T21:00:12Z

I was re-looking into that orthogonal matrix initialization thing (i.e., I was referencing the PEFT repo ticket and asking whether we should remove a redundant init), and I found that for some cases, removing that init step might cause unstable results. Have you look in to this again by doing some tests on your side? Thanks.

Interesting, I haven't tested this specifically. From looking at the code, it makes sense the orthogonal init is redundant, would you suggest we re-add it still?

This PR integrates multiple ReFT variants as new adapter methods. Paper: https://arxiv.org/pdf/2404.03592 Original code: https://github.com/stanfordnlp/pyreft ## Changes - Add ReFT module implementation via `ReftLayer`, integrated into all models supported by Adapters. Integration via `init_reft()` method & Pytorch hook. - Add new `ReftConfig` as base config class with three default instances: `LoReftConfig`, `NoReftConfig` and `DiReftConfig`. - Method documentation can be found here: https://github.com/adapter-hub/adapters/blob/6c19ea06c143621a735226e477bf772068e55be3/docs/methods.md#reft ## Compatibility Tested that Pyreft & Adapters produce the same outputs on inference by converting Pyreft checkpoints to Adapters checkpoints (tested settings: LoReft, NoReft, DiReft, weight tying, prefix, suffix, rank, mostly using roberta-base). Script for testing & checkpoint conversion here: https://github.com/calpt/pyreft/blob/main/compatibility.py. ## Evaluation Roberta-base with LoReFT on GLUE, using hyperparameters similar to the paper: Task | Score --- | --- Cola (Matthews Corr.) | 53.95 MNLI (Acc.) | 83.23 MRPC (F1) | 91.70 QNLI (Acc.) | 90.94 QQP (Acc.) | 86.82 RTE (Acc.) | 76.53 SST-2 (Acc.) | 93.81 STS-B (Spearmanr) | 88.99 ## Todos - [x] Modeling implementations - [x] Add test methods - [x] Make all checks passing - [x] Add documentation - [x] Make sure implementation produces same outputs as original code - [x] Sanity check training runs

calpt added 3 commits May 25, 2024 15:42

WIP reft

e6f004a

First working version

29c32f4

Add DiReft support

25f1d1c

calpt mentioned this pull request May 31, 2024

[P1] Questions on differences between paper and code stanfordnlp/pyreft#95

Closed

calpt added 2 commits June 1, 2024 13:00

Minor hook fix

beb1280

Fix saving reft orthogonal layers with safetensors

d74eee5

calpt added 3 commits June 1, 2024 14:24

Start adding docs

b041b46

Add method description docs

4ab4069

ReFT

1cd81a8

frankaging reviewed Jun 3, 2024

View reviewed changes

src/adapters/methods/reft.py Outdated Show resolved Hide resolved

calpt added 2 commits June 6, 2024 21:53

Remove redundant orthogonal init

1eaa863

Fix CLIP Reft

6c19ea0

calpt marked this pull request as ready for review June 8, 2024 10:10

calpt requested review from TimoImhof, hSterz and lenglaender June 8, 2024 10:12

hSterz reviewed Jun 8, 2024

View reviewed changes

Fix review issues

f94dbeb

calpt force-pushed the dev/reft branch from 789f495 to f94dbeb Compare June 8, 2024 12:29

calpt added 2 commits June 13, 2024 22:12

Support left padding

baf4dfe

test fix

a798c41

calpt requested review from hSterz and frankaging June 13, 2024 20:37

Revert left padding

db7f950

calpt marked this pull request as draft June 20, 2024 19:43

lenglaender reviewed Jun 20, 2024

View reviewed changes

docs/methods.md Outdated Show resolved Hide resolved

docs/methods.md Outdated Show resolved Hide resolved

src/adapters/methods/adapter_layer_base.py Show resolved Hide resolved

tests/methods/test_reft.py Show resolved Hide resolved

lenglaender reviewed Jun 21, 2024

View reviewed changes

src/adapters/methods/reft.py Outdated Show resolved Hide resolved

calpt and others added 5 commits June 22, 2024 14:02

Apply suggestions from code review

52b1978

Co-authored-by: Leon Engländer <[email protected]>

Remove default implementations from adapter layer classes

d519a3a

Merge branch 'main' into dev/reft

2f26970

Reft gather/ scatter improvements

5f2b712

Fix for untied weights

682de00

calpt marked this pull request as ready for review June 22, 2024 17:52

lenglaender approved these changes Jun 25, 2024

View reviewed changes

Add location_key to reft

dcf2f29

calpt merged commit d8c991f into adapter-hub:main Jul 1, 2024
4 checks passed

calpt deleted the dev/reft branch July 1, 2024 14:16

calpt mentioned this pull request Sep 15, 2024

No improvement in training loss using ReFT Methods #739

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ReFT (LoReFT, NoReFT, DiReFT) #705

Add ReFT (LoReFT, NoReFT, DiReFT) #705

calpt commented May 31, 2024 •

edited

Loading

frankaging commented May 31, 2024

calpt commented May 31, 2024

frankaging commented May 31, 2024

calpt commented Jun 1, 2024

hSterz left a comment

hSterz Jun 8, 2024

calpt Jun 13, 2024

lenglaender left a comment

lenglaender left a comment

calpt commented Jun 25, 2024

frankaging commented Jun 26, 2024

calpt commented Jun 27, 2024

Add ReFT (LoReFT, NoReFT, DiReFT) #705

Add ReFT (LoReFT, NoReFT, DiReFT) #705

Conversation

calpt commented May 31, 2024 • edited Loading

Changes

Compatibility

Evaluation

Todos

frankaging commented May 31, 2024

calpt commented May 31, 2024

frankaging commented May 31, 2024

calpt commented Jun 1, 2024

hSterz left a comment

Choose a reason for hiding this comment

hSterz Jun 8, 2024

Choose a reason for hiding this comment

calpt Jun 13, 2024

Choose a reason for hiding this comment

lenglaender left a comment

Choose a reason for hiding this comment

lenglaender left a comment

Choose a reason for hiding this comment

calpt commented Jun 25, 2024

frankaging commented Jun 26, 2024

calpt commented Jun 27, 2024

calpt commented May 31, 2024 •

edited

Loading