Add phi3 #1597

abuelnasr0 · 2024-04-25T12:42:31Z

This PR adds phi3 model.
The PR is not ready for merge yet I need to:

Add conversion script and numeric check
add Phi3SuScaledRotaryEmbedding for microsoft/Phi-3-mini-128k-instruct

EDIT:
It's ready for review now.

abuelnasr0 · 2024-04-27T17:04:48Z

Phi3Backbone is now ready to be reviewed and merged!

I have checked numerics for both models phi3_mini_4k_instruct_en and phi3_mini_128k_instruct_en and it produce good mean differnce between the two outputs for float32 and float16 ( i.e. 5.2247e-08 ), but for bfloat16 it produces a less quality result (i.e. -0.0004). I think this is normal, isn't it?
here is a notebook with the script run on the two models https://www.kaggle.com/code/mohamedabuelnasr/phi3-keras-conversion

tirthasheshpatel · 2024-04-29T14:09:43Z

Thanks for the work on this @abuelnasr0, this is awesome!

I have checked numerics for both models phi3_mini_4k_instruct_en and phi3_mini_128k_instruct_en and it produce good mean differnce between the two outputs for float32 and float16 ( i.e. 5.2247e-08 ), but for bfloat16 it produces a less quality result (i.e. -0.0004). I think this is normal, isn't it?

Yes, that's pretty good. Are these absolute tolerence values or relative? Either way, it's below the machine precision of 32-bit and 16-bit floating values so they definitely match!

tools/checkpoint_conversion/convert_phi3_checkpoints.py

abuelnasr0 · 2024-04-29T17:14:36Z

@tirthasheshpatel

Are these absolute tolerence values or relative?

It was supposed to calculate the absolute difference, but I took a look again at the function and I found that I was calculating the mean of the difference not the absolute difference. I have corrected it and run the conversion script again. and the results become worse but acceptable for float32 (3.0725e-06), but for bfloat it is bad (0.0254 for 128k model and 0.0469 for 4k model). and for float16 (0.0046).

mattdangerw

Thanks! Still should go through this more, but left some initial comments...

Also, see this #1605 (comment), we have to do a large rebase on this. Sorry about that!

keras_nlp/models/phi3/phi3_attention.py

mattdangerw · 2024-05-02T03:04:23Z

keras_nlp/models/phi3/phi3_backbone.py

+            length that the model was trained with. Defaults to `4096`.
+        rope_max_wavelength (int, optional): The maximum angular wavelength of
+            the sine/cosine curves, for rotary embeddings. Defaults to `10000`.
+        rope_scaling_type (str, optional): The type of the rope scaling. Can be


What does su stand for? Just the original author of the RoPE paper? We try to avoid short/inscrutable names like this, but I'm not sure there's a good alternative.

Is this called "su" scaling outside of huggingface anywhere? Also will we need more options here that just two?

Honestly, I have no idea what "su" stands for, but it was just implemented in the official model repo in huggingface https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/8a362e755d2faf8cec2bf98850ce2216023d178a/modeling_phi3.py#L142
I tried to search for the source of this naming before your question, but I didn't fined anything. It can be standing for the paper author, but the implementation is different from what is proposed in the paper.

Is this called "su" scaling outside of huggingface anywhere?

I didn't see this term anywhere else.

Also will we need more options here that just two?

may be we will need 'yarn' also, If they published the larger models and they are using yarn. https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/8a362e755d2faf8cec2bf98850ce2216023d178a/modeling_phi3.py#L183
yarn is introduced here https://arxiv.org/pdf/2309.00071

I took a look again at the phi-3 paper and they actually mentioned that they used LongRope. may be I was in harry when I searched the first time 😅.
So yes 'su' stands for the original paper, but with scaling as introduced in LongRope paper. the layer name in the original implementation is SuScaled, but they made the types names shorter in the config to be only su or yarn. the yarn type is not only yarn but also YarnScaled

keras_nlp/models/phi3/phi3_decoder.py

abuelnasr0 · 2024-05-07T17:53:35Z

The model is ready now.
And the output matches the huggingface output for the two models.

phi3-mini-4k model:

phi3-mini-128k model:

but there is a problem with the tokenizer described here:

text = "<|user|>\nHow to win?<|end|>\n<|assistant|>"
# the output after adding special_tokens as user_defined_symbols to the sentence_piece model.
keras_nlp  : [1, 29871, 32010, 13, 5328, 304, 5401, 29973, 32007, 13, 32001]
# same as keras but without adding '▁' at the beginning. can be configured in the spm model.
llama_cpp  : [1, 32010, 13, 5328, 304, 5401, 29973, 32007, 13, 32001] 
# Removes '\n' (LF token) completly. 
# Adds '▁' at the beginning (If text starts with non-special token) and after each special token.
hf_fast_tok: [1, 32010, 1128, 304, 5401, 29973, 32007, 32001]
# Removes '\n' (LF token) completly. Adds '▁' at the beginning.
# Same as keras but if the text doesn't contain '\n'.
hf_tok     : [1, 29871, 32010, 5328, 304, 5401, 29973, 32007, 32001]

The huggingface output should match the sentencepiece output after adding the special tokens, but huggingface handles special tokens outside the sentencepiece library, that's why output doesn't much.

LlamaTokenizer and LlamaFastTokenizer in huggingface aren't consistent. But if we have to match huggingface output we should try to match LlamaFastTokenizer as it is used in the example of the official model page, but we will do a lot of work around, for example like here #1445

NOTE: The generation match in the photos because I used a text that is tokenized the same in keras_nlp and huggingface using LlamaTokenizer

mattdangerw

Very impressive work!! Thank you.

I think this is basically good to go, just left some comments on arg naming and the behavior of the specialized rope layer.

Thank so much!!

keras_nlp/src/models/phi3/Phi3_preprocessor_test.py

mattdangerw · 2024-05-09T01:01:57Z

keras_nlp/src/models/phi3/phi3_backbone.py

+            decoder.
+        max_sequence_length (int, optional): The maximum sequence length
+            that this model might ever be used with. Defaults to `4096`.
+        original_max_sequence_length (int, optional): The maximum sequence


What if we call this max_sequence_length and training_sequence_length? original_max_sequence_length is just very clunky as a name

I changed it again from training_sequence_length to pretraining_sequence_length. I think this is more clear. I can revert that commit if you would like to keep it training_sequence_length

mattdangerw · 2024-05-09T01:03:27Z

keras_nlp/src/models/phi3/phi3_causal_lm.py

+            "padding_mask": padding_mask,
+        }
+
+    def generate(self, inputs, max_length=None, stop_token_ids="auto"):


why do we need to override generate here? Maybe we should do some refactoring to avoid this need.

phi3 stops generation at <|end|> and <|endoftext|> tokens by default. The generation will be bad if we don't stop at <|end|>.
Refactoring will be good. may be we can add a variable stop_token_ids to the CausalLMPreprocessor class to be used when stop_token_ids is "auto"

Got it! Thanks for the explainer. I think what you have now makes sense. Some sort of refactoring so a model can specify default stop tokens without touching the "business logic" of generate sgtm, but that does not need to be on this PR.

Also, kinda weird that there's two end token ids. Do you know why?

keras_nlp/src/models/phi3/phi3_presets.py

tools/sentencepiece_testing/utils.py

keras_nlp/src/models/phi3/phi3_rotary_embedding.py

mattdangerw · 2024-05-09T01:24:52Z

keras_nlp/src/models/phi3/phi3_rotary_embedding.py

+        else:
+            self.inverese_freq_long_factor = None
+
+    def _compute_cos_sin_embedding(self, inputs, start_index=0, positions=None):


This looks almost exactly the same as what's on the upstream. Is this possible to do just by overriding self._get_inverse_freq(rotary_dim)?

If so, would save a lot of code here.

I will need to override call() also because cos_embeddings and sine_embeddings are also multiplied by a factor.
https://github.com/keras-team/keras-nlp/blob/0dff9f1eeda8dc37559c7b7a99514c1a8d469c17/keras_nlp/src/models/phi3/phi3_rotary_embedding.py#L129-L136

tools/sentencepiece_testing/create_phi3_test_proto.py

keras_nlp/src/models/phi3/phi3_causal_lm.py

mattdangerw · 2024-05-17T16:35:58Z

Copying the presets over now on Kaggle. I will pull this in today.

mattdangerw · 2024-05-17T19:05:11Z

Updated links, though I think things are still processing Kaggle side.

abuelnasr0 force-pushed the phi3 branch from 11d3314 to 60e0471 Compare April 27, 2024 00:13

tirthasheshpatel reviewed Apr 29, 2024

View reviewed changes

tools/checkpoint_conversion/convert_phi3_checkpoints.py Outdated Show resolved Hide resolved

mattdangerw reviewed May 2, 2024

View reviewed changes

abuelnasr0 force-pushed the phi3 branch from 2f184db to 36682c7 Compare May 2, 2024 19:59

abuelnasr0 added 23 commits May 7, 2024 18:00

Add phi3

e2b569a

Add phi3 to init

c963f7b

layer naming and some nits

0f30b5f

Decoder layers naming

1d18572

Remove bias from einsumdense

3d24cb2

nit fix for layernorm

a98369a

Add SuRotary embedding

aecd9e2

Romve print()

368e5a2

Add conversion script

4267257

Nit fix in script

e946503

Add phi3_4k as default preset

145864b

Fix Doc and nit changes

c5c78ed

Nit in test

7b0def0

Doc fix

c78c482

Add length check for rope scaling factors

cd0381a

Calculate the mean of the absolute differnce in conversion script

6f9108d

Fix typo

8e99e04

Add tokenizer and preprocessor

b3ca8a3

Format fix

b53a326

Fix dtype and device in conversion script

0e37c9a

Batch the input

45ab340

Batch the input

a459038

Nit

9c38dec

abuelnasr0 added 10 commits May 7, 2024 18:00

Change text

1eed34b

Set pad token id to 0

d048f2d

Default stop at EOS and EOT

3af4096

Add presets

c9f0ad9

Add presets and tests to tokenizer

5c5e4ef

Add prepreocessor preset tests

5225c6f

Add preset tests to causal_lm

b02c0b4

Add backbone preset tests

4ab0d32

Naming nits

b2b7c55

Clean surotaryembeddding

0dff9f1

abuelnasr0 force-pushed the phi3 branch from f469350 to 0dff9f1 Compare May 7, 2024 15:03

abuelnasr0 requested a review from mattdangerw May 8, 2024 20:27

mattdangerw reviewed May 9, 2024

View reviewed changes

abuelnasr0 added 6 commits May 10, 2024 00:50

Lower case file name

9552750

Save SuScaled rope factors as python lists

b76f314

Rename orignal_max seq_length to training seq_length

ad585b0

Foemat

9f10b63

Remove placeholders tokens from spm

55e15bf

Edit examples

19fc9ca

mattdangerw reviewed May 10, 2024

View reviewed changes

keras_nlp/src/models/phi3/phi3_causal_lm.py Outdated Show resolved Hide resolved

abuelnasr0 added 2 commits May 11, 2024 00:36

Nit in generate

f0a4236

Change training_seq_length to pretraining_seq_length

c205c20

Update links

b735170

mattdangerw added the kokoro:force-run Runs Tests on GPU label May 17, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 17, 2024

mattdangerw merged commit a675aeb into keras-team:master May 17, 2024
10 checks passed

awsaf49 mentioned this pull request May 22, 2024

Add remainig Phi-3 models #1645

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add phi3 #1597

Add phi3 #1597

abuelnasr0 commented Apr 25, 2024 •

edited

Loading

abuelnasr0 commented Apr 27, 2024

tirthasheshpatel commented Apr 29, 2024 •

edited

Loading

abuelnasr0 commented Apr 29, 2024

mattdangerw left a comment

mattdangerw May 2, 2024

abuelnasr0 May 7, 2024

abuelnasr0 May 8, 2024

abuelnasr0 commented May 7, 2024 •

edited

Loading

mattdangerw left a comment

mattdangerw May 9, 2024

abuelnasr0 May 14, 2024

mattdangerw May 9, 2024

abuelnasr0 May 9, 2024

mattdangerw May 10, 2024

mattdangerw May 10, 2024

abuelnasr0 May 10, 2024

mattdangerw May 9, 2024

abuelnasr0 May 9, 2024

mattdangerw commented May 17, 2024

mattdangerw commented May 17, 2024

Add phi3 #1597

Add phi3 #1597

Conversation

abuelnasr0 commented Apr 25, 2024 • edited Loading

abuelnasr0 commented Apr 27, 2024

tirthasheshpatel commented Apr 29, 2024 • edited Loading

abuelnasr0 commented Apr 29, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abuelnasr0 commented May 7, 2024 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented May 17, 2024

mattdangerw commented May 17, 2024

abuelnasr0 commented Apr 25, 2024 •

edited

Loading

tirthasheshpatel commented Apr 29, 2024 •

edited

Loading

abuelnasr0 commented May 7, 2024 •

edited

Loading