(kaggle link -> https://www.kaggle.com/code/banddaniel/spam-mail-detection-w-tensorflow-distilbert)
I tried to predict a spam mail with finetuning a DistilBert based Tensorflow model.
- I applied several preprocessing operations (cleaning,dropping stop words),
- Used tf.data pipeline for efficient training,
- I only used only 20 max length for sequence length (bert models support up to 512 input lengths),
- Only 18000 samples be used for training (12000 samples for validating and 20000 samples for testing),
![Screenshot 2024-03-14 at 8 45 08 PM](https://private-user-images.githubusercontent.com/50263592/313107207-d169e959-c215-4dd5-a217-e1a78201aedb.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMTc4MjEsIm5iZiI6MTczOTIxNzUyMSwicGF0aCI6Ii81MDI2MzU5Mi8zMTMxMDcyMDctZDE2OWU5NTktYzIxNS00ZGQ1LWEyMTctZTFhNzgyMDFhZWRiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDE5NTg0MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZmMTVmZTE2ZmMxZmJjYTY5ZDlmM2IzZGQ1OTJjM2NhNzU2NmI0OTEzMzc4ZmRlYTU0YmQxNzgzZGM2MWM5MTQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.pZE_k3cyyMij1LomL5G8faGFuVzJ5AHUsoCEPmc4wRc)