-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a notebook for data validation and preprocessing (#2)
add a notebook for data validation and preprocessing
- Loading branch information
Showing
1 changed file
with
153 additions
and
0 deletions.
There are no files selected for viewing
153 changes: 153 additions & 0 deletions
153
notebooks/validate_and_preprocess_data_by_cohere-finetune.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a9c4f20e-378a-4ed7-af1c-16ee594c9cab", | ||
"metadata": {}, | ||
"source": [ | ||
"# Validate and preprocess data by cohere-finetune" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "39829323-ea35-4018-9cd4-1a29b39b1282", | ||
"metadata": {}, | ||
"source": [ | ||
"This notebook shows you how to validate and preprocess your training and evaluation data by the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository.\n", | ||
"\n", | ||
"Throughout, we use the notation `<some_content_you_must_change>` to denote some content that you must change according to your own use case, e.g., paths to files or directories, etc. Meanwhile, for any content that is not between the angle brackets, you must use it as it is, unless otherwise stated." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fb6d7abd-6d43-4a41-a82d-b72753722739", | ||
"metadata": {}, | ||
"source": [ | ||
"## 1. Setup" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d92bb4bb-9d18-48f4-9b66-5a59c1402bd8", | ||
"metadata": {}, | ||
"source": [ | ||
"Get access to the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository by the commands below:\n", | ||
"```\n", | ||
"git clone [email protected]:cohere-ai/cohere-finetune.git\n", | ||
"cd cohere-finetune \n", | ||
"```\n", | ||
"\n", | ||
"Install the necessary Python packages by the command below:\n", | ||
"```\n", | ||
"pip install pandas==2.2.3 liquidpy==0.8.2 transformers==4.44.2\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9ddadfa8-4f40-4865-bf04-f8c17cca1a41", | ||
"metadata": {}, | ||
"source": [ | ||
"## 2. Import packages" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "0b65d40d-9656-4f1d-b99c-e2296f087f76", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import sys\n", | ||
"sys.path = [\"<some_path>/cohere-finetune/src/cohere_finetune\"] + sys.path\n", | ||
"\n", | ||
"from consts import CHAT_PROMPT_TEMPLATE_CMD_R, CHAT_PROMPT_TEMPLATE_CMD_R_08_2024\n", | ||
"from data import CohereDataset\n", | ||
"from preprocess import preprocess\n", | ||
"from tokenizer_utils import create_and_prepare_tokenizer" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "1b11b847-b058-44e2-a6a0-8bd2a85d3e87", | ||
"metadata": {}, | ||
"source": [ | ||
"## 3. Set variables and parameters" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e0ea9ce0-1517-4e87-9ac2-604622291c40", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# The following two paths are your raw training and evaluation datasets, which must follow the requirements at \n", | ||
"# https://github.com/cohere-ai/cohere-finetune?tab=readme-ov-file#step-3-prepare-the-training-and-evaluation-data\n", | ||
"input_train_dir = \"<root_dir>/input/data/training\"\n", | ||
"input_eval_dir = \"<root_dir>/input/data/evaluation\"\n", | ||
"\n", | ||
"# The following two paths are the final validated & preprocessed training and evaluation datasets, which are ready to be fed to the model for fine-tuning\n", | ||
"finetune_train_path = \"<root_dir>/finetune/data/train.csv\"\n", | ||
"finetune_eval_path = \"<root_dir>/finetune/data/eval.csv\"\n", | ||
"\n", | ||
"eval_percentage = 0.2 # The percentage of data split from training data for evaluation (ignored if evaluation data are provided)\n", | ||
"hf_model_name_or_path = \"CohereForAI/c4ai-command-r-v01\" # Change it to \"CohereForAI/c4ai-command-r-08-2024\" if you are fine-tuning Command R 08-2024\n", | ||
"prompt_template = CHAT_PROMPT_TEMPLATE_CMD_R # Change it to CHAT_PROMPT_TEMPLATE_CMD_R_08_2024 if you are fine-tuning Command R 08-2024" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "701cfaa8-d271-471c-b8da-7475e0a165ca", | ||
"metadata": {}, | ||
"source": [ | ||
"## 4. Validate and preprocess your training and evaluation data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2f1164e1-620e-4914-af98-083f6854de9f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"cohere_dataset = CohereDataset(train_dir=input_train_dir, eval_dir=input_eval_dir)\n", | ||
"cohere_dataset.convert_to_chat_jsonl()\n", | ||
"\n", | ||
"tokenizer = create_and_prepare_tokenizer(hf_model_name_or_path)\n", | ||
"\n", | ||
"preprocess(\n", | ||
" input_train_path=cohere_dataset.train_path,\n", | ||
" input_eval_path=cohere_dataset.eval_path,\n", | ||
" output_train_path=finetune_train_path,\n", | ||
" output_eval_path=finetune_eval_path,\n", | ||
" eval_percentage=eval_percentage,\n", | ||
" template=prompt_template,\n", | ||
" max_sequence_length=16384,\n", | ||
" tokenizer=tokenizer,\n", | ||
")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "finetune_validate_and_preprocess", | ||
"language": "python", | ||
"name": "finetune_validate_and_preprocess" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.14" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |