Skip to content

Commit

Permalink
Add a notebook for data validation and preprocessing (#2)
Browse files Browse the repository at this point in the history
add a notebook for data validation and preprocessing
  • Loading branch information
youran-qi authored Nov 14, 2024
1 parent 41b2161 commit 0d7b9a9
Showing 1 changed file with 153 additions and 0 deletions.
153 changes: 153 additions & 0 deletions notebooks/validate_and_preprocess_data_by_cohere-finetune.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a9c4f20e-378a-4ed7-af1c-16ee594c9cab",
"metadata": {},
"source": [
"# Validate and preprocess data by cohere-finetune"
]
},
{
"cell_type": "markdown",
"id": "39829323-ea35-4018-9cd4-1a29b39b1282",
"metadata": {},
"source": [
"This notebook shows you how to validate and preprocess your training and evaluation data by the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository.\n",
"\n",
"Throughout, we use the notation `<some_content_you_must_change>` to denote some content that you must change according to your own use case, e.g., paths to files or directories, etc. Meanwhile, for any content that is not between the angle brackets, you must use it as it is, unless otherwise stated."
]
},
{
"cell_type": "markdown",
"id": "fb6d7abd-6d43-4a41-a82d-b72753722739",
"metadata": {},
"source": [
"## 1. Setup"
]
},
{
"cell_type": "markdown",
"id": "d92bb4bb-9d18-48f4-9b66-5a59c1402bd8",
"metadata": {},
"source": [
"Get access to the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository by the commands below:\n",
"```\n",
"git clone [email protected]:cohere-ai/cohere-finetune.git\n",
"cd cohere-finetune \n",
"```\n",
"\n",
"Install the necessary Python packages by the command below:\n",
"```\n",
"pip install pandas==2.2.3 liquidpy==0.8.2 transformers==4.44.2\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "9ddadfa8-4f40-4865-bf04-f8c17cca1a41",
"metadata": {},
"source": [
"## 2. Import packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b65d40d-9656-4f1d-b99c-e2296f087f76",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path = [\"<some_path>/cohere-finetune/src/cohere_finetune\"] + sys.path\n",
"\n",
"from consts import CHAT_PROMPT_TEMPLATE_CMD_R, CHAT_PROMPT_TEMPLATE_CMD_R_08_2024\n",
"from data import CohereDataset\n",
"from preprocess import preprocess\n",
"from tokenizer_utils import create_and_prepare_tokenizer"
]
},
{
"cell_type": "markdown",
"id": "1b11b847-b058-44e2-a6a0-8bd2a85d3e87",
"metadata": {},
"source": [
"## 3. Set variables and parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0ea9ce0-1517-4e87-9ac2-604622291c40",
"metadata": {},
"outputs": [],
"source": [
"# The following two paths are your raw training and evaluation datasets, which must follow the requirements at \n",
"# https://github.com/cohere-ai/cohere-finetune?tab=readme-ov-file#step-3-prepare-the-training-and-evaluation-data\n",
"input_train_dir = \"<root_dir>/input/data/training\"\n",
"input_eval_dir = \"<root_dir>/input/data/evaluation\"\n",
"\n",
"# The following two paths are the final validated & preprocessed training and evaluation datasets, which are ready to be fed to the model for fine-tuning\n",
"finetune_train_path = \"<root_dir>/finetune/data/train.csv\"\n",
"finetune_eval_path = \"<root_dir>/finetune/data/eval.csv\"\n",
"\n",
"eval_percentage = 0.2 # The percentage of data split from training data for evaluation (ignored if evaluation data are provided)\n",
"hf_model_name_or_path = \"CohereForAI/c4ai-command-r-v01\" # Change it to \"CohereForAI/c4ai-command-r-08-2024\" if you are fine-tuning Command R 08-2024\n",
"prompt_template = CHAT_PROMPT_TEMPLATE_CMD_R # Change it to CHAT_PROMPT_TEMPLATE_CMD_R_08_2024 if you are fine-tuning Command R 08-2024"
]
},
{
"cell_type": "markdown",
"id": "701cfaa8-d271-471c-b8da-7475e0a165ca",
"metadata": {},
"source": [
"## 4. Validate and preprocess your training and evaluation data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f1164e1-620e-4914-af98-083f6854de9f",
"metadata": {},
"outputs": [],
"source": [
"cohere_dataset = CohereDataset(train_dir=input_train_dir, eval_dir=input_eval_dir)\n",
"cohere_dataset.convert_to_chat_jsonl()\n",
"\n",
"tokenizer = create_and_prepare_tokenizer(hf_model_name_or_path)\n",
"\n",
"preprocess(\n",
" input_train_path=cohere_dataset.train_path,\n",
" input_eval_path=cohere_dataset.eval_path,\n",
" output_train_path=finetune_train_path,\n",
" output_eval_path=finetune_eval_path,\n",
" eval_percentage=eval_percentage,\n",
" template=prompt_template,\n",
" max_sequence_length=16384,\n",
" tokenizer=tokenizer,\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "finetune_validate_and_preprocess",
"language": "python",
"name": "finetune_validate_and_preprocess"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 0d7b9a9

Please sign in to comment.