Add a notebook for data validation and preprocessing (#2)

add a notebook for data validation and preprocessing
cohere-ai · Nov 14, 2024 · 0d7b9a9 · 0d7b9a9
1 parent 41b2161
commit 0d7b9a9
Showing 1 changed file with 153 additions and 0 deletions.
diff --git a/notebooks/validate_and_preprocess_data_by_cohere-finetune.ipynb b/notebooks/validate_and_preprocess_data_by_cohere-finetune.ipynb
@@ -0,0 +1,153 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a9c4f20e-378a-4ed7-af1c-16ee594c9cab",
+   "metadata": {},
+   "source": [
+    "# Validate and preprocess data by cohere-finetune"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39829323-ea35-4018-9cd4-1a29b39b1282",
+   "metadata": {},
+   "source": [
+    "This notebook shows you how to validate and preprocess your training and evaluation data by the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository.\n",
+    "\n",
+    "Throughout, we use the notation `<some_content_you_must_change>` to denote some content that you must change according to your own use case, e.g., paths to files or directories, etc. Meanwhile, for any content that is not between the angle brackets, you must use it as it is, unless otherwise stated."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb6d7abd-6d43-4a41-a82d-b72753722739",
+   "metadata": {},
+   "source": [
+    "## 1. Setup"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d92bb4bb-9d18-48f4-9b66-5a59c1402bd8",
+   "metadata": {},
+   "source": [
+    "Get access to the [cohere-finetune](https://github.com/cohere-ai/cohere-finetune) repository by the commands below:\n",
+    "```\n",
+    "git clone [email protected]:cohere-ai/cohere-finetune.git\n",
+    "cd cohere-finetune    \n",
+    "```\n",
+    "\n",
+    "Install the necessary Python packages by the command below:\n",
+    "```\n",
+    "pip install pandas==2.2.3 liquidpy==0.8.2 transformers==4.44.2\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ddadfa8-4f40-4865-bf04-f8c17cca1a41",
+   "metadata": {},
+   "source": [
+    "## 2. Import packages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b65d40d-9656-4f1d-b99c-e2296f087f76",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path = [\"<some_path>/cohere-finetune/src/cohere_finetune\"] + sys.path\n",
+    "\n",
+    "from consts import CHAT_PROMPT_TEMPLATE_CMD_R, CHAT_PROMPT_TEMPLATE_CMD_R_08_2024\n",
+    "from data import CohereDataset\n",
+    "from preprocess import preprocess\n",
+    "from tokenizer_utils import create_and_prepare_tokenizer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1b11b847-b058-44e2-a6a0-8bd2a85d3e87",
+   "metadata": {},
+   "source": [
+    "## 3. Set variables and parameters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e0ea9ce0-1517-4e87-9ac2-604622291c40",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The following two paths are your raw training and evaluation datasets, which must follow the requirements at \n",
+    "# https://github.com/cohere-ai/cohere-finetune?tab=readme-ov-file#step-3-prepare-the-training-and-evaluation-data\n",
+    "input_train_dir = \"<root_dir>/input/data/training\"\n",
+    "input_eval_dir = \"<root_dir>/input/data/evaluation\"\n",
+    "\n",
+    "# The following two paths are the final validated & preprocessed training and evaluation datasets, which are ready to be fed to the model for fine-tuning\n",
+    "finetune_train_path = \"<root_dir>/finetune/data/train.csv\"\n",
+    "finetune_eval_path = \"<root_dir>/finetune/data/eval.csv\"\n",
+    "\n",
+    "eval_percentage = 0.2  # The percentage of data split from training data for evaluation (ignored if evaluation data are provided)\n",
+    "hf_model_name_or_path = \"CohereForAI/c4ai-command-r-v01\"  # Change it to \"CohereForAI/c4ai-command-r-08-2024\" if you are fine-tuning Command R 08-2024\n",
+    "prompt_template = CHAT_PROMPT_TEMPLATE_CMD_R  # Change it to CHAT_PROMPT_TEMPLATE_CMD_R_08_2024 if you are fine-tuning Command R 08-2024"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "701cfaa8-d271-471c-b8da-7475e0a165ca",
+   "metadata": {},
+   "source": [
+    "## 4. Validate and preprocess your training and evaluation data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2f1164e1-620e-4914-af98-083f6854de9f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cohere_dataset = CohereDataset(train_dir=input_train_dir, eval_dir=input_eval_dir)\n",
+    "cohere_dataset.convert_to_chat_jsonl()\n",
+    "\n",
+    "tokenizer = create_and_prepare_tokenizer(hf_model_name_or_path)\n",
+    "\n",
+    "preprocess(\n",
+    "    input_train_path=cohere_dataset.train_path,\n",
+    "    input_eval_path=cohere_dataset.eval_path,\n",
+    "    output_train_path=finetune_train_path,\n",
+    "    output_eval_path=finetune_eval_path,\n",
+    "    eval_percentage=eval_percentage,\n",
+    "    template=prompt_template,\n",
+    "    max_sequence_length=16384,\n",
+    "    tokenizer=tokenizer,\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "finetune_validate_and_preprocess",
+   "language": "python",
+   "name": "finetune_validate_and_preprocess"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}