From 13aa945938079e265aa28947e9509a5484d03a2d Mon Sep 17 00:00:00 2001
From: aditya0by0 <aditya0by0@gmail.com>
Date: Tue, 27 Aug 2024 11:24:05 +0200
Subject: [PATCH] add info related to protein dataset

---
 data_exploration.ipynb | 418 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 418 insertions(+)

diff --git a/data_exploration.ipynb b/data_exploration.ipynb
index e36fc1fe..b0c9e78f 100644
--- a/data_exploration.ipynb
+++ b/data_exploration.ipynb
@@ -856,6 +856,424 @@
     "\r\n",
     "These different encodings provide various ways to represent the structure and properties of benzene, each suited to different computational tasks such as molecule identification, database searches, and pattern recognition in cheminformatics.d by different computational tools."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93e328cf-09f9-4694-b175-28320590937d",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92e059c6-36a4-482d-bd0b-a8bd9b10ccde",
+   "metadata": {},
+   "source": [
+    "# Information for Protein Dataset\r\n",
+    "\r\n",
+    "The protein dataset follows thsimilarme file structure, class inheritance hierarchy, and methods as described for the ChEBI dataset.\r\n",
+    "\r\n",
+    "### Configuration Parameters\r\n",
+    "\r\n",
+    "Data classes related to proteins can be configured using the following main parameters:\r\n",
+    "\r\n",
+    "- **`go_branch (str)`**: The Gene Ontology (GO) branch. The default value is `\"all\"`, which includes all branches of GO in the dataset.\r\n",
+    "\r\n",
+    "- **`dynamic_data_split_seed (int, optional)`**: The seed for random data splitting, ensuring reproducibility. The default is `42`.\r\n",
+    "\r\n",
+    "- **`splits_file_path (str, optional)`**: Path to a CSV file containing data splits. If not provided, the class will handle splits internally. The default is `None`.\r\n",
+    "\r\n",
+    "- **`kwargs`**: Additional keyword arguments passed to `XYBaseDataModule`.\r\n",
+    "\r\n",
+    "### Available GOUniProt Data Classes\r\n",
+    "\r\n",
+    "#### `GOUniProtOver250`\r\n",
+    "\r\n",
+    "A class for extracting data from the Gene Ontology and Swiss UniProt dataset with a threshold of 250 for selecting classes.\r\n",
+    "\r\n",
+    "- **Inheritance**: Inherits from `_GOUniProtOverX`.\r\n",
+    "\r\n",
+    "#### `GOUniProtOver50`\r\n",
+    "\r\n",
+    "A class for extracting data from the Gene Ontology and Swiss UniProt dataset with a threshold of 50 for selecting classes.\r\n",
+    "\r\n",
+    "- **Inheritance**: Inherits from `_GOUniProtOverX`.\r\n",
+    "\r\n",
+    "### Instantiation Example\r\n",
+    "\r\n",
+    "```python\r\n",
+    "from chebai.preprocessing.datasets.go_uniprot import GOUniProtOver250\r\n",
+    "go_class = GOUniProtOver250()\r\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2ffca830-bc0b-421c-8054-0860c95c10f2",
+   "metadata": {},
+   "source": [
+    "## GOUniProt Data File Structure\r\n",
+    "\r\n",
+    "1. **`Raw Data Files`**: (e.g., `.obo` file and `.dat` file)\r\n",
+    "   - **Description**: These files contain the raw GO ontology and Swiss UniProt data, which are downloaded directly from their respective websites. They serve as the foundation for data processing. Since there are no versions associated with this dataset, common raw files are used for all subsets of the data.\r\n",
+    "   - **File Paths**:\r\n",
+    "     - `data/GO_UniProt/raw/${filename}.obo`\r\n",
+    "     - `data/GO_UniProt/raw/${filename}.dat`\r\n",
+    "\r\n",
+    "2. **`data.pkl`**\r\n",
+    "   - **Description**: This file is generated by the `prepare_data` method and contains the processed data in a dataframe format. It includes protein IDs, data representations (such as SMILES strings), and class columns with boolean values.\r\n",
+    "   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/data.pkl`\r\n",
+    "\r\n",
+    "3. **`data.pt`**\r\n",
+    "   - **Description**: Generated by the `setup` method, this file contains encoded data in a format compatible with the PyTorch library. It includes keys such as `ident`, `features`, `labels`, and `group`, making it ready for model input.\r\n",
+    "   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}/data.pt`\r\n",
+    "\r\n",
+    "4. **`classes.txt`**\r\n",
+    "   - **Description**: This file lists the selected GO or UniProt classes based on a specified threshold. It ensures that only the relevant classes are included in the dataset for analysis.\r\n",
+    "   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/classes.txt`\r\n",
+    "\r\n",
+    "5. **`splits.csv`**\r\n",
+    "   - **Description**: This file contains saved data splits from previous runs. During subsequent runs, it is used to reconstruct the train, validation, and test splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`.\r\n",
+    "   - **File Path**: `data/GO_UniProt/${dataset_name}/processed/splits.csv`\r\n",
+    "\r\n",
+    "**Note**: If `go_branch` is specified, the `dataset_name` will include the branch name in the format `${dataset_name}_${go_branch}`. Otherwise, it will just be `${dataset_name}`.\r\n",
+    "}/processed/splits.csv`\r\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "61bc261e-2328-4968-aca6-14c48bb24348",
+   "metadata": {},
+   "source": [
+    "## data.pkl"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 123,
+   "id": "31df4ee7-4c03-4ea2-9798-5e5082a74c2b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Size of the data (rows x columns):  (27459, 1050)\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>swiss_id</th>\n",
+       "      <th>accession</th>\n",
+       "      <th>go_ids</th>\n",
+       "      <th>sequence</th>\n",
+       "      <th>41</th>\n",
+       "      <th>75</th>\n",
+       "      <th>122</th>\n",
+       "      <th>165</th>\n",
+       "      <th>209</th>\n",
+       "      <th>226</th>\n",
+       "      <th>...</th>\n",
+       "      <th>2000145</th>\n",
+       "      <th>2000146</th>\n",
+       "      <th>2000147</th>\n",
+       "      <th>2000241</th>\n",
+       "      <th>2000243</th>\n",
+       "      <th>2000377</th>\n",
+       "      <th>2001020</th>\n",
+       "      <th>2001141</th>\n",
+       "      <th>2001233</th>\n",
+       "      <th>2001234</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>14331_ARATH</td>\n",
+       "      <td>P42643,Q945M2,Q9M0S7</td>\n",
+       "      <td>[19222]</td>\n",
+       "      <td>MATPGASSARDEFVYMAKLAEQAERYEEMVEFMEKVAKAVDKDELT...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>14331_CAEEL</td>\n",
+       "      <td>P41932,Q21537</td>\n",
+       "      <td>[132, 1708, 5634, 5737, 5938, 6611, 7346, 8340...</td>\n",
+       "      <td>MSDTVEELVQRAKLAEQAERYDDMAAAMKKVTEQGQELSNEERNLL...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>14331_MAIZE</td>\n",
+       "      <td>P49106</td>\n",
+       "      <td>[3677, 5634, 10468, 44877]</td>\n",
+       "      <td>MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>14332_MAIZE</td>\n",
+       "      <td>Q01526</td>\n",
+       "      <td>[3677, 5634, 10468, 44877]</td>\n",
+       "      <td>MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>14333_ARATH</td>\n",
+       "      <td>P42644,F4KBI7,Q945L2</td>\n",
+       "      <td>[5634, 5737, 6995, 9409, 9631, 16036, 19222, 5...</td>\n",
+       "      <td>MSTREENVYMAKLAEQAERYEEMVEFMEKVAKTVDVEELSVEERNL...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>...</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 1050 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "       swiss_id             accession  \\\n",
+       "8   14331_ARATH  P42643,Q945M2,Q9M0S7   \n",
+       "9   14331_CAEEL         P41932,Q21537   \n",
+       "10  14331_MAIZE                P49106   \n",
+       "13  14332_MAIZE                Q01526   \n",
+       "14  14333_ARATH  P42644,F4KBI7,Q945L2   \n",
+       "\n",
+       "                                               go_ids  \\\n",
+       "8                                             [19222]   \n",
+       "9   [132, 1708, 5634, 5737, 5938, 6611, 7346, 8340...   \n",
+       "10                         [3677, 5634, 10468, 44877]   \n",
+       "13                         [3677, 5634, 10468, 44877]   \n",
+       "14  [5634, 5737, 6995, 9409, 9631, 16036, 19222, 5...   \n",
+       "\n",
+       "                                             sequence     41     75    122  \\\n",
+       "8   MATPGASSARDEFVYMAKLAEQAERYEEMVEFMEKVAKAVDKDELT...  False  False  False   \n",
+       "9   MSDTVEELVQRAKLAEQAERYDDMAAAMKKVTEQGQELSNEERNLL...  False  False  False   \n",
+       "10  MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...  False  False  False   \n",
+       "13  MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...  False  False  False   \n",
+       "14  MSTREENVYMAKLAEQAERYEEMVEFMEKVAKTVDVEELSVEERNL...  False  False  False   \n",
+       "\n",
+       "      165    209    226  ...  2000145  2000146  2000147  2000241  2000243  \\\n",
+       "8   False  False  False  ...    False    False    False    False    False   \n",
+       "9   False  False  False  ...    False    False    False    False    False   \n",
+       "10  False  False  False  ...    False    False    False    False    False   \n",
+       "13  False  False  False  ...    False    False    False    False    False   \n",
+       "14  False  False  False  ...    False    False    False    False    False   \n",
+       "\n",
+       "    2000377  2001020  2001141  2001233  2001234  \n",
+       "8     False    False    False    False    False  \n",
+       "9     False    False    False    False    False  \n",
+       "10    False    False    False    False    False  \n",
+       "13    False    False    False    False    False  \n",
+       "14    False    False    False    False    False  \n",
+       "\n",
+       "[5 rows x 1050 columns]"
+      ]
+     },
+     "execution_count": 123,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pkl_df = pd.DataFrame(pd.read_pickle(r\"data/GO_UniProt/GO250_BP/processed/data.pkl\"))\n",
+    "print(\"Size of the data (rows x columns): \", pkl_df.shape)\n",
+    "pkl_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be0078fd-bcf1-4d4c-b8c6-c84e3aeac99c",
+   "metadata": {},
+   "source": [
+    "## data.pt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 127,
+   "id": "a70f9c35-daca-4728-a9ea-b1212866f421",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Type of loaded data: <class 'list'>\n",
+      "{'features': [10, 14, 15, 23, 13, 14, 11, 11, 14, 16, 20, 27, 25, 28, 22, 10, 14, 21, 17, 14, 27, 18, 14, 27, 16, 22, 27, 27, 10, 28, 27, 25, 10, 27, 21, 28, 14, 21, 14, 28, 20, 21, 20, 27, 17, 15, 28, 27, 27, 16, 19, 17, 17, 11, 28, 14, 22, 21, 19, 28, 12, 13, 14, 16, 16, 14, 11, 26, 16, 12, 12, 11, 11, 12, 27, 18, 21, 27, 27, 11, 16, 13, 19, 20, 20, 29, 28, 11, 17, 12, 16, 20, 22, 16, 11, 21, 12, 27, 15, 27, 17, 11, 20, 12, 24, 20, 13, 12, 17, 21, 17, 17, 20, 15, 12, 17, 28, 23, 14, 14, 14, 11, 13, 20, 11, 21, 28, 25, 22, 17, 21, 10, 21, 13, 20, 22, 29, 16, 22, 17, 14, 27, 25, 21, 11, 13, 18, 27, 16, 21, 20, 14, 14, 27, 29, 15, 17, 15, 14, 22, 21, 14, 14, 18, 20, 12, 14, 19, 11, 27, 17, 14, 23, 15, 29, 23, 12, 16, 17, 13, 17, 14, 17, 19, 25, 11, 28, 25, 22, 22, 27, 12, 17, 19, 11, 23, 20, 16, 14, 24, 19, 17, 14, 21, 18, 14, 25, 20, 27, 14, 12, 14, 27, 17, 20, 15, 17, 13, 27, 27, 11, 22, 21, 20, 11, 15, 17, 12, 10, 18, 17, 17, 16, 20, 19, 17, 15, 17, 26, 15, 11, 20, 10, 18, 20, 20, 28, 14, 20, 20, 12, 21, 27, 14, 14, 23, 14, 14, 14, 21, 23, 14, 20, 27, 18, 18, 11], 'labels': array([False, False, False, ..., False, False, False]), 'ident': '14331_ARATH', 'group': None}\n"
+     ]
+    }
+   ],
+   "source": [
+    "data_pt = torch.load(r\"data/GO_UniProt/GO250_BP/processed/protein_token/data.pt\")\n",
+    "print(\"Type of loaded data:\", type(data_pt))\n",
+    "for i in range(1):\n",
+    "    print(data_pt[i])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "380049c1-2963-4223-b698-a7b59b9fe595",
+   "metadata": {},
+   "source": [
+    "## Protein Representation Using Amino Acid Sequence Notation\n",
+    "\n",
+    "Proteins are composed of chains of amino acids, and these sequences can be represented using a one-letter notation for each amino acid. This notation provides a concise way to describe the primary structure of a protein.\n",
+    "\n",
+    "### Example Protein Sequence\n",
+    "\n",
+    "Protein: **Lysozyme C** from **Gallus gallus** (Chicken).  \n",
+    "[Lysozyme C - UniProtKB P00698](https://www.uniprot.org/uniprotkb/P00698/entry#function)\n",
+    "\n",
+    "- **Sequence**: `MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL`\n",
+    "- **Sequence Length**: 147\n",
+    "\n",
+    "In this sequence, each letter corresponds to a specific amino acid. This notation is widely used in bioinformatics and molecular biology to represent protein sequences.\n",
+    "\n",
+    "### The 20 Amino Acids and Their One-Letter Notations\n",
+    "\n",
+    "Here is a list of the 20 standard amino acids, along with their one-letter notations and descriptions:\n",
+    "\n",
+    "| One-Letter Notation | Amino Acid Name      | Description                                             |\n",
+    "|---------------------|----------------------|---------------------------------------------------------|\n",
+    "| **A**               | Alanine              | Non-polar, aliphatic amino acid.                        |\n",
+    "| **C**               | Cysteine             | Polar, contains a thiol group, forms disulfide bonds.   |\n",
+    "| **D**               | Aspartic Acid        | Acidic, negatively charged at physiological pH.         |\n",
+    "| **E**               | Glutamic Acid        | Acidic, negatively charged at physiological pH.         |\n",
+    "| **F**               | Phenylalanine        | Aromatic, non-polar.                                    |\n",
+    "| **G**               | Glycine              | Smallest amino acid, non-polar.                         |\n",
+    "| **H**               | Histidine            | Polar, positively charged, can participate in enzyme active sites. |\n",
+    "| **I**               | Isoleucine           | Non-polar, aliphatic.                                   |\n",
+    "| **K**               | Lysine               | Basic, positively charged at physiological pH.          |\n",
+    "| **L**               | Leucine              | Non-polar, aliphatic.                                   |\n",
+    "| **M**               | Methionine           | Non-polar, contains sulfur, start codon in mRNA translation. |\n",
+    "| **N**               | Asparagine           | Polar, uncharged.                                       |\n",
+    "| **P**               | Proline              | Non-polar, introduces kinks in protein chains.          |\n",
+    "| **Q**               | Glutamine            | Polar, uncharged.                                       |\n",
+    "| **R**               | Arginine             | Basic, positively charged, involved in binding phosphate groups. |\n",
+    "| **S**               | Serine               | Polar, can be phosphorylated.                           |\n",
+    "| **T**               | Threonine            | Polar, can be phosphorylated.                           |\n",
+    "| **V**               | Valine               | Non-polar, aliphatic.                                   |\n",
+    "| **W**               | Tryptophan           | Aromatic, non-polar, largest amino acid.                |\n",
+    "| **Y**               | Tyrosine             | Aromatic, polar, can be phosphorylated.                 |\n",
+    "\n",
+    "### Understanding Protein Sequences\n",
+    "\n",
+    "In the example sequence `MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQGQL`, each letter represents one of the above amino acids. The sequence reflects the specific order of amino acids in the protein, which is critical for its structure and function.\n",
+    "\n",
+    "This notation is used extensively in various bioinformatics tools and databases to study protein structure, function, and interactions.\n",
+    "\n",
+    "\n",
+    "_Note_:  Refer for amino acid sequence:  https://en.wikipedia.org/wiki/Protein_primary_structure"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "702359d6-5338-4391-b196-2328ba5676a1",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
   }
  ],
  "metadata": {