From 13aa945938079e265aa28947e9509a5484d03a2d Mon Sep 17 00:00:00 2001 From: aditya0by0 Date: Tue, 27 Aug 2024 11:24:05 +0200 Subject: [PATCH] add info related to protein dataset --- data_exploration.ipynb | 418 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 418 insertions(+) diff --git a/data_exploration.ipynb b/data_exploration.ipynb index e36fc1fe..b0c9e78f 100644 --- a/data_exploration.ipynb +++ b/data_exploration.ipynb @@ -856,6 +856,424 @@ "\r\n", "These different encodings provide various ways to represent the structure and properties of benzene, each suited to different computational tasks such as molecule identification, database searches, and pattern recognition in cheminformatics.d by different computational tools." ] + }, + { + "cell_type": "markdown", + "id": "93e328cf-09f9-4694-b175-28320590937d", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "92e059c6-36a4-482d-bd0b-a8bd9b10ccde", + "metadata": {}, + "source": [ + "# Information for Protein Dataset\r\n", + "\r\n", + "The protein dataset follows thsimilarme file structure, class inheritance hierarchy, and methods as described for the ChEBI dataset.\r\n", + "\r\n", + "### Configuration Parameters\r\n", + "\r\n", + "Data classes related to proteins can be configured using the following main parameters:\r\n", + "\r\n", + "- **`go_branch (str)`**: The Gene Ontology (GO) branch. The default value is `\"all\"`, which includes all branches of GO in the dataset.\r\n", + "\r\n", + "- **`dynamic_data_split_seed (int, optional)`**: The seed for random data splitting, ensuring reproducibility. The default is `42`.\r\n", + "\r\n", + "- **`splits_file_path (str, optional)`**: Path to a CSV file containing data splits. If not provided, the class will handle splits internally. The default is `None`.\r\n", + "\r\n", + "- **`kwargs`**: Additional keyword arguments passed to `XYBaseDataModule`.\r\n", + "\r\n", + "### Available GOUniProt Data Classes\r\n", + "\r\n", + "#### `GOUniProtOver250`\r\n", + "\r\n", + "A class for extracting data from the Gene Ontology and Swiss UniProt dataset with a threshold of 250 for selecting classes.\r\n", + "\r\n", + "- **Inheritance**: Inherits from `_GOUniProtOverX`.\r\n", + "\r\n", + "#### `GOUniProtOver50`\r\n", + "\r\n", + "A class for extracting data from the Gene Ontology and Swiss UniProt dataset with a threshold of 50 for selecting classes.\r\n", + "\r\n", + "- **Inheritance**: Inherits from `_GOUniProtOverX`.\r\n", + "\r\n", + "### Instantiation Example\r\n", + "\r\n", + "```python\r\n", + "from chebai.preprocessing.datasets.go_uniprot import GOUniProtOver250\r\n", + "go_class = GOUniProtOver250()\r\n" + ] + }, + { + "cell_type": "markdown", + "id": "2ffca830-bc0b-421c-8054-0860c95c10f2", + "metadata": {}, + "source": [ + "## GOUniProt Data File Structure\r\n", + "\r\n", + "1. **`Raw Data Files`**: (e.g., `.obo` file and `.dat` file)\r\n", + " - **Description**: These files contain the raw GO ontology and Swiss UniProt data, which are downloaded directly from their respective websites. They serve as the foundation for data processing. Since there are no versions associated with this dataset, common raw files are used for all subsets of the data.\r\n", + " - **File Paths**:\r\n", + " - `data/GO_UniProt/raw/${filename}.obo`\r\n", + " - `data/GO_UniProt/raw/${filename}.dat`\r\n", + "\r\n", + "2. **`data.pkl`**\r\n", + " - **Description**: This file is generated by the `prepare_data` method and contains the processed data in a dataframe format. It includes protein IDs, data representations (such as SMILES strings), and class columns with boolean values.\r\n", + " - **File Path**: `data/GO_UniProt/${dataset_name}/processed/data.pkl`\r\n", + "\r\n", + "3. **`data.pt`**\r\n", + " - **Description**: Generated by the `setup` method, this file contains encoded data in a format compatible with the PyTorch library. It includes keys such as `ident`, `features`, `labels`, and `group`, making it ready for model input.\r\n", + " - **File Path**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}/data.pt`\r\n", + "\r\n", + "4. **`classes.txt`**\r\n", + " - **Description**: This file lists the selected GO or UniProt classes based on a specified threshold. It ensures that only the relevant classes are included in the dataset for analysis.\r\n", + " - **File Path**: `data/GO_UniProt/${dataset_name}/processed/classes.txt`\r\n", + "\r\n", + "5. **`splits.csv`**\r\n", + " - **Description**: This file contains saved data splits from previous runs. During subsequent runs, it is used to reconstruct the train, validation, and test splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`.\r\n", + " - **File Path**: `data/GO_UniProt/${dataset_name}/processed/splits.csv`\r\n", + "\r\n", + "**Note**: If `go_branch` is specified, the `dataset_name` will include the branch name in the format `${dataset_name}_${go_branch}`. Otherwise, it will just be `${dataset_name}`.\r\n", + "}/processed/splits.csv`\r\n" + ] + }, + { + "cell_type": "markdown", + "id": "61bc261e-2328-4968-aca6-14c48bb24348", + "metadata": {}, + "source": [ + "## data.pkl" + ] + }, + { + "cell_type": "code", + "execution_count": 123, + "id": "31df4ee7-4c03-4ea2-9798-5e5082a74c2b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Size of the data (rows x columns): (27459, 1050)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
swiss_idaccessiongo_idssequence4175122165209226...2000145200014620001472000241200024320003772001020200114120012332001234
814331_ARATHP42643,Q945M2,Q9M0S7[19222]MATPGASSARDEFVYMAKLAEQAERYEEMVEFMEKVAKAVDKDELT...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
914331_CAEELP41932,Q21537[132, 1708, 5634, 5737, 5938, 6611, 7346, 8340...MSDTVEELVQRAKLAEQAERYDDMAAAMKKVTEQGQELSNEERNLL...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1014331_MAIZEP49106[3677, 5634, 10468, 44877]MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1314332_MAIZEQ01526[3677, 5634, 10468, 44877]MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1414333_ARATHP42644,F4KBI7,Q945L2[5634, 5737, 6995, 9409, 9631, 16036, 19222, 5...MSTREENVYMAKLAEQAERYEEMVEFMEKVAKTVDVEELSVEERNL...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", + "

5 rows × 1050 columns

\n", + "
" + ], + "text/plain": [ + " swiss_id accession \\\n", + "8 14331_ARATH P42643,Q945M2,Q9M0S7 \n", + "9 14331_CAEEL P41932,Q21537 \n", + "10 14331_MAIZE P49106 \n", + "13 14332_MAIZE Q01526 \n", + "14 14333_ARATH P42644,F4KBI7,Q945L2 \n", + "\n", + " go_ids \\\n", + "8 [19222] \n", + "9 [132, 1708, 5634, 5737, 5938, 6611, 7346, 8340... \n", + "10 [3677, 5634, 10468, 44877] \n", + "13 [3677, 5634, 10468, 44877] \n", + "14 [5634, 5737, 6995, 9409, 9631, 16036, 19222, 5... \n", + "\n", + " sequence 41 75 122 \\\n", + "8 MATPGASSARDEFVYMAKLAEQAERYEEMVEFMEKVAKAVDKDELT... False False False \n", + "9 MSDTVEELVQRAKLAEQAERYDDMAAAMKKVTEQGQELSNEERNLL... False False False \n", + "10 MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE... False False False \n", + "13 MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVE... False False False \n", + "14 MSTREENVYMAKLAEQAERYEEMVEFMEKVAKTVDVEELSVEERNL... False False False \n", + "\n", + " 165 209 226 ... 2000145 2000146 2000147 2000241 2000243 \\\n", + "8 False False False ... False False False False False \n", + "9 False False False ... False False False False False \n", + "10 False False False ... False False False False False \n", + "13 False False False ... False False False False False \n", + "14 False False False ... False False False False False \n", + "\n", + " 2000377 2001020 2001141 2001233 2001234 \n", + "8 False False False False False \n", + "9 False False False False False \n", + "10 False False False False False \n", + "13 False False False False False \n", + "14 False False False False False \n", + "\n", + "[5 rows x 1050 columns]" + ] + }, + "execution_count": 123, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pkl_df = pd.DataFrame(pd.read_pickle(r\"data/GO_UniProt/GO250_BP/processed/data.pkl\"))\n", + "print(\"Size of the data (rows x columns): \", pkl_df.shape)\n", + "pkl_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "be0078fd-bcf1-4d4c-b8c6-c84e3aeac99c", + "metadata": {}, + "source": [ + "## data.pt" + ] + }, + { + "cell_type": "code", + "execution_count": 127, + "id": "a70f9c35-daca-4728-a9ea-b1212866f421", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Type of loaded data: \n", + "{'features': [10, 14, 15, 23, 13, 14, 11, 11, 14, 16, 20, 27, 25, 28, 22, 10, 14, 21, 17, 14, 27, 18, 14, 27, 16, 22, 27, 27, 10, 28, 27, 25, 10, 27, 21, 28, 14, 21, 14, 28, 20, 21, 20, 27, 17, 15, 28, 27, 27, 16, 19, 17, 17, 11, 28, 14, 22, 21, 19, 28, 12, 13, 14, 16, 16, 14, 11, 26, 16, 12, 12, 11, 11, 12, 27, 18, 21, 27, 27, 11, 16, 13, 19, 20, 20, 29, 28, 11, 17, 12, 16, 20, 22, 16, 11, 21, 12, 27, 15, 27, 17, 11, 20, 12, 24, 20, 13, 12, 17, 21, 17, 17, 20, 15, 12, 17, 28, 23, 14, 14, 14, 11, 13, 20, 11, 21, 28, 25, 22, 17, 21, 10, 21, 13, 20, 22, 29, 16, 22, 17, 14, 27, 25, 21, 11, 13, 18, 27, 16, 21, 20, 14, 14, 27, 29, 15, 17, 15, 14, 22, 21, 14, 14, 18, 20, 12, 14, 19, 11, 27, 17, 14, 23, 15, 29, 23, 12, 16, 17, 13, 17, 14, 17, 19, 25, 11, 28, 25, 22, 22, 27, 12, 17, 19, 11, 23, 20, 16, 14, 24, 19, 17, 14, 21, 18, 14, 25, 20, 27, 14, 12, 14, 27, 17, 20, 15, 17, 13, 27, 27, 11, 22, 21, 20, 11, 15, 17, 12, 10, 18, 17, 17, 16, 20, 19, 17, 15, 17, 26, 15, 11, 20, 10, 18, 20, 20, 28, 14, 20, 20, 12, 21, 27, 14, 14, 23, 14, 14, 14, 21, 23, 14, 20, 27, 18, 18, 11], 'labels': array([False, False, False, ..., False, False, False]), 'ident': '14331_ARATH', 'group': None}\n" + ] + } + ], + "source": [ + "data_pt = torch.load(r\"data/GO_UniProt/GO250_BP/processed/protein_token/data.pt\")\n", + "print(\"Type of loaded data:\", type(data_pt))\n", + "for i in range(1):\n", + " print(data_pt[i])" + ] + }, + { + "cell_type": "markdown", + "id": "380049c1-2963-4223-b698-a7b59b9fe595", + "metadata": {}, + "source": [ + "## Protein Representation Using Amino Acid Sequence Notation\n", + "\n", + "Proteins are composed of chains of amino acids, and these sequences can be represented using a one-letter notation for each amino acid. This notation provides a concise way to describe the primary structure of a protein.\n", + "\n", + "### Example Protein Sequence\n", + "\n", + "Protein: **Lysozyme C** from **Gallus gallus** (Chicken). \n", + "[Lysozyme C - UniProtKB P00698](https://www.uniprot.org/uniprotkb/P00698/entry#function)\n", + "\n", + "- **Sequence**: `MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL`\n", + "- **Sequence Length**: 147\n", + "\n", + "In this sequence, each letter corresponds to a specific amino acid. This notation is widely used in bioinformatics and molecular biology to represent protein sequences.\n", + "\n", + "### The 20 Amino Acids and Their One-Letter Notations\n", + "\n", + "Here is a list of the 20 standard amino acids, along with their one-letter notations and descriptions:\n", + "\n", + "| One-Letter Notation | Amino Acid Name | Description |\n", + "|---------------------|----------------------|---------------------------------------------------------|\n", + "| **A** | Alanine | Non-polar, aliphatic amino acid. |\n", + "| **C** | Cysteine | Polar, contains a thiol group, forms disulfide bonds. |\n", + "| **D** | Aspartic Acid | Acidic, negatively charged at physiological pH. |\n", + "| **E** | Glutamic Acid | Acidic, negatively charged at physiological pH. |\n", + "| **F** | Phenylalanine | Aromatic, non-polar. |\n", + "| **G** | Glycine | Smallest amino acid, non-polar. |\n", + "| **H** | Histidine | Polar, positively charged, can participate in enzyme active sites. |\n", + "| **I** | Isoleucine | Non-polar, aliphatic. |\n", + "| **K** | Lysine | Basic, positively charged at physiological pH. |\n", + "| **L** | Leucine | Non-polar, aliphatic. |\n", + "| **M** | Methionine | Non-polar, contains sulfur, start codon in mRNA translation. |\n", + "| **N** | Asparagine | Polar, uncharged. |\n", + "| **P** | Proline | Non-polar, introduces kinks in protein chains. |\n", + "| **Q** | Glutamine | Polar, uncharged. |\n", + "| **R** | Arginine | Basic, positively charged, involved in binding phosphate groups. |\n", + "| **S** | Serine | Polar, can be phosphorylated. |\n", + "| **T** | Threonine | Polar, can be phosphorylated. |\n", + "| **V** | Valine | Non-polar, aliphatic. |\n", + "| **W** | Tryptophan | Aromatic, non-polar, largest amino acid. |\n", + "| **Y** | Tyrosine | Aromatic, polar, can be phosphorylated. |\n", + "\n", + "### Understanding Protein Sequences\n", + "\n", + "In the example sequence `MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQGQL`, each letter represents one of the above amino acids. The sequence reflects the specific order of amino acids in the protein, which is critical for its structure and function.\n", + "\n", + "This notation is used extensively in various bioinformatics tools and databases to study protein structure, function, and interactions.\n", + "\n", + "\n", + "_Note_: Refer for amino acid sequence: https://en.wikipedia.org/wiki/Protein_primary_structure" + ] + }, + { + "cell_type": "markdown", + "id": "702359d6-5338-4391-b196-2328ba5676a1", + "metadata": {}, + "source": [ + "---" + ] } ], "metadata": {