Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/v1.1.0 #6

Merged
merged 17 commits into from
Jul 2, 2020
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
*.json filter=lfs diff=lfs merge=lfs -text
*.tsv filter=lfs diff=lfs merge=lfs -text
73 changes: 56 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,18 +48,33 @@ A [setup.py](./setup.py) file is provided in order to simplify the installation
```Python
pip list | grep mtdnn
```
> For Mixed Precision and Distributed Training, please install NVIDIA apex by following instructions [here](https://github.com/NVIDIA/apex#linux)

## Run an example
An example Jupyter [notebook](./examples/classification/tc_mnli.ipynb) is provided to show a runnable example using the MNLI dataset. The notebook reads and loads the MNLI data provided for your convenience [here](./sample_data). This dataset is mainly used for natural language inference (NLI) tasks, where the inputs are sentence pairs and the labels are entailment indicators.

> **NOTE:** The MNLI data is very large and would need [Git LFS](https://docs.github.com/en/github/managing-large-files/installing-git-large-file-storage) installed on your machine to pull it down.

## How To Use


## How To Use
1. Create a model configuration object, `MTDNNConfig`, with the necessary parameters to initialize the MT-DNN model. Initialization without any parameters will default to a similar configuration that initializes a BERT model. This configuration object can be initialized wit training and learning parameters like `batch_size` and `learning_rate`. Please consult the class implementation for all parameters.

```Python
BATCH_SIZE = 16
config = MTDNNConfig(batch_size=BATCH_SIZE)
MULTI_GPU_ON = True
MAX_SEQ_LEN = 128
NUM_EPOCHS = 5
config = MTDNNConfig(batch_size=BATCH_SIZE,
max_seq_len=MAX_SEQ_LEN,
multi_gpu_on=MULTI_GPU_ON)
```

1. Define the task parameters to train for and initialize an `MTDNNTaskDefs` object.
1. Define the task parameters to train for and initialize an `MTDNNTaskDefs` object. Definition can be a single or multiple tasks to train. MTDNNTaskDefs can take a python dict, yaml or json file with task(s) defintion.

```Python
DATA_DIR = "../../sample_data/"
DATA_SOURCE_DIR = os.path.join(DATA_DIR, "MNLI")
tasks_params = {
"mnli": {
"data_format": "PremiseAndOneHypothesis",
Expand All @@ -73,27 +88,52 @@ A [setup.py](./setup.py) file is provided in order to simplify the installation
"n_class": 3,
"split_names": [
"train",
"matched_dev",
"mismatched_dev",
"matched_test",
"mismatched_test",
"dev_matched",
"dev_mismatched",
"test_matched",
"test_mismatched",
],
"data_source_dir": DATA_SOURCE_DIR,
"data_process_opts": {"header": True, "is_train": True, "multi_snli": False,},
"task_type": "Classification",
},
}

# Define the tasks
task_defs = MTDNNTaskDefs(tasks_params)
```

1. Create a data tokenizing object, `MTDNNTokenizer`. Based on the model initial checkpoint, it wraps around the model's Huggingface transformers library to encode the data to **MT-DNN** format. This becomes the input to the data building stage.

```
tokenizer = MTDNNTokenizer(do_lower_case=True)

# Testing out the tokenizer
print(tokenizer.encode("What NLP toolkit do you recommend", "MT-DNN is a fantastic toolkit"))

# ([101, 2054, 17953, 2361, 6994, 23615, 2079, 2017, 16755, 102, 11047, 1011, 1040, 10695, 2003, 1037, 10392, 6994, 23615, 102], None, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
```

1. Create a data preprocessing object, `MTDNNDataBuilder`. This class is responsible for converting the data into the MT-DNN format depending on the task. This object is responsible for creating the vectorized data for each task.

```
## Load and build data
data_builder = MTDNNDataBuilder(tokenizer=tokenizer,
task_defs=task_defs,
data_dir=DATA_SOURCE_DIR,
canonical_data_suffix="canonical_data",
dump_rows=True)

## Build data to MTDNN Format as an iterable of each specific task
vectorized_data = data_builder.vectorize()
```

1. Create a data preprocessing object, `MTDNNDataProcess`. This creates the training, test and development PyTorch dataloaders needed for training and testing. We also need to retrieve the necessary training options required to initialize the model correctly, for all tasks.

```Python
data_processor = MTDNNDataProcess(
config=config,
task_defs=task_defs,
data_dir="/home/useradmin/sources/mt-dnn/data/canonical_data/bert_uncased_lower",
train_datasets_list=["mnli"],
test_datasets_list=["mnli_mismatched", "mnli_matched"],
)
data_processor = MTDNNDataProcess(config=config,
task_defs=task_defs,
vectorized_data=vectorized_data)

# Retrieve the multi task train, dev and test dataloaders
multitask_train_dataloader = data_processor.get_train_dataloader()
Expand Down Expand Up @@ -131,8 +171,7 @@ A [setup.py](./setup.py) file is provided in order to simplify the installation
1. At this point the MT-DNN model allows us to fit to the model and create predictions. The fit takes an optional `epochs` parameter that overwrites the epochs set in the `MTDNNConfig` object.

```Python
model.fit()
model.predict()
model.fit(epochs=NUM_EPOCHS)
```


Expand All @@ -141,7 +180,7 @@ Optionally using a previously trained model as checkpoint.

```Python
# Predict using a PyTorch model checkpoint
checkpt = "./model_0.pt"
checkpt = "./checkpoint/model_4.pt"
model.predict(trained_model_chckpt=checkpt)

```
Expand Down
1 change: 1 addition & 0 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
theme: jekyll-theme-cayman
26 changes: 26 additions & 0 deletions ci/component_governance.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

# Pull request against these branches will trigger this build
pr:
- master

# no CI trigger
trigger: none

jobs:
- job: Component_governance
timeoutInMinutes: 20 # how long to run the job before automatically cancelling
pool:
vmImage: 'ubuntu-16.04'

steps:
- bash: |
python scripts/generate_requirements_txt.py
displayName: 'Generate requirements.txt file from generate_conda_file.py'

- task: ComponentGovernanceComponentDetection@0
inputs:
scanType: 'Register'
verbosity: 'Verbose'
alertWarningLevel: 'High'
Loading