Azure Machine Learning - Model Training

In this article, we’ll learn to train the machine learning model in Notebook within Azure Machine Learning Studio. This article, details about a part of the machine learning workflow where we fit best combinations of bias and weight to minimize the loss function. We’ll also view the various features offered by Azure Machine Learning Studio to graphically represent the minimization of loss function metric in real time when running the model training.  This article is a part of the Azure Machine Learning Series.

  1. Azure Machine Learning - Create Workspace for Machine Learning 
  2. Azure Machine Learning – Create Compute Instance and Compute Cluster 
  3. Azure Machine Learning - Writing Python Script in Notebook 
  4. Azure Machine Learning - Model Training

Microsoft AI

Microsoft AI is a powerful framework that enables organizations, researchers, and non-profits to use AI technologies with its powerful framework which offers services and features across domains of Machine Learning, Robotics, Data Science, IoT, and many more.  Learn more about Microsoft AI from this article.

Azure Machine Learning

The Azure Machine Learning enriches and consolidates the functionalities to support model training and deployment which transitions from Machine Learning Studio. It provides tools for Machine Learning works for all skill levels, provides an open and interoperable framework with support to different languages, and enables robust end-to-end MLOps. It also supports Automated Machine Learning. Read this article Auto ML to learn more about it.

So, where and how do we start if we want to create and deploy a Machine Learning project? Azure Machine Learning provides all the tools through its portal to create the resources and set up the infrastructure that is needed for any kind of machine learning works.

Pre-requisite

Before we start with the tutorial of this article, you first need to create Machine Learning Workspace in Azure and create compute instance along with compute cluster. Follow up the Azure Machine Learning - Create Workspace for Machine Learning and Azure Machine Learning – Create Compute Instance and Compute Cluster respectively. Next, create appropriate folders in Notebook within Azure Machine Learning Studio following the article, Azure Machine Learning - Writing Python Script in Notebook

Now, once you’ve followed up the above pre-requisite, you’ll be ready to follow up the step below.

Step 1

Click on Notebook under Author. Your Folder should look similar to the one shown here. The src folder is within the learn-ml.

Right Click on the … in the src and Click Create new file.

Step 2

Name this file, model.py and Click on Create.

Defining Convolution Neural Network

Step 3

Now, under the model.py write in the following script. This is referred from the official Pytorch website.

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Now, Click on Save Button.

Step 4

Similarly, Create new file under src folder again.

Name this file train.py and Click on Create.

The files structure will look similar to as below.

Downloading Training Dataset and Training Model

Step 5

Now, copy the following code.

import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

from model import Net

# download CIFAR 10 data
trainset = torchvision.datasets.CIFAR10(
    root="../data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=4, shuffle=True, num_workers=2
)


if __name__ == "__main__":

    # define convolutional network
    net = Net()

    # set up pytorch loss /  optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    # train the network
    for epoch in range(2):

        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # unpack the data
            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:
                loss = running_loss / 2000
                print(f"epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}")
                running_loss = 0.0

    print("Finished Training")

This script will download the CIFAR 10 dataset through the PyTorch API torchvision.dataset and set up the network which we defined in model.py and train it for epochs using cross-entropy loss and Stochastic Gradient Descent.

Now, Click on Save and Run.

Step 6

The Machine Learning Terminal is opened as we run the script. We can here see, the CIFAR 10 Dataset being downloaded and extracted to data folder.

We can here see, the epochs of the training and the decrease in the loss with every new batch of training.

Creating Environment with Package Dependencies

Step 7

Now, create a new file in learn-ml folder and name it pytorch-env.yml

Add the following to the file.

name: pytorch-env
channels:
    - defaults
    - pytorch
dependencies:
    - python=3.6.2
    - pytorch
    - torchvision

Creating Control Script

Step 8

Similarly, under the learn-ml folder, create another file.

This would be the control script to run our train model. Name this run-pytorch.py

Add the following code to file.

# run-pytorch.py
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core import Environment
from azureml.core import ScriptRunConfig

#set cpu-cluster to your compute target name - as I named mine ojashshrestha11

if __name__ == "__main__":
    ws = Workspace.from_config()
    experiment = Experiment(workspace=ws, name='day1-experiment-train')
    config = ScriptRunConfig(source_directory='./src',
                             script='train.py',
                             compute_target='cpu-cluster')

 

    # set up pytorch environment
    env = Environment.from_conda_specification(
        name='pytorch-env',
        file_path='pytorch-env.yml'
    )
    config.run_config.environment = env

    run = experiment.submit(config)

    aml_url = run.get_portal_url()
    print(aml_url)

Now, as we save and run it, it’ll open the terminal.

Click the link that is created.

Step 9

Here, we can see the status of the experiment is preparing. It’ll be updated and running and completed as the Build status succeeds fully.

Now, under the Outputs + Logs check out the files.

Here, under 70_driver_log.txt we can see the update of the epochs and loss during the trainig process as the control scripts runs the training file.

Viewing Real-Time Metrics

Step 10

Let us make some changes on the train.py file.

Add, the import scripts,

from model import Net
from azureml.core import Run

Set the variable run.

run = Run.get_context()

Add the loss metric log.

run.log('loss', loss)

The final code must look similar to as below.

import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from model import Net
from azureml.core import Run


# ADDITIONAL CODE: get run from the current context
run = Run.get_context()

# download CIFAR 10 data
trainset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor()
)
trainloader = torch.utils.data.DataLoader(
    trainset,
    batch_size=4,
    shuffle=True,
    num_workers=2
)


if __name__ == "__main__":
    # define convolutional network
    net = Net()
    # set up pytorch loss /  optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    # train the network
    for epoch in range(2):
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # unpack the data
            inputs, labels = data
            # zero the parameter gradients
            optimizer.zero_grad()
            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:
                loss = running_loss / 2000
                # ADDITIONAL CODE: log loss metric to AML
                run.log('loss', loss)
                print(f'epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}')
                running_loss = 0.0
    print('Finished Training')

Step 11

Visit the pytorch.env.yml file and add pip and azureml-sdk to the dependencies.

Step 12

Now, save and run the run-pytorch.py file.

The terminal will open and as you press enter, you’ll be given the experiment link. Click on that.

Step 13

Here, we can see the experiment musing_shark_2k6zn90m is running.

Under the Metrics under the experiment, we can see the real time chart representation of the Loss VS Iterations. Here, from the first epoch with the loss of 2.3000 the 6th iteration already has only 1.5000 loss.

We can also switch to table and obtain these values of loss and steps.

Step 14

Now, as the experiment is completed, we can see, by the 10th step, the loss has reduced to 1.4000 which is significant from the 2.3000 of the first step.

Thus, we have successfully trained our model based on the CIFAR 10 dataset and also learnt to explore the graphical representation of metrics offered by Azure Machine Learning Studio over the training process.

Deleting Resources

Step 15

In order to save ourselves from any charges that may incur, it is essential we stop all our resources once our goal is completed.

It is even better to delete the resource group from the Azure Portal which will completely wipe all the resources that were created under the resource group. Here, we can see, the Machine Learning Workspace, Key Vault, Storage Account, Application Insights, Container Registry were all created within the ojash-rg resource. I click on Delete Resource Group and reconfirm with typing the name of the resource group and click on delete to remove the resource group and all its containing resources.

Conclusion

Thus, in this article, we learned about Training a Model in Notebook in Azure Machine Learning Studio. This article was a follow-up on the Azure Machine Learning Series where we learned from creating the Azure Machine Learning Workspace and now, we’ve come so far to even train the machine learning models. In this article, we also explored of the various features available in Notebook from creating control scripts to visualizing the process of Machine Learning Model training in real time with representation of loss values over different steps. We have now learned to train machine learning model in Notebook within the Azure Machine Learning Studio.