Azure Batch AI Service

What is Artificial Intelligence (AI)?

Artificial Intelligence is the study of how to make computers do things which, at the moment, people do better.

What is Batch AI in Azure?

Batch AI is a managed service that enables data scientists and AI researchers to train AI and other machine learning models on clusters of Azure virtual machines, including VMs with GPU support.



Why Batch AI?

Developing powerful AI algorithms is a compute-intensive and iterative process. Data scientists and AI researchers are working with increasingly larger data sets. Doing this efficiently requires multiple CPUs or GPUs per model, running experiments in parallel, and having shared storage for training data, logs, and model outputs.They are developing models with more layers and doing this with more experimentation on network design on hyper-parameter tuning.

Data scientists and AI researchers are experts in their topic, yet managing infrastructure at scale can get in the way. Developing AI at scale requires many infrastructure actions as they are provisioning clusters of VMs, installing software and containers, queuing work, prioritizing and scheduling jobs, handling failures, distributing databases, sharing results, scaling resources to manage costs, and integrating with tools and workflows. Here the workflows are represented diagrammatically in the figure.



Batch AI provides resource management and job scheduling specialized for AI training and testing. Key capabilities include:
  • Running long-running batch jobs, iterative experimentation, and interactive training
  • Automatic or manual scaling of VM clusters using GPUs or CPUs
  • Configuring SSH communication between VMs and for remote access
  • Support for any Deep Learning or machine learning framework, with optimized configuration for popular toolkits such as Microsoft Cognitive Toolkit (CNTK), TensorFlow, and Chainer
  • Priority-based job queue to share clusters and take advantage of low-priority VMs and reserved instances
  • Flexible storage options including Azure Files and a managed NFS server
  • Mounting remote file shares into the VM and optional container
  • Providing job status and restarting in case of VM failures
  • Access to output logs, stdout, stderr, and models, including streaming from Azure Storage
  • Azure command-line interface (CLI), SDKs for Python, C#, and Java, monitoring in the Azure Portal, and integration with Microsoft AI tools

The Batch AI SDK supports writing scripts or applications to manage training pipelines and integrate with tools.

The SDK currently provides Python, C#, Java, and REST APIs.By using the Batch AI, you can define and manage clusters and jobs.