Getting Started With Azure ML: Chapter 2

Getting Started

Before going any further, I would recommend going through the previous part.

Datasets

A Dataset is a collection of information/ data from various domains or sectors that is uploaded in Azure Machine Learning Studio. These datasets are free to use and anyone can use these datasets in their modelling process.



There are several types of datasets available in the studio and you can use any one of them in your experiments as per your requirement. In addition to that you can upload you own dataset for some random experiments.

Here are some datasets available at the studio-
  1. Book Reviews

    Reviews of books in Amazon, taken from the Amazon.com website by University of Pennsylvania researchers. See the research paper, “Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification.”

    The original dataset has 975K reviews with rankings 1, 2, 3, 4, or 5. The reviews were written in English and are from the time period 1997-2007. This dataset has been down-sampled to 10K reviews.

  2. IMDB Movie Titles

    The dataset contains information about movies that were rated in Twitter tweets: IMDB movie ID, movie name and genre, production year. There are 17K movies in the dataset. The dataset was introduced in the paper.

    Movie Tweetings - a Movie Rating Dataset Collected From Twitter.

  3. Movie Tweets

    The dataset is an extended version of the Movie Tweetings dataset. The dataset has 170K ratings for movies, extracted from well-structured tweets on Twitter. Each instance represents a tweet and is a tuple: user ID, IMDB movie ID, rating, timestamp, number of favourites for this tweet, and number of retweets of this tweet.

  4. Forest fires data

    Contains weather data, such as temperature and humidity indices and wind speed, from an area of northeast Portugal, combined with records of forest fires. This is a difficult regression task, where the aim is to predict the burned area of forest fires.

    (For exploring more about datasets visit
Modules

A module can be defined as an algorithm that can be used on your dataset. So, it’s the one important tool you need to use for training your data. Apart from training Azure ML studio provides you lots of other functionalities hat you can perform on your data using any particular algorithm. These other functionalities are validation, analysing, scoring etc.
Here are some most common and handy modules available in the studio-
  1. Linear Regression

    It creates an online gradient descent algorithm based linear regression model. Which is based on gradient descent itself and the function-


    Here-

    θ → Parameter Vector
    α → Learning Rate
    j → Slope Parameter

  2. Score Model

    Scores predictions for a trained classification or regression model.

    You can use Score Model to generate predictions using a trained classification or regression model. The predicted value can be in many different formats, depending on the model and your input data.
     
  3. Elementary Statistics

    Calculates specified summary statistics for selected dataset columns.

    You can use the Compute Elementary Statistics module to generate a summary report for your dataset that lists key statistics such as mean, standard deviation, and the range of values for each of the selected columns. This report is useful for analysing the central tendency, dispersion, and shape of data.

    (There are lots of other.. You can explore all, once you’ll be in the studio, specifically experiment section.)

    Now, there maybe a question in your mind that if there are lots of modules / algorithms are available in the studio, then how to select which will be better for your dataset.
Parameters: When Selecting an Algorithm

So, for that there some guidelines or parameters you need to count in while selecting any particular module / algorithm. These guidelines are-
  • Accuracy

    Your algorithm don’t need to extremely perfect. It would be better also if it’s close.

  • Training Time

    This is like bridging the gap between your data and models. Always look for the choice which can take you towards your goal in few minutes or hours maybe.

  • Number of Parameters

    This is significantly very important, because some models are designed for handling numbers of parameters while other are designed for less. But the catch is, both these type can do the same work.
    So, here you always need to go for your best shot.
     
  • Features Available

    This guideline is somehow same as numbers of available parameters.

  • Linearity

    Lots of machine learning algorithms make use of linearity. Linear classification algorithms assume that classes can be separated by a straight line. For example-



    The above linearity represents, how equally distribution took place for the particular dataset.

    (I’ll later explain all these parameters in detail.)
Wrapping Up

That’s all from this part. Stay tuned for more.


Similar Articles