Machine Learning Project 2: IRIS dataset


This chapter demonstrates a machine learning application using the IRIS dataset.

IRIS Dataset

Iris data set is the famous smaller databases for easier visualization and analysis techniques.
1. Environment setup
  • Download and install Anaconda navigator, which is distributed a variety of integrated development environments for scientific programming and data analyst as without using any command-line commands.
  • The spyder IDE provides better interactive editing and debugging facilities for data analysis.
  • To check out necessary libraries that have been installed or not with their versions.
    1. import pandas    
    2. print('pandas version is: {}'.format(pandas.__version__))    
    3. import numpy    
    4. print('numpy version is:{}'.format(numpy.__version__))    
    5. import seaborn    
    6. print('seaborn version is{}'.format(seaborn.__version__))    
    7. import sklearn    
    8. print('sklearn version is:{}'.format(sklearn.__version__))    
 2. Load and understanding data
  • Pandas is a python package that provides fast and flexible data analysis to the relational or labeled database.
  • Before loading the dataset, you should store the dataset in the spyder working directory.
2.1 Loading the dataset
  1. #load dataset    
  2. import pandas as PD    
  3. iris=pd.read_csv('Iris.csv')  
2.2 Understanding the dataset
Here, we are going to do a few tasks to understand how numerical data has categorized.
2.2.1 Preview data
Let’s, look at the iris flowers numerical data belongs to their four species. You can see a first 15 numerical row of species. If the dataset contains three types of flower sets called Iris virginica, Versicolor, and iris Sentosa. These three flower features are measured along with their species.
  1. #preview data    
  2. print(iris.head(15))   
2.2.2 Description of dataset
Let’s look at a summary of each iris instance attributes.
  1. #Description of Data    
  2. print(iris.describe())   
    If the four features of iris species measured by count, min, mean, max, and percentiles.
    2.2.3 Description of class
    Now, the time to view how many instances the data frame contains.
    1. #Flower distribution    
    2. print(iris.groupby('iris').size())   
      If the dataset contains three classes with 150 instances and its entire instances were measured as numerical values by their features.
      2.2.4 Shape of Data
      The shape property provides us to seek entire counts of flower instances.
      1. #Data Shape    
      2. print(iris.shape)   
        There is a data frame contains 150 samples under the 5 columns.
        3. Data visualization
        • The visualization techniques provide imagery representation of Iris species and feature It is used to determine correlations between the X and Y variables (dependent and independent variables).
        • Now, we are going to visualize the dataset in two ways such as Boxplot, and pairwise joint plot distribution (scatter plot).
        3.1 Boxplot
        • The graph represented the shape of data distribution and their upper and lower quartiles.
        • The iris species might show with few box plot standard ways such as mean, median, and deviation.
          1. #Boxplot    
          2. plt.figure(figsize=(15,10))    
          3. plt.subplot(2,2,1)    
          4. sns.boxplot(x='iris',y='sepallength',data=iris)    
          5. plt.subplot(2,2,2)    
          6. sns.boxplot(x='iris',y='sepalwidth',data=iris)    
          7. plt.subplot(2,2,3)    
          8. sns.boxplot(x='iris',y='petallength',data=iris)    
          9. plt.subplot(2,2,4)    
          10. sns.boxplot(x='iris',y='petalwidth',data=iris)   
        3.2 Pair plot
        The pair plot used to figure out a distribution of single variables and the relationship between two variables. If the pair plot is given a solution for that as a clear understanding of each flower sets at a single graph. Each flower scatters plots represented in different colors.
        1. #Pairwise joint plot (scatter matrix)    
        2. sns.pairplot(iris, hue='iris', size=3, diag_kind="kde")    
        3. sns.pairplot(iris,hue='iris')   
          Here, it also appears some overleaping data points. 
          4. Train and validate data (Machine learning)
          Here, we’ll separate the dataset into two parts for validation processes such as train data and test data. Then allocating 80% of data for training tasks and the remainder 20% for validation purposes.
          1. #dataset spliting    
          2. array = iris.values    
          3. X = array[:,0:4]    
          4. Y = array[:,4]    
          5. validation_size = 0.20    
          6. seed = 7    
          7. X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size,     
          8. random_state=seed)    
          9.  #k=10    
          10. num_folds = 10    
          11. num_instances = len(X_train)    
          12. seed = 7    
          13. scoring = 'accuracy'  
          4.1 Train the model (Modeling)
          • Now, it's time to determine the best suitable algorithm for getting effective accuracy.
          • Here, we are going to evaluate five famous frequently used algorithms such as
          1. Linear regression algorithm
          2. Logistic regression
          3. Decision tree classifier
          4. Gaussian Naïve Base
          5. Support Vector Machine
          1. #evaluate model to determine better algorithm    
          2. models = []    
          3. models.append(('LR', LogisticRegression()))    
          4. models.append(('LDA', LinearDiscriminantAnalysis()))    
          5. models.append(('CART', DecisionTreeClassifier()))    
          6. models.append(('NB', GaussianNB()))    
          7. models.append(('SVM', SVM()))    
          9. results = []    
          10. names = []    
          11. for name, model in models:    
          12.     kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)    
          13.     cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)    
          14.     results.append(cv_results)    
          15.     names.append(name)    
          16.     msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())    
          17.     print(msg)   
             The support vector machine model appears a high score of accuracy.
            4.2 Algorithm comparison
            You can also discover better accuracy by the algorithm comparison graph.
              1. #choose best one model trough graphical representation    
              2. plt.figure(figsize=(15,10))    
              3. fig = plt.figure()    
              4. fig.suptitle('Differentiate algorithms')    
              5. ax = fig.add_subplot(111)    
              6. plt.boxplot(results)    
              7. ax.set_xticklabels(names)    
              5.  Validate the data(prediction)
              Let’s predict a value by validation.
              1. #pretiction    
              2. svn = SVM()    
              3., Y_train)    
              4. predictions = svn.predict(X_validation)    
              5. print(accuracy_score(Y_validation, predictions))    
              6. print(confusion_matrix(Y_validation, predictions))    
              7. print(classification_report(Y_validation, predictions))   
              Testing the new data.
              1. #verify new data    
              2. X_new = numpy.array([[3240.2], [  4.731.30.2 ]])    
              3. print("X_new.shape: {}".format(X_new.shape))  
              Validating the prediction
              1. #validate    
              2. prediction = svn.predict(X_new)    
              3. print("Prediction of Species: {}".format(prediction))   


                So in this chapter, you learned how to build an application using the IRIS dataset.
                Elavarasan R
                153 11.4k 612.9k
                Next » Machine Learning Project 3: Tweet Classifier