A first machine learning project in python with Iris dataset

I. Introduction
 
Machine learning is a subfield of artificial intelligence, which is learning algorithms to make decision-based on those data and try to behave like a human being. It is now growing one of the top five in-demand technologies of 2018. Iris data set is the famous smaller databases for easier visualization and analysis techniques. In this article, we will see a quick view of how to develop machine learning “hello world program”.
 
II. Prerequisites
 
Spyder(python3.6) IDE
 
III. Topics
  1. Environment setup
  2. Loading and understanding data
  3. Data visualization
  4. Train and validate data
  5. Predict the result
1. Environment setup
  • Download and install Anaconda navigator, which is distributed a variety of integrated development environments for scientific programming and data analyst as without using any command-line commands.
  • The spyder IDE provides better interactive editing and debugging facilities for data analysis.
  • To check out necessary libraries that have been installed or not with their versions.
    1. import pandas    
    2. print('pandas version is: {}'.format(pandas.__version__))    
    3. import numpy    
    4. print('numpy version is:{}'.format(numpy.__version__))    
    5. import seaborn    
    6. print('seaborn version is{}'.format(seaborn.__version__))    
    7. import sklearn    
    8. print('sklearn version is:{}'.format(sklearn.__version__))    
 
 2. Load and understanding data
  • Pandas is a python package that provides fast and flexible data analysis to the relational or labeled database.
  • Before loading the dataset, you should store the dataset in the spyder working directory.
2.1 Loading the dataset
  1. #load dataset    
  2. import pandas as PD    
  3. iris=pd.read_csv('Iris.csv')  
 
2.2 Understanding the dataset
 
Here, we are going to do a few tasks to understand how numerical data has categorized.
 
2.2.1 Preview data
 
Let’s, look at the iris flowers numerical data belongs to their four species. You can see a first 15 numerical row of species. If the dataset contains three types of flower sets called Iris virginica, Versicolor and iris Sentosa. These three flower features are measured along with their species.
  1. #preview data    
  2. print(iris.head(15))   
 
2.2.2 Description of dataset
 
Let’s look at a summary of each iris instance attributes.
  1. #Description of Data    
  2. print(iris.describe())   
    If the four features of iris species measured by count, min, mean, max, and percentiles.
                              
     
    2.2.3 Description of class
     
    Now, the time to view how many instances the data frame contains.
    1. #Flower distribution    
    2. print(iris.groupby('iris').size())   
      If the dataset contains three classes with 150 instances and its entire instances were measured as numerical values by their features.
                             
       
      2.2.4 Shape of Data
       
      The shape property provides us to seek entire counts of flower instances.
      1. #Data Shape    
      2. print(iris.shape)   
        There is a data frame contains 150 samples under the 5 columns.
                               
         
        3. Data visualization
        • The visualization techniques provide imagery representation of Iris species and feature It is used to determine correlations between the X and Y variables (dependent and independent variables).
        • Now, we are going to visualize the dataset in two ways such as Boxplot, and pairwise joint plot distribution (scatter plot).
        3.1 Boxplot
        • The graph represented the shape of data distribution and their upper and lower quartiles.
        • The iris species might show with few box plot standard ways such as mean, median, and deviation.
          1. #Boxplot    
          2. plt.figure(figsize=(15,10))    
          3. plt.subplot(2,2,1)    
          4. sns.boxplot(x='iris',y='sepallength',data=iris)    
          5. plt.subplot(2,2,2)    
          6. sns.boxplot(x='iris',y='sepalwidth',data=iris)    
          7. plt.subplot(2,2,3)    
          8. sns.boxplot(x='iris',y='petallength',data=iris)    
          9. plt.subplot(2,2,4)    
          10. sns.boxplot(x='iris',y='petalwidth',data=iris)   
         
        3.2 Pair plot
         
        The pair plot used to figure out a distribution of single variables and the relationship between two variables. If the pair plot is given a solution for that as a clear understanding of each flower sets at a single graph. Each flower scatters plots represented in different colors.
        1. #Pairwise joint plot (scatter matrix)    
        2. sns.pairplot(iris, hue='iris', size=3, diag_kind="kde")    
        3. sns.pairplot(iris,hue='iris')   
          Here, it also appears some overleaping data points. 
           
           
           
          4. Train and validate data (Machine learning)
           
          Here, we’ll separate the dataset into two parts for validation processes such as train data and test data. Then allocating 80% of data for training tasks and the remainder 20% for validation purposes.
          1. #dataset spliting    
          2. array = iris.values    
          3. X = array[:,0:4]    
          4. Y = array[:,4]    
          5. validation_size = 0.20    
          6. seed = 7    
          7. X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size,     
          8. random_state=seed)    
          9.  #k=10    
          10. num_folds = 10    
          11. num_instances = len(X_train)    
          12. seed = 7    
          13. scoring = 'accuracy'  
          4.1 Train the model (Modeling)
          • Now, it's time to determine the best suitable algorithm for getting effective accuracy.
          • Here, we are going to evaluate five famous frequently used algorithms such as
          1. Linear regression algorithm
          2. Logistic regression
          3. Decision tree classifier
          4. Gaussian Naïve Base
          5. Support Vector Machine
          1. #evaluate model to determine better algorithm    
          2. models = []    
          3. models.append(('LR', LogisticRegression()))    
          4. models.append(('LDA', LinearDiscriminantAnalysis()))    
          5. models.append(('CART', DecisionTreeClassifier()))    
          6. models.append(('NB', GaussianNB()))    
          7. models.append(('SVM', SVM()))    
          8.     
          9. results = []    
          10. names = []    
          11. for name, model in models:    
          12.     kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)    
          13.     cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)    
          14.     results.append(cv_results)    
          15.     names.append(name)    
          16.     msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())    
          17.     print(msg)   
             The support vector machine model appears a high score of accuracy.
                                     
             
            4.2 Algorithm comparison
             
            You can also discover better accuracy by the algorithm comparison graph.
              1. #choose best one model trough graphical representation    
              2. plt.figure(figsize=(15,10))    
              3. fig = plt.figure()    
              4. fig.suptitle('Differentiate algorithms')    
              5. ax = fig.add_subplot(111)    
              6. plt.boxplot(results)    
              7. ax.set_xticklabels(names)    
              8. plt.show()   
                                                                    
               
              5.  Validate the data(prediction)
               
              Let’s predict a value by validation.
              1. #pretiction    
              2. svn = SVM()    
              3. svn.fit(X_train, Y_train)    
              4. predictions = svn.predict(X_validation)    
              5. print(accuracy_score(Y_validation, predictions))    
              6. print(confusion_matrix(Y_validation, predictions))    
              7. print(classification_report(Y_validation, predictions))   
               
              Testing the new data.
              1. #verify new data    
              2. X_new = numpy.array([[3240.2], [  4.731.30.2 ]])    
              3. print("X_new.shape: {}".format(X_new.shape))  
               
              Validating the prediction
              1. #validate    
              2. prediction = svn.predict(X_new)    
              3. print("Prediction of Species: {}".format(prediction))   
                 
                IV. Summary
                 
                I hope, this machine learning a quick demonstration you understood very well.