A simple classification techniques on fruit dataset


Machine learning technique, which it learns from a historical dataset that categories in various ways to predict new observation based on the given inputs. There are two types of data analysis used to predict future data trends such as classification and prediction. Here we will use these techniques to clarify various fruits and predict the best accuracy of them. Some example of classification applications is mail checking (spam or not), credit card fraud detection, speech recognition, and biometric identification.
Number of steps to be followed
1. Understand the dataset
2. Methods of showing data (Visualization)
3. Crete train and test teste set to generate accuracy.
1. Understanding the data
The fruits dataset is a multivariate dataset introduced by Mr. Iain Murray from Edinburgh University. It contains dozens of fruit measurements such as apple, orange, and lemon.
1.1 Shape of data
Let’s look, how many instances we have at the dataset.
  1. import pandas as pd    
  2. import matplotlib.pyplot as plt    
  3. import seaborn as sns    
  4. fruit=pd.read_csv('fruit.csv')    
  5. #fruits shape    
  6. print(fruit.shape)  
Here, the dataset contains 59 pieces of fruit with seven features.
1.2 Types of fruits and count
  1. #types of fruits    
  2. print(fruit.groupby('fruit_names').size())    
  3. sns.countplot(fruit['fruit_name'],label="Count")  
Graphical representation of fruit counts
1.3 count data features
In the data frame, each row contains one piece of fruit which measured by four features.
  1. #preview data    
  2. print(fruit.head(15))    
1.4 Statistical distribution
The fruits numerical data points, which can be measured by the mean, median and percentiles. If the data distribution does not have the same scale so, we need to apply the scaling techniques.
  1. #Describtion of Data    
  2. print(fruit.describe())   
2. Methods of showing data (Visualization)
Here, we will apply two types of visualization techniques to determine the distribution of variables and their correlations.
2.1 Boxplot
It figures out data distribution by boxplot graph.
  1. #Boxplot    
  2. plt.figure(figsize=(15,10))    
  3. plt.subplot(2,2,1)    
  4. sns.boxplot(x='fruit_name',y='mass',data=fruit)    
  5. plt.subplot(2,2,2)    
  6. sns.boxplot(x='fruit_name',y='width',data=fruit)    
  7. plt.subplot(2,2,3)    
  8. sns.boxplot(x='fruit_name',y='height',data=fruit)    
  9. plt.subplot(2,2,4)    
  10. sns.boxplot(x='fruit_name',y='color_score',data=fruit)   
2.2 Pair plot – scatter matrix
Each fruit data point represented by different color plots to provides better and effective determination as well as a correlation between them.
  1. #pairplot    
  2. sns.pairplot(fruit,hue='fruit_name')   
3. Create a train and test set to generate accuracy
3.1 Split dataset
Now, we will separate the data frame in two parts such as train and test set.
  1. feature_names = ['mass''width''height''color_score']    
  2. X = fruit[feature_names]    
  3. y = fruit['fruit_label']    
  4. from sklearn.model_selection import train_test_split    
  5. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)    
  6. from sklearn.preprocessing import MinMaxScaler    
  7. scaler = MinMaxScaler()    
  8. X_train = scaler.fit_transform(X_train)    
  9. X_test = scaler.transform(X_test)    
3.2 machine learning (Modeling)
Now the time to find out a best-suite algorithm for getting the highest accuracy points. So, we are going to handle with some frequently use algorithms for modeling the dataset.
3.2.1 Decision tree
  1. # DecisionTreeClassifier    
  2. from sklearn.tree import DecisionTreeClassifier    
  3. clf = DecisionTreeClassifier().fit(X_train, y_train)    
  4. print('DecisionTreeClassifier:')    
  5. print('Accuracy of training set: {:.2f}'    
  6.      .format(clf.score(X_train, y_train)))    
  7. print('Accuracy of test set: {:.2f}'    
  8.      .format(clf.score(X_test, y_test)))   
3.2.2 logistic regression
  1. #LogisticRegression    
  2. from sklearn.linear_model import LogisticRegression    
  3. logreg = LogisticRegression()    
  4. logreg.fit(X_train, y_train)    
  5. print('LogisticRegression:')    
  6. print('Accuracy of training set: {:.2f}'    
  7.      .format(logreg.score(X_train, y_train)))    
  8. print('Accuracy of test set: {:.2f}'    
  9.      .format(logreg.score(X_test, y_test)))  
3.2.3 K-nearest neighbor
  1. #KNeighborsClassifier    
  2. from sklearn.neighbors import KNeighborsClassifier    
  3. knn = KNeighborsClassifier()    
  4. knn.fit(X_train, y_train)    
  5. print('KNeighborsClassifier:')    
  6. print('Accuracy of on training set: {:.2f}'    
  7.      .format(knn.score(X_train, y_train)))    
  8. print('Accuracy of test set: {:.2f}'    
  9.      .format(knn.score(X_test, y_test)))  
3.2.4 Gaussian Naive Bayes
  1. #GaussianNB    
  2. from sklearn.naive_bayes import GaussianNB    
  3. gnb = GaussianNB()    
  4. gnb.fit(X_train, y_train)    
  5. print('GaussianNB:')    
  6. print('Accuracy of training set: {:.2f}'    
  7.      .format(gnb.score(X_train, y_train)))    
  8. print('Accuracy of test set: {:.2f}'    
  9.      .format(gnb.score(X_test, y_test)))   
3.2.5 support vector machine
  1. #SVC    
  2. from sklearn.svm import SVC    
  3. svm = SVC()    
  4. svm.fit(X_train, y_train)    
  5. print('Support vectore machine:')    
  6. print('Accuracy of training set: {:.2f}'    
  7.      .format(svm.score(X_train, y_train)))    
  8. print('Accuracy of test set: {:.2f}'    
  9.      .format(svm.score(X_test, y_test)))  
After, end of modeling we can obtain the best accuracy model is K-nearest neighbor it provides the highest accuracy score.
3.3 Prediction
  • Now, we have the best accuracy model for the validation process.
  • The KNN model directly runs on the validation set to finding the best final accuracy of points.
    1. #pretiction    
    2. from sklearn.metrics import classification_report    
    3. from sklearn.metrics import confusion_matrix    
    4. from sklearn.metrics import accuracy_score    
    5. pred = knn.predict(X_test)    
    6. print(accuracy_score(y_test, pred))    
    7. print(confusion_matrix(y_test, pred))    
    8. print(classification_report(y_test, pred))   


In this article, we obtain the best accuracy of fruit distribution. I hope you have understood very well.