A Complete Python Scikit Learn Tutorial

Python Scikit-Learn

In the previous article, we studied pandas, its functions and their python implementations. In this article, we will be studying each of the scikit-learn functions and their python implementation. 

What is Scikit-Learn? 

Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
The scikit-learn project started as scikits.learn a Google Summer of Code project by David Cournapeau. Its name stems from the notion that it is a "SciKit" (SciPy Toolkit), a separately-developed and distributed third-party extension to SciPy.
It was first released on June 2012. The official website is www.scikit-learn.org.

Are Scikit-Learn and Sklearn same? 

Yes, both are same, with the difference that "scikit-learn" is the official name of the package and is mostly used only for installing the package, whereas "sklearn" is the abbreviated name which is used when we have to use it for python programming. 

Features of Scikit-Learn 

  • Clustering
    for grouping unlabeled data such as KMeans.

  • Cross-Validation
    for estimating the performance of supervised models on unseen data.

  • Datasets
    for test datasets and for generating datasets with specific properties for investigating model behaviour.

  • Dimensionality Reduction
    for reducing the number of attributes in data for summarization, visualization and feature selection such as Principal component analysis.

  • Ensemble methods
    for combining the predictions of multiple supervised models.

  • Feature extraction
    for defining attributes in image and text data.

  • Feature selection
    for identifying meaningful attributes from which to create supervised models.

  • Parameter Tuning
    for getting the most out of supervised models.

  • Manifold Learning
    For summarizing and depicting complex multi-dimensional data.

  • Supervised Models
    a vast array not limited to generalized linear models, discriminate analysis, naive Bayes, lazy methods, neural networks, support vector machines and decision trees.

Installing Scikit-Learn in Python 

1. Ubuntu/Linux
  1. sudo apt update -y        
  2. sudo apt upgrade -y        
  3. sudo apt install python3-tk python3-pip -y        
  4. sudo pip install scikit-learn -y     
2. Anaconda Prompt
  1. conda install scikit-learn     


Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modelling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).
For example, spam detection in email service providers can be identified as a classification problem. This is s binary classification since there are only 2 classes as spam and not spam.
A classifier utilizes some training data to understand how given input variables relate to the class. In this case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to detect an unknown email.
Classification belongs to the category of supervised learning where the targets also provided with the input data. There are many applications in classification in many domains such as in credit approval, medical diagnosis, target marketing etc.
  1. from sklearn import datasets    
  2. from sklearn.linear_model import LinearRegression    
  3. digits = datasets.load_digits()  #loading the MNIST Digits Database  
  5. clf=LinearRegression()   #creating LinearRegression Classifier  

Python Scikit-Learn Functions 


1. Loading Dataset

Scikit-learn comes with a few standard datasets, for instance, the iris and digits datasets for classification and the Boston house prices dataset for regression.
  1. from sklearn import datasets    
  2. iris = datasets.load_iris()    
  3. digits =  datasets.load_digits()     
The above code will load iris dataset into "iris" and MNIST Digits dataset into "digits"
  • ndarray.data 
    we use data attribute to show the data that was loaded
    1. print(digits.data)   
    The above will result in the following:
    Similarly, we can print iris.data
  •  ndarray.target
    we use the target attribute to show the labels of the dataset that we loaded
    1. print(digits.target)    
    The above will result : [0 1 2 ... 8 9 8]. Similarly, we can print iris.target

2. Breaking data into Training and Test Set

A dataset can be divided into 3 parts:
1. Test Set
It is the part of the dataset which is used for Black box testing, i.e. here the motive is to test the model on data that the model has never seen before. Here we use metrics like the Confusion Matrix, accuracy score and F1 score.
2. Training Set
It is the part of the dataset which is used to train the model and make the hypothesizes more accurate
3. Validation set
In machine learning, when we generate a model, we need to validate the model before sending it for the final testing. So to validate the working of the model we use the validation set
For Example:
Let us take the MNIST cloth dataset
Here we would like to generate a model that can predict the type and genre of the cloth.
So to do so, we would divide the whole fo the dataset into a ratio of 7:3 where 70% of the data is the training+Validation Data and 30% is the Test data
In Python to segregate the training and test data, we use the sklear.model_selection.train_test_split()
  1. from sklearn import datasets    
  2. from sklearn.model_selection import train_test_split  
  3. digits = datasets.load_digits()  #loading the MNIST Digits Database      
  4. X,y = digits.data, digits.target     
  5. X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0)      
The above code demonstrates how to divide the dataset in the 7:3 ratio

3. Learning and Predicting 

Learning is the process of acquiring new or modifying existing, knowledge, behaviours, skills, values, or preferences. The ability to learn is possessed by humans, animals, and some machines; there is also evidence for some kind of learning in certain plants. 
In the case of Machine Learning, learning is a process that improves the knowledge of an AI program by making observations about its environment. 
A prediction, or forecast, is a statement about a future event. A prediction is often, but not always, based upon experience or knowledge. There is no universal agreement about the exact difference between the two terms; different authors and disciplines ascribe different connotations.  
“Prediction” refers to the output of an algorithm after it has been trained on a historical dataset and applied to new data when you’re trying to forecast the likelihood of a particular outcome, such as whether or not a customer will churn in 30 days. The algorithm will generate probable values for an unknown variable for each record in the new data, allowing the model builder to identify what that value will most likely be.
1. classifier.fit(labels,targets)
The fit the method is provided on every estimator. It usually takes some samples x, targets y if the model is supervised, and potentially other sample properties such as sample_weight. It should:
  • clear any prior attributes stored on the estimator, unless warm_start is used;
  • validate and interpret any parameters, ideally raising an error if invalid
  • validate the input data;
  • estimate and store model attributes from the estimated parameters and provided data, and
  • return the now fitted estimator to facilitate method chaining.   
  1. from sklearn import datasets        
  2. from sklearn.linear_model import LinearRegression        
  3. digits = datasets.load_digits()  #loading the MNIST Digits Database      
  5. clf=LinearRegression()  #creating LinearRegression Classifier       
  7. X,y = digits.data[:-10], digits.target[:-10]  #selecting the first 10 data and targets      
  9. clf.fit(X,y)   #we fitted the given data and targets to form a model  
2. classifier.predict()
Predicts each sample, usually only taking X as input (but see under regressor output conventions below). In a classifier or regressor, this prediction is in the same target space used in fitting (e.g. one of {‘red’, ‘amber’, ‘green’} if the y in fitting consisted of these strings). Despite this, even when y passed to fit is a list or other array-like, the output of predict should always be an array or sparse matrix. In a clustered or outlier detector the prediction is an integer.
If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.
  1. from sklearn import datasets
  2. from sklearn.linear_model import LinearRegression        
  3. digits = datasets.load_digits()  #loading the MNIST Digits Database      
  5. clf=LinearRegression()  #creating LinearRegression Classifier       
  7. X,y = digits.data[:-10], digits.target[:-10]  #selecting the first 10 data and targets      
  9. clf.fit(X,y)   #we fitted the given data and targets to form a model  
  11. clf.predict(X[0].reshape(1,-1)) #we are predicting the output based on the trained model   

4. Performance Analysis

1. classifier.confusion_matrix()
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
  • It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.
  • The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.
  • The confusion matrix shows how your classification model is confused when it makes predictions.
  • It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.
  • Confusion Matrix is often used for calculations of accuracy score
   Positive Negative 
 True TP TN 
 False FP  FN 
  1. # Python script for confusion matrix creation.   
  2. from sklearn.metrics import confusion_matrix    
  3. actual = [1101001000]   
  4. predicted = [1001001110]   
  5. results = confusion_matrix(actual, predicted)   
  6. print ('Confusion Matrix :')  
  7. print(results)  
The output of the above code will be: Confusion Matrix : [[4 2] [1 3]]
2. classifier.accuracy_score()
A method on an estimator, usually a predictor, which evaluates its predictions on a given dataset, and returns a single numerical score. A greater return value should indicate better predictions; accuracy is used for classifiers and R^2 for regressors by default.
If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.
Some estimators implement a custom, estimator-specific score function, often the likelihood of the data under the model.
Accuracy = (TP+TN)/ (TP+TN+FN+FP)
where, TP- True Positive 
            TN- True Negative
            FN- False Negative
            FP- False Positive 
  1. from sklearn.metrics import accuracy_score  
  2. y_pred = [0213]  
  3. y_true = [0123]  
  4. accuracy_score(y_true, y_pred)  
A list of Scikit-Learn Functionalities in form of table is provided below:
 Sno. Function Name Description 
 1 sklearn.base Base Class for all the Estimators
 2 sklear.calibration Calibration of Predicted Probabilities 
 3 sklear.cluster It provides various unsupervised learning algorithms 
 4 sklearn.cluster.k_means Provides all the functionalities of the K-Means clustering algorithm 
 5 sklear.cluster.bicluster It provides the spectral biclustering algorithms 
 6 sklearn.compose  Meta-estimators for building composite models with transformers
 7 sklearn.covariance This module includes methods and algorithms to robustify estimate the covariance of features given a set of points. The precision matrix defined as the inverse of the covariance is also estimated
 8 sklearn.cros_decomposition It provides methods and algorithms to support cross decomposition 
 9 sklearn.datasets This module includes utilities to load datasets, including methods to load and fetch popular reference datasets. It also provides artificial data generators
 10 sklearn.decomposition This module includes matrix decomposition algorithms, including among others PCA, NMF or ICA.
 11 sklearn.discriminant_analysis It provides Linear Discriminant Analysis and Quadratic Discriminant Analysis
 12 sklearn.dummy  It provides Dummy Estimatators which are helpful to get a baseline value of those metrics for random predictions
 13 sklearn.ensemble  This module includes ensemble-based methods for classification, regression and anomaly detection 
 14 sklearn.exceptions  This module contains all custom warnings and error classes used across scikit-learn
 15 sklearn.experimental  This module provides importable modules that enable the use of experimental features or estimators 
 16 sklearn.feature_extraction  This module deals with features extraction from raw data. It can currently extract features from text and images 
 17 sklearn.feature_selection  This module implements feature selection algorithms. It currently provides univariate filter selection methods and the recursive feature elimination algorithm
 18 sklearn.gaussian_process  This module implements Gaussian Process-based regression and classification 
 19 sklearn.isotonic  This module provides us with capabilities to implement isotonic regression 
 20 sklearn.impute  It provides transformers for missing value imputation 
 21 sklearn.kernel_approximation This module implements several approximate kernel feature maps based on Fourier Transforms 
 22 sklearn.kernel_ridge  It provides capabilities to help us implement kernel ridge regression 
 23 sklearn.linear_model  It module implements generalized linear models. It includes Ridge regression, Bayesian Regression, Lasso and Elastic Net estimators computed with Least Angle Regression and coordinate descent. It also implements Stochastic Gradient Descent related algorithms. 
 24 sklearn.linear_model.LinearRegression It provides functionalities to implement Linear Regression 
 25 sklearn.linear_model.LogisticRegression It provides functionalities to implement Logistic Regression 
 26 sklearn.mainifold  This module implements data embedding techniques 
 27 sklearn.metrics  It includes score functions, performance metrics and pairwise metrics and distance computations
 28 sklearn.metrics.accuracy_score  It gives the accuracy classification score 
 29 sklearn.metrics.confusion_matrix  It gives the confusion matrix 
 30 sklearn.metrics.f1_Score  It gives the F1 score or balanced F-score or F-measure 
 31 sklearn.metrics.classification_report  It builds a text report showing the main classification metrics 
 32 sklearn.metrics.precision_score  It gives the precision of the classification 
 33 sklearn.metrics.mean_absolute_error  It gives the mean absolute error regression loss 
 34 sklearn.metrics.mean_squared_error  It gives the mean squared error regression loss 
 35 sklearn.mixture  This module implements mixture modelling algorithms 
 36 sklearn.model_selection  This module contains model selections functions
 37 sklearn.multiclass  This module provides functionalities for implementation of multiclass and multilabel classification 
 38 sklearn.multioutput  This module implements multioutput regression and classification. The estimators provided in this module are meta-estimators: they require a base estimator to be provided in their constructor. The meta-estimator extends single output estimators to multi-output estimators.
 39 sklearn.naive_bayes  This module implements Naive Bayes algorithms. These are supervised learning methods based on applying Bayes’ theorem with strong (naive) feature independence assumptions. 
 40 sklearn.neighbours  This module implements the k-nearest neighbours' algorithm. 
 41 sklearn.neaural_network  This module includes models based on neural networks 
 42 sklearn.pipeline  This module implements utilities to build a composite estimator, as a chain of transforms and estimators 
 43 sklearn.inspection This module includes tools for model inspection 
 44 sklearn.preprocessing This module includes scaling, centring, normalization, binarization and imputation methods 
 45 sklearn.random_projection  It provides random_projection. Random Projections are a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes. 
 46 sklearn.semi_supervised  This module implements semi-supervised learning algorithms. These algorithms utilized small amounts of labelled data and large amounts of unlabeled data for classification tasks. This module includes Label Propagation. 
 47 sklearn.svm  This module includes Support Vector Machine algorithms 
 48 sklearn.tree  This module includes decision tree-based models for classification and regression. 
 49 sklearn.utils  It includes various utilities 


In this article, we studied python scikit-learn, features of scikit-learn in python, installing scikit-learn, classification, how to load datasets, breaking dataset into test and training sets, learning and predicting, performance analysis and various functionalities provided by scikit-learn. Hope you were able to understand each and everything. For any doubts, please comment on your query.
In the next article, we will learn about matplotlib, the next library in the series.
Congratulations!!! you have climbed your next step in becoming a successful ML engineer.
Next Article In this Series >> Matplotlib

Similar Articles