Random Forest - A Machine Learning Technique

Aleksei Pavlov
5y
43.2k
0
1

Article

Decision tree

Random Forest is a method for classification, regression, and some kinds of prediction. The method is based on the decision tree definition as a binary tree-like graph of decisions and possible consequences.

The image above shows a possible decision tree for a training set of items with three features X, Y, and Z. Z is the resolution feature to classify items by. Usually, a resolution feature contains categorical values; for numerical versions of algorithms, categorical values should be replaced by numeric codes. A training set is a number of items with known values of Z. Each terminal node contains a number of items from the training set after the decision tree have been generated, and an average value of Z feature can be calculated as an arithmetic mean and may be considered as a predicted value.

For example, there is an item {X=9, Y=3, Z=?} and prediction of a value of Z feature which is required. After the descent, according to the tree graph and conditions, we could say Z is equal to an arithmetic mean of Z values for training items from terminal node #4.

Generation of decision tree

The algorithm to generate a decision tree is recursive. Every iteration finds the best way to split a current training subset into two parts and get a dividing condition. Which division is the best? There are some criteria. Gini coefficient is more suitable for the case of categorical values (classification). Since example values are numerical (regression), the RSS (residual sum of squares) criteria will be applied, but the Gini coefficient is an option as well.

The following steps describe splitting algorithms in general. For each feature (except resolution feature, Z in the example above).

Calculate RSS for the current subset;
Sort items by feature value;
For each possible division into two subsets, calculate RSS for left (L) and right (R) subset;
Find the division with min(RSS(L)+RSS(R)) and return feature name and value.

Criteria to stop the recursion

RSS for the current subset is equal to 0;
Items count in the current subset is less than or equal to the acceptable maximum number of items in a terminal node.

A decision tree will be generated and the training set will be divided into a number of categories (Terminal nodes) as a result.

Random forest

A forest is a number of trees. And what is a "random" forest? It is a number of decision trees generated based on a random subset of the initial training set. The important thing is that any random subset to generate a decision tree can contain duplicate training items. Each tree of a forest makes its prediction; a prediction of a forest is an arithmetic mean of all trees' predictions.

Library

First, prepare Microsoft Excel file with training data in a format like below.

X	Y	Z
4.49519	4.972543	1
3.050089	2.18187	2
0.943726	1.617163	1
3.886888	4.365553	1
..	...	...

Where X, Y, Z – feature the names of training item (values for each feature are known, Z is a resolution feature and contains numeric codes for a categorical description of classes). Please keep in mind that the current version of the library does not validate the data file.

Library usage,

using RandomForest.Lib.Numerical.Interfaces;
// Parameters to "grow" the forest
ForestGrowParameters p = new ForestGrowParameters
{
ExportDirectoryPath = [folder to save json tree files],
ExportToJson = true,
ResolutionFeatureName = [name of the feature to classify training items by],
ItemSubsetCountRatio = [ratio to get a number of training items to form subset and generate tree],
TrainigDataPath = [path to training data MS Excel file],
MaxItemCountInCategory = [max number of items in a terminal node],
TreeCount = [number of trees to generate],
SplitMode = [RSS or Gini]
};
// Method to "grow" the forest
IForest forest = ForestFactory.Create();
forest.GrowAsync(p);
// Method to make a prediction for an item
IItemNumerical item = forest.CreateItem();
item.SetValue("X", 1.23);
item.SetValue("Y", 2.05);
double z = forest.Resolve(item);

Application

Select Microsoft Excel file with training set data
Select folder to save JSON files
Input name of resolution feature (Z in training-data-sample.xlsx)
Set number of trees
Set the max number of items in category/leaf (recommended 5-10)
Set training subset count ratio (affects on algorithm duration, depends on training set count)
Click the "Generate" button and wait until progress reaches 100 percent
After generation completes, a list of item feature names appears (X, Y in training-data-sample.xlsx). Input corresponding values and click the "Resolve" button to get a prediction for resolution feature (Z).

Example training set

Here are some resolution hits. Note that any resolution has a random component and any fractional value should be rounded to the nearest integer (for the given example training set).

X = 2.5, Y = 1.75, Z => 1
X = 3, Y = 3, Z => 0
X = 4, Y = 5, Z => 1.992 (2)
X = 5, Y = 5, Z => 2
X = 3.8, Y = 4, Z => 1.54804761904762 (2)

Find sample source code at GitHub.