Decision Trees are a supervised, classification machine learning modeling method. A Decision Tree has a root node, internal nodes (decision nodes), and leaf nodes. A root node is the start of the tree (contains no incoming edges and zero or more outgoing edges) and contains all variables. Internal nodes contain exactly one incoming edge and two or more outgoing edges. Internal nodes contain variable test conditions. Leaf nodes contain one incoming edge and zero outgoing edges. Leaf nodes are assigned a class label. Leaf nodes indicate that a label for the current data vector has been reached. This label will then be assigned to that data vector. There are technically an infinite number of Decision Trees that can be constructed from a set of variables where at least one variable is continuous. This is because the continuous variable could technically be split on any value. This leads to an infinite number of different splits and in turn an infinite number of different trees. So, the way around this problem when constructing a Decision Tree with at least one continuous variable is to use an algorithm such as, Hunt’s Algorithm, to find the optimal tree.
Split Measures
So, how does the Decision Tree algorithm decide which variable and what value of that variable to split on? It uses numerical measures called Information Gain and then either GINI or Entropy. Information Gain measures the difference between impurity before splitting the data and the weighted average of the impurity after splitting the data. So, essentially it quantifies how much a split at ‘tree node’ un-mixes the labels at the node. GINI or Entropy are different impurity measures that are both commonly used. When there are two labels, Entropy can be between 0 and 1, while GINI can be between 0 and 0.5. An Entropy of 1 is the worst, or maximum impurity. A GINI of 0.5 is the worst, or maximum impurity. If there are more than 2 labels, then Entropy can go above 1 and GINI can go above 0.5. The higher the Information Gain, the better the purity of the labels after the split. So, the algorithm calculates the Information Gain for possible splits at the current node and determines what variable and what value of the variable to split on based on which split produced the highest Information Gain.
The image on the left has a low information gain from the split because there is still relatively high impurity for both new nodes (equal distribution of classes). The image on the right has high information gain from the split because there is relatively low impurity for both new nodes. Both nodes have mostly one class in them. (Image from Medium.com)
Example
This example will walk through how a split is determined using GINI, on a relatively small dataset. The following dataset has two variables (color and diameter) and a label for the type of fruit.
Color | Diameter | Label |
Green | 3 | Apple |
Yellow | 3 | Apple |
Red | 1 | Grape |
Red | 1 | Grape |
Yellow | 3 | Lemon |
GINI = 1 – ((2/5)^2 + (2/5)^2 + (1/5)^2) = 0.64
The first step is to split the root node. The root node will contain all vectors and variables. To decide, how to split the GINI and information gain values need to be calculated. One option is to split the data by Color
in the following way:
Color | Diameter | Label |
Yellow | 3 | Apple |
Red | 1 | Grape |
Red | 1 | Grape |
Yellow | 3 | Lemon |
Color | Diameter | Label |
Green | 3 | Apple |
GINI = 1 – ((1/4)^2 + (2/4)^2 + (1/4)^2) = 0.625
GINI = 1 – ((1/1)^2) = 0
Information Gain = 0.64 – ((4/5)*(0.625) + (1/5)*0) = 0.14
Another option is to split the data by Color
in the following way:
Color | Diameter | Label |
Red | 1 | Grape |
Red | 1 | Grape |
Color | Diameter | Label |
Green | 3 | Apple |
Yellow | 3 | Apple |
Yellow | 3 | Lemon |
GINI = 1 – ((2/2)^2) = 0
GINI = 1 – ((2/3)^2 + (1/3)^2) = 0.44
Information Gain = 0.64 – ((2/5)*(0) + (3/5)*(0.44)) = 0.376
Since, the second option has a higher Information Gain the root node would split by the Color
being red or not.