cv.DTrees - mexopencv

Decision Trees

The class represents a single decision tree or a collection of decision trees.

The current public interface of the class allows user to train only a single decision tree, however the class is capable of storing multiple decision trees and using them for prediction (by summing responses or using a voting schemes), and the derived from cv.DTrees classes (such as cv.RTrees and cv.Boost) use this capability to implement decision tree ensembles.

Decision Trees

The ML classes discussed in this section implement Classification and Regression Tree algorithms described in [Breiman84].

The class cv.DTrees represents a single decision tree or a collection of decision trees. It's also a base class for cv.RTrees and cv.Boost.

A decision tree is a binary tree (tree where each non-leaf node has two child nodes). It can be used either for classification or for regression. For classification, each tree leaf is marked with a class label; multiple leaves may have the same label. For regression, a constant is also assigned to each tree leaf, so the approximation function is piecewise constant.

Predicting with Decision Trees

To reach a leaf node and to obtain a response for the input feature vector, the prediction procedure starts with the root node. From each non-leaf node the procedure goes to the left (selects the left child node as the next observed node) or to the right based on the value of a certain variable whose index is stored in the observed node. The following variables are possible:

Ordered variables. The variable value is compared with a threshold that is also stored in the node. If the value is less than the threshold, the procedure goes to the left. Otherwise, it goes to the right. For example, if the weight is less than 1 kilogram, the procedure goes to the left, else to the right.
Categorical variables. A discrete variable value is tested to see whether it belongs to a certain subset of values (also stored in the node) from a limited set of values the variable could take. If it does, the procedure goes to the left. Otherwise, it goes to the right. For example, if the color is green or red, go to the left, else to the right.

So, in each node, a pair of entities (variable_index, decision_rule (threshold/subset)) is used. This pair is called a split (split on the variable variable_index). Once a leaf node is reached, the value assigned to this node is used as the output of the prediction procedure.

Sometimes, certain features of the input vector are missed (for example, in the darkness it is difficult to determine the object color), and the prediction procedure may get stuck in the certain node (in the mentioned example, if the node is split by color). To avoid such situations, decision trees use so-called surrogate splits. That is, in addition to the best "primary" split, every tree node may also be split to one or more other variables with nearly the same results.

Training Decision Trees

The tree is built recursively, starting from the root node. All training data (feature vectors and responses) is used to split the root node. In each node the optimum decision rule (the best "primary" split) is found based on some criteria. In machine learning, gini "purity" criteria are used for classification, and sum of squared errors is used for regression. Then, if necessary, the surrogate splits are found. They resemble the results of the primary split on the training data. All the data is divided using the primary and the surrogate splits (like it is done in the prediction procedure) between the left and the right child node. Then, the procedure recursively splits both left and right nodes. At each node the recursive procedure may stop (that is, stop splitting the node further) in one of the following cases:

Depth of the constructed tree branch has reached the specified maximum value.
Number of training samples in the node is less than the specified threshold when it is not statistically representative to split the node further.
All the samples in the node belong to the same class or, in case of regression, the variation is too small.
The best found split does not give any noticeable improvement compared to a random choice.

When the tree is built, it may be pruned using a cross-validation procedure, if necessary. That is, some branches of the tree that may lead to the model overfitting are cut off. Normally, this procedure is only applied to standalone decision trees. Usually tree ensembles build trees that are small enough and use their own protection schemes against overfitting.

Variable Importance

Besides the prediction that is an obvious use of decision trees, the tree can be also used for various data analyses. One of the key properties of the constructed decision tree algorithms is an ability to compute the importance (relative decisive power) of each variable. For example, in a spam filter that uses a set of words occurred in the message as a feature vector, the variable importance rating can be used to determine the most "spam-indicating" words and thus help keep the dictionary size reasonable.

Importance of each variable is computed over all the splits on this variable in the tree, primary and surrogate ones. Thus, to compute variable importance correctly, the surrogate splits must be enabled in the training parameters, even if there is no missing data.

References

[Breiman84]:

Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. "Classification and regression trees". CRC press, 1984.

Superclasses	handle
Sealed	false
Construct on load	false

CVFolds	If `CVFolds > 1` then algorithms prunes the built decision tree
MaxCategories	Cluster possible values of a categorical variable into
MaxDepth	The maximum possible depth of the tree.
MinSampleCount	If the number of samples in a node is less than this parameter then
Priors	The array of a priori class probabilities, sorted by the class label
RegressionAccuracy	Termination criteria for regression trees.
TruncatePrunedTree	If true then pruned branches are physically removed from the tree.
Use1SERule	If true then a pruning will be harsher.
UseSurrogates	If true then surrogate splits will be built.
id	Object ID

	addlistener	Add listener for event.
	calcError	Computes error on the training or test dataset
	clear	Clears the algorithm state
	delete	Destructor
	empty	Returns true if the algorithm is empty
	eq	== (EQ) Test handle equality.
	findobj	Find objects matching specified conditions.
	findprop	Find property of MATLAB handle object.
	ge	>= (GE) Greater than or equal relation for handles.
	getDefaultName	Returns the algorithm string identifier
	getNodes	Returns all the nodes
	getRoots	Returns indices of root nodes
	getSplits	Returns all the splits
	getSubsets	Returns all the bitsets for categorical splits
	getVarCount	Returns the number of variables in training samples
	gt	> (GT) Greater than relation for handles.
	isClassifier	Returns true if the model is a classifier
	isTrained	Returns true if the model is trained
Sealed	isvalid	Test handle validity.
	le	<= (LE) Less than or equal relation for handles.
	listener	Add listener for event without binding the listener to the source object.
	load	Loads algorithm from a file or a string
	lt	< (LT) Less than relation for handles.
	ne	~= (NE) Not equal relation for handles.
	notify	Notify listeners of event.
	predict	Predicts response(s) for the provided sample(s)
	save	Saves the algorithm parameters to a file or a string
	train	Trains a decision tree