Package org.wdssii.decisiontree

Provides classes that implement Quinlan's C4.5 algorithm in Java to train a decision tree based on labeled data (supervised learning) and to classify unlabeled cases using a trained decision tree.

See:
          Description

Interface Summary
DecisionTree Capable of classifying new data points into categories.
DecisionTreeCreator The learning algorithm for a decision tree.
FitnessFunction Performance measure of a classifier
 

Class Summary
AxialDecisionTree A decision tree each of whose branches depends on only one attribute.
AxialTreeNode A node in an axial decision tree has two branches and a condition to decide which branch to take.
Classify Uses a trained decision tree to classify cases.
FitnessFunction.Split  
GainRatioFitnessFunction Picks a threshold to maximize the gain but uses gain-ratio as the fitness of this attribute
GainRatioFitnessFunction.ValueAndCategory  
MulticategorySkillScore Computes and returns skill scores for a multi-category forecast.
QuinlanC45AxialDecisionTreeCreator C45 learning algorithm to create an axial decision tree.
Train Command-line program that invokes Quinlan's C4.5 algorithm
 

Exception Summary
DecisionTreeCreator.TreeCreationException  
QuinlanC45AxialDecisionTreeCreator.TreeCreationException  
 

Package org.wdssii.decisiontree Description

Provides classes that implement Quinlan's C4.5 algorithm in Java to train a decision tree based on labeled data (supervised learning) and to classify unlabeled cases using a trained decision tree. The decision tree can be saved using ObjectOutputStream or XMLEncoder (java.beans)

Training from Command-Line

  1. Create a training file that is organized line-by-line. One line corresponds to each training example. The columns should be separated by commas and the last column should be the label. All columns should be numeric (if you have text columns, convert them to numbers). The label should be 0,1,2,..,N-1. It can be sparse, i.e. there could be no training examples for category=3. In that case, the decision tree will not generate category=3.
  2. Run the training program specifying the training file and where the output files should be placed:
        java org.jscience.statistics.decisiontree.Train trainingfile.csv outdir
    
  3. There are three optional parameters that you can specify when training:
        java org.jscience.statistics.decisiontree.Train trainingfile.csv outdir 0.1    1   " "
    
    1. The first optional parameter is the pruningFraction - the fraction of the training set that should be used for validation. You can specify zero if you don't want a validation set. A validation set of between 0.1 and 0.3 is recommended to limit overfitting.
    2. The second optional parameter is whether to randomly shuffle the training set before training. This is recommended in case your dataset has an implicit order. Shuffling only helps to make the validation set a random selection. In the absence of a validation set, shuffling has no effect.
    3. The third optional parameter can be used to change the separation character in the dataset from the default of a "," (comma).
  4. The output directory contains a file named decisiontree.xml which contains the final decision tree in a form that can be read and used by this package. It also contains a DecisionTree.java which contains a Java source form of the trained decision tree. You can incorporate the generated source code in your programs - there are no restrictions whatsoever.
  5. To try out a trained decision tree, run it on a test dataset which has the same format as the training dataset. You need to provide the decisiontree.xml that is a result of training:
        java org.jscience.statistics.decisiontree.Train outdir/decisiontree.xml testingfile.csv outdir
    
    The output directory will contain a file in which each line corresponds to the output of the decisiontree for the set of inputs.
  6. If the testing file is labeled i.e. if its last column is the correct ("expected") labels, then the Classify program also reports the True Skill Statistic, a measure of skill where 1 is the best performance and 0 is no better than random.

Usage from Java programs

   // TRAIN
   float[][] data = new float[numTraining][numAttr];
   int[] categories = new int[numTraining];
   // populate arrays
   ...
   QuinlanC45AxialDecisionTreeCreator classifier = new QuinlanC45AxialDecisionTreeCreator(0.1); // pruning fraction
   DecisionTree tree = classifier.learn(data, categories);
   
   
   // SAVE
   import java.beans.XMLEncoder;
   XMLEncoder encoder = new XMLEncoder(new FileOutputStream("decisiontree.xml"));
   encoder.writeObject(tree);
   encoder.close();
   
   // RETRIEVE
   import java.beans.XMLDecoder;
                XMLDecoder decoder = new XMLDecoder("decisiontree.xml");
                AxialDecisionTree tree = (AxialDecisionTree) decoder.readObject();


   // CLASSIFY USING TREE
   float[] data = new float[numAttr];
   // populate array
   ....
   int category = tree.classify(data);


   // COMPUTE SKILL
   float[][] data = new float[numTesting][numAttr];
   int[] categories = new int[numTesting]; // true categories
   MulticategorySkillScore tss = new MulticategorySkillScore(tree.getNumCategories());
   for (int i=0; i < numTesting; ++i){
                int result = tree.classify(data[i]);
                tss.update(categories[i], result);
    }
    float trueSkillScore = tss.getTSS();
   

Since:
2.0 # Jean-Marie, please correct this as needed