Running VFML's Tools and Learners
This section describes how you can use the over two dozen tools and
learning algorithms that come with VFML to help you solve your own
data mining problems.
Getting Data
To use them you will first need to get access to some data sets. The
getting started documentation
contains some information on how to download the data sets that come
with VFML. We will now briefly talk about how to construct your own.
The native data format for VFML is the C4.5 Format which was introduced by Ross
Quinlan with his C4.5 decision tree induction algorithm. You can see
the VFML appendix for a detailed
description of the format, but we will also describe it here briefly.
Each data set in C4.5 format consists of several files, a
<stem>.names file that describes the data schema, a
<stem>.data file that contains training data, an optional
<stem>.test file that contains testing data, and an optional
<stem>.prune file that contains prune data. Pretty much all of
the VFML tools expect to find at least a .names file, and many also
require a .data file.
The .names Format
A .names file describes the data in the data set. It enumerates the
classes, attributes, and their possible values. Look at the following
sample .names file:
| This file contains a make-believe problem which has been designed to help
| introduce users to the VFML framework.
| The goal is to predict if a banana is edible or spoiled from how many days
| it has been sitting on the counter and how many brown spots it has.
| The classes
edible, spoiled.
| The first attribute, a number representing how many days it has been on the
| counter.
days: continuous.
| The second attribute, an indication of how many brown spots the banana has
spots: none, few, many.
As you can see, lines beginning with a | are comments. The first
non-comment element in the file is a list of the classes. Following
that is a list of the attributes, each of which contains a description
of the values that it can take (or continuous for ones that can take
any numeric value). There is also a distinguished value called
'ignore', and many of the VFML tools will respect this and will not
use the attribute.
The .data, .test, and .prune Formats
These three file types share the same format. See the following
sample .data file, which contains data conforming to the .names file
listed above:
1, few, edible
?, none, edible
7, many, spoiled
2, many, spoiled
?, many, spoiled
Each line contains an example with a comma separated list of attribute
values. The final value on the line is the value of the class
attribute. For instance, the first example represents a banana that
has been on the table for 1 day, has few spots, and is still edible.
Notice also that some of the attribute values are '?' which is a
special feature of the C4.5 format used when the value of the
attribute is not known. Many of the VFML tools and learners will
perform gracefully when faced with such values.
The VFML tools
Once you have a data set you will probably want to perform some
learning. There are a few useful arguments that are supported by
practically every VFML program.
- -f <stem>
- Tells the tool where to load data from <stem>.names,
<stem>.data, etc. The default is 'DF'.
- -source <dir>
- Tells the tool to look in the specified directory to find the
data files. The default is to look in the current working directory.
- -v
- Increases the amount of debugging output displayed by the tool.
Pass the -v argument multiple times to increase amount of output.
- -u
- For learning programs, this tells the program to measure the
accuracy of the learned model on the data in <stem>.test, and
print the output in a format suitable to be used with batchtest.
- -h
- A critical argument for users to know about. This tells the
program to print out a complete list of its options with descriptions
of what they do. These argument lists are the best source of
documentation for the individual VFML tools.
To learn more about the available tools you should look at the list of tools and to learn more about
the available learning programs you should look at the list of learners. Then, when you
find tools and learners that you are interested in, run them with the
-h argument to get more detailed documentation.