Example for: basic functions of Example and ExampleSpec.
This is a simple example that introduces everything you'll need to load a dataset and
extract the information you will need to execute a learning algorithm. It includes a
made-up data set, a sample makefile, and a program which it loads, collects statistics
from, and frees the data set. The example's file are in the <VFML-root>/examples/scan-dataset/
directory. This document presents the code with a detailed commentary and some
suggestions for modifications.
You might like to go to the <VFML-root>/examples/scan-dataset/
directory and get your favorite code/text editor ready.
The dataset used for the scan-dataset example is made-up. Each example represents a banana sitting on a kitchen counter. The attributes tell how long each banana has been sitting on the counter and how many black spots each has. From this information, you would like to predict if the banana is edible or spoiled. Unfortunately people are always forgetting how long ago they got their bananas, so the attribute containing that information is sometimes unknown.
Look at the test.names
file for the C4.5-name description of the dataset.
The test.data
file contains the examples, notice the '?'s indicating that
some of the attribute values are unknown. The first line of the .data file
represents a banana that has been on the counter for 1 day, has a few spots, and is still
edible.
Glance at the makefile; the top couple lines contain information you would need to update if you want to use the file with another project.
The makefile is set up to work as is for the scan-dataset example. Make sure
you've properly installed the VFML library (see the Getting
Started section if you haven't done this yet), and changed to the <VFML-root>/examples/scan-dataset/
directory. Type 'make' to build the example program. Run it by typing scan-dataset
,
and look at the output.
Now let's take a look at the code, load scan-dataset.c into your editor.
#include "uwml.h"
#include <stdio.h>
These two include files will appear in just about every project build with VFML. The first includes all the VFML interfaces, the second is needed to work with files, something you will do in most of your VFML project.
The next couple lines declare some global variables which we'll use to keep statistics about the data. We used globals to highlight the separation between this less-interesting code from the code that does the real work of the example.
int main(void) {
ExampleSpecPtr es = ExampleSpecRead("test.names");
ExamplePtr e;
FILE *exampleIn = fopen("test.data", "r");
These lines load the example spec, declare an example pointer, and open the example data file. The example spec is very important, it contains a complete description of the dataset including attributes, their types and values, and the classes. Your program will query the example spec to determine how to go about working with a particular dataset, what values to expect, and how to iterate over them. You will also need to pass the spec to various VFML interfaces; it might be a good thing to make global in your projects.
exampleIn is initialized to contain a file handle to the data which is configured for reading. The program will read examples from this file, one at a time, until there are no more left to read.
Note that the file names are hard coded as test.<names, data>
.
test
is called the filestem. Your programs will need to
accept a command line argument which allows the filestem to be set at runtime.
The next couple lines make calls to the ExampleSpec interface to figure out some properties of the test dataset. First we figure out how many attributes and classes there are.
printf("There are %d attributes.\n",
ExampleSpecGetNumAttributes(es));
printf("There are %d classes.\n",
ExampleSpecGetNumClasses(es));
Then we figure out some more information about the attributes. In the example, we hard code the attribute indexes; a real learner would have to be more sophisticated. Notice that attribute indexing (and all other indexing in VFML) is zero-based, just like C arrays.
if(ExampleSpecIsAttributeContinuous(es, 0)) {
printf(" Attribute with index 0 is
continuous.\n");
}
if(ExampleSpecIsAttributeDiscrete(es, 1)) {
printf(" Attribute with index 1 is discrete \
and has %d values.\n",
ExampleSpecGetAttributeValueCount(es, 1));
}
The Scan-Dataset program loads, examines, and frees the examples from the data set in turn. Most learners will need to load the entire dataset into RAM and do some significant processing. Finding the right data structure can be a bit problematic. Arrays have quick random access but are a bit inconvenient when you don't know the size of the data set ahead of time. Linked lists are easy to build but are slow to access
After reading an example, the program tests the values of its attributes and record some statistics. As above, the example program hard codes indexes to attributes and values.
e = ExampleRead(exampleIn, es);
while(e != 0) { /* ExampleRead returns 0 when EOF */
/* keep a count of the examples */
gNumExamples++;
/* keep a count of how many of them are spoiled */
if(!ExampleIsClassUnknown(e)) {
if(ExampleGetClass(e) == 1) {
gNumSpoiled++;
}
}
Scan-dataset always checks each value to make sure it isn't 'unknown' before attempting to use it. The result of accessing an unknown value is undefined.
/* keep a sum of the number of days */
if(!ExampleIsAttributeUnknown(e, 0)) {
gSumDays +=
ExampleGetContinuousAttributeValue(e, 0);
} else {
gNumDaysUnknown++;
}
/* keep a total of the number of bananas
that have a few spots */
if(!ExampleIsAttributeUnknown(e, 1)) {
if(ExampleGetDiscreteAttributeValue(e, 1)
== 1) {
gNumFewSpots++;
}
}
VFML allows you to access any attribute as either continuous or discrete, but accessing with the wrong type will return a garbage value at best. You should always use the ExampleSpec interface to check an attribute's type before you access it.
/* now move on to the next example */
ExampleFree(e);
e = ExampleRead(exampleIn, es);
}
Scan-dataset prints out some statistics when it's done scanning the data.