bindata File Reference

Detailed Description

Converts continuous attributes into discrete ones.

Converts all continuous attributes in a data set to categorical ones. Uses two passes over data, one to gather the stats needed to pick bin boundaries, and one to do the conversion (although the first pass can be done on a sample with the -samples argument below).

bindata uses one of two methods to select bin boundaries. The first is to find the range of each attribute (by identifing its highest and lowest value) and then dividing the range into even with bins. This is the default method. The other method assumes that the attribute was generated from a Gaussian, estimates the mean and variance of the Gaussian from data, and sets bin boundaries so that each bin holds an even amount of the Gaussian's probability mass.

Thanks:: to Chun-Hsiang Hung for doing the core development work for this tool.

Wish List:: that this tool would have more methods for selecting bin boundaries, for example to reduce entropy.

Arguments

-f <filestem>
- Set the stem name (default DF)
-fout <filestem>
- Set the name of the output dataset (default DF-out)
-source <dir>
- Set the directory that contains the dataset (default '.')
-target <dir>
- Set the directory to contain the output dataset (default '.')
-test
- Also converts the test set (but does not use the test set to pick boundaries)
-samples <n>
- Use the first n examples to pick bin boundaries (default use the whole train set)
-bins <n>
- Sets the number of bins (default 10)
-gaussian
- Pick bin boundaries by fitting a Gaussian and making even probability bins
-h
- Display usage information and exit.
-v
- Can be used multiple times to increase the debugging output

Generated for VFML by

hosted by