cvfdt File Reference
Detailed Description
Learns a DecisionTree from a high-speed time-changing data stream (or very large data set).
Learns a decision tree from a high-speed time-changing data stream or a very large data set as described in this paper. cvfdt does not work with continuous attributes.
CVFDT learns a tree as follows. It starts with a single leaf and starts collecting training examples from a data stream (with the -stdin argument) or from the file stem.data. When it has enough data to know, with high confidence that it knows which attribute is the best to partition the data with, it turns the leaf into an internal node splitting on that attribute and starts learning at the new leaves recursively. CVFDT maintains a window of training examples and keeps its learned tree up-to-date with this window by monitoring the quality of its old decisions as data moves into and out of the window. In particular, whenever a new example is read it is added to the statistics at all the nodes in the tree that it passes through, the last example in the window is forgotten from every node where it had previously had an effect, and the validity of all statistical tests are checked. If cvfdt detects a change it starts growing an alternate tree in parallel which is rooted at the newly-invalidated node. When the alternate is more accurate on new data than the original the original is replaced by the alternate and freed.
cvfdt takes input and does output in c4.5 format. It expects to find the files <stem>.names
and <stem>.data
.
- Thanks:
- to Laurie Spencer for doing the core development work for cvfdt.
- Wish List:
- Modify this learner to work with continuous attributes.
An API to this learner like the one to learning BeliefNet structure in beliefnet-engine.h
Arguments
- -f 'filestem'
- Set the name of the dataset (default DF)
- -source 'dir'
- Set the source data directory (default '.')
- -u
- Test the learner's accuracy on data in 'stem'.test
- -sc 'allowed chance of error in each decision'
- Default 0.01 that's 1 percent
- -stdin
- Reads training examples from stdin instead of from 'stem'.data (default off) NOTE this disables the rescans switch
- -tc 'tie error'
- Call a tie when hoeffding e < this. (default 0.05)
- -rescans 'count'
- Naievely consider each example 'count' times with no concern for using it once per level of the induced tree (default 1)
- -chunk 'count'
- Wait until 'count' examples accumulate at a leaf before testing for a split (default 300)
- -growMegs 'count'
- Limit dynamic memory allocation to 'count' megabytes (default 1000)
- -window 'count'
- Number of examples in the window (default 50000)
- -cache 'count'
- Number of examples from the window to keep in memory (default 10000)
- -schedule '#'
- Run tests every # examples (default 10000).
- -altTest '#'
- Interval for building and testing alternate trees (default 10000).
- -incrementalReporting
- As each example arrives test the current learned model with it and then learn on it (default off).
- -v
- Can be used multiple times to increase the debugging -output
- -checkSize 'count'
- Wait until 'count' examples accumulate before checking for questionable nodes(default 10000)
- -h
- Run cvfdt -h for a list of the arguments and their meanings.
Generated for VFML by
hosted by