This is a very simple 'learner', but it may be useful as a baseline to compare your learner against; predicting with 99% accuracy isn't impressive if 98% of the examples have the same class.
The mostcommonclass learner works in time proportional to the number of training examples and uses space proportional to the number of classes. It should be able to work on large datasets.
The learner takes input and does output in c4.5 format. It expects to find the files <stem>.names
and <stem>.data.
Depending on command line argument, it will either output the most common class or test its error rate on <stem>.test
.
mostcommonclass -f banana -source datasets/banana
Looks for a dataset named 'banana' in the 'datasets/banana' directory. Outputs the name of the most common class in the dataset.