Tracks and maintains the sufficient statistics needed to calculate Entropy and Gini of discrete and continuous attributes, as well as make some queries about the probability of events in the data.
Go to the source code of this file.
Data Structures | |
struct | _ExampleGroupStats_ |
Sufficient statistics for Entropy and Gini. More... | |
Typedefs | |
typedef _ExampleGroupStats_ | ExampleGroupStats |
Sufficient statistics for Entropy and Gini. | |
typedef _ExampleGroupStats_ * | ExampleGroupStatsPtr |
Sufficient statistics for Entropy and Gini. | |
Functions | |
ExampleGroupStatsPtr | ExampleGroupStatsNew (ExampleSpecPtr es, AttributeTrackerPtr at) |
Creates a structure to track sufficient statistics. | |
void | ExampleGroupStatsFree (ExampleGroupStatsPtr egs) |
Frees all the memory that was being used by the structure. | |
void | ExampleGroupStatsDeactivate (ExampleGroupStatsPtr egs) |
Temporarily frees the memory being used to hold statistics. | |
void | ExampleGroupStatsReactivate (ExampleGroupStatsPtr egs) |
Reallocates the memory that is freed by a call to ExampleGroupStatsDeactivate. | |
void | ExampleGroupStatsAddExample (ExampleGroupStatsPtr egs, ExamplePtr e) |
Adds the information from the example to the statistics structure. | |
void | ExampleGroupStatsWrite (ExampleGroupStatsPtr egs, FILE *out) |
A debugging function that prints a representation of the stats structure to specified file. | |
long | ExampleGroupStatsNumExamplesSeen (ExampleGroupStatsPtr egs) |
Number of examples being tracked by the structure. | |
AttributeTrackerPtr | ExampleGroupStatsGetAttributeTracker (ExampleGroupStatsPtr egs) |
Returns the attribute tracker associated with the structure. | |
int | ExampleGroupStatsIsAttributeActive (ExampleGroupStatsPtr egs, int num) |
Tests if the attribute is active. | |
void | ExampleGroupStatsIgnoreAttribute (ExampleGroupStatsPtr egs, int num) |
Frees the memory being used by the attribute and stops tracking it. | |
int | ExampleGroupStatsGetMostCommonClassLaplace (ExampleGroupStatsPtr egs, int addClass, int addCount) |
Returns the index of the most common class, but adds addCount samples to addClass. | |
int | ExampleGroupStatsGetMostCommonClass (ExampleGroupStatsPtr egs) |
Returns the index of the most common class. | |
long | ExampleGroupStatsGetMostCommonClassCount (ExampleGroupStatsPtr egs) |
Returns the number of examples with the most common class that were seen by the structure. | |
int | ExampleGroupStatsGetMostCommonClassForAttVal (ExampleGroupStatsPtr egs, int att, int val) |
Returns the most common class among examples where the specified attribute has the specified value. | |
int | ExampleGroupStatsIsPure (ExampleGroupStatsPtr egs) |
Returns 1 if all the examples shown to the structure have the same class. | |
float | ExampleGroupStatsGetValuePercent (ExampleGroupStatsPtr egs, int attNum, int valNum) |
Returns the fraction of examples that have the specified value for the specified attribute. | |
double | ExampleGroupStatsGetValueGivenClassMEstimate (ExampleGroupStatsPtr egs, int attNum, int valNum, int classNum) |
Returns P(att = value | class). | |
float | ExampleGroupStatsGetClassPercent (ExampleGroupStatsPtr egs, int classNum) |
Returns P(class). | |
float | ExampleGroupStatsGetPercentBelowThreshold (ExampleGroupStatsPtr egs, int attNum, float thresh) |
Returns the fraction of examples with a value below the specified threshold. | |
double | ExampleGroupStatsGetValueGivenClassMEstimateLogP (ExampleGroupStatsPtr egs, int attNum, int valNum, int classNum) |
Returns a smoothed P(att = value | class). | |
double | ExampleGroupStatsGetClassLogP (ExampleGroupStatsPtr egs, int classNum) |
Returns the log of the fraction of examples that have the specified class. | |
float | ExampleGroupStatsEntropyTotal (ExampleGroupStatsPtr egs) |
Returns the entropy of the class attribute of all examples seen so far. | |
float | ExampleGroupStatsEntropyDiscreteAttributeSplit (ExampleGroupStatsPtr egs, int attNum) |
Returns the weighted entropy of the class attribute after partitoning the data by the values of the specified attribute. | |
float | ExampleGroupStatsEntropyPlusDiscreteAttributeSplit (ExampleGroupStatsPtr egs, int attNum, float delta) |
Returns upper bound on weighted entropy of the class attribute after partitoning the data by the values of the specified attribute. | |
float | ExampleGroupStatsEntropyMinusDiscreteAttributeSplit (ExampleGroupStatsPtr egs, int attNum, float delta) |
Returns lower bound on weighted entropy of the class attribute after partitoning the data by the values of the specified attribute. | |
void | ExampleGroupStatsEntropyContinuousAttributeSplit (ExampleGroupStatsPtr egs, int attNum, float *firstIndex, float *firstThresh, float *secondIndex, float *secondThresh) |
Finds the entropy of the best split thresholds. | |
float | ExampleGroupStatsGiniTotal (ExampleGroupStatsPtr egs) |
Returns the gini index of the class attribute of all examples seen so far. | |
float | ExampleGroupStatsGiniDiscreteAttributeSplit (ExampleGroupStatsPtr egs, int attNum) |
Returns the weighted gini of the class attribute after partitoning the data by the values of the specified attribute. | |
void | ExampleGroupStatsGiniContinuousAttributeSplit (ExampleGroupStatsPtr egs, int attNum, float *firstIndex, float *firstThresh, float *secondIndex, float *secondThresh) |
Finds the Gini index of the best split thresholds. | |
void | ExampleGroupStatsIgnoreSplitsWorseThanEntropy (ExampleGroupStatsPtr egs, int attNum, float entropyThresh) |
Stop monitoring some thresholds. | |
void | ExampleGroupStatsIgnoreSplitsWorseThanGini (ExampleGroupStatsPtr egs, int attNum, float giniThresh) |
Stop monitoring some thresholds. | |
int | ExampleGroupStatsLimitSplitsEntropy (ExampleGroupStatsPtr egs, int attNum, int maxSplits, int pruneDownTo) |
Reduce the number of thresholds being considered if above the max. | |
void | ExampleGroupStatsStopAddingSplits (ExampleGroupStatsPtr egs, int attNum) |
Stop adding new split thresholds, but continue to use future examples to evaluate the existing ones. | |
int | ExampleGroupStatsNumSplitThresholds (ExampleGroupStatsPtr egs, int attNum) |
Returns the number of thresholds that are being monitored for the specified attribute. | |
int | ExampleGroupStatsGetMostCommonClassAboveThreshold (ExampleGroupStatsPtr egs, int attNum, float threshold) |
Returns the most common class above the specified value. | |
int | ExampleGroupStatsGetMostCommonClassBelowThreshold (ExampleGroupStatsPtr egs, int attNum, float threshold) |
Returns the most common class below the specified value. |
|
Sufficient statistics for Entropy and Gini.
|
|
Sufficient statistics for Entropy and Gini.
|
|
Adds the information from the example to the statistics structure.
|
|
Temporarily frees the memory being used to hold statistics. Does not free the whole structure. A later call to ExampleGroupStatsReactivate will restore the memory (but not the counts that used to be there). This is a convienient way to focus RAM usage (and learning) in one part of the instance space while keeping the book keeping around to quickly resume learning in another. You shouldn't try to add examples to a deactiveated structure. |
|
Finds the entropy of the best split thresholds. Calculates the entropy of splitting the specified attribute by every threshold under consideration (values are sorted and then a threshold is considered between each pair of adjacent values that have different class). The remaining arguments return the entropy of the best and second best thresholds, along with the thresholds themselves. This function adds an MDL penelty similar to the one Quinlan uses in C4.5. Should only be called for continuous attributes. |
|
Returns the weighted entropy of the class attribute after partitoning the data by the values of the specified attribute. Should only be called for discrete attributes. |
|
Returns lower bound on weighted entropy of the class attribute after partitoning the data by the values of the specified attribute. This uses the Hoeffding bound and the empirical probabilities to return a value that is lower than the true entropy with probability 1
Should only be called for discrete attributes. |
|
Returns upper bound on weighted entropy of the class attribute after partitoning the data by the values of the specified attribute. This uses the Hoeffding bound and the empirical probabilities to return a value that is higher than the true entropy with probability 1
Should only be called for discrete attributes. |
|
Returns the entropy of the class attribute of all examples seen so far.
|
|
Frees all the memory that was being used by the structure.
|
|
Returns the attribute tracker associated with the structure.
|
|
Returns the log of the fraction of examples that have the specified class.
|
|
Returns P(class).
|
|
Returns the index of the most common class.
|
|
Returns the most common class above the specified value. Should only be called for continuous attributes. |
|
Returns the most common class below the specified value. Should only be called for continuous attributes. |
|
Returns the number of examples with the most common class that were seen by the structure.
|
|
Returns the most common class among examples where the specified attribute has the specified value. Should only be called for discrete attributes. |
|
Returns the index of the most common class, but adds addCount samples to addClass. Use addClass of -1 for no addition (or just call ExampleGroupStatsGetMostCommonClass). This adding allows you to, for example, smooth the class towards the parent class during decision tree induction. |
|
Returns the fraction of examples with a value below the specified threshold. Should only be called for continuous attributes. |
|
Returns P(att = value | class). Returns the fraction of examples among those that have the specified class that have the specified value for the specified attribute, but smooths the return value by adding a small amount (that decreases with the number of samples seen) to each class count first. Should only be called for discrete attributes. |
|
Returns a smoothed P(att = value | class). Returns the log of the fraction of examples among those that have the specified class that have the specified value for the specified attribute, but smooths the return value by adding a small amount (that decreases with the number of samples seen) to each class count first. Should only be called for discrete attributes. |
|
Returns the fraction of examples that have the specified value for the specified attribute. Should only be called for discrete attributes. |
|
Finds the Gini index of the best split thresholds. Calculates the gini of splitting the specified attribute by every threshold under consideration (values are sorted and then a threshold is considered between each pair of adjacent values that have different class). The remaining arguments return the gini of the best and second best thresholds, along with the thresholds themselves. Should only be called for continuous attributes. |
|
Returns the weighted gini of the class attribute after partitoning the data by the values of the specified attribute. Should only be called for discrete attributes. |
|
Returns the gini index of the class attribute of all examples seen so far.
|
|
Frees the memory being used by the attribute and stops tracking it. This is useful if you decide that some attribute will not be used (perhaps using some statistical tests) and would like to use the memory elsewhere. |
|
Stop monitoring some thresholds. Stop monitoring every threshold with an entropy worse than the specified value. This frees some memory, but adding future values to the egs may require some interpolation to estimate the position of the new value in the array of all values for the attribute (and so this introduces some error into future calls for the Entropy or Gini of the attribute). Should only be called for continuous attributes. |
|
Stop monitoring some thresholds. Stop monitoring every threshold with an Gini worse than the specified value. This frees some memory, but adding future values to the egs may require some interpolation to estimate the position of the new value in the array of all values for the attribute (and so this introduces some error into future calls for the Entropy or Gini of the attribute). Should only be called for continuous attributes. |
|
Tests if the attribute is active. Returns 1 if the attribute was active in the initial attribute tracker and has not been ignored by a call to ExampleGroupStatsIgnoreAttribute since then. |
|
Returns 1 if all the examples shown to the structure have the same class.
|
|
Reduce the number of thresholds being considered if above the max. If the attribute is monitoring more than 'maxSplits' split thresholds this function will find the best 'pruneDownTo' based on entropy and start ignoring all the rest. This frees some memory, but adding future values to the egs may require some interpolation to estimate the position of the new value in the array of all values for the attribute (and so this introduces some error into future calls for the Entropy or Gini of the attribute). Returns the number of thresholds that were pruned. Should only be called for continuous attributes. |
|
Creates a structure to track sufficient statistics. Creates a structure to track the statistics needed to cacluate several common machine metrics for the attributes that are active in the AttributeTracker. This function takes over the memory for the AttributeTracker and will free it when ExampleGroupStatsFree For categorical attributes this uses memory proportional to the number of classes * the number of values of the attribute. For continuous attributes this uses constant memory at first, but as examples are added with ExampleGroupStatsAddExample the memory grows proportionally with the number of unique values of the attribute. |
|
Number of examples being tracked by the structure. Outputs the number of examples added to the structure with ExampleGroupStatsAddExample since the last call to ExampleGroupStatsReactivate. |
|
Returns the number of thresholds that are being monitored for the specified attribute. Should only be called for continuous attributes. |
|
Reallocates the memory that is freed by a call to ExampleGroupStatsDeactivate.
|
|
Stop adding new split thresholds, but continue to use future examples to evaluate the existing ones. Should only be called for continuous attributes. |
|
A debugging function that prints a representation of the stats structure to specified file.
|