Machine Learning

Purity and Impurity

A set of labels is pure if all the labels in that set belong to the same class.
A set of labels is impure if the labels in that set belong to two or more different classes.

Gini Impurity

Gini Impurity is the measure to calculate impurity of a set. It tells you what is the probability of classifying a data point to a wrong class based on the probability distribution of the classes in the dataset.
In simple words, say, we have a dataset with two classes. Both have the same probability distribution, that is, both are 50% in the dataset. Now, if we pick a random point, and classify it into one of two classes, there is a 50% chance that the point might be misclassified. So the gini impurity is 0.5 .

The formula for gini impurity is
Gini Impurity = 1 - (p1^2 + p2^2 + ... + pn^2)
where p1, p2, ..., pn are the probabilities of each class.