Probably Approximately Correct (PAC)
Table of Contents
1. The learning problem
PAC stands for “probably approximately correct”. In machine learning we want to find a hypothesis that is as close as possible to the ground truth. Since we only have access to a sample of the real distribution, the hypothesis that one builds is itself a function of the sample data, and therefore it is a random variable. The problem that we want to solve is whether the sample error incurred in choosing a particular hypothesis is approximately the same as the exact distribution error, within a certain confidence interval.
Suppose we have a binary classification problem (the same applies for multi-class) with classes , and we are given a training dataset with data-points. Each data-point is characterised by features, and represented as a vector . We want to find a map between these features and the corresponding class :
This map, however, does not always exist. There are problems for which we can only determine the class up to a certain confidence level. In this case we say that the learning problem is agnostic, while when the map exists we say that the problem is realisable. For example, image recognition is an agnostic problem.
Let us assume for the moment that such a map exists, that is, we are in the realisable case. The learner chooses a set of hypothesis and by doing this it introduces bias in the problem- a different learner may chose a different set of hypothesis. Then, in order to find the hypothesis that most accurately represents the data, the learner chooses one that has the smallest empirical risk, that is the error on the training set. In other words, one tries to find the minimum of the sample loss function
with the Kronecker delta function. Denote the solution of this optimisation problem as . The true or generalization error is defined instead as the unbiased average
where is the real distribution. In the case of classification, the generalisation error is also the probability of misclassifying a point
If we choose appropriately we may find . This can happen, for example, by memorising the data. In this case, we say that the hypothesis is overfitting the data. Although memorising the data results in zero empirical error, the solution is not very instructive because it does not give information of how well it will perform on unseen data.
The overfitting solution performs very well on the data because the learner used prior knowledge to choose a hypothesis set with sufficient capacity (or complexity) to accommodate the entire dataset. In the above minimisation problem, one should find a solution that does well (small error) on a large number of samples rather then having a very small error in the given sample. Overfitting solutions should be avoided as they can lead to misleading conclusions. Instead, the learner should aim at obtaining a training error that is comparable to the error obtained with different samples.
To make things practical, consider the problem of classifying points on a 2D plane as red or blue. The decision boundary is a circumference of radius concentric with the origin of the plane, which colours points that are inside as red and outside as blue. See figure below. The training dataset consists of data-points sampled independently and identically distributed (i.i.d) from a distribution .

Here the circumference denotes the ground truth which classifies points as red or blue, depending on whether they are inside or outside of the circle, respectively.
The learning problem is to find a hypothesis that has small error on unseen data.
Assuming that the learner has prior knowledge of the ground truth (realisability assumption), that is, the learner assumes that the best hypothesis is a circumference but it does not how its radius. One of the simplest algorithms is to consider the set of concentric circumferences and minimise the empirical risk. This can be achieved by drawing a decision boundary that is as close as possible to the most outward red (or inward blue data-points). This guarantees that when the sample has infinite number of points, that is , we recover the exact decision boundary: the circumference . The empirical risk minimisation problem gives the solution represented in the figure below by the circumference . However, newly generated data-points may lie in between and , and therefore would be misclassified.

a) The hypothesis is a circumference of radius concentric with the origin and it is determined by the most outward red data-point. This ensures that all training set is correctly classified. b) The circumference of radius corresponds to a hypothesis that has generalization error .
Given that this is an overfitting solution, one has to be careful of how well it generalises. It is possible that the generalisation error is small for such a solution, but one has to be confident of how common this situation may be. If the sample that led to that solution is a rare event then we should not trust its predictions, and we should expect large generalization error. Therefore we are interested in bounding the probability of making a bad prediction, that is,
Conversely, this tells us with confidence of at least that
A PAC learnable hypothesis is a hypothesis for which one can put a bound on the probability of the form Eq.1 with arbitrary.
In the case of the circumference example, define for which , with the corresponding solution. Any hypothesis corresponding to a radius less than leads to a generalisation error larger than . The probability of sampling a point and falling in the region between and is precisely . Conversely the probability of falling outside that region is . It is then easy to see that the probability that we need equals
Using the bound we can choose , and thus equivalently . Hence using equation \eqref{eq2}, we have
with probability .
2. Finite hypothesis classes are PAC learnable
Let us assume that we have a finite hypothesis class with hypothesis, that is, , and that this class is realisable, meaning that it contains a for which . We want to upper bound the generalisation error of a hypothesis obtained using empirical risk minimisation, that is, we want to find a bound of the form
Define as the set of hypotheses that have generalisation error larger than (it does not necessarily minimise the emprirical risk). We call this the set of bad hypotheses
Similarly one can define the set of misleading training sets, as those that lead to a hypothesis with . That is,
Since we assume the class is realisable, the hypothesis in equation Eq.3 must have , and therefore the sample data is a misleading dataset. So we need the probability of sampling a misleading dataset . Using
and the property , we have
Now for each we can put a bound on . Since we want , the probability of misclassifying a data-point is larger than , and conversely a point will correctly classified with probability . Therefore, as the solution is always overfitting and so all the points are correctly classified, we have
The final bound becomes
Setting , we have with a probability of at least that
3. Agnostic learning
In agnostic learning we do not have anymore an exact mapping between the features and the classes. Instead the classes themselves are sampled from a probability distribution given the features, that is, we have . In the realisable example this probability is always . Given this we extend the distribution to both the features and the classes so we have .
The definition of generalisation error is slightly changed to
Because we do not have anymore the realisability condition, showing that a problem is PAC learnable is a bit more complicated. For this purpose we use one of the most useful inequalities in statistics:
Hoeffding’s Inequality:
for a random variable and any distribution. Here is the sample mean, is the distribution average and . We can apply this property to the empirical loss and the generalisation loss. Since they are quantities between zero and one (they are probabilities), we have
We are interested in the probability of sampling a training set which gives a misleading prediction. So we want
and thus using Hoeffding’s inequality we have We set , and conclude
Say that we have for , the solution we obtain after minimising the empirical loss, then
This equation demonstrates clearly the trouble with overfitting. To memorise the data we need to use hypothesis classes with large dimension, so the solution has enough capacity to accommodate each data-point. This makes the second term on r.h.s of the inequality Eq.4 very large, loosening the bound on the generalisation error instead of making it tighter. The fact is that we should minimise the empirical error together with that term, so we make the bound on the true error smaller. This leads us to the idea of regularisation in machine learning, whereby the empirical loss is endowed with correction terms that mitigate highly complex solutions.
References
[1] Understanding Machine Learning: from Theory to Algorithms, Shai Ben-David and Shai Shalev-Shwartz