Given a set of input data points \( \{x_1,...,x_m\} \) associated to a set of output data \(\{y_1,...,y_m\}\) the goal is to train a model that is able to predict \(y\) from \(x\).
The problem of supervised learning can be formulated as follows: Given an unknown function $$g: \mathcal{X}\rightarrow \mathcal{Y}$$ which is known as the ground truth and maps input instances \(\boldsymbol{x} \in \mathcal{X}\) to output labels \(y \in \mathcal{Y}\), along with training data \(\mathbf{D} = {(\boldsymbol{x}_1,y_1),\dots,(\boldsymbol{x}_n, y_n)}\), the goal is to find a function $$h:\mathcal{X}\rightarrow\mathcal{Y}$$ that approximates as closely as possible the correct mapping \(g\). In decision theory, the condition "approximates as closely as possible" is defined by specifying a loss function that assigns a specific value to "loss" resulting from producing an incorrect label. The goal then is the expected loss minimization. The distribution of \(\mathcal{X}\) and the ground truth function \(g:\mathcal{X}\rightarrow\mathcal{Y}\) can only be computed empirically by providing a large number of samples of \(\mathcal{X}\) and their corresponding labels \(\mathcal{Y}\). The type of label being predicted determines the particular loss function needed. For example, in the case of binary classification, a simple zero-one loss function would be sufficient. The zero-one loss function corresponds to assigning a loss of 1 to any incorrect labeling. In other words it is equivalent to computing the accuracy of the classification procedure over a set of test data where the goal is to maximize this accuracy.
From a probabilistic point of view, the supervised learning problem is to estimate a function \(f\) which is the probability of each possible output label \(y\) given a particular input instance \(x\): $$p({\rm y}|\boldsymbol{x},\boldsymbol\theta) = f\left(\boldsymbol{x};\boldsymbol{\theta}\right)$$
where the feature vector input is \(x\), and the function \(f\) is typically parametrized by some parameters \(\theta\). There are two approaches to the problem:
- Discriminative approach, where the function \(f\) is estimated directly;
- Generative approach, where the inverse probability \(p(x|{\rm y})\) is estimated and combined with the prior probability \(p({\rm y}|\boldsymbol\theta)\) using Bayes' rule, as follows: $$p({\rm y}|\boldsymbol{x},\boldsymbol\theta) = \frac{p({\boldsymbol{x}|\rm y}) p({\rm y|\boldsymbol\theta})}{\sum_{y^{'} \in \mathcal{Y}} p(\boldsymbol{x}|y^{'}) p(y^{'} |\boldsymbol\theta)}$$
In the case that the labels are continuously distributed[^1], the denominator will be an integration:
$$p({\rm y}|\boldsymbol{x},\boldsymbol\theta) = \frac{p({\boldsymbol{x}|\rm y}) p({\rm y|\boldsymbol\theta})}{\int_{y^{'} \in \mathcal{Y}} p(\boldsymbol{x}|y^{'}) p(y^{'}|\boldsymbol\theta) \operatorname{d}y^{'}}$$
The following supervised learning methods belong to the set of most commonly applied machine learning methods:
- Naive Bayes Classifier
- Maximum Entropy
- Decision Tree
- Linear Regression
- Support Vector Machines (SVM)
- Neural Networks
- Nearest Neighbor