Maximum Entropy modeling has been successfully applied to many fields such as Computer Vision, Spatial Physics and Natural Language Processing. Maximum Entropy models can be used for learning from many heterogeneous information sources. Maximum entropy classifiers are considered alternatives to Naive Bayes classifiers, because they do not require the naive assumption of the independent features. However, they are significantly slower than Naive Bayes classifiers, and thus may not be appropriate when there is a very large number of classes.

A typical binary maximum entropy classifier can be implemented using logistic regression. Logistic regression can be explained with the logistic function as follows:

$$ L_f(z) = \frac{1}{1 + e^{-z} } $$

where the variable \(z\) is defined as:

Generally, maximum entropy classifiers are implemented using the principle of maximum entropy. Suppose we have some testable label \(L\) about a quantity \(x\) taking values in \({x_1, x_2,..., x_n}\) where \(x_i\) is a feature vector of size \(m\). This label can be expressed by \(m\) constraints on the expectations of the feature values; that is, the following probability distribution must be satisfied, where \(f_k(x_i)\) returns the \(k\)th element of the feature vector \(x_i\):

$$\sum_{i=1}^n p(x_i|L)f_k(x_i) = C_k \qquad k = 1, \ldots,m.$$

In addition, the probabilities must sum to one.

$$\sum_{i=1}^n p(x_i|L) = 1.$$

The probability distribution with maximum information entropy subject to above constraints is:

$$p(x_i|L) = \frac{e^{\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)}}{\sum_{i=1}^n e^{\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)}}$$

The \(\lambda_k\) parameters are Lagrange multipliers whose particular values are determined by the constraints according to

The above equation presents \(m\) simultaneous equations, which are usually solved by numerical methods.