User:Akela1101/Supervised Learning Algorithms

Gradient Descent
An approximate method of minimizing $$J(\bar{w})$$.

Let $$\bar{w}^0 = 0$$ and α be a small positive number. Then if ε is a precision, while $$J(\bar{w}) > \epsilon$$,

$$\bar{w} := \bar{w} - \alpha \nabla J$$

Gradient $$\nabla J$$ can be thought of as J growth angle, so this makes sense.

$$w_i := w_i - \alpha {\partial J\over\partial w_i} = w_i - \frac{\alpha}{m} \sum_{j = 1}^m \biggl(h(\bar{x}^j) - y^j\biggr) x_i^j = w_i - \frac{\alpha}{m} \sum_{j = 1}^m \biggl(w_0 + w_1 x_1^j + .. + w_n x_n^j - y^j\biggr) x_i^j$$, for each $$i \in 0 .. n$$

$$\triangleleft$$

For n = 1 it's only two actions in a step:

$$w_0 := w_0 - \frac{\alpha}{m} \sum_{j = 1}^m \biggl(w_0 + w_1 x_1^j - y^j\biggr)$$

$$w_1 := w_1 - \frac{\alpha}{m} \sum_{j = 1}^m \biggl(w_0 + w_1 x_1^j - y^j\biggr) x_1^j$$

$$\triangleright$$

Logistic Regression (Classification)
As usual, a set of m samples $$\{\bar{x}^j \rightarrow y^j\}$$, where $$\bar{x}^j$$ is a vector $$\{x_1, x_2,.. x_n\}^j$$ and $$j \in 1 .. m$$.

We need to classify, whether y is 0 or 1.

$$\Box$$

Logistic function $$g(z) = \frac{1}{1 + e^{-z}} \in (0, 1)$$

Let $$h(\bar{x}) = g(\bar{w} \cdot \bar{x}) = \frac{1}{1 + e^{-\bar{w} \cdot \bar{x}}}$$, where $$\bar{x} = \{ 1, x_1, x_2, .. x_n \}$$ and $$\bar{w} = \{ w_0, w_1, w_2, .. w_n \}$$

If h < 0.5, prediction is 0, and if h ≥ 0.5, prediction is 1.

$$J(\bar{w}) = -\frac{1}{m}\sum_{j = 1}^m \biggl(y^j \ln{(h(x^j))} + (1 - y^j) \ln{(1- h(x^j))}\biggr) \xrightarrow[w]{} min$$

Function $$y^j \ln{(h(x^j))}$$ calls cross-entropy.

$$\blacksquare$$

Gradient descent looks same as in linear regression:

$$w_i := w_i - \alpha {\partial J\over\partial w_i} = w_i - \frac{\alpha}{m} \sum_{j = 1}^m \biggl(h(\bar{x}^j) - y^j\biggr) x_i^j = w_i - \frac{\alpha}{m} \sum_{j = 1}^m \biggl(\frac{1}{1 + e^{-\bar{w} \cdot \bar{x}^j}} - y^j\biggr) x_i^j$$

Regularization
Is a technique allowing to avoid overfitting.

$$J(\bar{w}) = \frac{1}{2m}\sum_{j = 1}^m \biggl(h(x^j) - y^j\biggr)^2 + \frac{\lambda}{2m}\sum_{i = 1}^n w_i^2 \xrightarrow[w]{} min$$

Additional item with lambda lessens coefficients in W, making them less precise for initial data, but more adequate for new data.

Multi-value classification (One vs All)
Logistic Regression can be used also in case, y can take more than 2 values: $$y \in 1 .. r$$.

$$\Box$$

Let $$y_k := (y = k) \ ? \ 1 : 0$$ for $$k \in 1 .. r$$. I.e. for each k divide all samples into 2 sets, one with y = k and one with all others.

So we need to build r functions $$h_k(\bar{x})$$ in the same manner as $$h(\bar{x})$$.

$$J_k(\bar{w}) = -\frac{1}{m}\sum_{j = 1}^m \biggl(y_k^j \ln{(h_k(x^j))} + (1 - y_k^j) \ln{(1- h_k(x^j))}\biggr) + \frac{\lambda}{2m} \sum_{i = 1}^n w_{k,i}^2 \xrightarrow[w_k]{} min$$

Classification is as simple as finding k for which $$h_k(\bar{x})$$ is maximum.

$$\blacksquare$$

Neural Networks
Actually, what we've got above is somewhat similar to one layer neural network, where $$h_k(\bar{x})$$ is a neuron representation.

$$\bar{y} = g(W\cdot\bar{x})$$, where W is a weights matrix.

Note that applying g is not necessary, as it's monotonic and y is the last layer.

Multilayer NN is built similarly by applying g function to each layer.

$$\bar{y}^{|L+1|} = g(W^{|L|}\cdot\bar{y}^{|L|})$$

$$\Box$$

Instead of using $$J_k(\bar{w})$$ one by one, W for one layer can be found with a sum of them:

$$J(W) = -\frac{1}{m}\sum_{j = 1}^m \sum_{k = 1}^r \biggl(y_k^j \ln{(h_k(x^j))} + (1 - y_k^j) \ln{(1- h_k(x^j))}\biggr) + \frac{\lambda}{2m} \sum_{k = 1}^r \sum_{i = 1}^n w_{k,i}^2 \xrightarrow[W]{} min$$

For multiple layers the algorithm is similar:

$$J(\{W^{|L|}\}) = -\frac{1}{m}\sum_{j = 1}^m \sum_{k = 1}^{r^{|q|}} \biggl(y_k^j \ln{(h_k^{|q|}(y^{|q-1|}))} + (1 - y_k^j) \ln{(1- h_k^{|q|}(y^{|q-1|}))}\biggr) + \frac{\lambda}{2m} \sum_{L = 1}^q \sum_{k = 1}^{r^{|L|}} \sum_{i = 1}^{n^{|L|}} (w_{k,i}^{|L|})^2 \xrightarrow[\{W^{|L|}\}]{} min$$

where $$\{W^{|L|}\}$$ is a set of weight matrices for each layer, and $$y^{|1|} = x^j$$ — input parameters.

$$\blacksquare$$

Support Vector Machines (Linear Classification)
Is a technique similar to Logistic Regression, that can be sometimes faster and more extensible.

Given cost functions: $$cost_1^* = ln\biggl(\frac{1}{1 + e^{-\bar{w} \cdot \bar{x}}}\biggr),\ cost_0^* = ln\biggl(1 - \frac{1}{1 + e^{-\bar{w} \cdot \bar{x}}}\biggr)$$

Change them to similar linear functions: $$cost_1 = (z < 1)\ ?\ a - az : 0,\ cost_0 = (z > -1)\ ?\ a + az : 0$$

where a is some positive number and $$z = \bar{w} \cdot \bar{x}$$.

$$\Box$$

With this simple approach samples are also divided linearly, but with some margin.

The hypothesis: $$h(x) = (\bar{w} \cdot \bar{x} < 0)\ ?\ 0 : 1$$

$$J(\bar{w}) = -\frac{1}{m}\sum_{j = 1}^m \biggl(y^j cost_1(\bar{w} \cdot \bar{x}^j) + (1 - y^j) cost_0(\bar{w} \cdot \bar{x}^j)\biggr) \xrightarrow[w]{} min$$

$$\blacksquare$$

Support Vector Machines (Kernels)
Kernels are used to get more complex than linear classification.

Gaussian Kernel: $$K^j(x) = e^{-\frac{||x - x^j||}{2\sigma^2}}$$, for each training sample.

Thus x is replaced with $$\bar{K} = \{ 1, K^1(x), .. K^m(x)\}$$.

$$\Box$$

$$h(x) = (\bar{w} \cdot \bar{K} < 0)\ ?\ 0 : 1$$, i.e. close to one of the samples it gives 1.

$$J(\bar{w}) = -\frac{1}{m}\sum_{j = 1}^m \biggl(y^j cost_1(\bar{w} \cdot \bar{K}^j) + (1 - y^j) cost_0(\bar{w} \cdot \bar{K}^j)\biggr) \xrightarrow[w]{} min$$

$$\blacksquare$$

Decision Tree
Is another approach to data classification. It is simpler to describe and visualize, but harder to implement properly.

Again, a set of m samples $$\{\bar{x}^j \rightarrow y^j\}$$, where $$\bar{x}^j$$ is a vector $$\{x_1, x_2,.. x_n\}^j$$ and $$j \in 1 .. m$$.

y can take multiple values: $$y \in 1 .. r$$.

While Logistic Regression splits space by distance to samples, i.e. with circles,

Decision Tree does splitting with planes parallel to axes, i.e. with rectangles.

$$\Box$$
 * Quantization. All continuous parameters should be converted to discrete (categorical).  I.e. if $$x_i$$ can take N values, and N is big, those $$x_i$$ should be grouped in a smaller set.

$$\triangleleft$$ $$\{10, 10, 12, 15\},\ \{21, 22, 25, 26\},\ \{34, 35, 36\}, ..$$ $$\triangleright$$ $$\blacksquare$$
 * Choose some parameter $$i \in 1 .. n$$ and split the dataset basing on $$x_i$$
 * For each subset $$A_i$$
 * if $$y^j = u, j \in A_i$$ with probability p, let's say (95%), then stop splitting this subset.
 * else go to previous step.

There are at least 3 implementation problems here: There will be no specific answer to these for the moment.
 * 1) How to convert continuous parameters?
 * 2) How to choose splitting parameter?
 * 3) When to stop splitting? How to avoid overfitting?

But in any approach the idea is to select the most pure subsets, i.e. $$p \rightarrow 100%$$
 * Gini impurity: $$E = 1 - \sum_{j\in1..r}p_j^2$$

$$\triangleleft$$ If $$p_j = 1$$, E = 0. If $$p_j = \frac{1}{2}, j \in 1|2$$, E = 0.5 $$\triangleright$$
 * Information gain: $$E = H - H_{subset}$$, where entropy $$H = - \sum_{j\in1..r}p_j \log_{2} p_j$$ The best split is one that has the most information gain.

Random forest
The idea is to generate multiple randomly initialized trees, and then select the most popular result.

Random initialization can concern:
 * Learning on random subset of samples.
 * Learning on random subset of features.
 * Random selection of tree splitting parameter.