Machine Learning - WEEK 1 2 3- 线性回归 、逻辑回归、梯度下降法及其优化算法、Octave 入门

WEEK 1、2、3

本文为个人笔记,只记了重要内容,不适合新手入手

线性回归

  • 样本$(x^{(i)},y^{(i)}),i \in 1,2,…,m$
  • $x^{(i)}=(x_1^{(i)},x_2^{(i)},…,x_n^{(i)})$ ,假设 $x^{(i)}$ 具有 $n$ 个特征
  • 假想函数(目标函数):
    $$h_\theta(x^{(i)})=\theta_0+\theta_1x_1^{(i)}+\theta_2x_2^{(i)}+…+\theta_n x_n^{(i)}$$
  • $h_\theta(x^{(i)})$ 表达式视具体情况而定
  • 线性回归线性 的含义是参数 $\theta_j$ 都是一次的,而非 $h_\theta$ 的自变量,回归 应该是代价函数 $J$ 的自变量的回归
  • $\theta=(\theta_0,\theta_1,\theta_2,…,\theta_n)$ 称为参数,Machine Learning 的目的就是求出参数的合适取值使得 $h_\theta(x^{(i)})$ 更能体现 $x^{(i)} \to y^{(i)}$ 的映射关系
  • 为此我们提出了一个衡量参数取值好坏的函数——代价函数:
    $$J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}) - y^{(i)})^2}$$
  • 现在问题转变为了求使得代价函数 $J(\theta)$ 最小时的参数 $\theta$,下面给出两种方法:

梯度下降法

时间复杂度 $O(kn^2)$ ,适合当 n > 10000 时 (k 为学习步数)
$$\theta_j := \theta_j - \alpha \frac {\partial J(\theta)}{\partial \theta_j}$$

  • $\alpha$ 称学习速率(或步长),取值视情况而定,取值过大会导致不收敛;若当取值适当,在趋近极值时 $\frac {\partial J(\theta)}{\partial \theta_j}$ 也会变小,所以此时梯度下降法是收敛的,所以没必要担心 $\alpha$ 在趋近极值点的时候不改变导致无法收敛。
  • 注意要先求出所有的 $temp_j= \theta_j - \alpha \frac {\partial J(\theta)}{\partial \theta_j}$ ,再对所有的 $\theta_j = temp_j$

若取 $x_0^{(i)}=1$ ,求偏导后可以写成:
$$\theta_j := \theta_j - \frac{\alpha}{m}\sum_{i=1}^{m}{[(h_\theta(x^{(i)}) - y^{(i)})\cdot x_j^{(i)}]}$$

用矩阵来表述的话:
$$\theta := \theta - \frac{\alpha}{m}[X^T \cdot (X \cdot \theta - y)]$$

  • $\theta \in \mathbb{R}^{(n+1) \times 1},X \in \mathbb{R}^{m \times (n+1)},y \in \mathbb{R}^{m \times 1}$

特征优化:

尽量让 $-1<x_i<1$
$$x_i := \frac {x_i-\mu_i}{s_i}$$

  • $\mu_i$ 为第 $i$ 个特征的平均取值,$s_i$ 为第 $i$ 个特征的极差 $(max - min)$

传统方法

时间复杂度 $O(n^3)$ ,适合当 n < 10000 时

直接令$ \frac {\partial J(\theta)}{\partial \theta_i}=0$,解得:
$$\theta=(X^TX)^{-1}X^Ty$$

  • $\theta \in \mathbb{R}^{(n+1) \times 1},X \in \mathbb{R}^{m \times (n+1)},y \in \mathbb{R}^{m \times 1}$

code

featureNormalize

1
2
3
4
5
6
7
8
9
function [X_norm, mu, sigma] = featureNormalize(X)
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
for iter = 1:size(X, 2)
mu(iter) = mean(X(:, iter));
sigma(iter) = std(X(:, iter));
X_norm(:, iter) = (X(:, iter) - mu(iter)) / sigma(iter);
end
end

computeCostMulti

1
2
3
4
function J = computeCost(X, y, theta)
m = length(y); % number of training examples
J = 1 / (2 * m) * sum((X * theta - y) .^ 2);
end

gradientDescentMulti

1
2
3
4
5
6
7
8
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
theta -= alpha / m * X' * (X * theta - y);
J_history(iter) = computeCostMulti(X, y, theta);
end
end

normalEqn

1
2
3
4
function [theta] = normalEqn(X, y)
theta = zeros(size(X, 2), 1);
theta = pinv(X' * X) * X' * y;
end

Octave

https://www.gnu.org/software/octave/

Ubuntu Install:

sudo apt-add-repository ppa:octave/stable
sudo apt-get update
sudo apt-get install octave

Octave 入门:

内容比较多,推荐两篇文章

http://blog.csdn.net/weixin_36106941/article/details/64443944
https://www.cnblogs.com/leezx/p/5635056.html

逻辑回归

线性回归是用来预测某个点的取值,逻辑回归是预测某个点具有某种特征的概率

为了达到我们的目的,重现建立模型:
$$\begin{aligned}
h_\theta(x) & = g(\theta^Tx)\\ \\
g(z) & = \frac{1}{1+e^{-z}}\\ \\
J(\theta) &=\frac {1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)},y^{(i)}))\\
\\
Cost(h_\theta(x,y)) &=
\begin{cases}
&-log(h_\theta(x)) && y = 1 \\ \\
&-log(1 - h_\theta(x)) && y = 0
\end{cases}
\end{aligned}
$$
Note: $y = 0$ or $1$ always , and log is ln

可以写成:
$$Cost(h_\theta(x,y)) = -[y\cdot log(h_\theta(x)) + (1 - y)\cdot log(1 - h_\theta(x)]$$

对代价函数求偏倒后发现和线性回归代价函数求偏倒的结果形式上是完全一样的:
$$\frac {\partial J(\theta)}{\partial \theta_j}=\frac{1}{m}\sum_{i=1}^{m}{[(h_\theta(x^{(i)}) - y^{(i)})\cdot x_j^{(i)}]}$$

cost function code

1
2
3
4
5
6
function [J, grad] = costFunction(theta, X, y)
m = length(y); % number of training examples
tmp = sigmoid(X*theta);
J = -1 / m * (y'*log(tmp)+(1-y)'*log(1-tmp));
grad = 1 / m * X' * (sigmoid(X * theta) - y);
end

高级优化算法

Optimization algorithms:

  • Gradient descent
  • Conjugate gradient
  • BFGS
  • L-BFGS

Advantages:

  • No need to manually pick $\alpha$
  • Often faster than grdicent descent.

disadvantages:

  • More complex

调用 Octave 优化算法 example :

[first] 定义 cost function:

costFunction.m

1
2
3
function [jval, gradient] = costFunction(theta, X, y)
% jval := J(theta)
% gradient := grad J(theta)

[then] 键入命令

options = optimset('GrandObj', 'on', 'MaxIter', '100');
initialTheta = zeros(n + 1, 1)
[optTheta, functionVal, exitFlag] = fminunc(@(t)costFunction(t, X, y), initialTheta, options);

正则化项

线性回归

为了防止 overfitting(过度拟合),对 cost function 引入了正则化项 $\frac\lambda{2m}\sum_{i=1}^{n}\theta_i^2$
$$J(\theta)=\frac{1}{2m}[\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{i=1}^{n}\theta_i^2]$$
求偏导:
$$ \frac {\partial J(\theta)}{\partial \theta_j} =
\begin{cases}
&\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)} &j=0\\ \\
&\frac{1}{m}[\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+ \lambda\cdot \theta_j]&j>0
\end{cases}
$$
注意!!! $\theta_0$ 不需要惩罚

梯度下降法

Repeat {
$$\begin{aligned}
\theta_0 & := \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\
\theta_j & := \theta_j - \alpha\frac{1}{m}[\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+ \lambda\cdot \theta_j]
\end{aligned}$$
}

对上整理后:

$$\theta_j := \theta_j(1- \alpha\frac\lambda m) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$$

由于 $1- \alpha\frac\lambda m<1$ ,可以将 $\theta_j(1- \alpha\frac\lambda m)$ 写成 $\theta_j\cdot 0.99$

常规方法

直接令$ \frac {\partial J(\theta)}{\partial \theta_i}=0$,解得:
$$\theta=(X^TX + \lambda \cdot \underbrace {\begin{bmatrix} 0 & 0 & 0 &\cdots &0 \\ 0 & 1 & 0 &\cdots &0 \\ 0 & 0 & 1 &\cdots &0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0&0&0&\cdots &1 \end{bmatrix}}_{(n+1)\times (n+1)})^{-1}X^Ty$$

  • Suppose $m \le n$ (m: examples , n : features)
  • 在 $\lambda>0$ 时,括号内的矩阵总是可逆的
  • $\theta \in \mathbb{R}^{(n+1) \times 1},X \in \mathbb{R}^{m \times (n+1)},y \in \mathbb{R}^{m \times 1}$

逻辑回归

对 cost function 引入了正则化项 $\frac\lambda{2m}\sum_{i=1}^{n}\theta_i^2$
$$\begin{aligned}
J(\theta) &=\frac {1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)},y^{(i)}))+\frac\lambda{2m}\sum_{i=1}^{n}\theta_i^2 \\
&=-\frac {1}{m}\sum_{i=1}^{m}[y^{(i)}\cdot log(h_\theta(x^{(i)})) + (1 - y^{(i)})\cdot log(1 - h_\theta(x^{(i)})]+\frac\lambda{2m}\sum_{i=1}^{n}\theta_i^2
\end{aligned}$$
求偏导数后结果形式和线性回归是一样的:
$$ \frac {\partial J(\theta)}{\partial \theta_j} =
\begin{cases}
&\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)} &j=0\\ \\
&\frac{1}{m}[\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+ \lambda\cdot \theta_j]&j>0
\end{cases}
$$

注意!!! $\theta_0$ 不需要惩罚

cost function code

1
2
3
4
5
6
function [J, grad] = costFunctionReg(theta, X, y, lambda)
tmp = sigmoid(X*theta);
J = -1 / m * (y'*log(tmp)+(1-y)'*log(1-tmp)) + lambda / (2 * m) * sum(theta(2:size(theta, 1),1) .^ 2);
theta(1) = 0;
grad = 1 / m * (X' * (tmp - y) + lambda * theta);
end


https://www.coursera.org/learn/machine-learning
教学方: Andrew Ng, Co-founder, Coursera; Adjunct Professor, Stanford University; formerly head of Baidu AI Group/Google Brain