CS229-Linear Regression and Gradient Descent Lecture2

发表于 2024/05/18 更新于 2024/05/20

作者 Silly Cheese 6 分钟阅读

Linear Regression

选择参数是 learning algorithm 里很重要的一部分。

To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, let’s say we decide to approximate y as a linear function of x

h_{θ} (x) = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2}

θ 就是参数，也被称为权重。

h (x) = \sum_{i = 0}^{d} θ_{i} x_{i} = θ^{T} x

Now, given a training set, how do we pick, or learn, the parameters θ? To formalize this, we will define a function that measures, for each value of the θ’s, how close the h(x(i))’s are to the corresponding y(i)’s.

没错，就是 cost function：

J (θ) = \frac{1}{2} \sum_{i = 1}^{n} (h_{θ} (x^{(i)}) - y^{(i)})^{2}

与最小二乘的 cost function 有些熟悉，让我们继续。

LMS algorithm

We want to choose θ so as to minimize J(θ)。

Specifically, let’s consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update:

θ_{j} := θ_{j} - α \frac{\partial}{\partial θ_{j}} J (θ)

α is called the learning rate.This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of J.

\begin{aligned} \frac{\partial}{\partial θ_{j}} J (θ) & = \frac{\partial}{\partial θ_{j}} \frac{1}{2} (h_{θ} (x) - y)^{2} \\ = 2 \cdot \frac{1}{2} (h_{θ} (x) - y) \cdot \frac{\partial}{\partial θ_{j}} (h_{θ} (x) - y) \\ = (h_{θ} (x) - y) \cdot \frac{\partial}{\partial θ_{j}} (\sum_{i = 0}^{d} θ_{i} x_{i} - y) \\ = (h_{θ} (x) - y) x_{j} \end{aligned}

For a single training example, this gives the update rule:

θ_{j} := θ_{j} + α (y^{(i)} - h_{θ} (x^{(i)})) x_{j}^{(i)}

The rule is called the LMS update rule (LMS stands for “least mean squares”), and is also known as the Widrow-Hoff learning rule.

magnitude of update 与 error term $(y^{(i)} - h_{θ} (x^{(i)}))$

有更大的 error 将会对参数发生巨大改变。

有两种方法来修改这个方法对于多于一个 example 的训练集。第一种便是上面的 LMS update rule。

第二种是：

θ := θ + α \sum_{i = 1}^{n} (y^{(i)} - h_{θ} (x^{(i)})) x

This method looks at every example in the entire training set on every step, and is called batch gradient descent.

θ := θ + α (y^{(i)} - h_{θ} (x^{(i)})) x^{(i)}

在这个算法中，我们更新参数只根据 error 相对于单个的 example。This algorithm is called stochastic gradient descent (also incremental gradient descent).

Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent.

Least squares revisited

因为 $h_{θ} (x^{(i)}) = (x^{(i)})^{T} θ$ ,我们可以轻易地用矩阵表示

X θ - \vec{y} = [\begin{matrix} (x^{(1)})^{T} θ \\ ⋮ \\ (x^{(n)})^{T} θ \end{matrix}] - [\begin{matrix} y^{(1)} \\ ⋮ \\ y^{(n)} \end{matrix}] = [\begin{matrix} h_{θ} (x^{(1)}) - y^{(1)} \\ ⋮ \\ h_{θ} (x^{(n)}) - y^{(n)} \end{matrix}]

同时，用这个公式 $z^{T} z = \sum_{i} z_{i}^{2}$ ，我们能进一步推

\frac{1}{2} (X θ - \vec{y})^{T} (X θ - \vec{y}) = \frac{1}{2} \sum_{i = 1}^{n} (h_{θ} (x^{(i)}) - y^{(i)})^{2} = J (θ)

最后，为了minimize J，让我们进一步推出它的梯度

\begin{aligned} \nabla_{θ} J (θ) & = \nabla_{θ} \frac{1}{2} (X θ - \vec{y})^{T} (X θ - \vec{y}) \\ = \frac{1}{2} \nabla_{θ} ((X θ)^{T} X θ - (X θ)^{T} \vec{y} - {\vec{y}}^{T} (X θ) + {\vec{y}}^{T} \vec{y}) \\ = \frac{1}{2} \nabla_{θ} (θ^{T} (X^{T} X) θ - {\vec{y}}^{T} (X θ) - {\vec{y}}^{T} (X θ)) \\ = \frac{1}{2} \nabla_{θ} (θ^{T} (X^{T} X) θ - 2 (X^{T} \vec{y})^{T} θ) \\ = \frac{1}{2} (2 X^{T} X θ - 2 X^{T} \vec{y}) \\ = X^{T} X θ - X^{T} \vec{y} \end{aligned}

为了minimize J,我们将它的梯度设置成0，因此得到normal equations

X^{T} X θ = X^{T} \vec{y}

同时，得到θ的值 $θ = (X^{T} X)^{- 1} X^{T} \vec{y}$

Probabilistic interpretation

In this section,你将看到一系列的probabilistic assumptions，在这之下，least-squares regression将会十分自然的衍生出来。

y^{(i)} = θ^{T} x^{(i)} + ϵ^{(i)}

在这里面， $ϵ^{(i)}$ 是error(unmodeled effects or random noise)。我们可以用正态分布来表示这个assumption。

p (ϵ^{(i)}) = \frac{1}{\sqrt{2 π} σ} \exp (- \frac{(ϵ^{(i)})^{2}}{2 σ^{2}})

这意味着

p (y^{(i)} | x^{(i)}; θ) = \frac{1}{\sqrt{2 π} σ} \exp (- \frac{(y^{(i)} - θ^{T} x^{(i)})^{2}}{2 σ^{2}})

注意： $θ$ 不是一个随机变量

因此，根据上面的assumption，我们可以进一步推出$p(\vec{y}

X;\theta)$,我们会将它称为likelihood function

L (θ) = L (θ; X, \vec{y}) = p (\vec{y} | X; θ)

根据独立性，这也可以被写为

L (θ) = \prod_{i = 1}^{n} p (y^{(i)} | x^{(i)}; θ) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π} σ} \exp (- \frac{(y^{(i)} - θ^{T} x^{(i)})^{2}}{2 σ^{2}})

对于我们如何选择最好的参数，我们可以用maximum likelihood法则，也就是选择一个参数来使得data的概率越高越好。

也就是，我们要挑选一个参数来maximize $L (θ)$

我们可以maximize任意strictly increasing function of $L (θ)$ 。比如，我们可以使用 log likelihood 来替代。

\begin{aligned} \log L (θ) & = \log (\prod_{i = 1}^{n} \frac{1}{\sqrt{2 π} σ} \exp (- \frac{(y^{(i)} - θ^{T} x^{(i)})^{2}}{2 σ^{2}})) \\ = \sum_{i = 1}^{n} \log (\frac{1}{\sqrt{2 π} σ} \exp (- \frac{(y^{(i)} - θ^{T} x^{(i)})^{2}}{2 σ^{2}})) \\ = \sum_{i = 1}^{n} (\log (\frac{1}{\sqrt{2 π} σ}) - \frac{1}{σ^{2}} \cdot \frac{1}{2} (y^{(i)} - θ^{T} x^{(i)})^{2}) \\ = n \log (\frac{1}{\sqrt{2 π} σ}) - \frac{1}{σ^{2}} \cdot \frac{1}{2} \sum_{i = 1}^{n} (y^{(i)} - θ^{T} x^{(i)})^{2} \end{aligned}

所以，我们要maximize $L (θ)$ 只需要minimize:

\frac{1}{2} \sum_{i = 1}^{n} (y^{(i)} - θ^{T} x^{(i)})^{2}

这就是被称为 $J (θ)$ 的函数，也就是我们最初的least squares cost function.

注意，我们对于参数的选择不取决于 $σ^{2}$ ,且事实上我们可能得到相同的结果即使 $σ^{2}$ 是未知的

Locally weighted linear regression

CS229

本文由作者按照 CC BY 4.0 进行授权