文章

CS229-Linear Regression and Gradient Descent Lecture2

Linear Regression

选择参数是 learning algorithm 里很重要的一部分。

To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, let’s say we decide to approximate y as a linear function of x

hθ(x)=θ0+θ1x1+θ2x2

θ 就是参数,也被称为权重。

h(x)=i=0dθixi=θTx

Now, given a training set, how do we pick, or learn, the parameters θ? To formalize this, we will define a function that measures, for each value of the θ’s, how close the h(x(i))’s are to the corresponding y(i)’s.

没错,就是 cost function

J(θ)=12i=1n(hθ(x(i))y(i))2

与最小二乘的 cost function 有些熟悉,让我们继续。

LMS algorithm

We want to choose θ so as to minimize J(θ)。

Specifically, let’s consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update:

θj:=θjαθjJ(θ)

α is called the learning rate.This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of J.

θjJ(θ)=θj12(hθ(x)y)2=212(hθ(x)y)θj(hθ(x)y)=(hθ(x)y)θj(i=0dθixiy)=(hθ(x)y)xj

For a single training example, this gives the update rule:

θj:=θj+α(y(i)hθ(x(i)))xj(i)

The rule is called the LMS update rule (LMS stands for “least mean squares”), and is also known as the Widrow-Hoff learning rule.

magnitude of update 与 error term (y(i)hθ(x(i)))

有更大的 error 将会对参数发生巨大改变。

有两种方法来修改这个方法对于多于一个 example 的训练集。第一种便是上面的 LMS update rule。

第二种是:

θ:=θ+αi=1n(y(i)hθ(x(i)))x

This method looks at every example in the entire training set on every step, and is called batch gradient descent.

θ:=θ+α(y(i)hθ(x(i)))x(i)

在这个算法中,我们更新参数只根据 error 相对于单个的 example。This algorithm is called stochastic gradient descent (also incremental gradient descent).

Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent.

Least squares revisited

因为hθ(x(i))=(x(i))Tθ​​,我们可以轻易地用矩阵表示

Xθy=[(x(1))Tθ(x(n))Tθ][y(1)y(n)]=[hθ(x(1))y(1)hθ(x(n))y(n)]

同时,用这个公式zTz=izi2​​,我们能进一步推

12(Xθy)T(Xθy)=12i=1n(hθ(x(i))y(i))2=J(θ)

最后,为了minimize J,让我们进一步推出它的梯度

θJ(θ)=θ12(Xθy)T(Xθy)=12θ((Xθ)TXθ(Xθ)TyyT(Xθ)+yTy)=12θ(θT(XTX)θyT(Xθ)yT(Xθ))=12θ(θT(XTX)θ2(XTy)Tθ)=12(2XTXθ2XTy)=XTXθXTy

为了minimize J,我们将它的梯度设置成0,因此得到normal equations

XTXθ=XTy

同时,得到θ的值 θ=(XTX)1XTy

Probabilistic interpretation

In this section,你将看到一系列的probabilistic assumptions,在这之下,least-squares regression将会十分自然的衍生出来。

y(i)=θTx(i)+ϵ(i)

在这里面,ϵ(i)​是error(unmodeled effects or random noise)。我们可以用正态分布来表示这个assumption。

p(ϵ(i))=12πσexp((ϵ(i))22σ2)

这意味着

p(y(i)|x(i);θ)=12πσexp((y(i)θTx(i))22σ2)

注意:θ不是一个随机变量

因此,根据上面的assumption,我们可以进一步推出$p(\vec{y} X;\theta)$,我们会将它称为likelihood function
L(θ)=L(θ;X,y)=p(y|X;θ)

根据独立性,这也可以被写为

L(θ)=i=1np(y(i)|x(i);θ)=i=1n12πσexp((y(i)θTx(i))22σ2)

对于我们如何选择最好的参数,我们可以用maximum likelihood法则,也就是选择一个参数来使得data的概率越高越好。

也就是,我们要挑选一个参数来maximize L(θ)

我们可以maximize任意strictly increasing function of L(θ)。比如,我们可以使用 log likelihood 来替代。

logL(θ)=log(i=1n12πσexp((y(i)θTx(i))22σ2))=i=1nlog(12πσexp((y(i)θTx(i))22σ2))=i=1n(log(12πσ)1σ212(y(i)θTx(i))2)=nlog(12πσ)1σ212i=1n(y(i)θTx(i))2

所以,我们要maximizeL(θ)只需要minimize:

12i=1n(y(i)θTx(i))2

这就是被称为J(θ)的函数,也就是我们最初的least squares cost function.

注意,我们对于参数的选择不取决于σ2,且事实上我们可能得到相同的结果即使σ2​是未知的

Locally weighted linear regression

本文由作者按照 CC BY 4.0 进行授权