CS229-Linear Regression and Gradient Descent Lecture2
Linear Regression
选择参数是 learning algorithm 里很重要的一部分。
To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, let’s say we decide to approximate y as a linear function of x
θ 就是参数,也被称为权重。
Now, given a training set, how do we pick, or learn, the parameters θ? To formalize this, we will define a function that measures, for each value of the θ’s, how close the h(x(i))’s are to the corresponding y(i)’s.
没错,就是 cost function:
与最小二乘的 cost function 有些熟悉,让我们继续。
LMS algorithm
We want to choose θ so as to minimize J(θ)。
Specifically, let’s consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update:
α is called the learning rate.This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of J.
For a single training example, this gives the update rule:
The rule is called the LMS update rule (LMS stands for “least mean squares”), and is also known as the Widrow-Hoff learning rule.
magnitude of update 与 error term
有更大的 error 将会对参数发生巨大改变。
有两种方法来修改这个方法对于多于一个 example 的训练集。第一种便是上面的 LMS update rule。
第二种是:
This method looks at every example in the entire training set on every step, and is called batch gradient descent.
在这个算法中,我们更新参数只根据 error 相对于单个的 example。This algorithm is called stochastic gradient descent (also incremental gradient descent).
Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent.
Least squares revisited
因为
同时,用这个公式
最后,为了minimize J,让我们进一步推出它的梯度
为了minimize J,我们将它的梯度设置成0,因此得到normal equations
同时,得到θ的值
Probabilistic interpretation
In this section,你将看到一系列的probabilistic assumptions,在这之下,least-squares regression将会十分自然的衍生出来。
在这里面,
这意味着
注意:
因此,根据上面的assumption,我们可以进一步推出$p(\vec{y} | X;\theta)$,我们会将它称为likelihood function |
根据独立性,这也可以被写为
对于我们如何选择最好的参数,我们可以用maximum likelihood法则,也就是选择一个参数来使得data的概率越高越好。
也就是,我们要挑选一个参数来maximize
我们可以maximize任意strictly increasing function of
所以,我们要maximize
这就是被称为
注意,我们对于参数的选择不取决于