cvml's Blog

Computer Vision and Machine Learning

[ESL]Chapter 3 Linear Methods for Regression

线性模型假设:回归函数E(Y|X)对于\(X_{1},X_{2},...,X_{p}\)是线性的。在训练集较少、低信噪比或稀疏数据时,线性模型有时会比非线性模型更好。

3.2 Linear Regression Models and Least Squares

输入向量\(X^{T}=(X_{1},X_{2},...,X_{p})\),输出Y线性回归模型:

\(f(X) = \beta_{0}+\sum_{j=1}^{p}{X_{j}\beta_{j}}\)                       (3.1)

设有训练数据\((x_{1},y_{1})...(x_{N},y_{N})\),\(x_{i}=(x_{i1},x_{i2},...,x_{ip})^{T}\),\(\beta=(\beta_{0},\beta_{1},...,\beta_{p})^{T}\)。

最小化residual sum of squares:

\(RSS(\beta) = \sum_{i=1}^{N}(y_{i}-f(x_{i}))^{2} = \sum_{i=1}^{N}(y_{i}-\beta_{0}-\sum_{j=1}^{p}{x_{ij}\beta_{j}})^{2}\)                                              (3.2)

从统计角度来看,训练数据若独立随机从总体抽取,这样做是很合理的;如果\(x_{i}\)不是独立抽取,只要\(y_{i}\)条件独立于\(x_{i}\),这样做也是可以的。

最小化(3.2),\(\mathbf{X}\)表示为\(N×(p+1)\)的矩阵,每行是一个输入向量,第一位为1,则

\(RSS(\beta) = (\mathbf{y}-\mathbf{X}\beta)^{T}(\mathbf{y}-\mathbf{X}\beta)\)     (3.3)

这是一个二次函数,p+1个参数,对\(\beta\)微分,

\(\frac{\partial{RSS}}{\partial{\beta}} = -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)\)      (3.4)

\(\frac{\partial^{2}RSS}{\partial{\beta}\partial{\beta}^{T}} = 2\mathbf{X}^{T}\mathbf{X}\)    (3.5)

若\(\mathbf{X}\)是列满秩的,\(\mathbf{X}^{T}\mathbf{X}\)正定,则

\(\hat{\beta}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}\mathbf{y}\)           (3.6)

那么对于一个输入向量\(x_{0}\),其预测值(predicted value)为\(\hat{f}(x_{0}) = (1:x_{0})^{T}\hat{\beta}\),训练集的预测值为

\(\hat{\mathbf{y}}=\mathbf{X}\hat{\beta}=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}\)      (3.7)

矩阵\(H=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\)成为hat matrix,也成为projection matrix,因为\(\hat{\mathbf{y}}\)是\(\mathbf{y}\)在\(\mathbf{X}\)的columns张成空间的投影。

下面来看一下\(\hat{\beta}\)的抽样性质,现在assume that \(y_{i}\)不相关,常数方差\(\sigma^{2}\),\(x_{i}\)固定,不随机,则从(3.6)得到其方差-协方差矩阵,

\(Var(\hat{\beta})=(\mathbf{X}^{T}\mathbf{X})^{-1}\sigma^{2}\)                    (3.8)

推导过程:因\(\mathbf{y}=\mathbf{X}\beta+\mu\),所以

                       \(\hat{\beta}=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}(\mathbf{X}\beta+\mu)\)

                           \(=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{X}\beta+\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mu\)

                           \(=\beta+\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mu\)

              \(\hat{\beta}-\beta=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mu\)

则:

\(Cov(\hat{\beta})=E\left[(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{T}\right]\)

                           \(=E\left[(\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mu)(\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mu)^{T}\right]\)

又\(E(\mu\mu^{T})=\sigma^{2}I\),所以得到(3.8)式。

方差\(\sigma^{2}\)的无偏估计为

\(\hat{\sigma}^{2}=\frac{1}{N-p-1}\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}\)