*This post is the second one of our series on* *the history and foundations of econometric and machine learning models. Part 1 is online here.*

## Geometric Properties of this Linear Model

Let’s define the scalar product in \mathbb{R}^n, ⟨\mathbf{a},\mathbf{b}⟩=\mathbf{a}^T\mathbf{b}, and let’s note \|\cdot\| the associated Euclidean standard, \|\mathbf{a}\|=\sqrt{\mathbf{a}^T\mathbf{a}} (denoted \|\cdot\|_{\ell_2} in the next post). Note \mathcal{E}_X the space generated by all linear combinations of the \mathbf{X} components (adding the constant). If the explanatory variables are linearly independent, \mathbf{X} is a full (column) rank matrix and \mathcal{E}_X is a space of dimension p+1. Let’s assume from now on that the variables \mathbf{x} and y are centered here. Note that no law hypothesis is made in this section, the geometric properties are derived from the properties of expectation and variance in the set of finite variance variables.

With this notation, it should be noted that the linear model is written m(\mathbf{x})=⟨\mathbf{x},\beta⟩. The space H_z=\{\mathbf{x}\in\mathbb{R}^{p+1}:m(\mathbf{x})=z\} is a hyperplane (affine) that separates the space in two. Let’s define the orthogonal projection operator on \mathcal{E}_X, \Pi_X =\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T. Thus, the forecast that can be made for it is: \widehat{\mathbf{y}}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}=\Pi_X\mathbf{y}. As, \widehat{\varepsilon}=\mathbf{y}-\widehat{\mathbf{y}}=(\mathbb{I}-\Pi_X)\mathbf{y}=\Pi_{X^\perp}\mathbf{y}, we note that \widehat{\varepsilon}\perp\mathbf{x}, which will be interpreted as meaning that residuals are a term of innovation, unpredictable in the sense that \Pi_{X }\widehat{\varepsilon}=\mathbf{0}. The Pythagorean theorem is written here: \Vert \mathbf{y} \Vert^2=\Vert \Pi_{ {X}}\mathbf{y} \Vert^2+\Vert \Pi_{ {X}^\perp}\mathbf{y} \Vert^2=\Vert \Pi_{ {X}}\mathbf{y}\Vert^2+\Vert \mathbf{y}-\Pi_{ {X}}\mathbf{y}\Vert^2=\Vert\widehat{\mathbf{y}}\Vert^2+\Vert\widehat{\mathbf{\varepsilon}}\Vert^2which is classically translated in terms of the sum of squares: \underbrace{\sum_{i=1}^n y_i^2}_{n\times\text{total variance}}=\underbrace{\sum_{i=1}^n \widehat{y}_i^2}_{n\times\text{explained variance}}+\underbrace{\sum_{i=1}^n (y_i-\widehat{y}_i)^2}_{n\times\text{residual variance}} The coefficient of determination, R^2, is then interpreted as the square of the cosine of the angle \theta between \mathbf{y} and \Pi_X \mathbf{y} : R^2=\frac{\Vert \Pi_{{X}} \mathbf{y}\Vert^2}{\Vert \mathbf{y}\Vert^2}=1-\frac{\Vert \Pi_{ {X}^\perp} \mathbf{y}\Vert^2}{\Vert \mathbf {y}\Vert^2}=\cos^2(\theta)An important application was obtained by Frish & Waugh (1933), when the explanatory variables are divided into two groups, \mathbf{X}=[\mathbf{X}_1 |\mathbf{X}_2], so that the regression becomes y=\beta_0+\mathbf{X}_1 β_1+\mathbf{X}_2 β_2+\varepsilon. Frish & Waugh (1933) showed that two successive projections could be considered. Indeed, if \mathbf{y}_2^\star=\Pi_{X_1^\perp} \mathbf{y} and X_2^\star=\Pi_{X_1^\perp}\mathbf{X}_2, we can show that \widehat{\beta} _2=[{\mathbf{X}_2^\star}^T \mathbf{X}_2^\star]^{-1}{\mathbf{X}_2^\star}^T \mathbf{y}_2^\star In other words, the overall estimate is equivalent to the combination of independent estimates of the two models if \mathbf{X}_2^\star=\mathbf{X}_2, i.e. \mathbf{X}_2\in \mathcal{E}_{X_1}^\perp, which can be noted \mathbf{x}_1\perp\mathbf{x}_2 We obtain here the Frisch-Waugh theorem which guarantees that if the explanatory variables between the two groups are orthogonal, then the overall estimate is equivalent to two independent regressions, on each of the sets of explanatory variables. This is a theorem of double projection, on orthogonal spaces. Many results and interpretations are obtained through geometric interpretations (fundamentally related to the links between conditional expectation and the orthogonal projection in space of variables of finite variance).

This geometric interpretation might help to get a better understanding of the problem of under-identification, i.e. the case where the real model would be y_i=\beta_0+ \mathbf{x}_1^T \beta_1+\mathbf{x}_2^T \beta_2+\varepsilon_i, but the estimated model is y_i=b_0+\mathbf{x}_1^T \mathbf{b}_1+\eta_i. The maximum likelihood estimator of \mathbf{b}_1 is \widehat{\mathbf{b}}_1=\mathbf {\beta}_1 + \underbrace{ (\mathbf {X}_1^T\mathbf {X}_1)^{-1} \mathbf {X}_1^T \mathbf {X}_{2} \mathbf{\beta}_2}_{\mathbf{\beta}_{12}}+\underbrace{(\mathbf{X}_1^{T}\mathbf{X}_1)^{-1} \mathbf{X}_1^T\varepsilon}_{\nu}so that \mathbb{E}[\widehat{\mathbf{b}}_1]=\beta_1+\beta_{12}, the bias ( \beta_{12}) being null only in the case where \mathbf{X}_1^T \mathbf{X}_2=\mathbf{0} (i. e. \mathbf{X}_1\perp \mathbf{X}_2 ): we find here a consequence of the Frisch-Waugh theorem.

On the other hand, over-identification corresponds to the case where the real model would be y_i=\beta_0+\mathbf{x}_1^T \beta_1+\varepsilon_i, but the estimated model is y_i=b_0+ \mathbf{x}_1^T \mathbf{b} _1+\mathbf{x}_2^T \mathbf{b}_2+\eta_i. In this case, the estimate is unbiased, in the sense that \mathbb{E}[\widehat{\mathbf{b}}_1]=\beta_1 but the estimator is not efficient. Later on, we will discuss an effective method for selecting variables (and avoid over-identification).

## From parametric to non-parametric

We can rewrite equation (4) in the form \widehat{\mathbf{y}}=\Pi_X\mathbf{y} which helps us to see the forecast directly as a linear transformation of the observations. More generally, a linear predictor can be obtained by considering m(\mathbf{x})=\mathbf{s}_{\mathbf{x}}^T \mathbf{y}, where \mathbf{s}_{\mathbf{x}} is a weight vector, which depends on \mathbf{x}, interpreted as a smoothing vector. Using the vectors \mathbf{s}_{\mathbf{x}_i}, calculated from the observations \mathbf{x}_i, we obtain a matrix \mathbf{S} of size n\times n, and \widehat{\mathbf{y}}=\mathbf{S}\mathbf{y}. In the case of the linear regression described above, \mathbf{s}_{\mathbf{x}}=\mathbf{X}[\mathbf{X}^T\mathbf{X}]^{-1}\mathbf{x}, and in that case \text{trace}(\mathbf{S}) is the number of columns in the \mathbf{X} matrix (the number of explanatory variables). In this context of more general linear predictors, \text{trace}(\mathbf{S}) is often seen as equivalent to the number of parameters (or complexity, or dimension, of the model), and \nu=n-\text{trace}(\mathbf{S}) is then the number of degrees of freedom (see Ruppert et al., 2003; Simonoff, 1996). The principle of parsimony says that we should minimize this dimension (the trace of the matrix \mathbf{S}) as much as possible. But in the general case, this dimension is more to obtain, explicitely.

The estimator introduced by Nadaraya (1964) and Watson (1964), in the case of a simple non-parametric regression, is also written in this form since\widehat{m}_h(x)=\mathbf{s}_{x}^T\mathbf{y}=\sum_{i=1}^n \mathbf{s}_{x,i}y_iwhere\mathbf{s}_{x,i}=\frac{K_h(x-x_i)}{K_h(x-x_1)+\cdots+K_h(x-x_n)} where K(\cdot) is a kernel function, which assigns a value that is lower the closer x_i is to x, and h>0 is the bandwidth. The introduction of this metaparameter h is an important issue, as it should be chosen wisely. Using asymptotic developments, we can show that if X has density f, \text{biais}[\widehat{m}_h(x)]=\mathbb{E}[\widehat{m}_h(x)]-m(x)\sim {h^2}\left(\frac{C_1 }{2}m''(x)+C_2 m'(x)\frac{f'(x)}{f(x)}\right)and \displaystyle{{\text{Var}[\widehat{m}_h(x)]\sim\frac{C_3}{{nh}}\frac{\sigma(x)}{f(x)}}}for some constants that can be estimated (see Simonoff (1996) for a discussion). These two functions evolve inversely with h, as shown in Figure 1 (where the metaparameter on the x-axis is here, actually, h^{-1}). Keep in ming that we will see a similar graph in the context of machine learning models.

**Figure** 1. Choice of meta-parameter and the Goldilocks problem: it must not be too large (otherwise there is too much variance), nor too small (otherwise there is too much bias).

The natural idea is then to try to minimize the mean square error, the MSE, defined as bias[\widehat{m}_h (x)]^2+Var[\widehat{m}_h (x)], and them integrate over x, which gives an optimal value for h of the form h^\star=O(n^{-1/5}) , and reminds us of Silverman’s rule – see Silverman (1986). In larger dimensions, for continuous \mathbf{x} variables, a multivariate kernel with matrix bandwidth \mathbf{H} can be used, and \mathbb{E}[\widehat{m}_{\mathbf{H}}(\mathbf{x})]\sim m(\mathbf{x})+\frac{C_1}{2}\text{trace}\big(\mathbf{H}^Tm''(\mathbf{x})\mathbf{H}\big)+C_2\frac{m'(\boldsymbol{x})^T\mathbf{H}\mathbf{H}^T \nabla f(\mathbf{x})}{f(\mathbf{x})}while\text{Var}[\widehat{m}_{\mathbf{H}}(\mathbf{x})]\sim\frac{C_3}{n~\text{det}(\mathbf{H})}\frac{\sigma(\mathbf{x})}{f(\mathbf{x})}

If \mathbf{H} is a diagonal matrix, with the same term h on the diagonal, then h^\star=O(n^{-1/(4+dim(\mathbf{x}))}. However, in practice, there will be more interest in the integrated version of the quadratic error, MISE(\widehat{m}_{h})=\mathbb{E}[MSE(\widehat{m}_{h}(X))]=\int MSE(\widehat{m}_{h}(x))dF(x)and we can prove that MISE[\widehat{m}_h]\sim \overbrace{\frac{h^4}{4}\left(\int x^2k(x)dx\right)^2\int\big[m''(x)+2m'(x)\frac{f'(x)}{f(x)}\big]^2dx}^{\text{bias}^2} +\overbrace{\frac{\sigma^2}{nh}\int k^2(x)dx \cdot\int\frac{dx}{f(x)}}^{\text{variance}}as n→∞ and nh→∞. Here we find an asymptotic relationship that again recalls Silverman’s (1986) order of magnitude, h^\star =n^{-\frac{1}{5}}\left(\frac{C_1\int \frac{dx}{f(x)}}{C_2\int \big[m''(x)+2m'(x)\frac{f'(x)}{f(x)}\big]dx}\right)^{\frac{1}{5}}The main problem here, in practice, is that many of the terms in the expression above are unknown. Automatic learning offers computational techniques, when the econometrician used to searching for asymptotic (mathematical) properties.

*To be continued (references mentioned above are online here)…*