Least Squares for Harmonic Analysis

In many engineering and precision measurment situations, we often arrive must determine the amplitude of a sinusoidal signal at a known frequency, $f$. This can be done in several ways with both frequency-domain and time-domain techniques. Here I’d like to review a time-domain method using simple Least Squares fitting which was regularly applied by the Eot-Wash group for precision gravity measurements.

A Quick Least-Squares Introduction

Least Squares Fit for a line

We are concerned with estimating some deterministic parameter from sampled data with random noise. A simple example is fitting a line to some noise data, figure 1. The observations $y_i$ look like a linear function of our data $x_i$. We start by parameterizing this as a model $\hat{y}_i = w_1 x_i + w_0 + \epsilon $, where $\epsilon$ is some additional noise which we assume to be normally distributed $\epsilon \sim \mathcal{N}(0, \sigma)$. With our model in mind, we now simply want a way to find the parameters that best match the data. One method for finding those parameters is through least squares fitting, that is we want to find the parameters $w_0, w_1$ that would give us the least value for the sum of the square-error between our model and all of our data points:

$$ \underset{w_0, w_1}{\text{min}} C =\underset{w_0, w_1}{\text{min}} \sum_{i}|y_i - \hat{y}_i|^2 = \underset{w_0, w_1}{\text{min}} \sum_{i}|y_i-(w_1 x_i + w_0)|^2$$

where we have labeled the square-error, $C$, as the “cost”. It turns out that often this cost is the exact thing one would want to minimize in many situations, and the solutions provide an unbiased estimate of the model parameters. From calculus, we know that in order to find an extremum, we should find where the derivative is zero. So to find the values that would minimize our square-error with respect to $w_0, w_1$ we need to find a solution to the equations

$$ \left.\frac{\partial C}{\partial w_0}\right|_{w_0^*,w_1^*} = 0 = -2\sum_{i}(y_i-(w_1^* x_i + w_0^*)) $$ $$\rightarrow w_1^*\sum_{i} x_i + w_0^*\sum_{i} 1 = \sum_{i}y_i$$ $$ \left.\frac{\partial C}{\partial w_1}\right|_{w_0^*,w_1^*} = 0 = -2\sum_{i}x_i(y_i-(w_1^* x_i + w_0^*))$$ $$\rightarrow w_1^*\sum_{i} x_i^2 + w_1^* \sum_{i}x_i = \sum_{i}x_iy_i$$

Since we already know all of the values of the data and observations $x_i, y_i$, all of the sums are easily computed. We are left with a simple system of two equations with two unknowns $w_0, w_1$ and so we can solve for this¹.

Polynomials

Now imagine that we looked at our data, and it actually looks like our observations have a quadratic-, cubic-, or some other polynomial-dependence. We simply adjust our model to include this new dependence with additional model parameters, dependence, $\hat{y}_i = w_0 + w_1 x_i + w_2 x_i^2 + w_3 x_i^3 + …$. We proceed by defining the cost as the sum of square-errors. And, to minimize this error with respect to our model parameters, we again take the derivative with respect to each parameter and set it equal to zero.

For a quadratic model, we find the following equations:

$$ \left.\frac{\partial C}{\partial w_0}\right|_{w_0^*,w_1^*, w_2^*} = 0 = -2\sum_{i}(y_i-(w_0^* + w_1^* x_i + w_2^* x_i^2)) $$ $$\rightarrow w_1^*\sum_{i} x_i + w_0^*\sum_{i} 1 = \sum_{i}y_i$$ $$ \left.\frac{\partial C}{\partial w_1}\right|_{w_0^*,w_1^*, w_2^*} = 0 = -2\sum_{i}x_i(y_i-(w_0^* + w_1^* x_i + w_2^* x_i^2))$$ $$\rightarrow w_1^*\sum_{i} x_i^2 + w_1^* \sum_{i}x_i = \sum_{i}x_iy_i$$ $$ \left.\frac{\partial C}{\partial w_1}\right|_{w_0^*,w_1^*, w_2^*} = 0 = -2\sum_{i}x_i^2(y_i-(w_0^* + w_1^* x_i + w_2^* x_i^2))$$ $$\rightarrow w_1^*\sum_{i} x_i^2 + w_1^* \sum_{i}x_i = \sum_{i}x_iy_i$$

Again, we can compute all of the sums, and we have three equations with three unknowns. It may be coming apparent that this will be the case for any number of polynomial terms. This is because, our model is linear in each of the fit parameters, $w_i$, so that the derivative of the cost function always results in an equation linear in the parameter.

Well, it turns out that we can simplify this work by doing some linear algebra. Let’s group our observations and model parameters into vectors, $y = [y_0, y_1, …, y_{N-1}]^T$, $w = [w_0, w_1, w_2, …]$. Since our model is linear in the parameters, we can write it as $\hat{y}_i = X_{i, j}w_{j} + \bm{\epsilon}_i$ with each row of the data matrix given by $X_{i} = [1, x_i, x_i^2, …]$. So to add another degree to our polynomial model, we simply include another column with our $x$ values raised to whatever power.

The cost is identical, but appears a bit simpler. $$ C=\bm{\epsilon}^T \bm{\epsilon} = |y-Xw|^2 $$

So that the minimization problem is: $$ \min_{w}~|y-Xw|^2 $$

And, with a little more algebra, we can compute the derivative of the cost function and find a solution for the model parameters (being a little sloppy in taking the derivative with respect to the vector $w$)

$$ \frac{\partial C}{\partial w}\bigg\rvert_{\hat{w}} = -2 X^T(y-X\hat{w}) = 0 $$ $$ \rightarrow \hat{w} = (X^{T}X)^{-1}X^{T}y $$

The result is elegant and requires a few simple matrix operations in order to find the best-fit values. As long as we can figure out how to construct our data matrix (sometimes referred to as a design matrix), we can just use some python or Matlab or whatever and compute. Not only that, but remember that we simply require that the model is linear in the fit parameters. That is

$$\hat{y}_i = w_0 f_0(x_i) + w_1 f_1(x_i) + … + \epsilon_i $$

In the polynomial case, the functions $f_j$ were powers of $x_i$, but they could be whatever crazy function you want: exponentials, bessel functions, … Go nuts!

Least-Squares for Sinusoidal Models

Now that we know how to perform a linear least squares analysis, let’s suppose we wish to find the amplitude of a sinusoid at frequency $f$ with unknown phase. Situations like this are often useful for making measurements in the presence of low-frequency noise. For instance, imagine you want to measure the light from an LED on a detector in a bright room. Turning on the LED may not change the amount of light on the detector by a noticeable amount, especially if shadows from people or clouds move past or various lights are dimming or flickering. However, if we turn our LED on and off at some known frequency, we only need to pay attention to the changes in light occurring at that frequency. And we can essentially ignore the changes in background light occurring at other rates. This is often referred to by amplitude modulation and lock-in detection².

So let’s generate some fake observations of a sinusoidally-varying position signal, $y_i(t_i)$ with frequency $f$ buried in some noise. The naive model to determine the amplitude and phase is $$ \tilde{y}_i = A_0 \sin(2 \pi f t_i + \phi_0) + w_0 + \epsilon_i $$ where our model parameters ($A_0, \phi_0, C_0$) are the unknown amplitude, phase, along with a constant offset from $0$. However, our model is non-linear in the fit parameter $\phi_0$! And it appears that our simple least-squares methods are no longer applicable. But, with a clever change of variables, we can reformulate the model into a linear model. Our new model is instead $$ \hat{y}_i = w_1 \cos(2 \pi f t_i) + w_2 \sin(2 \pi f t_i) + w_0 + \epsilon_i $$ where now our model parameters are the corresponding $sine$ and $cosine$ coefficients. And the overall amplitude and phase can be derived by $$ A_0 = \sqrt{w_1^2 + w_2^2}, ~~ \phi_0 = \tan^{-1}(-w_2/w_1) $$.

We construct the design matrix with rows given as $X_i = [1, \cos(\omega t_i), \sin(\omega t_i)] $, where we have substituted $\omega = 2\pi f$, and the solution for the model parameters follows in the same manner as before.

This technique can be expanded to include other harmonics of the sinusoid in a similar manner to the way we included more polynomial terms before. We simply increase the size of our design matrix to include the additional basis functions. So if we cared about an oscillation at frequency $f$ and its third and fifth harmonics, each row of our design matrix would now be $$ X_i = [1, \cos(\omega t_i), \sin(\omega t_i), \cos(3\omega t_i), \sin(3\omega t_i), \cos(5\omega t_i), \sin(5\omega t_i)]. $$

And we could include more and more terms so long as there are more data points then fit parameters.

Weighted Least-Squares and Parameter Covariance

In some cases, we may have some a priori knowledge of our measurements and may wish to weight some elements in the fit more than others. We may consider this in the same way as our previous problem, by including a weighting matrix $R$ in our cost³ $$ C = \bm{\epsilon}^TR\bm{\epsilon} = (y-Xw)^TR(y-Xw). $$

We will assume that $R$ is symmetric, $R^T = R$, so that again, we may minimize the cost with respect to the model parameters $$ \frac{\partial C}{\partial w}\bigg\rvert_{\hat{w}} = -2X^T R(y-X \hat{w}) = 0 $$ $$ \rightarrow \hat{w} = (X^T R X)^{-1}X^T R y ~.$$

Let’s now consider the covariance of the estimate of our model parameters. The covariance is $$ \Sigma_w = \mathbb{E}\left[(\hat{w}-w\right)^2] $$ First, note $$ \hat{w}-w = (X^T R X)^{-1} X^T R y - x $$ $$ = (X^T R X)^{-1} X^T R (Xw+\epsilon) - x $$ $$ = (X^T R X)^{-1} X^T R\epsilon ~. $$

The covariance is then $$ \Sigma_w = (X^T R X)^{-1} X^T R~\mathbb{E}\left[\epsilon^T \epsilon\right] R X (X^T R X)^{-1} ~. $$ We now see that if we choose our weights to be the inverse covariance matrix of the data, $R = \Sigma_X^{-1} = \mathbb{E}\left[\epsilon^T \epsilon\right]$, the covariance estimate greatly simplifies $$ \Sigma_w = (X^T \Sigma_X^{-1}X)^{-1} ~.$$ This can be useful for estimating fit parameters from correlated data. It also becomes obvious, that if each data point has identical uncertainty, $\Sigma_X = \text{Var}[\epsilon_i]I = \sigma^2 I$, then the parameter uncertainty is $$ \Sigma_w = \sigma^2 (X^T X)^{-1} ~.$$

Orthogonality

From our calculations of the parameter variance-covariance matrix, we see that the covariance terms disappear when the data matrix is proportional to an orthogonal matrix, $X^T X = c I$, for some constant $c$. If we choose an orthogonal set of basis functions for our model, the parameter estimates will be uncorrelated.

We consider the sinusoidal basis of harmonics of some frequency. We will ignore the addition of a constant term for now. It’s generally a good idea to have some consistent sample rate, $f_s = 1/\Delta t$, and we’ll assume we have collected data from time $[\Delta t, \dots, N\Delta T]$. Then the lowest frequency we might be interested in is $\omega = 2\pi/(N\Delta t)$. The highest frequency we can expect to resolve would be a signal that oscillates up and then down on alternating samples, known as the Nyquist frequency $f_{Nyq} = 1/2\Delta t = f_s/2 = N\omega/2$. So the complete set of harmonics is $[1, 2, …, m=N/2]$ of $\omega$ leads to the data matrix

$$ X = \begin{pmatrix} \cos(\omega \Delta t) & \dots & \cos(m\omega \Delta t) & \sin(\omega \Delta t) & \dots & \sin(m\omega \Delta t) \\ \vdots & & & & \ddots & \vdots \\ \cos(\omega N \Delta t) & \dots &\cos(m\omega N\Delta t) & \sin(\omega N\Delta t) & \dots & \sin(m\omega N \Delta t) \end{pmatrix} .$$

Then

$$ X^T X = \begin{pmatrix} \sum\cos_1\cos_1 & \dots & \sum\cos_1\cos_m &\sum\cos_1\sin_1 & \dots &\sum\cos_1\sin_m \\ \vdots & \ddots & & & & \vdots \\ \sum\cos_m\cos_1 & \dots & \sum\cos_m\cos_m &\sum\cos_m\sin_1 & \dots &\sum\cos_m\sin_m \\ \vdots & & & \ddots & & \vdots \\ \sum\cos_1\sin_m & \dots & \sum\cos_m\sin_m &\sum\sin_1\sin_m & \dots &\sum\sin_m\sin_m \\ \end{pmatrix} $$ where we denote $$ \sum\cos_j\cos_k = \sum_{q=1}^{N}\cos(j\omega q \Delta t)\cos(k\omega q \Delta t) ~.$$

It is relatively simple to show that for $j,k\neq 0$ $$ \sum_{q=1}^{N}\cos(j\omega q \Delta t)\cos(k\omega q \Delta t) = \sum_{q=1}^{N}\sin(j\omega q \Delta t)\sin(k\omega q \Delta t) = \frac{N}{2}\delta_{j,k} ~,$$ and $$ \sum_{q=1}^{N}\cos(j\omega q \Delta t)\sin(k\omega q \Delta t) = 0 ~.$$

So that $$ (X^T X)^{-1} = \frac{2}{N} I~. $$

You also may have noticed, that substituting this into our least-squares fit is identical to the discrete Fourier transform (DFT) fitting to out of phase sinusoids rather than complex exponentials. It is also easy to see that a constant term is identical to a $cosine$ term at the $0^{th}$ harmonic. The same identities hold except now $$ \sum_{q=1}^{N}\cos(0 \cdot q \Delta t)\cos(0 \cdot q \Delta t) = \sum_{q=1}^{N} 1 = N~.$$

Another example of an orthogonal set of basis functions is the Legendre polynomials, an alternative to the polynomial basis used in our earlier example.

Filter Response and Correction

Sometimes we may be interested in a filtered version of observation data. In particular we will consider finite-impulse response filters, where the output is a linear combination of the input at finite time delays ⁴.

As an example we consider observation data consisting of the position of some inertial sensor. However, our model requires fitting the force on the sensor. We assume we know the mass $M$ and the spring constant $K$ of the inertial sensor. Then a derived version of the force $F_i$ from the observations $y_i$ is $$ F_i = M\frac{(y_{i-2}+y_{i}-2y_{i-1})}{2\Delta T^2} + K y_{i-1} $$ $$ = c_1 y_{i} + c_1 y_{i-2} + c_2 y_{i-1}, ~~ c_1 = \frac{M}{2\Delta T^2}, c_2 = K-\frac{M}{\Delta T^2}$$ where we have used a backward symmetric difference to estimate the acceleration. However, this now correlates different observations of $F_i$. If we assume the position observations have uncorrelated noise $Cov[y_i y_j] = \sigma^2 \delta_{i j}$, then we find $$ \begin{aligned} Cov[F_i F_j] &= Cov[(c_1 y_{i} + c_1 y_{i-2} + c_2 y_{i-1})(c_1 y_{j} + c_1 y_{j-2} + c_2 y_{j-1})] \cr &=~~ c_1^2 (Cov[y_i y_j] + Cov[y_i y_{j-2}] + Cov[y_{i-2} y_j] + Cov[y_{i-2}y_{j-2}]) \cr & + c_2^2 Cov[y_{i-1}y_{j-1}] \cr & + c_1 c_2(Cov[y_i y_{j-1}] + Cov[y_{i-2}y_{j-1}] + Cov[y_{i-1}y_j] + Cov[y_{i-1}y_{j-2}]) \cr &= \sigma^2\left( c_1^2 (2\delta_{i j} + \delta_{i,j-2} + \delta_{i-2, j}) + c_2^2 \delta_{i j} + c_1 c_2(2\delta_{i,j-1} + 2\delta_{i-1, j})\right) \end{aligned} $$

So we see that the covariance matrix has a diagonal and two off-diagonal terms. We input this to our weighted least-squares calculation and account for the added correlation of the derived force.

As a second example, we may have observations contaminated by an offset drifting with a random walk, $\epsilon_i= \sum_{q=1}^{i} \eta_q$, with $\eta_i\sim\mathcal{N}(0, \sigma) $ $$ \hat{y}_i = y_i + \epsilon_i,~~. $$ A simple difference of adjacent observations is again a filter, and leads to white noise $$ \hat{z}_i = \hat{y}_i - \hat{y}_{i-1} = y_i - y_{i-1} + \eta_i $$ However, let’s now assume we are interested in fitting a high frequency sinusoidal signal in $y_i$. Fitting $z_i$ has made our noise uncorrelated but leads to an attenuation of the fit amplitude due to the low-pass nature of differencing. We proceed with fitting $z_i$ for the sinusoid and will correct the fit amplitude by calculating the filter response.

Take the frequency response of the original observations to be $y_n = y(\omega)e^{i\omega n\Delta T}$. Then the response of $z_n$ is $$ \begin{aligned} z_n &= y_n-y_{n-1} \cr &= y(\omega)(e^{i\omega n\Delta T} - e^{i\omega (n-1)\Delta T}) \cr &= y(\omega)(1-e^{-i\omega \Delta T})e^{i\omega n \Delta T} \cr &= 2ie^{-i\omega \Delta T/2}\sin(\omega \Delta T/2)y(\omega)e^{i\omega n \Delta T} \cr &= z(\omega)y(\omega)e^{i\omega n \Delta T}~, \end{aligned} $$ where the filter response is $$ z(\omega) = 2ie^{-i\omega \Delta T/2}\sin(\omega \Delta T/2)~. $$

This can be related to a change in the estimated amplitude and phase of a sinusoidal fit at frequency $\omega$ in $z$ to a fit in $y$ by taking the modulus and complex angle.

Conclusions

Least-squares analysis is a simple, yet effective tool for harmonic analysis, and amounts to computing one matrix inversion and a couple matrix multiplications to find the optimal fit parameters. Least-squares fits additionally give the covariance structure of the fit parameters given knowledge of the observation errors. If the basis functions for the fit are orthogonal, the covariance matrix simplifies to something proportional to the identity matrix. It is also possible to analytically account for the effects of correlation or attenuation from FIR filters.

If we only had one point in $(x, y)$, then this system is underdetermined and there are an infinite number of choices for $w_0, w_1$ that would parameterize a line going right through the point. We will ignore this subtlety, but more can be found regarding a suitable solution under Moore-Penrose Inverse ↩︎
A nice detailed explanation of lock-in amplifiers can be found in the manual for the Stanford Research System devices ↩︎
https://en.wikipedia.org/wiki/Weighted_least_squares ↩︎
https://en.wikipedia.org/wiki/Finite_impulse_response ↩︎