Let’s say we have some data. Maybe it’s a spectrum, maybe it’s time series photometry. For this scenario, let’s say we have *n** *measurements of some parameter and that we know each *x_{i} *value for our data (frequencies, times) very precisely (here, the

*i*subscript refers to an individual measurement). We also know the measurement values (the

*y*at each

_{i}*x*), and have a good idea of the uncertainty (

_{i}*σ*) associated with each data point. If we want to find the best-fitting model to these observations*, then the standard practice is to evaluate the model at our data points, measure the residual (

_{i}*r*

*) between our data and the model, and then calculate a*

_{i}*χ*value by evaluating the sum of (

^{2}*r*)

_{i}/σ_{i}^{2}. The model that minimizes this value is our best-fitting model.

But what are we really doing here? While it may not be evident from the math above, the calculation of *χ ^{2}* has its roots in matrix manipulation. In fact, we can show through a few matrix operations why

*χ*works for finding the best fit to a system of equations. You may have seen this before if you have taken a linear algebra course! If our model is linear, then we can write that

^{2}*b*where

**=****A**x,**A**

**is a set of linear equations,**

*b*our measurements and

*x*our model parameters. We then have a matrix that looks like this:

You might remember from your linear algebra course that, to find the best-fitting value for **x**, we want to minimize (*b*-**A***x*)^{T}(*b*-**A***x*). (If you don’t remember this, you don’t have to take my word for it: check out this pdf for a quick refresher.) In the case where each value of *b* also has a known uncertainty, then we instead want to minimize (*b*-**A***x*)^{T}**C**^{-1}(*b*-**A***x*), where **C** is a diagonal matrix, with the squares of the individual uncertainties (the *variances*) along the diagonal.

That looks like a lot of math. But in this case, (*b*-**A***x*) is our residual vector! To simplify our lives, let’s call it *r. *In this case, the value we want to minimize is (*r*)^{T}**C**^{-1}(*r*) ≡ ∑ (*r _{i} /σ_{i}*)

^{2}. This is exactly chi-squared! Evaluating

*χ*is, in this case, identical to performing these matrix operations.

^{2 }There’s one assumption we’ve made throughout which makes the matrix math identical to evaluating chi-squared: we assumed that when we collected our data, the individual *y _{i}* observations were independent. That is, each

*y*depends only on its

_{i}*x*, and not on other values of

_{i}*y*. This is often a reasonable assumption, but not always! For example, a light curve of a star might contain contain correlated noise caused by the star’s rotation or stellar activity. Quasar light curves also might contain correlated noise due to slow changes in the accretion rate of the central black hole. Sometimes, the correlated noise isn’t astrophysical, but instrumental: observations from

_{i}*Spitzer*, for example, often contain correlated noise as the spacecraft moves slightly over the course of an observation.

Correlated noise can significantly affect our results if not properly accounted for. In general, correlated noise causes our error bars to be underestimated. This makes sense! To understand why, let’s look at some real (fake) data. Here, I created some data by randomly drawing from a model that has a positive slope and adding correlated noise. The data are shown in the figure to the left.

If our noise is uncorrelated, and a few consecutive data points are all significantly above our model (like around* x*=-1 in our example), then they will all contribute to a calculation of a large value for *χ ^{2}*. In this case, a model that provides a larger value near

*x*=-1 will be heavily favored over the “truth.” Really, these points are above our model

*because all the other points around them are above our model.*Therefore,

*χ*is insufficient to fully describe our observations. Because

^{2 }*y*and

_{i}*depend on each other (as well as*

*y*_{i+1}*y*, and all the other nearby data points), each residual should be trusted less than the full

_{i-1},*y*_{i+2}*r*weight we discussed above.

_{i}/σ_{i}

To see this graphically, let’s try to fit these fake observations to a straight line. In our fitting, let’s treat the observations as if they were perfectly uncorrelated, and attempt to minimize *χ ^{2}*. The result of this fit is shown in the figure to the right. The red shaded region shows our 1

*σ*range on the line that describes our data. As we predicted, the model predicts a significantly larger value near

*x*=-1 than the truth. In this case, we have a very tight measurement of what the slope of the line that describes our data should be. We would expect that if the shaded region were representative of the actual acceptable range of parameters that fit our data, that the truth (our black line) would be included in the shaded region. Unfortunately, the constraints do not match reality at all, as only significantly shallower slopes than the truth are predicted. In this case, not accounting for correlated noise means we have the wrong answer! It also means that we think we did a great job in our measurements. We have very small error bars in our fit; the red shaded region is very narrow, so we think the region in which the truth could lie is tightly constrained.

Luckily, there are techniques to account for correlated noise. To do this correctly, we need to leave behind our simple chi-square calculation and use the power of **C**. Not the ocean, but our uncertainty matrix from earlier! Before, we said if our noise is perfectly uncorrelated, or white, then *χ ^{2 }*is mathematically the same as evaluating (

*r*)

^{T}

**C**

^{-1}(

*r*).

**C**‘s official name is the “covariance matrix.” Each element in the covariance matrix shows the relation between two observations

*y*and

_{i}*. If the two observations depend strongly on each other,*

*y*_{j }*C*is relatively large; if the two observations have no dependence on each other,

_{ij}*C*is zero. If the two observations are the same (

_{ij}*i=j*), then

*C*

_{ii}is the variance

*σ*

_{i}^{2}.

Therefore, by modifying the covariance matrix **C** and solving (*r*)^{T}**C**^{-1}(*r*) instead of just summing (*r _{i} /σ_{i}*)

^{2}, we can account for correlated noise in our model. In the figure to the left, I’ve modified my uncertainty calculation to account for my updated covariance matrix, and again attempted to find the best-fitting line to these fake data. The region of acceptable values is shown in blue. This time, we find that the truth is included in our 1

*σ*uncertainties! Our error bars are larger, but these are physically meaningful: the original error bars (in red) didn’t accurately describe the uncertainty in our model, while these do. With this fit, we can be satisfied that we have the most accurate result the data can provide.

Of course, in this example I knew what my correlated noise *should *be, and was able to modify **C** to exactly represent this noise model. In the real world, we don’t always know exactly what our noise model should be, so we don’t know exactly what **C** should look like. This makes the problem more difficult, but not intractable! There are two primary techniques used to treat correlated noise in real life:

1) Estimate **C** with some other data, and use this covariance matrix instead of a perfectly diagonal one. This is useful when you have plenty of data from which a reasonable estimate of **C** can be drawn. An example of this might be *Kepler* data, when looking at planet transits. The transit of an Earth-like planet occurs once per year, for 13 hours, leaving 364.75 days of stellar and instrumental noise between each transit. This data can be used to probe the noise and its correlations for the star in question, which can then be applied to each transit event.

2) Assume a functional form of **C** and allow the parameters that define this function to vary. Maybe you know that a certain point is going to be influenced by points that are very close to it only. Maybe you’re looking at a rotating star and you know the noise is going to be correlated with noise exactly one stellar rotation period apart. In these cases, you can define **C** as a function of a few other parameters and fit these parameters at the same time as you fit your astrophysical model. The benefit of this technique is that you can constrain the correlation in your noise, *as long as the function that you have assumed is representative of the true noise in your data! *This technique is the basis of Gaussian processes, a popular technique for dealing with correlated noise in data.

Correlated noise can be tricky, but it doesn’t have to be scary! There are existing techniques to deal with it, and even packages to make these techniques accessible to newcomers. Understanding when to go beyond *χ ^{2 }*is necessary to ensure that your uncertainties are correctly estimated; techniques like the ones above should be a part of any observational astronomer’s toolkit.

*The author would like to thank Dan Foreman-Mackey (NYU) for use of the figures in this post.*

*Here, I’m assuming we know what the functional form of the model should be. Model selection is an entirely different topic, and could be the topic of a separate blog post/textbook.

#### Ben Montet

#### Latest posts by Ben Montet (see all)

- Newer Horizons Beyond Pluto – October 21, 2014
- Welcome to Mars, MAVEN and MOM! – September 23, 2014
- Hide and Seek Planets – July 29, 2014

## Discussion

## No comments yet.