In June of this year, I had the good fortune to attend the “Modern Statistical and Computational Methods for Analysis of Kepler Data” at the Statistical and Applied Mathematical Sciences Institute in Research Triangle Park, North Carolina (shorthand: exoSAMSI. An exoplanet conference at SAMSI. Get it?). SAMSI’s goal is “to forge a synthesis of the statistical sciences and the applied mathematical sciences with disciplinary science to confront the very hardest and most important data- and model-driven scientific challenges.” This conference brought together 30 scientists (including 13 grad students) from astronomy and statistics to talk about some of the big questions from Kepler and how advanced statistical techniques might be used to answer them. After two days of talks, we split into three working groups and spent the next three weeks trying to answer these questions. It was a lot of fun! We normally don’t talk about normal conference proceedings in great detail here at Astrobites, but this wasn’t really a normal conference (two days of talks with 13 days of working?!). Therefore, over the next two days we’ll take a closer look at what happens at workshops like this one and see what it looks like when statisticians and astronomers hang out.
Model selection and comparison: Ben Nelson
I arrived at SAMSI with a very specific goal in mind: I wanted to learn how to compare different models of exoplanet systems (by evaluating the marginal likelihood) in a quantifiable manner for a particular set of observations (transit or radial velocity). The marginal likelihood answers the question “given a model, how likely is it that the data I have could come from this model?”
Previous astrobites have addressed applications of Bayesian statistics for astrophysical problems, especially Markov chain Monte Carlo (MCMC) algorithms. MCMC is a great method for estimating the uncertainties of your model parameters, once you have selected a model to study. But which model should you choose? For example, in the problem I’m faced with, I have a set of Doppler radial velocity observations of a star with 3 known planets. There may be an additional planet but its signal is weak. Comparing the marginal likelihood for different models gives us a quantifiable way of determining which model better describes the observations. By calculating the marginal likelihoods of the 3-planet vs. 4-planet models, I can determine which model better represents the data. To calculate the marginal likelihood, we marginalize (or integrate over) all parameters. The problem is that this is hard, because our distributions have many parameters and do not generally have a nice analytic form.
To learn how to calculate a marginal likelihood, I wanted to interact with statisticians. SAMSI meetings are a great opportunity to become familiar with more modern and efficient machinery developed by statisticians, which benefits the entire astronomy community. In my discussions with statisticians, one particular method stuck out: Monte Carlo integration.
In Monte Carlo integration, instead of performing an integral directly or doing a numerical technique like the trapezoidal rule, you randomly choose points in your distribution. By determining the likelihood value at many points, and summing over these points, you are effectively integrating over the model parameters. You would never do this in a one dimensional case, as the trapezoidal rule can be used much more efficiently, but when you have to integrate over 5 or more dimensions, this technique can be very fast.
I presented my marginal likelihood problem and was later approached by a grad student in statistics who said he had an algorithm that did exactly what I needed (via a method called importance sampling). It was one of those rare moments when “interdisciplinary magic” arose from a simple discussion.
While the meeting specialized in working with Kepler photometry, the general methodology behind these techniques can be applied to a myriad of topics. Modern model selection techniques certainly provide us an opportunity to tackle some ambiguous Kepler systems to determine the probability of various explanations for our observations.
Hierarchical Models: Megan Shabram
The Kepler mission has provided us with 5,000 candidate planets, enough to analyze the population statistically. Efforts are underway to use this unprecedented sample to infer population distributions and occurrence rates for the Kepler sample. Understanding the distribution of exoplanet size regimes and orbital parameters can expose the end states of complicated planet formation phenomena whose relative importance we are still trying to disentangle. Given a sample of planets, the simplest way to look at the distribution of a given parameter is to make a histogram from the best fit parameter values for each planet. This approach is limited in that it fails to account for selection effects and other physical biases from the data acquisition process. In order to address this limitation, astronomers and statisticians at the 2013 SAMSI meeting worked together to begin analyzing the Kepler sample with a rigorous population analysis method, known as hierarchical Bayesian analysis. A hierarchical Bayesian model is a robust method to estimate a population distribution. There are many benefits to using a hierarchical model to understand parameter distributions. For instance, it allows us to incorporate each observation’s unique uncertainty properties, propagating them into the estimation of the population parameters. Additionally, selection effects can be accounted for directly in the inference. Another major benefit is the ability to use information about individual objects to make inferences about other objects.
A hierarchical Bayesian model consists of multiple levels, with each level being described by the information at that level and levels below. At the bottom level are the most basic observations or data, in this case consisting of exoplanet orbital parameters, planet properties, and host star properties. The mid-level consists of a physically-motivated analysis model (usually a parametric model) whose form is chosen to approximate the behavior of each parameter’s true distribution. In this example, this might be the distribution of exoplanet orbital properties that satisfy the data in the bottom level. The top level, referred to as the “hyper priors,” consists of distributions for the parameters that describe the mid-level model. The hyper prior distribution might be chosen such that it spans the appropriate parameter space for the particular problem at hand. The parameters that describe the population would in this case be determined by the posterior mode and credible interval returned by the hierarchical model. Using the data to infer the population parameters, the model simultaneously calculates posteriors for each observation used in the model. In practice, this exposes a value closer to the true value than the measurement obtained for each of the observed systems used in the hierarchical inference. To learn more about hierarchical Bayesian analysis, check out this book.
Tomorrow we’ll hear more reactions from SAMSI, including the statistician’s perspective!
Ben Nelson is a graduate student in astronomy at Penn State, working with Eric Ford.
Megan Shabram is a graduate student in astronomy at Penn State, working with Eric Ford.
SAMSI is a partnership of Duke University, N.C. State University, UNC-Chapel Hill, and the National Institute of Statistical Sciences, in collaboration with the William R. Kenan, Jr. Institute for Engineering, Technology, and Science. For more info, check out samsi.info.