- Title: Automated Transient Identification in the Dark Energy Survey
- Authors: D. A. Goldstein, C. B. D’Andrea, J. A. Fischer, et. al
- First Author’s Institution: U. C. Berkeley
The problem of Big Data in Astronomy
Astronomers work with a lot of data. A serious ton of data. And the rate at which telescopes and simulations pour out data is increasing rapidly. Take the upcoming Large Synoptic Survey Telescope, or LSST. Each image taken by the telescope will be several GBs in size. Between the 2000 nightly images, processing over 10 million sources in each image, and then sending up to 100,000 transient alerts, the survey will result in more than 10 Terabytes of data… EVERY NIGHT! Throughout the entire project, more than 60 Petabytes of data will be generated. At one GB per episode, it would take 6 million seasons of Game of Thrones to amount to that much data. That’s a lot of science coming out of LSST.
One of the largest problems with handling such large amounts of astronomical data is how to efficiently search for transient objects: things that appear and disappear. A major challenge of transient astronomy is how to distinguish something that truly became brighter (like a supernova) from a mechanical artifact (such as a faulty pixel, cosmic ray, or other defect). With so many images coming in, you can’t retake them every time to check if a source is real.
Before the era of Big Data, astronomers could often check images by eye. With this process, known as “scanning”, a trained scientist could often distinguish between a real source and an artifact. As more images have come streaming in, citizen science projects have arisen to harness the power of eager members of the public in categorizing data. But with LSST on the horizon, astronomers are in desperate need of better methods for classifying images.
Bring in Machine Learning
Today’s paper makes use of a computational technique known as machine learning to solve this problem. Specifically, they use a technique known as “supervised machine learning classification“. The goal of this method is to derive a classification of an object (here, an artifact or a real source) based on particular features that can be quantified about the object. The method is “supervised” because it requires a training set: a series of objects and their features along with known classifications. The training set is used to teach the algorithm how to classify objects. Rather than having a scientist elaborate rules that define a classification, this technique develops these rules as it learns. After this training set is processed, the algorithm can classify new objects based on their features (see Fig. 1).
To better understand supervised machine learning, imagine you are trying to identify species of trees. A knowledgeable friend tells you to study the color of the bark, the shape of the leaves, and the number of leaves on a twig — these are the features you’ll use to classify. Your friend shows you many trees, and tells you their species name (this is your training set), and you learn to identify each species based on their features. With a large enough training set, you should be able to classify the next tree you come to, without needing a classification from your friend. You are now ready to apply a “supervised learning” method to new data!
Using machine learning to improve transient searches
The authors of today’s paper developed a machine learning algorithm called autoScan, which classifies possible transient objects as artifacts or real sources. They apply this technique to imaging data from the Dark Energy Survey , or DES. The DES Supernova program is designed to measure the acceleration of the universe by imaging over 3000 supernovae and obtaining spectra for each. Housed in the Chilean Andes mountains, DES will be somewhat akin to a practice run for LSST, in terms of data output.
The autoScan algorithm uses a long list of features (such as the flux of the object and its shape) and a training set of almost 900,000 sources and artifacts. After this training set was processed, the authors tested the algorithm’s classification abilities against another validation set: more objects with known classifications that were not used in the training set. AutoScan was able to correctly identify real sources in the validation set 96% of the time, with a false detection (claiming an artifact to be a source) rate of only 2.5% (see Fig. 2).
With autoScan, the authors are prepared to analyze new data coming live from the Dark Energy Survey. They can greatly improve the efficiency of detecting transient sources like supernova, by easily distinguishing them from instrumental artifacts. But better techniques, such as more clever development of training sets, will continue to beat down the rate of false positives.
Machine learning algorithms will become critical to the success of future large surveys like LSST, where person-power alone will be entirely insufficient to manage the incoming data. The same can be said for Gaia, TESS, the Zwicky Transient Facility, and pretty much any other upcoming astronomical survey. Citizen science projects will still have many practical uses, and the trained eye of a professional astronomer will always be essential. But in the age of Big Astro, computers will continue to become more and more integral parts of managing the daily operations of research and discovery.