Cloud computing for (observational) astronomy

This article is based on the AAS 241 special session Astronomy and Cloud Computing. Invited Speakers: Bruce Berriman, Senior Scientist, IPAC, Caltech; Ivelina Momcheva, Data science group lead, MPIA; Mario Juric, Professor, University of Washington, Seattle.

Chances are you used a cloud-based service today. If you’re a student, Canvas (based on Amazon Web Service) and Google Docs run on the cloud. Most people depend on cloud computing so deeply that we do not remember when we made the switch. For app developers  cloud computing is prevalent for good reasons: a data center provides pay-as-you-go storage and computing. The resources are flexible and easy to scale, ideal for fast innovation.

New way to work

Prof. Mario Juric laid out the vision for the day-to-day workflow of astronomers, especially observers. Until now, a large dataset would be stored on the departmental intranet. The researcher would identify a subset of the data and download it, then analyze that subset on a laptop. This model does not work for future big programs. With the expected data volume of surveys like LSST, any meaningful fraction of the data will be too large to download. Thus we need cloud computing to “bring the code to the data.” Both the data and the code would be hosted in the cloud and the analysis would be done on cloud-based computing resources. Then researchers would not need to download anything except the final plots and data products.

To illustrate what it means to “bring the code to the data,” Dr. Momcheva explained how she reprocessed Hubble Space Telescope (HST) archival data in the cloud. When she worked at the Space Telescope Science Institute (STScI), she moved the entire HST data archive to the cloud. The archive hosts 1.2 TB of astronomy data mostly in FITS files. She also worked with astropy developers to integrate cloud query into the existing astropy package. Astropy v5.2.1 now includes a tutorial page about how to retrieve a cutout from a FITS file hosted on the cloud. Once the database is moved into the cloud, they can re-run the data reduction whenever the pipeline improves. Compared to re-reducing the data at STScI, moving the pipeline to the cloud and running it on cloud-hosted data is both faster and cheaper. 

Faster completion time comes from the vast computing resources available on the cloud. Cloud computing is best for “pleasingly parallel” tasks, which means that each of the “parallel” tracks do not depend on the others. Cloud computing resources are highly scalable, but they don’t necessarily transfer data quickly, which means that they are best for parallel tasks. An example of a “pleasingly parallel” cooking project is baking cookies. Each cookie bakes in eight minutes, but you can bake multiple cookies in one oven at the same time. Cloud computing basically means taking all your cookie dough to a baking facility with a thousand ovens, so you can bake all your cookies simultaneously and finish in eight minutes. On the other hand, making cookie dough is less parallelizable because you need to retrieve multiple ingredients from different places and combine them in a specific order. 

Cheaper project cost is achieved by allocating resources carefully. Prof. Bruce Berriman cautioned researchers to pay careful attention to the type of computing resource they purchase on the cloud. The high-performance CPUs finish your computation faster, but may not be a good early investment because of their disproportionate cost. As an example of cost management, Dr. Momcheva tallied up the cost of prototyping new software on the cloud and found that they save money by trying out new code on a smaller scale and expanding later. There are also many educational and research discounts. Dr. Momcheva convinced Amazon Web Service to host the HST dataset for free and make it available to the public. Amazon Web Service also offers free credits for research which students and faculty can apply for.

Challenges, and what the astro community needs to do

Cloud storage is different from the file systems that we are used to on regular computers. While files on a computer are stored in directories and subdirectories, cloud services use object storage which does not have a hierarchical structure. A bunch of files are stored in a “bucket,” and you can find a file by using its unique identifier within that bucket. Thus, metadata that links the object identifier with its information is crucial for finding the data you need. 

At an even more fundamental level, common data formats in astronomy like the FITS format are not designed for the cloud. Dr. Momcheva described a workaround implemented in the astropy cloud routine: the user needs to specify which pixels to retrieve from the cloud instead of thinking in terms of files. All three panelists advocate for building up an interconnected ecosystem of astro software libraries that are native to the cloud. This can vastly improve the “zero to first plot” time for anyone new to a large dataset.

Entering the era of big surveys, observational astronomy inevitably needs cloud computing. Even though cloud computing costs extra money and time to set up for now, it is a great investment for your future. Eventually, cloud computing will make science more accessible to everyone: no need to own expensive hardware or have institutional access to datasets. Especially for students: embrace the future of astronomy on the cloud!

Astrobite edited by Sarah Bodansky

Featured image credit: Zili Shen

About Zili Shen

Hi! I am a Ph.D. student in Astronomy at Yale University. My research focuses on ultra-diffuse galaxies and their globular cluster populations. Since I came to Yale, I have worked on two "dark-matter-free" galaxies NGC1052-DF2 and DF4. I have been coping with the pandemic and working from home by making sourdough bread and baking various cookies and cakes, reading books ranging from philosophy to virology, going on daily hikes or runs, and watching too many TV shows.

2 Comments

  1. Thanks for this summary, Zili! A quick question:

    ‘All three panelists advocate for building up an interconnected ecosystem of astro software libraries that are native to the cloud. This can vastly improve the “zero to first plot” time for anyone new to a large dataset.’

    This is for sure necessary, but did the panelists comment on how they would go about funding the developers of these libraries?

    Reply

Leave a Reply to Zili Shen Cancel reply