Why Data is Eating the Universe




9:00 AM

Why Data is Eating the Universe: The Coming Age of Massive Sky Surveys
From Killer Asteroids to Dark Energy: How Apache Spark and Delta Lake can enable the next generation of discoveries in Astronomy

Based on the feedback, we will be switching this to an online event. Thanks!

Over the past decade, astronomy has morphed into an extremely data-rich field, with numerous telescope projects dedicated to scanning the sky every night in order to find and measure the properties of the tens of billions of visible objects in the sky. For example, Rubin Observatory’s Legacy Survey of Space and Time (LSST; http://lsst.org) will be the most comprehensive optical astronomy project ever undertaken. Starting in 2024, the LSST will take panoramic images of the entire visible sky twice each week for 10 years, building up the deepest, widest, image of the Universe. The resulting hundreds of petabytes of imaging data for close to 40 billion objects can enable scientific investigations ranging from the properties of near-Earth asteroids to characterizations of dark matter and dark energy. Yet at the same time, the sheer data volume and richness make it a difficult dataset to analyze using classical data management tools.

This is where Spark can help. At the UW’s DIRAC Institute, we’re about to embark on a 5-yr LINCC Frameworks project to develop analysis frameworks on industry-standard solutions, and enable astronomers to scalably work with petabytes of data stored both in cloud and on traditional HPC resources. Spark, combined with astronomy-specific extensions we developed, enabled us to prototype a system that gave our researchers exploratory access to large astronomical datasets. In this talk, we will describe the challenges of astronomical data analysis, how we tweaked Spark to analyze 2Bn of astronomical time-series data, some hopes and visions for a (cloud-based) future, and how you could get involved with the largest data analysis problem in the history of optical astronomy.

Mario Juric: I’m interested in astronomical ‘Big Data’: developing and applying methods and algorithms that let us use large data sets to answer research questions. Major astronomical surveys of today are routinely collecting hundreds of terabytes of images, creating databases with billions of objects and several billion measurements. Large surveys astronomers are becoming part data scientists. In my research, I go where the data takes me — I’ve worked on topics ranging from asteroids in the Solar System, Galactic structure, to the scale structure of the universe. My current focus is using survey data to understand the structure and evolution of the Milky Way. I also lead the Data Management team for the Large Synoptic Survey Telescope, a project to build the largest sky survey ever undertaken.

Colin Slater: I work on understanding interactions between the Milky Way and the population of dwarf galaxies in the Local Group. This includes observing the tidal debris left behind by dwarfs as they fall onto the Galaxy, along with modeling the changing properties of dwarfs as they become satellites of the Milky Way. Much of my work uses data from the Pan-STARRS survey. I am part of the LSST Data Management System Science Team, and I support that project with analyses of the scientific requirements and expected performance of the survey.