Data Engineering

Cloud computing has revolutionized the way business is done on the Internet. By linking hundreds to thousands of computers together with Petabyte data storage, massively intensive (in both data and computation) tasks can be addressed. There are many parallels that can be drawn between astronomy today and the early DNA sequencing efforts and the research they enabled. Initial approaches for DNA sequencing were aimed at the careful (and slow, serial) collection of data through the sequencing of long DNA strands, requiring multinational consortia and thousands of man-years of effort. It was not until the advent of Celera’s (fast, parallel) shotgun sequencing combined with the algorithmic breakthroughs that could link those separate strands, that whole genome sequencing became common. Many of our fundamental questions about the origin of the universe are ripe for similar disruptive software innovation.

DiRAC draws upon resources and expertise for large scale computational that are available within the University of Washington computational and statistical communities. This includes high-performance computing, statistics, data management systems, and machine-learning techniques as applied to astronomical data.

The projects we’re presently working on include:

  • Vera C. Rubin Observatory in the Cloud: Apache Spark is a general engine for big data processing that enables the easy scaling of analyses to thousands of processors. We are extending the LSST image processing techniques to run seamlessly on Spark and other big-data platforms.
  • ASTROML 2.0: A new version of the popular machine learning library for astrophysics known as astroml. The new methodologies in this second edition will include hierarchical Bayesian techniques, autoencoders, and deep learning.
  • Stream processing of light curves and images: We are developing algorithms and software to process the alert streams from the ZTF and LSST telescopes.

This work is in support of our two major survey programs:

  • The Zwicky Transient Facility, a rapid time-domain survey of the Northern Sky
  • The Simonyi Survey Telescope, the largest optical survey ever undertaken, where we lead the Solar System algorithms and data effort.

Core Team:

  • Faculty: Magda Balazinska (Computer Science), Eric Bellm, Andy Connolly, Mario Juric
  • DiRAC Fellows: Zach Golkhou, Colin Slater, Jorge Vergaras
  • Graduate Students: Dino Bektešević,  Hayden Smotherman