Data Engineering

“Scaling science to match the data rates from a new generation of telescopes and surveys”

gmm

Integrating machine-learning within databases we can scale astronomical analyses (the separation of stars from quasars) to hundreds of millions of sources

Cloud computing has revolutionized the way business is done on the Internet. By linking hundreds to thousands of computers together with Petabyte data storage, massively intensive (in both data and computation) tasks can be addressed. There are many parallels that can be drawn between astronomy today and the early DNA sequencing efforts and the research they enabled. Initial approaches for DNA sequencing were aimed at the careful (and slow, serial) collection of data through the sequencing of long DNA strands, requiring multinational consortia and thousands of man-years of effort. It was not until the advent of Celera’s (fast, parallel) shotgun sequencing combined with the algorithmic breakthroughs that could link those separate strands, that whole genome

sequencing became common. Many of our fundamental questions about the origin of the universe are ripe for similar disruptive software innovation.

DIRAC draws upon resources and expertise for large scale computational that are available within the University of Washington computational and statistical communities. This includes high-performance computing, statistics, data management systems, and machine-learning techniques as applied to astronomical data.

The projects we’re presently working on include:

  • ASTROML 2.0: A new version of the popular machine learning library for astrophysics known as astroml. The new methodologies in this second edition will include hierarchical Bayesian techniques, autoencoders, and deep learning.
  • Stream processing of light curves and images: We are developing algorithms and software to process the alert streams from the ZTF and LSST telescopes.
  • LSST on Spark: Apache Spark is a general engine for big data processing that enables the easy scaling of analyses to thousands of processors. We are extending the LSST image processing techniques to run seamlessly on Spark and other big-data platforms.

This work is in support of our two major survey programs:

Core Team:

  • Faculty: Magda Balazinska (Computer Science), Eric Bellm, Andy Connolly, Mario Juric
  • DIRAC Fellows: Zach Golkou, Maria Patterson, Colin Slater, Jorge Vergaras
  • Graduate Students: Dino Bektešević,  Hayden Smotherman