Carnegie Mellon, UW to Pioneer Platforms that Harness Astrophysical Data to Unravel the Universe’s Mysteries

Close your eyes and imagine the night sky filled with billions of stars, galaxies, stellar clusters and asteroids. Incredible, right? Over the next decade, those celestial images will be captured through the Legacy Survey of Space and Time (LSST) at the Vera C. Rubin Observatory in Chile. 

The University of Washington is one of the four founders of the LSST Project, which will be the most ambitious and comprehensive optical astronomy survey ever undertaken. And faculty and researchers from the University of Washington DiRAC Institute will play leading roles in developing its science capabilities and data processing pipelines.

Carnegie Mellon University and the University of Washington have announced an expansive, multi-year collaboration to create new software platforms to analyze large astronomical datasets generated by the upcoming Legacy Survey of Space and Time (LSST), which will be carried out by the Vera C. Rubin Observatory in northern Chile. The open-source platforms are part of the new LSST Interdisciplinary Network for Collaboration and Computing (LINCC) and will fundamentally change how scientists use modern computational methods to make sense of big data. 

Through the LSST, the Rubin Observatory, a joint initiative of the National Science Foundation and the Department of Energy, will collect and process more than 20 terabytes of data each night — and up to 10 petabytes each year for 10 years — and will build detailed composite images of the southern sky. Over its expected decade of observations, astrophysicists estimate the Department of Energy’s LSST Camera will detect and capture images of an estimated 30 billion stars, galaxies, stellar clusters and asteroids. Each point in the sky will be visited around 1,000 times over the survey’s 10 years, providing researchers with valuable time series data. 

Scientists plan to use this data to address fundamental questions about our universe, such as the formation of our solar system, the course of near-Earth asteroids, the birth and death of stars, the nature of dark matter and dark energy, the universe’s murky early years and its ultimate fate, among other things.

“Our goal is to maximize the scientific output and societal impact of Rubin LSST, and these analysis tools will go a huge way toward doing just that,” said Jeno Sokoloski, director for science at the LSST Corporation. “They will be freely available to all researchers, students, teachers and members of the general public.”

The Rubin Observatory will produce an unprecedented data set through the LSST. To take advantage of this opportunity, the LSST Corporation created the LSST Interdisciplinary Network for Collaboration and Computing (LINCC), whose launch was announced August 9 at the Rubin Observatory Project & Community Workshop. One of LINCC’s primary goals is to create new and improved analysis infrastructure that can accommodate the data’s scale and complexity that will result in meaningful and useful pipelines of discovery for LSST data.

“Many of the LSST’s science objectives share common traits and computational challenges. If we develop our algorithms and analysis frameworks with forethought, we can use them to enable many of the survey’s core science objectives,” said Rachel Mandelbaum, professor of physics and member of the McWilliams Center for Cosmology at Carnegie Mellon.

The LINCC analysis platforms are supported by Schmidt Futures, a philanthropic initiative founded by Eric and Wendy Schmidt that bets early on exceptional people making the world better. This project is part of Schmidt Futures’ work in astrophysics, which aims to accelerate our knowledge about the universe by supporting the development of software and hardware platforms to facilitate research across the field of astronomy.

“Many years ago, the Schmidt family provided one of the first grants to advance the original design of the Vera C. Rubin Observatory. We believe this telescope is one of the most important and eagerly awaited instruments in astrophysics in this decade. By developing platforms to analyze the astronomical datasets captured by the LSST, Carnegie Mellon University and the University of Washington are transforming what is possible in the field of astronomy,” said Stuart Feldman, chief scientist at Schmidt Futures.

“Tools that utilize the power of cloud computing will allow any researcher to search and analyze data at the scale of the LSST, not just speeding up the rate at which we make discoveries but changing the scientific questions that we can ask,” said Andrew Connolly, a professor of astronomy, director of the eScience Instituteand former director of the Data Intensive Research in Astrophysics and Cosmology (DiRAC) Institute at the University of Washington.

Connolly and Carnegie Mellon’s Mandelbaum will co-lead the project, which will consist of programmers and scientists based at Carnegie Mellon and the University of Washington, who will create platforms using professional software engineering practices and tools. Specifically, they will create a “cloud-first” system that also supports high-performance computing (HPC) systems in partnership with the Pittsburgh Supercomputing Center (PSC), a joint effort of Carnegie Mellon and the University of Pittsburgh, and the National Science Foundation’s NOIRLab. LSSTC will run programs to engage the LSST Science Collaborations and broader science community in the design, testing and use of the new tools.

“The software funded by this gift will magnify the scientific return on the public investment by the National Science Foundation and the Department of Energy to build and operate Rubin Observatory’s revolutionary telescope, camera and data systems,” said Adam Bolton, director of the Community Science and Data Center (CSDC) at NSF’s NOIRLab. CSDC will collaborate with LINCC scientists and engineers to make the LINCC framework accessible to the broader astronomical community.

Through this new project, new algorithms and processing pipelines developed at LINCC will be able to be used across fields within astrophysics and cosmology to sift through false signals, filter out noise in the data and flag potentially important objects for follow-up observations. The tools developed by LINCC will support a “census of our solar system” that will chart the courses of asteroids; help researchers to understand how the universe changes with time; and build a 3D view of the universe’s history.

“The Pittsburgh Supercomputing Center is very excited to continue to support data-intensive astrophysics research being done by scientists worldwide. The work will set the stage for the forefront of computational infrastructure by providing the community with tools and frameworks to handle the massive amount of data coming off of the next generation of telescopes,” said Shawn Brown, director of the PSC. 

Northwestern University and the University of Arizona, in addition to Carnegie Mellon and the University of Washington, are hub sites for LINCC. The University of Pittsburgh will partner with the Carnegie Mellon hub.  

Sifting through the Static

Trans-Neptunian objects provide a window into the history of the solar system, but they can be challenging to observe due to their distance from the Sun and relatively low brightness.

In the recently published paper, Sifting through the Static: Moving Object Detection in Difference Images, DiRAC researchers report the detection of 75 moving objects that could not be linked to any other known objects, the faintest of which has a VR magnitude of 25.02 ± 0.93 using the Kernel-Based Moving Object Detection (KBMOD) platform.

They recover an additional 24 sources with previously known orbits and place constraints on the barycentric distance, inclination, and longitude of ascending node of these objects. The unidentified objects have a median barycentric distance of 41.28 au, placing them in the outer solar system. The observed inclination and magnitude distribution of all detected objects is consistent with previously published KBO distributions. They describe extensions to KBMOD, including a robust percentile-based lightcurve filter, an in-line graphics-processing unit filter, new coadded stamp generation, and a convolutional neural network stamp filter, which allow KBMOD to take advantage of difference images.

These enhancements mark a significant improvement in the readiness of KBMOD for deployment on future big data surveys such as LSST.

ADS Published Paper access here.

Letter From the Director

As we come to a close of a challenging but scientifically exciting academic year, I’m delighted to share in this newsletter some of the work and discoveries made by DiRAC researchers over the past months. 

We start with a profile of Dr. Stephen Portillo, a DiRAC Postdoctoral Fellow whose work at the intersection of statistics, machine learning, and astronomy is making it possible for us to precisely measure even the densest areas of our Galaxy. Then read about how Dr. Kyle Boone, a DiRAC and NSF Postdoctoral Fellow, uses “supernova twins” for precision cosmology — precise measurements of distances in the universe. Find out how Joachim Moeyens, one of our graduate students, is advancing the state-of-the-art in discovery of dwarf planets, comets, and asteroids in the Solar System with novel object discovery algorithms. And finally, stay for an interview with DiRAC’s associate director Prof. Jim Davenport about the searches for intelligent life in the universe, and tale of a rare eclipsing binary system, RR Hydrae. 

These are just some of the many accomplishments our researchers made in a year marked by the stresses of the pandemic and remote work. I am especially proud by how we’ve pulled through this difficult times by supporting and caring each other, and through it all managed to push forward the boundaries of science. As we move into the summer and plan for return to campus in the fall, I can’t help but be excited by the prospect of our entire DiRAC community being in person, together, again!

Mario Jurić

Professor, Department of Astronomy
Director, DiRAC Institute

Meet DiRAC’s Research Team: Dr. Stephen Portillo

Stephen Portillo’s research focuses on using advances in statistics and machine learning to allow more science to be done with existing astronomical data sets. On the statistics front, he has been developing probabilistic cataloging, a Bayesian Markov chain Monte Carlo method that improves source detection and measurement in crowded images. On the machine learning front, he has been applying autoencoders, a type of deep neural network, to enable astronomers to more easily find patterns and outliers in large datasets.

Dr. Portillo is a DiRAC Postdoctoral Fellow and UW Data Science Postdoctoral Fellow. He joined the DiRAC Institute in September 2018 after finishing his PhD in Astronomy and Astrophysics at Harvard University. Before graduate studies, he completed a BSc in Astrophysics at the University of Alberta in Edmonton, Canada in 2012.

Crowded images are difficult to analyze because sources can appear blended with their close neighbors. This problem will worsen with more sensitive observatories like Rubin Observatory, James Webb Space Telescope, and Roman Space Telescope that will see unprecedented numbers of objects in the same area of sky. Unlike traditional methods that first identify sources before measuring them, probabilistic cataloging treats source identification probabilistically. Dr. Portillo has shown that this method can find stars four times fainter than state-of-the-art methods in extremely crowded images with 1 star per 10 pixels.

At DiRAC, Dr. Portillo has joined the KBMOD team, who are developing GPU-accelerated software to find Kuiper belt objects. Recently, he has been developing a method to correct for the Earth’s motion in the solar system, allowing KBMOD to better track objects over longer periods of time. He is also extending probabilistic cataloging to search for binaries among the objects found by KBMOD, because these binaries are powerful probes of the dynamical history of the outer Solar System.

Working with Prof. Connolly, Dr. Portillo has implemented a variational autoencoder to galaxy spectra from the Sloan Digital Sky Survey. He showed that the autoencoder can summarize spectra with thousands of pixels with only six numbers and easily separates known classes of galaxies. Currently, he is working with students to use this autoencoder to find massive black hole binaries, rare objects identified by unusual spectra.

Dr. Portillo is also passionate about public outreach and teaching. While at DiRAC, he has given an Astronomy at Home talk and given virtual presentations to school groups. He was also a lecturer at AstroHackWeek 2020 and is currently co-instructing an undergraduate course on astrostatistics with Prof. Juric.

Dr. Portillo is excited to be at the DiRAC Institute because it brings together researchers interested in all aspects of data-intensive science from software engineering to statistics and machine learning. He is also happy to be a part of the eScience Institute that encourages researchers across scientific fields to find commonalities in the ways they analyze data.

Astronomers document the rise and fall of a rarely observed stellar dance

A team led by Dr. James Davenport, research assistant professor of astronomy at the UW and associate director of the UW’s DIRAC Institute, analyzed more than 125 years of observations of HS Hydra – from astro-photographic plates in the late 1800s to 2019 observations by TESS – and showed how this system has changed dramatically over the course of just a few generations. The two stars began to eclipse in small amounts starting around a century ago, increasing to almost full eclipses by the 1960s. The degree of eclipsing then plummeted over the course of just a half century, and will cease around February 2021.

Read the full article here.

_ _ _ _

Astronomers document the rise and fall of a rarely observed stellar dance

The sun is the only star in our system. But many of the points of light in our night sky are not as lonely. By some estimates, more than three-quarters of all stars exist as binaries — with one companion — or in even more complex relationships. Stars in close quarters can have dramatic impacts on their neighbors. They can strip material from one another, merge or twist each other’s movements through the cosmos.

And sometimes those changes unfold over the course of a few generations.

That is what a team of astronomers from the University of Washington, Western Washington University and the University of California, Irvine discovered when they analyzed more than 125 years of astronomical observations of a nearby stellar binary called HS Hydrae. This system is what’s known as an eclipsing binary: From Earth, the two stars appear to pass over one another — or eclipse one another — as they orbit a shared center of gravity. The eclipses cause the amount of light emitted by the binary to dim periodically.

_ _ _ _

Continue reading this article by James Urton in the UW press release here.

THOR: An Algorithm for Cadence-Independent Asteroid Discovery

One of the significant research focuses at the DiRAC Institute has been the development of next generation asteroid and comet discovery algorithms. DiRAC researchers have published a pre-print detailing one such algorithm called “Tracklet-less Heliocentric Orbit Recovery” (THOR). Applied to observations from the Zwicky Transient Facility (ZTF), THOR recovered 97% of the known objects with at least 5 observations, a factor of 1.5-2 more than the current generation of discovery algorithms. In addition to recovering most of the known objects, THOR would have discovered nearly 500 new Solar System objects (including a parabolic/hyperbolic comet) had it been running when the observations were made in 2018. 

The Hammer and the Comet 

One of the significant research focuses at the DiRAC Institute has been the development of next generation asteroid and comet discovery algorithms. DiRAC researchers have published a pre-print detailing one such algorithm called “Tracklet-less Heliocentric Orbit Recovery” (THOR). Discovering minor planets involves having current generation astronomical surveys observe the same area of the sky at least twice in one night. The two sets of observations of the same region of the sky can then be scanned for what is known as a “tracklet”: a motion vector made of at least two observations that could represent the actual motion of a Solar System object. This observing pattern is repeated over the course of a 2-week window until enough tracklets are observed so that they can be used to discover asteroids and comets. 

Requiring tracklets for Solar System discovery has two striking consequences: first, any astronomical survey with the goal of discovering minor planets must observe the same area of sky at least twice in one night thereby limiting the amount of sky the telescope could observe in a single night. Second, any dataset that was not constructed with a tracklet building cadence is not a dataset suitable for Solar System discovery. Leveraging the latest innovations in Solar System discovery and backed by large scale computing, THOR has addressed these concerns by removing the requirement for tracklets to be made and enabling minor planet discovery without the need for a specific cadence of observations. 

As a proof-of-concept demonstration, THOR was applied to two weeks of observations from the Zwicky Transient Facility (ZTF) that were taken in early September 2018. THOR recovered 97% of all Solar System objects known at the time. ZTF’s own discovery algorithm, ZMODE, which relies on tracklet-like observations for discovery, could at best recover 68% of the same population. The Vera C. Rubin Observatory’s discovery algorithm which also relies on tracklets would at best recover 45% of the same population. In other words, by enabling Solar System discovery without requiring a specific pattern of observations, THOR can recover 1.5-2 times as many asteroids and comets as the current generation of algorithms. Of the 21,000+ orbits that were recovered by THOR, 488 were identified as high quality discovery candidates that could not be associated with any objects that were known in 2018. 

DiRAC researchers then posed the question: had THOR been running on ZTF when the data were taken in 2018, how many asteroids and comets would it have discovered? Of the 488 discovery candidates identified, THOR would have discovered at least 477 new asteroids and comets had it been running as ZTF’s discovery algorithm when the observations were made. 

The observations and best-fit trajectories of the 11 remaining objects are shown in the figure. Subsequent analysis showed that of the 11 candidates, 10 are as yet undiscovered objects. The e > 1 candidate (2nd row, 3rd column) represents “precovery” observations of parabolic/hyperbolic comet C/2018 U1 that was discovered on October 27th 2018 by the Mount Lemmon Survey. Precovery observations are observations of an object that pre-date its original discovery date. The ZTF data on which THOR was tested and developed were taken 6-8 weeks prior to the discovery date of this comet. Had THOR been running in September 2018, it would have allowed ZTF to claim the discovery of this fascinating object. 

Following their success using just two weeks of ZTF observations, the research team behind THOR is now working to process all three years of ZTF observations. They anticipate this should yield the discovery of several hundred new Solar System objects. THOR is completely open-source and available on GitHub. The eventual goal of the THOR project is to launch a discovery service where surveys can submit their observations and, powered by THOR, they can be processed and analyzed for the discovery of new asteroids and comets.

Joachim Moeyens

Joachim Moeyens is a graduate student in the Department of Astronomy at the University of Washington. He is interested in big data and software driven solutions to problems in astronomy. During his undergraduate studies at the University of Washington, he was presented with the opportunity to work on a research project for the Vera Rubin Observatory’s Legacy Survey of Space and Time (LSST).  For his doctoral thesis, Joachim is working on algorithms that discover minor planets in astronomical surveys, in particular, on Rubin Observatory’s Solar System Processing pipelines, and on a novel algorithm named Tracklet-less Heliocentric Orbit Recovery (THOR).

Supernovae Twins Open Up New Possibilities for Precision Cosmology

Type Ia supernovae are some of the most powerful tools for testing different theories of gravity. These supernovae are explosions of massive stars that all look remarkably similar. By measuring how bright a supernova is, we can figure out how far away it is. Type Ia supernovae were used to make the initial discovery of dark energy in 1998. We have since used supernovae to measure the properties of dark energy with better and better precision with the goal of determining what it really is. In our new work, we developed a technique that uses the spectra of Type Ia supernovae to improve how well we can measure the distances to them. Our new technique can measure these distances around twice as well as previous techniques, and our results will be very important for measurements of dark energy with upcoming surveys such as the Legacy Survey of Space and Time (LSST) at the Rubin Observatory, or for the Nancy Grace Roman Space Telescope.

The upper left figure shows the spectra — brightness versus wavelength — for two supernovae. One is nearby and one is very distant. To measure dark energy, scientists need to measure the distance between them very accurately, but how do they know whether they are the same? The lower right figure compares the spectra — showing that they are indeed “twins.” This means their relative distances can be measured to an accuracy of 3 percent. The bright spot in the upper-middle is a Hubble Space Telescope image of supernova 1994D (SN1994D) in galaxy NGC 4526. (Graphic credit: Zosia Rostomian/Berkeley Lab; photo credit: NASA/ESA)

Kyle Boone is a lead author on two papers published in The Astrophysical Journal that report these findings. Currently a postdoctoral fellow at the University of Washington, his research focuses on developing novel statistical methods for astronomy and cosmology. He is particularly interested in using Type Ia supernovae to probe the accelerated expansion of the universe that we believe is due to some form of “dark energy”. Dr. Boone is a former graduate student of Nobel laureate Saul Perlmutter, the Berkeley Lab senior scientist and UC Berkeley professor who led one of the teams that originally discovered dark energy. Dr. Perlmutter was also a co-author on both studies.

The Twins Embedding of Type Ia Supernovae. II. Improving Cosmological Distance Estimates

The Twins Embedding of Type Ia Supernovae. I. The Diversity of Spectra at Maximum Light

Going Dark: The Mystery of Vanishing Stars

Surveys like ZTF and the LSST on the Vera C. Rubin Observatory are improving our understanding for nearly every area of modern astronomy. Sometimes, however, these large projects discover something truly unexpected…

James Davenport (UW research assistant professor, and the Associate Director of the DiRAC Institute) and Beatriz Villaroel (Stockholm University) were interviewed this April by the SETI Institute‘s Seth Shostak about the “VASCO” project to search for disappearing stars. In this hour-long discussion, Davenport and Villaroel discuss the importance of searching for intelligent life in the universe, and finding the unexpected in our data.

Vera C. Rubin Observatory Science Pipelines in the Cloud

The Legacy Survey of Space and Time (LSST) is an upcoming sky survey that aims to conduct a longitudinal 10 year long survey in which answers to open questions about dark matter, dark energy, hazardous asteroids and the formation and structure of Milky Way galaxy will be searched for. To find these answers Vera C. Rubin Observatory will image, in high quality, the entire night sky every three nights. It is estimated that in the 10 years of operations it will deliver a total of 500 petabytes (PB) of data. Rubin will produce 20TB of data per night, generating a petabyte of data in month and a half. For comparison, the largest to date released astronomical dataset is the Pan-STARRS (PS) Data Release 2 (DR2). PS DR2 is the result of nearly 4 years worth of observations and measures 1.6 PB in size.

Science catalogs, on which most of the science will be performed, are produced by image reduction pipeline that is a part of Rubin’s code base called Science Pipelines. While Rubin Science Pipelines attempt to adopt a set of image processing algorithms and metrics that cover as many science goals as possible, and while the Rubin will set aside 10% of their compute power to be shared by the collaboration members for additional (re)processing, Rubin can not guarantee absolute coverage of all conceivable science cases. Science cases for which Science Pipelines produce sub-optimal measurements could require additional reprocessing as well some science cases might require a completely different processing.

Enabling processing of the underlying pixel data is a very challenging problem, as at these data volumes large supporting and compute infrastructure is required. Without considerable infrastructure it will also be impossible to independently verify whether Science Pipelines measurements are biased or not. Thinking forward, Rubin is likely not going to be the last, nor the only such, large scale sky survey.  Reproducibility and repeatability are necessary for maximizing the usability and impact of Rubin data.

If pixel data re/processing were accessible to more astronomers it would undoubtedly positively impact shareability, repeatability, and reproducibility going forward and would, in general, positively impact the type and quantity of science that can be done with such data. Additionally, due to how the Science Pipelines are implemented, all of this applies to other instruments such as Sloan Digital Sky Survey (SDSS), Dark Energy Camera (DECam), Hyper Suprime-Cam (HSC), PS and more…

Modeling astronomical workflows after approach taken by the IT industry which has, in a lot of cases, significantly surpassed Rubin’s data volumes and adopting cloud based solutions allows us to resolve some of these tensions. Instead of maintaining our own compute infrastructure, we cheaply rent one. Instead of hosting datasets on our own infrastructure, where equal high-speed availability for all users can be difficult to achieve, we host it in one of many storage services offered by cloud service providers, which were designed to provide high speed access to anybody at anytime.

Rubin Data Management (DM) team is aware of these issues and commissioned an Amazon Web Services (AWS) Proof of Concept (PoC) group to ascertain what are the code base changes required to enable use of cloud services, to determine whether a cloud deployment of Data Release Production (DRP) is feasible, to measure its performance, determine the final cost of investigate other more-native cloud options for certain system components that are difficult to develop and/or maintain. The plan and its phases were described in the DMTN-114 document. The group’s members and notes from biweekly meetings can be seen on the AWS PoC Confluence pages.

Rubin Data Management architecture

The Data Butler is the overarching data IO abstraction through which all Rubin data access is mediated. The main purpose of the Data Butler is to isolate the end user from file organization, filetypes and related file access mechanisms by exposing datasets as, mostly, Python objects. Datasets are referred to by their unique IDs, or by a set of identifying references, which are then resolved through an registry that matches the dataset IDs, or references, to the location, file format and the Python object/type the dataset is to be read as. A Dataset can be fully contained in a single file but often they are comprised of combined data supplied from multiple sources.

The Data Butler architecture diagram showing how the two main components, the Registry and the Datastore, are related to each other.

Data Butler consists of two interacting but separate abstractions: the Datastore and the Registry. The system that persists, reads and potentially modifies the data sources, i.e. files, comprising a dataset is called a Datastore. A Registry holds metadata, relationships and provenance for managed datasets.

The Registry is almost always backed by an SQL database and the Datastore is usually backed by a shared filesystem. A major focus of AWS POC was to implement, and investigate issues related to, an S3 backed Datastore and a PostgreSQL backed Registry.

At the time of writing, Data Butler implementation is referred to as Generation 2 Butler. Generation 3 Butler is the revised and re-implemented version of the Generation 2 Butler that incorporates many lessons learned during implementation and use of Gen. 2 Data Butler. The description provided above, and the work described below, are all done using Gen. 3 Butler. The Gen. 3 Butler is expected to replace Gen. 2 Butler by mid of 2020.

Implementing an S3 Datastore and an PostgreSQL Registry

Science Pipelines access data purely through the Butler, therefore the file access mechanism is completely opaque to Science Pipelines. Writing a Datastore and a Registry that can interface to appropriate cloud services effectively enables the entire DRP to run while storing, reading and querying relationships directly from cloud services.

As mentioned earlier, AWS was chosen as the cloud service provider for this project. AWS is a leading cloud service provider with the widest selection of different services on offer. The choice of AWS was mainly done to its cheap pricing and wide popularity, and therefore support. AWS services are interfaces through their official Python SDK module called “boto3”. Mocking of these services, for testing purposes, was achieved by using the “moto” library. The AWS S3 service was to back the Datastore and the RDS service for the Registry.

S3: Simple Storage Service

S3 is object storage provided by AWS. Unlike the more traditional file systems that manage data as a file hierarchy, or data blocks within sectors and tracts, object storage manages data as objects. This allows the storage of massive amounts of unstructured data where each object typically includes the data, related metadata and is identified by a globally unique identifier.

S3, specifically, promises 99.999999999\% durability as uploading an object to it automatically stores it across multiple systems, thus also ensuring scalability and performance. Objects can be stored in collections called Buckets for easier administrative purposes. Access, read, write, delete and other permissions can be managed at the account, bucket or individual object level. Logging is available for all actions on the Bucket level and/or at the individual object granularity. It is also possible to define and issue complex alert conditions on Bucket or objects which can trigger execution of arbitrary actions or workflows.

Relational Database Service (RDS) and PostgreSQL

PostgreSQL is one of the most popular open source relational database systems available. The choice to go with PostgreSQL was based on the fact that it’s a very popular and well supported open source software that suffers from no additional licensing fees usually associated with proprietary software. Additionally, a wider support of Relational Database Management Systems (RDBMS) was desired as, at the time, the implemented registries were backed by SQLite, an open source but relativey poor-on-features DB, and Oracle, a powerful but proprietary RDBMS. To launch and configure the RDBs the AWS Relational Database Service (RDS) was used.

S3Datastore

The first tentative S3 datastore implementation was written in March 2019. The tentative implementation revealed a major issue with how Uniform Resource Identifier (URI) were manipulated. After expansive discussions, and an example re-implementation called S3Location to demonstrate the issues, in May 2019 a new ButlerURI class resolved the problems. During the initial implementation of the Butler it was assumed that a shared file system is the only datastore mechanism that will be used. Every call to OS functionality in the Butler module had to be changed to the new interface (Butler, Config, ButlerConfig, YAML Loader etc.). Naturally, now that this interface exists, it will be much easier to amend the existing functionality and adapt the code in the future.

Formatters, the interfaces that would construct Python objects from the data in the files, received a revision also. Changes were made to nearly every formatter (JsonFormatter, PickeFormatter, YamlFormatter, PexConfigFormatter and the generic abstract class Formatter) in the data management code base. Changes allowed them to construct objects not only from files, but also by deserializing from bytes. Since many of the objects in the Rubin code base support serialization to bytes almost none of them need to be downloaded and saved to a temporary file before opening, but can be constructed from memory instead. This applies more generally, than just to AWS, too. For example, due to the chnages made downloading the data via HTTP request avoids the need to persist the data locally.

The S3 Datastore was accepted and merged into the master branch shortly after that in PR-179. The S3 Datastore implementation demonstrated that significant portions of datastore code is, or can be made, more general. This prompted a major code refactoring to avoid code duplication leaving the entire code base cleaner in the end.

PostgreSQL and RDS Registry

When the AWS PoC Group started Oracle backed registry was a fresh addition to the DM code. Because Oracle and PostgreSQL are normatively very similar, the initial implementation of PostgreSQL registry was based on it. SQLAlchemy was used to interface with the DBAPIs. There was one new problem and a one previously known one that affected the new registry. One was the camel case table naming convention, where neither Oracle nor PostgreSQL allow uppercase table or column names. The other one were missing SQL expression compilers for views, which are not correctly created by SQLAlchemy for neither PostgreSQL nor Oracle RBMS. Once view compilers were written PostgreSQL registry was re-implemented in terms of the more general SQL registry class.

On of the bigger issues during development was the data model complexity. Because data is so difficult to mock a decision that integration tests are run against handpicked prepared real data. This data is usually bundled in a continuous integration module named after the instrument that recorded it; the canonical one being the ci_hsc containing data created by the Hyper Suprime-Cam on Subaru telescope. To run the integration tests existing SQLite registries were migrated to PostgreSQL in July and the processing pipeline was executed against it.

Integration testing revealed a major issue. As was described above, the Datasets can actually consist of multiple different building blocks, each of which is also treated as a Dataset. This means that if some processing produces a new Dataset and ingest in into the data repository, it’s not known in advance whether the dataset, or one of its constituent parts, already exist in the registry. Instead of issuing many select queries to check for existence, DM code instead relied on catching unique, primary and foreign key constraint failures to catch these edge cases. A roll-back of the single failed transaction is issued and only the parts of the dataset that are new would then be inserted. This was an intentional optimization as majority of ingest operations ingest new datasets, ergo such constraints are rarely triggered.

In PostgreSQL, however, if a single statement in a transaction fails, all previous and future statements in that transaction are invalidated too. This means that in case of a failure it would be necessary to run all previously executed statements in that transaction too, then fix the broken statement and re-executing it. Technically speaking, this seems to follow the SQL standards more closely, but almost none of the other RDBMS’s implement it in this way.

A stopgap solution, based on issuing a savepoint statement between every SQL statement and then issuing an explicit rollback to that savepoint, which works for all currently implemented registries, was implemented in PR-190 and a more complete solution, based on on-conflict SQL statements, was implemented in PR-196. The PostgreSQL registry implementation was merged into the master branch of Gen. 3 Butler in PR-161.

RDS performance issues

What we want in the end is to be able to run the complete DRP on a dataset, on a cluster of AWS instances, using AWS S3 and RDS services. Skipping slightly ahead of how we achieved this execution and how the workflow is executed, there is an important lesson regarding PostgreSQL views that was learned and I want to share.

The DRP workflow is defined by a Directed Acyclic Graph (DAG). The Rubin DAG, so called QuantumGraph, is basically an XML file that sets the order of execution of commands, their compute requirements as well as the datasets they need and datasets they produce. In this way none of the commands that require data produced by some other command will be executed before it, and none of the commands will execute on instances with insufficient resources.

The QuantumGraph is created by issuing a very large SQL statement that, effectively, creates a Cartesian product between many, of the 15-ish or so, tables in the registry. From that product it then sub-selects the relevant datasets, based on what tasks were defined in the workflow configuration, and creates the XML DAG. The results are then parsed in Python and a slew of many different, small, follow-up queries are issued. This diverse workflow presents a very difficult challenge in DB optimization. Still, the Rubin DM team reports that it takes them on average 0.5 to 1.5h to create QuantumGraphs for larger datasets on their infrastructure, the difference attributed to how complicated the workflow is. In our tests we were unable to create even the simplest QuantumGraphs even when doubling on the available resources and were forced to abort multiple times after 30+h of execution.

The lack of performance did not seem related to hardware limitations but SQL ones instead. Further inspection then revealed that the fault lied in PostgreSQL views. PostgreSQL, and Oracle, regularly collect statistics on database objects used in queries. These statistics are then used to generate execution plans. These execution plans are then oftentimes cached. This in general results in performance gains. But because Butler generates all SQL dynamically there are no guarantees that a cached, or even prepared execution plans can be executed. In PostgreSQL many of these optimizations then fail. When the views are part of larger statements they often get materialized completely, even if the outer statements have strict constraints on them.

The performance penalty was debilitating where even a simple query such as:

select * from visit_detector_patch_join limit 4;

took up to 12 seconds (for 4 results!). After moving the views to materialized views, and adding triggers to recreate the materialized views on any insert that would update them the same statement executes in 0.021 miliseconds. However, even after the changes, the simplest QuantumGraphs still took 40 minutes to an 1h to create. Indexing and several further memory related optimizations then reduced the time it takes to create a QuantumGraph for an entire tract (a small subset of a night) to under 2 minutes. Recently the SQL schema and data model for the registries was reworked and completely re-implemented, completely removing views from the schema.

What does this get us and what can we do with it?

Processing is invoked via the pipetask command, explained bellow. Output logs show calibration steps performed alongside with their measured metrics.

What all of this gives us is an ability to run Science Pipelines (instrument signature removal, image characterization, image calibration, source detection, image differencing, coaddition…) from a local machine, or even better yet an AWS compute instance where the data repository is hosted in an S3 Bucket and the registry is backed by an RDS PostgreSQL database. The above described code changes give us this ability, natively from within the Rubin Science Pipelines themselves. Now that the processing is able to execute from a single machine, we want to learn how we can employ the cloud resources to scale out and process many datasets simultaneously.

Scaling the processing.

The second major effort of the AWS POC group was focused on how do we scale the processing in the cloud to (reasonably) arbitrary sized clusters of instances. This problem is multifaceted. A part of the problem was already briefly described above: how to organize execution so that no actions, that require products of a previous action(s), are executed before former actions complete. Another facet of the problem is how do we actually automatically procure compute resources in the cloud, and how do we configure said instances.

Science Pipeline workflow

A simple workflow that processes raw exposure into calibrate exposures. Along the way it detects sources and performs astrometric and photometric measurements.

Briefly, the DPR was already imagined as a Directed Acyclic Graph (DAG) execution. To execute a workflow users target datasets and a single or multiple different Tasks such as calibration, source detection etc. Each Task can be given some configuration and the entire pipeline is given general information about the data repository, such as were the root of the datastore is and at what URI can the registry be connected to. All these configurations are provided in YAML format. The workflow machinery churns out the dependencies and creates a simple DAG that is then executed. The Rubin DAG is called the QuantumGraph and its role is to organize the order in which processing needs to occur. The command invoking this whole machinery is called a pipetask to show it’s not as scary as it sounds here is an example:

pipetask -d 'visit.visit=903334'           # IDs or references to datasets we want to process.                   
         -b 's3://bucket/butler.yaml'      # Data repository config file (where the datastore/registry is etc.)     
         -p lsst.ip.isr                    # List of packages containing tasks that will be executed.             
         -i calib,shared/ci_hsc            # input dataset collections required (flat, dark, bias exposures etc.)  
        -o outcoll                         # Name of the output collection, i.e. where the results will be stored 
         run                                                                                                      
         -t isrTask.IsrTask:isr            # Tasks to execute,                                                    
         -C isr:/home/centos/configs/isr.py # and a config for each task                                           

In this case we target a single task – the instrument signature removal task. Note how the data repository does not need to be local but that it’s possible to target configuration hosted on S3.

HTCondor and condor_annex

HTCondor is a popular open source distributed computing software. Simplified, it is a scheduler that executes jobs submitted to its queue. Each job has a submit script that defines which queue it is placed in, what are the jobs resource requirements, where should errors be redirected to etc. Typical environments in which HTCondor is most often run are co-local compute resources with a shared, or networked, filesystems.

Thankfully however, HTCondor clusters can be configured very flexibly. A module called condor_annex extends the typical functionality of HTCondor by adding in cloud resources. It currently supports AWS only but Google Compute Engine and Microsoft Azure will be implemented shortly. Using condor_annex we can take advantage of the AWS’s Elastic Compute Service (EC2) to procure computational resources on which jobs in HTCondor’s queue can execute.

Example of terminal status output of a condor cluster using EC2 compute instances executing a DAG submitted through Pegasus.

To execute the QuantumGraph on the cluster procured by condor_annex a workflow management system called Pegasus is employed. In the future, as the pipetask workflow matures, this package might not be required anymore, but for now think of it as the interface that converts the QuantumGraph into submit scripts and adds them to HTCondor’s cluster queue in the correct order.

Pegasus status detailing the DAG currently being executed. The workflow is processing raw images into calibrated exposures.

The configuration of the HTCondor cluster can be pretty tedious and complicated so the AWS POC group offers pre-prepared Amazon Machine Image(s) (AMIs) that contain the Rubin Stack, Pegasus and HTCondor with condor_annex pre-installed and configured. If you wish to see how the configuration is done, or replicate the configuration yourself, see the accompanying documents on setting up Amazon Linux 2 or Centos 7 environments and then follow HTCondor and condor_annex setup guide.

An AMI is effectively an image of the system that you want present on the EC2 instance after booting it up. It prescribes the OS, software, any additional installed packages and data that will exists on that instance. The AMIs are based on HVM and should work on all EC2 instance types, however they are still an early product and are shareable on a per-case basis. Contact dinob at uw.edu for access to the AMIs.

Conclusion

At the time of writing the scalability, pricing and usability are still being evaluated. The final conclusions will be reported on in DMTN-137 but the preliminary results and behavior are very promising.

A ~35 person demo was held on November 8th at Kavli Petabytes to Science workshop in Cambridge, MA. Details of the demo, as well as more detailed explanations of the workflow accompanied by images, can be found on the tutorial website. Each attendee spun up their own 10 instance cluster and executed a 100 job workflow based on the ci_hsc data. The total cost of the demo was ~35USD.

Additional larger scale testing, so far, were performed with clusters up to a hundred r5.xlarge, m5.xlarge and m5.2xlarge. These instances have 2 to 4 cores per instance with 8 to 32GB of memory on which a tract-sized datasets were processed; a tract is a small sub-selection of a whole nightly Rubin run. Further scaling and pricing tests will be executed up to a whole nightly run at which point more qualitative performance characteristics and pricing can be established.

AWS PoC showed that it is not impossible, nor is it significantly more difficult to run the Rubin DRP in the cloud compared to setting up a local compute resource. The preliminary tests indicate that the cloud definitely has the potential for significant scaling while still remaining affordable. Given the variety of problems mentioned in the introduction, as of yet, there is almost a certainty that such functionality would be desirable and accepted in the community.

Another important AWS PoC result was exercising the DM code. While implementing the support for AWS into the DM code we discovered many assumptions that were made in the DM code. Many of these assumptions were not only limiting the implementation of the AWS support, but also made the Butler less flexible than necessary. Even if cloud aspect is not pursued further, just recounting the contributions made in return made this exercise worth while. As a result of the AWS PoC Groups work, the Butler code base is left at more general and flexible than it started.