We introduce Dagster, an open source Python library for building ETL processes, ML pipelines, and similar software systems, all of which we call data applications.
Data applications are graphs of functional computations that consume and produce data assets. Dagster provides abstractions and tools for modeling the semantics of these applications by providing a unified type system, a data dependency graph, a configuration system, a structured API for emitting events such as data quality tests and materializations, and high-quality developer tools built on those abstractions. Computations themselves can be in the tools used by builders -- Spark jobs for data engineers, SQL statements for analysts, Python for data scientists -- and can be deployed to arbitrary orchestration engines -- such as Airflow, Dask, or Kubernetes-based execution.
The result is more reliable, testable, understandable data systems, that leverage the existing tools that work and that are deployable to your infrastructure.
Twitter heavily relies on Scala/JVM and has deep expertise in this area. For instance, we’ve built Finagle for low latency client / server RPCs, Heron for near real time data processing and Scalding for offline use cases (Hadoop / Spark). In comparison, the ML world is focused on the Python / C++ stack.
To provide a reliable Tensorflow inference offering for the different use cases at Twitter, we’ve had to overcome multiple problems to make our offering reliable, cost effective and scalable to large models. In this presentation, we’ll present our key learnings.
We’ll do a deep dive into specific performance issues that we’ve had to deal with and show you how we’ve handled them and built the tools and techniques to mitigate both issues we observe as well quality gates to prevent issues in the future.. We’ll also have a particular emphasis on observability, catching performance issues early through automatic performance regression analysis on key metrics (CPU usage, memory usage, latency, throughput). We’ll also talk about caring what you should optimize for (throughput VS latency for instance) and thinking early about your performance goals and Service Level Objectives before working on a new model.
All of these aspects helped us serve successfully 50+ different models in production, serving 20M to 40M+ requests per second.
At the end of this talk, we hope that you will understand better the choices Twitter made along the way to create a reliable JVM based inference Pipeline and that you will be able to benefit from our experience.