Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

data [clear filter]
Thursday, November 14

9:40am PST

Enabling real time querying of Data using Apache Druid, Flink and Kafka
In this talk, we'll learn more about how Apache Druid powers alerting against real time data at Lyft, which is useful for several use cases including validating A/B tests, accuracy of emails sent out to customers and for internal tools. We'll talk about the challenges we faced while setting up our real time ingestion pipeline into Druid using Apache Flink and Kafka, and how we went about solving them.

avatar for Sharanya Santhanam

Sharanya Santhanam

Software Engineer, Lyft
Im a Software Engineer @ Lyft working in the Data Platform Infrastructure team. I work on Interactive Query Engines. Interested to chat about Druid & Presto.
avatar for Shiv Toolsidass

Shiv Toolsidass

Software Engineering - Data Infrastructure, Lyft

Thursday November 14, 2019 9:40am - 10:10am PST

10:20am PST

How to Eliminate Surprises In Your Data
How do you know you can trust the accuracy of the data flowing through a pipeline, and the insights derived from it? At Spotify, we have an infrastructure team focused on data quality to address this problem. From the cultural changes we’re making to give data engineers a quality mindset, to the specific tools we’ve written, we’ll explain how we increase confidence and eliminate surprises in our data contents, and how we approach problems in the wide space of ‘data quality.’ You’ll learn about a few key moments in the pipeline lifecycle when data quality might be compromised, and the approach we took to improving them.

avatar for Idrees Khan

Idrees Khan

Senior Data Engineer, Spotify

Thursday November 14, 2019 10:20am - 10:50am PST

11:00am PST

From datasets to tables in a multitenant data lake
Salesforce Einstein democratize access to world class machine learning in the Salesforce ecosystem by making it easier to build trusted, scalable, and efficient ML powered apps. A major effort required is to make tenant data available to those ML processes. This talks will cover our journey to change the major abstraction offered by the data micro-services in the Einstein platform, moving from the dataset to the table. In particular, why we think that new abstraction is more useful for consumer of the service, and the technology choices we have made.

avatar for Thomas Gerber

Thomas Gerber

Director of Engineering, Salesforce

Thursday November 14, 2019 11:00am - 11:30am PST

11:40am PST

Vectorized Query Processing for CPUs and GPUs using Apache Arrow
Query processing technology has seen rapid development since the iconic C-Store paper was published in 2005. The focus has been on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. In this talk we will explore different types of vectorized query processing in Dremio using Apache Arrow. Abstract Columnar data has become the de facto format for building high performance query engines that run analytical workloads. Apache Arrow is an in-memory columnar data format that houses canonical in-memory representations for both flat and nested data structures. It is a natural complement to on-disk formats like Apache Parquet and Apache ORC. Data stored in a columnar format is amenable to processing using vectorized instructions (SIMD) available on all modern architectures. Query processing algorithms can implement simple and efficient code that operates on the columnar values in a tight-loop, providing fast and CPU cache-friendly access patterns. Operations like SUM, FILTER, COUNT, MIN, MAX etc on columnar data can be made more efficient by leveraging the data-level parallelism property of SIMD instructions. Columnar data can be encoded using lightweight algorithms like dictionary encoding, run length encoding, bit packing and delta encoding that are far more CPU efficient than general purpose compression algorithms like LZO and ZLIB. Furthermore, vectorized query processing algorithms can be written in a manner that are aware of column level encoding and can easily operate on the compressed column values in some cases. This saves CPU-memory bandwidth since we need only decompress the necessary column values. Columnar format allows us to efficiently utilize CPU and GPU cache by filling cache lines with related data (column values from an in-memory vector). With the increasing use of GPUs and FPGAs, efficient use of the smaller on-chip memory available in these architectures is especially important. In addition, Apache Arrow allows for zero-copy, shared access to buffers so that multiple processes can more efficiently operate on the same data. On the storage side, columnar representation of on-disk data makes a good case for efficient utilization of disk I/O bandwidth for analytical queries. Dremio’s query processing engine leverages columnar format of Apache Arrow and Parquet for in-memory and on-disk representations respectively. We have vectorized implementations of operators like hash join and hash aggregation to name a few.

avatar for Jacques Nadeau

Jacques Nadeau

CTO & Co-founder, Dremio

Thursday November 14, 2019 11:40am - 12:10pm PST

1:00pm PST

Was He Wright All Along? Software After Moore's Law
Moore's Law is indisputably ending -- but what was it even in the first place?  In particular, can the phenomenon we think of as Moore's Law actually be better explained by Theodore Wright in a 1936 paper on aircraft economics?  If so, could Wright's Law continue to apply as Moore's Law ends?  More generally, what are the ramifications for software as silicon-based microprocessors reach their physical limitations?  In this talk, we will explore the end of Moore's Law, the prospects for Wright's Law in microprocessors, and what it all means for those of us who build software systems.


Thursday November 14, 2019 1:00pm - 1:30pm PST

1:40pm PST

Scaling Financial Automation on TypeBus
Financial systems are known to move at glacial speeds making it tricky to build innovative systems in a world where everyone wants to access their data in real-time. Tally provides financial automation to our customers in an innovative way using TypeBus, a framework for building distributed microservices in Scala using Akka Streams and Kafka. TypeBus allows the ability to run various asynchronous tasks, yet remaining available for customers to access their data in real-time. 

There are many libraries out there focused on delivering low-latency responses. There are less which aim to be transparent to the user, provide auditability with baked in retries and back pressure, and can be easily distributed across a cluster. In this session, we'll discuss problems Tally faced building micro services in the past and why we moved to TypeBus.

avatar for Tabitha Blagdon

Tabitha Blagdon

Engineering Manager, Tally
avatar for Kaoru Kohashigawa

Kaoru Kohashigawa

Senior Platform Engineer, Tally

Thursday November 14, 2019 1:40pm - 2:10pm PST

2:20pm PST

End-to-End ML Pipelines with KubeFlow and TensorFlow Extended (TFX)
Title Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + Airflow + Jupyter Description In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow. Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google. KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking. XGBoost results on the pipelines UI Airflow is the most-widely used pipeline orchestration framework in machine learning. Pre-requisites Modern browser - and that's it! Every attendee will receive a cloud instance Nothing will be installed on your local laptop Everything can be downloaded at the end of the workshop Agenda 1. Create a Kubernetes cluster 2. Install KubeFlow, Airflow, TFX, and Jupyter 3. Setup ML Training Pipelines with KubeFlow and Airflow 4. Transform Data with TFX Transform 5. Validate Training Data with TFX Data Validation 6. Train Models with Jupyter, Keras, and TensorFlow 2.0 7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow Fairing 8. Analyze Models using TFX Model Analysis and Jupyter 9. Perform Hyper-Parameter Tuning with KubeFlow and Katib 10. Select the Best Model using KubeFlow Experiment Tracking 11. Reproduce Model Training with TFX Metadata Store 12. Deploy the Model to Production with TensorFlow Serving and Istio 13. Save and Download your Workspace Key Takeaways Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools. Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extend ed (TFX) + Kubernetes + Airflow + Jupyter In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow. Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google. KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking. Airflow is the most-widely used pipeline orchestration framework in machine learning. Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.

avatar for Chris Fregly

Chris Fregly

Developer Advocate, AI and Machine Learning, AWS

Thursday November 14, 2019 2:20pm - 2:50pm PST

3:00pm PST

Apache Flink 2.0: Unified Enterprise Data Processing System and Beyond
As the most popular and widely adopted stream processing framework, Apache Flink powers some of the world's largest stream processing use cases in companies like Netflix, Alibaba, Uber, Lyft, Pinterest, Yelp , etc.

In this talk, we will first go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.

We will then take a look at how Flink is going beyond stream processing into areas like unified streaming/batch data processing, enterprise intergration with Hive, AI/machine learning, and serverless computation, how Flink fits with its distinct value, and what development is going on in Flink community to gap.

avatar for Bowen Li

Bowen Li

Senior Software Engineer, Alibaba
Bowen is a committer of Flink and senior engineer at Alibaba. Bowen frequently give talks of Flink at conferences, and organizes Flink meetups and events in Seattle.

Thursday November 14, 2019 3:00pm - 3:30pm PST

3:40pm PST

Swift for TensorFlow: Machine Learning with No Boundaries
Swift for TensorFlow is a platform for the next generation of machine learning that leverages innovations like first-class differentiable programming to seamlessly integrate deep neural networks with traditional software development. In this session, learn how Swift for TensorFlow can make advanced machine learning research easier and why Jeremy Howard’s fast.ai has chosen it for the latest iteration of their deep learning course.

avatar for Paige Bailey

Paige Bailey

Developer Advocate (TensorFlow), Google

Thursday November 14, 2019 3:40pm - 4:10pm PST

4:20pm PST

Hack Weekend: ML models on mobile
ML models are increasingly deployed on phones, but what does it actually take to go from a state of the art model in Python to running that model on a phone? Spoiler alert: a lot. Erik recaps his weekend of attempting to go from 0 mobile development or on device model experience to running GPT-2 in an iOS app, covering the problems he encountered using Core ML/Onnx/TFLite, how to solve those problems, and why Swift for TensorFlow has the potential to change everything.

avatar for Erik Reppel

Erik Reppel

ML Platform Engineer, Coinbase
Erik Reppel is an engineer on the Machine Learning and Platform team at Coinbase where he primarily works on improving the quality of ML tooling and deploying ML models at scale.

Thursday November 14, 2019 4:20pm - 4:50pm PST

5:00pm PST

Machine Learning's Missed Opportunity in Visual Data Management
ApertureData's platform accelerates AI applications through its Data Management solution that redefines how large visual data sets are stored, searched and processed. It exposes a unified interface that allows users to store and search both the data and metadata associated with visual artifacts (images or videos). ApertureData's platform provides several innovative features: the ability to evolve metadata easily without requiring costly schema change, first-class status for feature vectors and bounding boxes, the ability to perform similarity searches as well as the ability to perform common pre-processing operations close to the data. The platform will be pluggable in allowing data to be stored on different backends and serve any machine learning pipeline. Based on our current work with customers, our platform, when used for a medical imaging use case, provides up to 5X improvement over the range of queries executed commonly in the field and can save upwards of 2 months per data scientist per machine learning deployment for every new application that wants to exploit data to gather insights. What other makeshift solutions fail to address is that once AI is ready to be commercialized, managing the onslaught of real visual data is going to be a killer for real deployments. Our talk will explain how ApertureData Platform achieves the performance and functionality for a wide range of application domains as well as a demo to show how to use it.

avatar for Vishakha Gupta-Cledat

Vishakha Gupta-Cledat

Founder and CEO, ApertureData
I am the Founder and CEO of ApertureData. Prior to that, I was at Intel Labs for over 7 years where I led the design and development of VDMS (the Visual Data Management System) which forms the core of the ApertureData Platform. I have a Ph.D in Computer Science from the Georgia Institute... Read More →

Thursday November 14, 2019 5:00pm - 5:30pm PST
Friday, November 15

9:40am PST

Human-Centric ML Infrastructure at Netflix
In this talk, we will share our experiences on building Metaflow, a Python library that empowers data scientists at Netflix to prototype, build, deploy, and operate end-to-end machine learning solutions. We started building Metaflow at Netflix to provide a solid foundation for hundreds of internal ML use cases, from classical statistical analysis to large-scale applications of deep learning. Metaflow is designed with a human-centric mindset: instead of reinventing the wheel for large-scale computing or machine learning, we integrate existing solutions into a delightfully consistent and easy-to-use package. This talk focuses on our philosophy towards Machine Learning infrastructure and dives into the internals of Metaflow; it will highlight lessons that we have learned in building a Python library that needs to be robust, performant, and flexible enough to solve a large set of complex real-world business problems related to machine learning. This talk is for you if you want to learn how to develop systems for big data and ML in Python.

avatar for Savin Goyal

Savin Goyal

Senior Software Engineer, Netflix
avatar for Ville Tuulos

Ville Tuulos

Architect, Netflix

Friday November 15, 2019 9:40am - 10:10am PST

10:20am PST

machine learning and mobile
We will discuss different approaches for bringing machine learning to mobile devices, then build an end-to-end pipeline using swift for tensorflow and mlir to train and deploy models to a phone.

avatar for brett koonce

brett koonce

cto, quarkworks

Friday November 15, 2019 10:20am - 10:50am PST

11:00am PST

Ludwig, a Code-Free Deep Learning Toolbox
The talk will introduce Ludwig, a deep learning toolbox that allows to train models and to use them for prediction without the need to write code. It is unique in its ability to help make deep learning easier to understand for non-experts and enable faster model improvement iteration cycles for experienced machine learning developers and researchers alike. By using Ludwig, experts and researchers can simplify the prototyping process and streamline data processing so that they can focus on developing deep learning architectures.

avatar for Piero Molino

Piero Molino

Senior ML / NLP Research Scientist, Uber AI Labs

Friday November 15, 2019 11:00am - 11:30am PST

11:40am PST

Weld: An Optimizing Runtime for High Performance Data Analytics
Developers write software by combining independently written libraries and functions. Even though individual functions in these libraries are optimized, the lack of end-to-end optimization can cause order of magnitude slowdowns in the whole workflow compared to a tuned implementation written in C. For example, even though TensorFlow uses highly tuned linear algebra functions for each of its operators, workflows that combine these operators can be 16x slower than hand-tuned code. Similarly, workflows that perform relational processing in Spark SQL or Pandas, numerical processing in NumPy, or a combination of these tasks spend much of their time in data movement across processing functions and could run up to 100× faster if optimized end to end. Weld is an ongoing open source project from Stanford to accelerate data-intensive applications by as much as 100×. It does so by JIT-compiling parallel code and optimizing across functions within a single library as well as across different libraries, so developers can write modular code and still get close to bare metal performance without incurring expensive data movement costs. Weld's compiler uses a new, explicitly parallel functional intermediate representation to capture the structure of data-parallel workloads such as SQL, machine learning, and graph analytics and then optimizes across them using adaptive optimizer that takes into account hardware characteristics. We demonstrate how Weld can be incrementally integrated into these libraries by porting only the most impactful operators first without breaking compatibility with other operators in the library, and without changing the API of the libraries (so users do not need to change their application code). We also show how Weld speeds up existing workloads in these frameworks and enables speed-ups of two orders of magnitude in applications that combine them. The Weld library and Weld-enabled versions of the Pandas and NumPy libraries are available to download on PyPi. Weld is open source at https://www.weld.rs.

avatar for Shoumik Palkar

Shoumik Palkar

Ph.D. Student, Stanford University

Friday November 15, 2019 11:40am - 12:10pm PST

1:00pm PST

Discovering Your Model's Known Unknowns and Unknown Unknowns
Selecting the right training data for human review is known as Active Learning. Almost every company invents (or reinvents) the same Active Learning strategies and too often they repeat the same avoidable errors. This talk will share some common Active Learning strategies, with PyTorch examples, covering: Least Confidence Sampling, Entropy-based Sampling, Cluster-based Sampling, Model-based Outliers, Monte Carlo Dropouts (Deep Bayesian Active Learning), Representative Sampling, and Sampling for Real-World Diversity.

avatar for Rob Munro

Rob Munro

Humanitarian and Technology experience includes: working in post-conflict development in Liberia and Sierra Leone for UNHCR; researching health communications in Malawi; software development supporting endangered languages; running crowdsourced translation following disasters in Haiti... Read More →

Friday November 15, 2019 1:00pm - 1:30pm PST

1:40pm PST

Introduction to geospatial analysis for uninitiated SQL Data Engineer
The talk will introduce GIS to uninitiated SQL data engineers. It would be most useful to someone who writes SQL queries and pipelines, know nothing about GIS, and wants to enhance the analysis with geospatial data.

avatar for Michael Entin

Michael Entin

Software Engineer, Google Inc
Senior Software Engineer at Google BigQuery team.Before joining Dremel team, worked on various data processing projects at Microsoft: SQL Server Integration Services, Analysis Services, distributed platform for AdCenter Business Intelligence, etc.

Friday November 15, 2019 1:40pm - 2:10pm PST

2:20pm PST

Large Scale On-Demand Low-Latency Near Real-Time Predictions
Predictive machine learning is optimizing customer experiences across many industries. This session presents the development process at Sony PlayStation that delivers scalable real-time low-latency predictive ML-based solutions on the cloud.

avatar for Gabor Melli

Gabor Melli

Senior Director of Engineering (ML&AI), Sony Interactive Entertainment

Friday November 15, 2019 2:20pm - 2:50pm PST

3:00pm PST

Lessons Learnt Building Domain Specific NLP Pipelines
At Indix (acquired by Avalara), our goal was to build the "Google of Products". The product catalog currently has 3+ billion products which was amassed by crawling 5000+ retailer and brand web sites. Naturally, we needed a robust NLP pipeline to make sense of the unstructured text data at this scale. The first part of the talk will cover the evolution of the architecture, building blocks and algorithms of the NLP Pipeline. The building blocks I will cover are Language Models, Word Embeddings and Knowledge Graph. The algorithms I will cover will be classification, entity extraction, document similarity and query understanding (for e-commerce domain). Post acquisition by Avalara, the team was tasked to make sense of the unstructured text data in the Tax Compliance domain with limited data. The second part of the talk will focus on how we fine tuned the e-commerce NLP Pipeline and transferred our learnings from the e-commerce domain to the Tax Compliance domain.

avatar for Rajesh Muppalla

Rajesh Muppalla

Senior Director of Engineering, Avalara
Sr. Director of Engineering at Avalara. Using AI to solve Tax automation. Previously co-Founder, Indix (acquired by Avalara), Tech Lead on go.cd. Topics - Machine Learning, Data Pipelines, Continuous Delivery, Mentoring

Friday November 15, 2019 3:00pm - 3:30pm PST

3:40pm PST

Reliable, High Scale Tensorflow Inference Pipelines at Twitter

Twitter heavily relies on Scala/JVM and has deep expertise in this area. For instance, we’ve built Finagle for low latency client / server RPCs, Heron for near real time data processing and Scalding for offline use cases (Hadoop / Spark). In comparison, the ML world is focused on the Python / C++ stack.

To provide a reliable Tensorflow inference offering for the different use cases at Twitter, we’ve had to overcome multiple problems to make our offering reliable, cost effective and scalable to large models. In this presentation, we’ll present our key learnings.

We’ll do a deep dive into specific performance issues that we’ve had to deal with and show you how we’ve handled them and built the tools and techniques to mitigate both issues we observe as well quality gates to prevent issues in the future.. We’ll also have a particular emphasis on observability, catching performance issues early through automatic performance regression analysis on key metrics (CPU usage, memory usage, latency, throughput). We’ll also talk about caring what you should optimize for (throughput VS latency for instance) and thinking early about your performance goals and Service Level Objectives before working on a new model.

All of these aspects helped us serve successfully 50+ different models in production, serving 20M to 40M+ requests per second.

At the end of this talk, we hope that you will understand better the choices Twitter made along the way to create a reliable JVM based inference Pipeline and that you will be able to benefit from our experience.

avatar for Briac Marcatté

Briac Marcatté

Staff ML Engineer, Twitter
avatar for Shajan Dasan

Shajan Dasan

Staff ML Engineer, Twitter
Staff Machine Learning Engineer at Twitter.Working on Distributed Systems for the last 15 years.

Friday November 15, 2019 3:40pm - 4:10pm PST

4:20pm PST

Build Your Own ML Data Feedback Loop
Machine learning models should learn from their history. Data collection and labeling is often the rate-limiting step of AI research. At Curai, our AI tools are deployed in a real-world healthcare setting, giving us the opportunity to learn from their usage. This talk will focus on how to build a semi-automated data feedback loop for ML model retraining, highlighting the specific use case at Curai. A data feedback loop consists of several key components. First, model output is presented to the user (in our case, a doctor or health professional), who can choose to accept or reject a medical suggestion. This usage data is then sent to data sinks and forwarded to a data store, where post-processing and additional calculations can happen (for example, calculating the edit distance between two strings). Processed data can then be sent down (most simply, through a CSV) to a model for retraining or fine-tuning, and the resulting v2 model can then be tested for accuracy and re-deployed into the product. In short, the semi-automated data feedback loop allows for rapid iteration and continuous learning for AI/ML models. This talk will focus on specific technologies I and my teammates have used, including, but not limited to, integration with StackDriver, BigQuery, and LaunchDarkly. Attendees will learn how to build a semi-automated data feedback loop, practical code examples and anecdotes of my own failures and successes in this domain, and ethical implications of using user-generated data for model retraining. There is tremendous potential for AI in healthcare, and closing the data loop for model retraining can help solve one of the key challenges in this domain and continuously improve machine learning models.

avatar for Sophia Sanchez

Sophia Sanchez

Machine Learning Engineer, Curai

Friday November 15, 2019 4:20pm - 4:50pm PST

5:00pm PST

Next-generation frameworks for Large-scale Machine Learning
As the deep-learning revolution matures, there is ever-growing demand for bigger datasets, larger models and more compute infrastructure. What is the role of algorithmic design in this?  I will show several ways to infuse structure into deep networks to overcome these limitations, viz., through tensors, graphs, physical laws, and simulations. Tensorized neural networks lead to large rates of compression while improving on generalization and robustness. In order to speed up multi-node model training, I will demonstrate how simple gradient compression (SignSGD) leads to communication savings while preserving accuracy. Thus, with better algorithmic design, it is possible to obtain “free lunches” and obtain better efficiency in ML.

avatar for Anima Anandkumar

Anima Anandkumar

Professor, Caltech
Anima Anandkumar holds dual positions in academia and industry. She is a Bren professor at Caltech CMS department and a director of machine learning research at NVIDIA. At NVIDIA, she is leading the research group that develops next-generation AI algorithms. At Caltech, she is t... Read More →

Friday November 15, 2019 5:00pm - 5:30pm PST