WNCG Seminar Series: Scalable and User-Friendly Machine Learning in Apache Spark
Modern datasets are rapidly growing in size and complexity, and this wealth of data holds the promise for many transformational applications. Machine learning is seemingly poised to deliver on this promise, having proposed and rigorously evaluated a wide range of data processing techniques over the past several decades. However, concerns over scalability and usability present major roadblocks to the wider adoption of these methods. In this talk I will describe the MLbase project, which aims to address these concerns by developing machine learning functionality on top of Apache Spark, a popular cluster computing engine designed for iterative computation. I will first describe MLlib, Spark’s scalable machine learning library that grew out of the MLbase project. I will also discuss higher level components of MLbase, focusing on the problem of hyperparameter optimization as a means to simplify the task of machine learning pipeline construction.