WG211M24Shaikhha
Modern data science pipelines have become increasingly complex, employing diverse workloads such as tensor algebra, graph processing algorithms, and relational query processing. This diversity often leads to the use of loosely coupled data processing frameworks that require moving data across different stages of the analytics pipeline. However, this constant movement of data introduces significant inefficiencies in resource utilization and increases energy consumptionâcritical concerns in today's computing environments.
In this talk, I present a novel compilation-based approach designed to bring computation closer to the data, thereby mitigating these inefficiencies. This method focuses on designing domain-specific languages (DSLs) that leverage the data's inherent structure through algebraic optimizations. By tailoring these DSLs to specific data types and computational patterns, we can perform computations more efficiently and reduce the need for excessive data movement across the pipeline. I will demonstrate how this approach significantly outperforms state-of-the-art frameworks across a wide range of applications, including database query processing and tensor operations, illustrating how compilation techniques and domain-specific languages can be harnessed to optimize data-intensive workloads and pave the way for more efficient and sustainable data science practices.