PhD Research School Speakers

Towards A Learning-Directed Operating System Modern applications demand efficient and adaptive resource management, which today’s operating systems (OSes) fail to provide. The underlying reason is that existing OS policies are rigid, component-specific, and lack coordination, resulting in suboptimal performance even when sufficient resources are available. Overprovisioning, though common, is costly and unsustainable. To address this shortcoming, we are developing LDOS, a Learning-Directed OS, which employs machine learning (ML) for end-to-end, data-driven resource management. LDOS will automatically adapt to dynamic conditions, maximizing efficiency without human intervention. By integrating ML with formal methods and co-designing the OS, LDOS offers composable, low-overhead policies that guarantee app performance across diverse environments, paving the way for a new era of intelligent and self-adaptive OSes. This talk will provide an overview of LDOS, briefly covering our research approach and highlighting some recent advances.
Image	Aditya Akella, Director of LDOS Expedition Aditya Akella is a Regents Chair professor of Computer Science at UT Austin. He obtained his Ph.D. from CMU and B. Tech. from IIT Madras. Prior to joining UT Austin, he spent fifteen years as a professor at UW-Madison. Prof. Akella works on improving the performance, reliability, and correctness of Cloud and Internet infrastructure. His research straddles the boundary between computer networking and adjacent areas such as operating systems, databases, and formal methods. Prof. Akella has won numerous awards for his research, teaching, and service contributions. Akella's research has impacted production systems run by some of the world’s largest tech companies.

Bridging ML and Formal Methods for Trustworthy Network Management
As computer networks support increasingly advanced applications and grow faster and more complex, managing them demands automated solutions that surpass human capability. While Machine Learning (ML) offers promise, its fragility, lack of guarantees, and susceptibility to attacks make it unreliable for high-stakes network management. Conversely, formal methods provide strong guarantees but struggle with scalability and require complete models. This talk explores how blending ML with formal methods can lead to intelligent, reliable, and attack-resistant network management. By combining learning-based adaptability with logical reasoning, we can develop solutions that are both scalable and verifiable, paving the way for networks that are not just automated—but truly trustworthy.
Image	Maria Apostolaki, Princeton University Maria Apostolaki is an Assistant Professor of Electrical and Computer Engineering at Princeton University. Her research spans networking and security, with a focus on combining ML with formal methods for more trustworthy network management. She has received the Google Research Scholar Award, IETF/IRTF Applied Networking Research Prizes, and Commendations for Outstanding Teaching. Maria earned her PhD from ETH Zurich and was a postdoctoral researcher at Carnegie Mellon University before joining Princeton.

Accelerating Software Development: The LLM (R)evolution
Large language models are achieving state of the art results in a wide variety of well-studied areas, eclipsing past work in well-studied areas like auto-completion. I argue that they should also presage a "Cambrian explosion" – a wave of radically new kinds of software development tools powered by AI that will make all our lives easier. I propose a paradigm for how we can best re-think existing tools to leverage a combination of LLMs and PL technologies like static and dynamic analysis, which promises to evolve our software tools far beyond their current capacities. I'll talk about this in the context of a range of tools built in my lab, including a profiler that proposes optimizations (Scalene), a debugger that actually debugs code and propose fixes, leveraging real-world knowledge (ChatDBG), compiler error messages that actually explain the problem and propose solutions (CWhy) and data analysis frameworks that actually analyze your data (FlowCo).
Image	Emery Berger, University of Massachusetts Amherst Emery Berger is a Professor of Computer Science at the University of Massachusetts Amherst, the flagship campus of the UMass system, and an Amazon Scholar at Amazon Web Services. At UMass, Professor Berger leads the PLASMA lab, whose research has led to numerous impactful software systems. Professor Berger is also the developer and sole maintainer of the influential CSrankings.org site, which has served over 3 million users. He served six years as an elected member of the SIGPLAN Executive Committee and a decade as Associate Editor of TOPLAS; he served as Program Chair for PLDI 2016 and co-Program Chair of ASPLOS 2021. His honors include an NSF CAREER Award, Most Influential Paper Awards at OOPSLA, PLDI, and ASPLOS; five CACM Research Highlights, and Best Paper Awards at FAST, OOPSLA, SOSP, and OSDI; he is an ACM Fellow.

ML and Generative AI for data system
Machine learning (ML) and Generative AI (GAI) is changing the way we build, operate, and use data systems. For example, ML-enhanced algorithms, such as learned scheduling algorithms and indexes/storage layouts are being deployed in commercial data services, GAI-code assistant help to more quickly develop features, ML-based techniques simplify operations by automatically tuning system knobs, and GenAI-based assistants help to debug operational issues. Most importantly though, Generative AI is reshaping the way users interact with data systems. Even today, all leading cloud providers already offer natural language to SQL (NL2SQL) features as part of their Python Notebook or SQL editors to increase the productivity of analysts. Business-line users are starting to use natural language as part of their visualization platforms or enterprise search, whereas application developers are exploring new ways to expose (structured) data as part of their GAI-based experiences using RAG and other techniques. Some even go so far and say that ``English will become the new SQL'' despite the obvious challenges that English is often more ambiguous. Arguably, industry is leading many of these efforts and they are happening at unprecedented speed - almost every week there is a new product announcement. Yet, a lot of the work feels ad-hoc and many challenges remain to make ML/GAI for systems in all these areas really practical despite all the product announcements. In this talk I will provide an overview of some of these recent developments and outline how the academic solution often differs from the ones deployed in industry. Finally, I will list several opportunities for academia to not only contribute but also build a better, more grounded foundation.
Image	Tim Kraska, Massachusetts Institute of Technology Tim Kraska is a director of applied science at Amazon Web Services (AWS), a professor of Electrical Engineering and Computer Science (EECS) in MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), co-director of MIT's Data System and AI LAB (DSAIL@CSAIL), and was a co-founder of Instancio and of Einblick Analytics (both acquired). Currently, his research focuses on using ML/GAI for data systems. Before joining MIT, Tim was an Assistant Professor at Brown and spent time at Google Brain. Tim is a 2017 Alfred P. Sloan Research Fellow in computer science and received several awards including the VLDB Early Career Research Contribution Award, the Intel Outstanding Researcher Award, the VMware Systems Research Award, the university-wide Early Career Research Achievement Award at Brown University, an NSF CAREER Award, as well as several best paper and demo awards at VLDB, SIGMOD, and ICDE.

AI-Driven Reliability: Towards "Trouble-Free" Cloud-Scale Systems In his 1998 Turing Award lecture, Jim Gray challenged us to create a "trouble-free" system that is so reliable that it is used by millions daily and yet managed by a single part-time person. The solution has remained elusive due to the ever-increasing complexity and scale of software systems. In this talk, I will discuss how recent advancements in AI are helping us take meaningful steps toward this goal, with a focus on tackling reliability challenges in Microsoft's cloud systems. Specifically, I will talk about three concrete problems: finding complex bugs, monitoring semantic violation in production, and localizing failures at scale, and explain why traditional methods fall short and how AI has helped us address these issues. Additionally, I will share insights from our successes and failures in designing AI-driven reliability solutions, along with key design principles we've learned along the way.
Image	Suman Nath, Microsoft Research Suman Nath is a Research Manager at Microsoft Research Redmond, where he leads the Systems Reliability group. He earned his PhD in Computer Science from Carnegie Mellon University in 2005. His current research interests focus on the performance and reliability of distributed and AI systems. His work has been successfully deployed in Microsoft systems and has received Best Paper Awards in conferences such as ACM SOSP, ACM NSDI, ICDE, ACM MobiSys, and ACM SoCC. He is an IEEE Fellow and an ACM Distinguished Member.

Early Experience Using ML in Linux and LAKE Operating systems play a fundamental role in enabling emerging applications such as AR/VR and assistive robotics to run on modern computer systems. However, today’s OSes were developed for outdated architectures and machine organizations to support applications that no longer reflect modern workloads. Moreover, today’s OSes manage a computer’s resources (e.g. CPU, memory, I/O) using human-coded heuristic policies designed to provide reasonable worst-case performance across a wide array of environments ranging from embedded/IoT devices to server-class machines, resulting in sub-optimal resource management and under-utilization. Machine learning (ML) offers a promising path forward for OSes, by replacing fixed hand-coded heuristics with learned policies that can adapt to dynamic environments, unlocking the full potential of modern computer systems. The talk will share our experience integrating ML in Linux and the LAKE kernel, highlighting early findings and elaborating on how these findings have determined out system building agenda in the LDOS project. The talk will also discuss practical challenges that arise when applying ML to such systems.
Image	Christopher J. Rossbach, University of Texas at Austin Christopher J. Rossbach is an associate professor of Computer Science at UT Austin, an alumnus of VMware Research Group and of Microsoft Research's Silicon Valley Lab, and co-founder of graph computing startup Katana Graph. He leads the Systems, Concurrent, and Emerging Architectures Research Group (SCEA) at UT Austin and co-directs the Learning-Directed Operating Systems NSF Expeditions Project. His technical interests are in operating systems, distributed systems, and OS, architectural, and PL support for parallel hardware.

Synthetic Data Generation for Accelerating Networked Systems Innovation Our conversations with stakeholders in academia, federal, and commercial organizations across various sectors (e.g., security, telemetry, finance) tell us that at every step along the way, lack of access to realistic and diverse data hampers innovation in network systems. For instance, systems are trained on data not representative of actual workloads, there is no way to quantitatively assess systems; machine learning workflows experience data drift, and system audit/feedback is not effective in the field. The result today is poor systems, lack of transparency, lots of effort in debugging/reproduction/resolution, and impossibility to share insights across customers. In this talk, we will cover some of our ongoing efforts on demonstrating the feasibility of using synthetic data using Deep Generative Models to address these pain points for various tasks (e.g., telemetry, anomaly detection, model training, long-term storage, transmission of telemetry). We will highlight key fidelity, scalability, and privacy challenges and tradeoffs in existing approaches. We also make a case for bridging system domain-specific insights with recent advances in machine learning and privacy to tackle these challenges. In this talk, we will cover some example results in applying these techniques to different types of networking and systems datasets and use cases.
Image	Vyas Sekar, Carnegie Mellon University Vyas Sekar is the Tan Family Professor of Electrical and Computer Engineering in the ECE Department at Carnegie Mellon University. He also serves as the Chief Scientist at Conviva, and as a cofounder at Rockfish Data, a startup commercializing his academic research on synthetic data. His work has been recognized with numerous awards including the SIGCOMM Rising Star Award, SIGCOMM Test of Time Award, the NSA Science of Security prize, the NSF CAREER award, the Internet Research Task Force Applied Networking Research Award, the Intel Outstanding Researcher Award, and the IIT Madras Young Alumni Achiever Award. He received his B.Tech from IIT Madras where he was awarded the President of India Gold Medal and a PhD from Carnegie Mellon University.

From Compilers to Code Whisperers: Can Generative AI Solve the Optimization Puzzle? As Moore’s Law slows, the challenge of optimizing program performance shifts toward higher-level abstractions like algorithm selection and API decisions—domains traditionally dependent on human expertise. In this talk, I explore how Large Language Models (LLMs) can revolutionize performance optimization by automating complex code transformations. I will present Performance-Improving Edits, a novel dataset that fuels the ability of LLMs to generate high-performance code edits, outpacing human efforts in competitive programming. This talk dives into the potential of generative AI to augment modern compilers, demonstrating how techniques like fine-tuning, reward-conditioning, and self-play can scale LLM capabilities to handle diverse optimization tasks. I will share insights on how these advancements can be applied to compiler design, achieving substantial speedups and enabling more efficient computing architectures. Finally, I outline a vision for scaling generative AI to autonomously manage code optimizations across platforms and hardware targets.
Image	Amir Yadanbakhsh, Google DeepMind Amir Yazdanbakhsh received his Ph.D. degree in Computer Science from the Georgia Institute of Technology, where his research was recognized with fellowships from Microsoft and Qualcomm. He is currently a Research Scientist at Google DeepMind, working at the intersection of machine learning (ML) and computer architecture. Amir's research focuses on applying ML to design efficient computing systems and improve the efficiency of large-scale data centers. He has also contributed to the design of large-scale distributed systems for training ML applications, leading the development of a reinforcement learning system that scales across TPU Pods and efficiently orchestrates thousands of actors to tackle complex real-world tasks. Additionally, Amir is actively involved in hardware-software codesign, shaping the next generation of Google's ML accelerators.

Diffusion Model Tutorial Diffusion models have emerged as a powerful new approach to generative modeling. This tutorial will use images generation as the running example, but the principles developed are applicable to a larger class of diffusion-based generative models including video, 3d, time-series, and language. We will discuss the basic mathematical models and techniques that underlie diffusions. Topics covered will include an overview of stochastic differential equations, the Fokker-Planck equation, forward and reverse processes, learning score functions through Tweedie’s formula, and ODE flow models. If time permits, we will discuss posterior sampling with diffusion models (applications to inverse problems and image editing/stylization). Lecture notes are available here: https://sites.google.com/view/sanjay-shakkottai/teaching
Image	Sanjay Shakkottai is a Professor and holds the Cockrell Family Chair in Engineering # 15 in the Department of Electrical and Computer Engineering at The University of Texas at Austin. Sanjay Shakkottai received his Ph.D. from the ECE Department at the University of Illinois at Urbana-Champaign in 2002. He is with The University of Texas at Austin, where he is a Professor in the Department of Electrical and Computer Engineering, and holds the Cockrell Family Chair in Engineering #15. He received the NSF CAREER award in 2004 and was elected as an IEEE Fellow in 2014. He was a co-recipient of the IEEE Communications Society William R. Bennett Prize in 2021. His research interests lie at the intersection of algorithms for resource allocation, statistical learning and networks, with applications to wireless communication networks and online platforms.

Towards A Learning-Directed Operating System

Aditya Akella, Director of LDOS Expedition

Bridging ML and Formal Methods for Trustworthy Network Management

Maria Apostolaki, Princeton University

Accelerating Software Development: The LLM (R)evolution

Emery Berger, University of Massachusetts Amherst

ML and Generative AI for data system

Tim Kraska, Massachusetts Institute of Technology

AI-Driven Reliability: Towards "Trouble-Free" Cloud-Scale Systems

Suman Nath, Microsoft Research

Early Experience Using ML in Linux and LAKE

Christopher J. Rossbach, University of Texas at Austin

Synthetic Data Generation for Accelerating Networked Systems Innovation

Vyas Sekar, Carnegie Mellon University

From Compilers to Code Whisperers: Can Generative AI Solve the Optimization Puzzle?

Amir Yadanbakhsh, Google DeepMind

Diffusion Model Tutorial