MSCS Thesis Presentation - Shiqi Pan
— 4:30pm
Location:
In Person
-
Reddy Conference Room, Gates Hillman 4405
Speaker:
SHIQI PAN
,
Master's Student, Computer Science Department, Carnegie Mellon University
Modern large language models contain operations with vastly different computational characteristics: projections and MLPs are compute-bound, while attention mechanisms are memory-bound. Hybrid architectures combining sliding window attention, linear attention, and Mixture of Experts further complicate this operational heterogeneity. Meanwhile, datacenters deploy heterogeneous GPUs with complementary profiles—H100s excel at compute-intensive workloads while H20s better serve memory-bound operations. This creates opportunities for operation-level disaggregation: matching different operations to specialized hardware.
However, two critical gaps prevent realizing these opportunities. First, no framework systematically characterizes how hybrid LLM operations perform on heterogeneous hardware. Second, current serving systems use rigid layer-granularity pipeline parallelism, preventing specialized placement of individual operations.
This thesis addresses both gaps. We develop quantitative performance models characterizing operation-level costs, arithmetic intensity, and bottlenecks for attention variants, MLP, and MoE operations, demonstrating the motivation for architectural disaggregated placement. Additionally, we design and implement a flexible system extending vLLM that supports arbitrary operation-level stage definitions and non-contiguous patterns through multi-visit execution, metadata caching, zero-copy tensor transmission, and tensor reordering for FlashAttention compatibility.
This work provides the analytical foundation and system infrastructure for operation-aware heterogeneous LLM serving, enabling future research in automated configuration and deployment optimization.
Thesis Committee
Rashmi K. Vinayak (Chair)
Zhihao Jia
Additional Information
For More Information:
amalloy@cs.cmu.edu