|
Junxiong Wang
I obtained my PhD in Computer Science from Cornell University, where I worked at the intersection of systems and large language models, with a focus on linear models and their hybrid variants.
I lead multiple research projects at Together AI,
including adaptive speculative decoding, inference time training, and efficient RL rollouts.
Recent Publications
-
Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun
Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining
In submission, 2026
-
Haojun Xia*, Xiaoxia Wu*, Jisen Li*, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Leon Song
Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost
Conference on Machine Learning and Systems (MLsys), 2026
-
Zelei Shao*, Vikranth Srivatsa*, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
Conference on Machine Learning and Systems (MLsys), 2026
-
Jiaqi Leng*, Xiang Hu*, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
International Conference on Learning Representations (ICLR), 2026
-
Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
International Conference on Learning Representations (ICLR), 2026
-
Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed S. Abdelfattah, Alexander M. Rush
Overfill: Two-Stage Models for Efficient Language Model Decoding
Conference on Language Modeling (CoLM), 2025
-
Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Workshop on Efficient Reasoning (Best Paper Award), Neural Information Processing Systems (NeurIPS), 2025
-
Junxiong Wang*, Daniele Paliotta*, Avner May, Alexander M. Rush, Tri Dao
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Models, Video, Code, Blog
Neural Information Processing Systems (NeurIPS), 2024
A shorter version at ICML 2024, 2nd Workshop on Efficient Systems for Foundation Models (ES-FoMo)
Email:Firstname@cs.cornell.edu /
Github /
HuggingFace Models /
Papers /
Twitter
|
|