|
Junxiong Wang
I obtained my PhD in Computer Science from Cornell University, where I worked at the intersection of systems and large language models, with a focus on linear models and their hybrid variants.
I lead multiple research projects at Together AI,
including adaptive speculative decoding, inference time training, and efficient RL rollouts.
If you would like to see my CV, please feel free to contact me by email.
Recent Publications
-
Haojun Xia*, Xiaoxia Wu*, Jisen Li*, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Leon Song
Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost
In submission, 2025
-
Zelei Shao*, Vikranth Srivatsa*, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
In submission, 2025
-
Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun
Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining
In submission, 2025
-
Jiaqi Leng*, Xiang Hu*, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
In submission, 2025
-
Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed S. Abdelfattah, Alexander M. Rush
Overfill: Two-Stage Models for Efficient Language Model Decoding
Conference on Language Modeling (CoLM), 2025
-
Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Workshop on Efficient Reasoning (Best Paper Award), Neural Information Processing Systems (NeurIPS), 2025
-
Junxiong Wang*, Daniele Paliotta*, Avner May, Alexander M. Rush, Tri Dao
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Models, Video, Code, Blog
Neural Information Processing Systems (NeurIPS), 2024
A shorter version at ICML 2024, 2nd Workshop on Efficient Systems for Foundation Models (ES-FoMo)
Email:Firstname@cs.cornell.edu /
Github /
HuggingFace Models /
Papers /
Twitter
|
|