Craw4LLM: Efficient Web Crawling for LLM Pretraining
Shi Yu , Zhiyuan Liu , Chenyan Xiong
ACL 2025 (Findings) source ↗
Research
Publications from the group, spanning foundation-model pretraining, retrieval, and applied language modeling. Sorted reverse-chronologically.
Nothing matches.
Shi Yu , Zhiyuan Liu , Chenyan Xiong
ACL 2025 (Findings) source ↗
João Coelho , Bruno Martins , João Magalhães , Chenyan Xiong
Association for Computing Machinery, Inc. source ↗
Liwen Sun , Hao-Ren Yao , Gary Gao , Ophir Frieder , Chenyan Xiong
arXiv preprint source ↗
Hao Kang* , Zichun Yu* , Chenyan Xiong
arXiv source ↗
Liwen Sun , James Zhao , Megan Han , Chenyan Xiong
Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) source ↗
Cathy Jiao , Weizhen Gao , Aditi Raghunathan , Chenyan Xiong
Findings of the Association for Computational Linguistics (NAACL 2025) source ↗
Zichun Yu , Fei Peng , Jie Lei , Arnold Overwijk , Wen-tau Yih , Chenyan Xiong
arXiv source ↗
Xiaochuan Li , Zichun Yu , Chenyan Xiong
ICLR 2025 source ↗
Cathy Jiao , Yijun Pan , Emily Xiao , Daisy Sheng , Niket Jain , Hanzhang Zhao , Ishita Dasgupta , Jiaqi W. Ma , Chenyan Xiong
arXiv source ↗
Luyang Zhang , Cathy Jiao , Beibei Li , Chenyan Xiong
arXiv source ↗
João Coelho , Bruno Martins , Joao Magalhaes , Jamie Callan , Chenyan Xiong
Association for Computational Linguistics source ↗
Liwen Sun , Abhineet Agarwal , Aaron Kornblith , Bin Yu , Chenyan Xiong
International Conference on Machine Learning (ICML) source ↗
Zichun Yu , Spandan Das , Chenyan Xiong
Neural Information Processing Systems (NeurIPS) source ↗