Selected Publications
Google Scholar | Group Publication Page
2025
-
Craw4LLM: Efficient Web Crawling for LLM Pretraining
Shi Yu, Zhiyuan Liu, and Chenyan Xiong
In Findings of ACL, 2025
-
Group-Level Data Selection for Efficient Pretraining
Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, and Chenyan Xiong
In NeurIPS, 2025
-
Fairshare Data Pricing via Data Valuation for Large Language Models
Luyang Zhang, Cathy Jiao, Beibei Li, and Chenyan Xiong
In NeurIPS, 2025
2024
-
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Zichun Yu, Chenyan Xiong, Arnold Overwijk, and Wen-tau Yih
In ICML, 2024
-
ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance
Liyang Zhang, Cathy Jiao, and Chenyan Xiong
In ICML, 2024
2022
-
ClueWeb22: 10 Billion Web Documents with Rich Information
Arnold Overwijk, Chenyan Xiong, and Jamie Callan
In SIGIR, 2022
2021
-
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk
In ICLR, 2021
-
COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining
Yu Meng, Chenyan Xiong, Payal Bajaj, Paul Bennett, Jiawei Han, Xia Song, and others
In NeurIPS, 2021