Publications

Selected Publications


Google Scholar | Group Publication Page

2025

  1. Craw4LLM: Efficient Web Crawling for LLM Pretraining
    Shi Yu, Zhiyuan Liu, and Chenyan Xiong
    In Findings of ACL, 2025
  2. Group-Level Data Selection for Efficient Pretraining
    Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, and Chenyan Xiong
    In NeurIPS, 2025
  3. Fairshare Data Pricing via Data Valuation for Large Language Models
    Luyang Zhang, Cathy Jiao, Beibei Li, and Chenyan Xiong
    In NeurIPS, 2025

2024

  1. MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
    Zichun Yu, Chenyan Xiong, Arnold Overwijk, and Wen-tau Yih
    In ICML, 2024
  2. ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance
    Liyang Zhang, Cathy Jiao, and Chenyan Xiong
    In ICML, 2024

2022

  1. ClueWeb22: 10 Billion Web Documents with Rich Information
    Arnold Overwijk, Chenyan Xiong, and Jamie Callan
    In SIGIR, 2022

2021

  1. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N Bennett, Junaid Ahmed, and Arnold Overwijk
    In ICLR, 2021
  2. COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining
    Yu Meng, Chenyan Xiong, Payal Bajaj, Paul Bennett, Jiawei Han, Xia Song, and  others
    In NeurIPS, 2021