CX Research Group

Research

All

2025

Craw4LLM: Efficient Web Crawling for LLM Pretraining
Shi Yu, Zhiyuan Liu, Chenyan Xiong
ACL 2025 (Findings)  ·  28 Jul 2025  ·  10.18653/v1/2025.findings-acl.712
Aligning Web Query Generation with Ranking Objectives via Direct Preference Optimization
Joao Coelho, Bruno Martins, Joao Magalhães, Chenyan Xiong
Association for Computing Machinery, Inc.  ·  14 Jul 2025  ·  doi:10.1145/3726302.3730162
Aligning Web Query Generation with Ranking Objectives via Direct Preference Optimization
João Coelho, Bruno Martins, João Magalhães, Chenyan Xiong
Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval  ·  13 Jul 2025  ·  doi:10.1145/3726302.3730162
Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models
Liwen Sun, Hao-Ren Yao, Gary Gao, Ophir Frieder, Chenyan Xiong
arXiv preprint  ·  30 May 2025  ·  doi:10.48550/arXiv.2506.00209
FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
Hao Kang, Zichun Yu, Chenyan Xiong
arXiv  ·  26 May 2025  ·  doi:10.48550/arXiv.2505.20225
On the Feasibility of In-Context Probing for Data Attribution
Cathy Jiao, Weizhen Gao, Aditi Raghunathan, Chenyan Xiong
Findings of the Association for Computational Linguistics (NAACL 2025)  ·  29 Apr 2025  ·  doi:10.18653/v1/2025.findings-naacl.286
Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
Liwen Sun, James Zhao, Megan Han, Chenyan Xiong
Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)  ·  29 Apr 2025  ·  doi:10.48550/arXiv.2407.15268
Group-Level Data Selection for Efficient Pretraining
Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong
arXiv  ·  20 Feb 2025  ·  doi:10.48550/arXiv.2502.14709
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning
Xiaochuan Li, Zichun Yu, Chenyan Xiong
ICLR 2025  ·  22 Jan 2025  ·  doi:10.48550/arxiv.2410.14208
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
Cathy Jiao, Yijun Pan, Emily Xiao, Daisy Sheng, Niket Jain, Hanzhang Zhao, Ishita Dasgupta, Jiaqi W. Ma, Chenyan Xiong
arXiv  ·  15 Jan 2025  ·  doi:10.48550/arXiv.2507.09424
Fairshare Data Pricing via Data Valuation for Large Language Models
Luyang Zhang, Cathy Jiao, Beibei Li, Chenyan Xiong
arXiv  ·  01 Jan 2025  ·  doi:10.48550/arXiv.2502.00198

2024

Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval
Joao Coelho, Bruno Martins, Joao Magalhaes, Jamie Callan, Chenyan Xiong
Association for Computational Linguistics  ·  11 Aug 2024  ·  doi:10.18653/v1/2024.acl-short.35
ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance
Liwen Sun, Abhineet Agarwal, Aaron Kornblith, Bin Yu, Chenyan Xiong
International Conference on Machine Learning (ICML)  ·  21 Jul 2024  ·  doi:10.48550/arXiv.2402.13448
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Zichun Yu, Spandan Das, Chenyan Xiong
Neural Information Processing Systems (NeurIPS)  ·  10 Jun 2024  ·  doi:10.48550/arXiv.2406.06046
Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval
João Coelho, Bruno Martins, Joao Magalhaes, Jamie Callan, Chenyan Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)  ·  01 Jan 2024  ·  doi:10.18653/v1/2024.acl-short.35