Research
All
2025
Craw4LLM: Efficient Web Crawling for LLM Pretraining
ACL 2025 (Findings)
·
28 Jul 2025
·
10.18653/v1/2025.findings-acl.712
Aligning Web Query Generation with Ranking Objectives via Direct Preference Optimization
Association for Computing Machinery, Inc.
·
14 Jul 2025
·
doi:10.1145/3726302.3730162
Aligning Web Query Generation with Ranking Objectives via Direct Preference Optimization
Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval
·
13 Jul 2025
·
doi:10.1145/3726302.3730162
Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models
arXiv preprint
·
30 May 2025
·
doi:10.48550/arXiv.2506.00209
FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
arXiv
·
26 May 2025
·
doi:10.48550/arXiv.2505.20225
On the Feasibility of In-Context Probing for Data Attribution
Findings of the Association for Computational Linguistics (NAACL 2025)
·
29 Apr 2025
·
doi:10.18653/v1/2025.findings-naacl.286
Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)
·
29 Apr 2025
·
doi:10.48550/arXiv.2407.15268
Group-Level Data Selection for Efficient Pretraining
arXiv
·
20 Feb 2025
·
doi:10.48550/arXiv.2502.14709
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning
ICLR 2025
·
22 Jan 2025
·
doi:10.48550/arxiv.2410.14208
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
arXiv
·
15 Jan 2025
·
doi:10.48550/arXiv.2507.09424
Fairshare Data Pricing via Data Valuation for Large Language Models
arXiv
·
01 Jan 2025
·
doi:10.48550/arXiv.2502.00198
2024
Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval
Association for Computational Linguistics
·
11 Aug 2024
·
doi:10.18653/v1/2024.acl-short.35
ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance
International Conference on Machine Learning (ICML)
·
21 Jul 2024
·
doi:10.48550/arXiv.2402.13448
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Neural Information Processing Systems (NeurIPS)
·
10 Jun 2024
·
doi:10.48550/arXiv.2406.06046
Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
·
01 Jan 2024
·
doi:10.18653/v1/2024.acl-short.35