Wikibench Allows Wikipedians To Improve AI Evaluation Datasets

Marylee WilliamsThursday, May 16, 2024

Wikibench allows Wikipedia communities to collaboratively curate AI evaluation datasets that represent their communities' norms and values while using discussion to navigate disagreements and labeling ambiguities.

Every two seconds, someone makes an edit to a Wikipedia page. And while many of those edits help curate and develop the internet's encyclopedia, some of them aren't made in good faith.

To monitor recent changes — accepting the beneficial and reverting the malicious — Wikipedia patrollers have turned to artificial intelligence to predict whether an edit was vandalism or done in bad faith. While such tools can help the volunteers who edit and monitor the pages (known as Wikipedians) quickly identify and address issues, there isn't a method for Wikipedia communities to evaluate how well these tools fit their specific needs. The tools also don't account for varying content-moderation norms and values across different languages used on Wikipedia.

Carnegie Mellon University School of Computer Science researchers have developed a solution to this problem in the form of Wikibench, which allows Wikipedia communities to collaboratively curate AI evaluation datasets that represent their communities' norms and values while using discussion to navigate disagreements and labeling ambiguities. The research team from the Human-Computer Interaction Institute included Ph.D. student Tzu-Sheng Kuo; Assistant Professor Ken Holstein; Haiyi Zhu, the Daniel P. Siewiorek Associate Professor; Assistant Professor Sherry Tongshuang Wu; and Meng-Hsin Wu, a Master of Human-Computer Interaction alumnus. It also includes Aaron Halfaker, a researcher at Microsoft; Jiwoo Kim of Columbia University; and Zirui Cheng of Tsinghua University.

Researchers chose Wikipedia because there was a demonstrated need among patrollers to both evaluate and train the AI systems deployed in their individual communities.

"Wikibench demonstrates that it's possible to have an approach that allows the people impacted by an AI system to be the people who evaluate the system," Kuo said.

The name Wikibench comes from the concept of benchmarking, the systematic comparison of an AI model's performance against a ground truth standard. Unlike typical AI benchmarking, Wikibench acknowledges that what accounts for good performance on socially constructed tasks like content moderation can be highly community-specific. 

"The goal is to create space for community members to find consensus, where possible, but also to identify areas where different community members may have genuine disagreements," said Holstein, whose Co-{Augmentation, Learning and AI} Lab (CoALA) studies how AI systems are designed and used by workers.

Researchers deployed Wikibench on English Wikipedia. To lower the barrier to participation and seamlessly integrate it into existing workflows, the team developed a simple plug-in that allowed Wikipedians to add or discuss labels while patrolling edits. Consider, for example, an AI model that flags an edit as vandalism. While a patroller using Wikibench might label this edit as damaging because it doesn't have a source, they may also believe the edit was done in good faith because the person wasn't trying to vandalize the page. Wikibench can also be used to curate other types of datasets that involve different labels.

Along with labeling the edits, Wikibench also creates a campaign page that allows the public to see the dataset and build consensus through discussions about disagreements. This space helps participants clarify ambiguities in labeling, potentially leading to changes in their labels. If they don't reach a solution, Wikibench preserves the information about their disagreement.

"Disagreements are OK, and some of the disagreements represent genuine differences in perspective," Holstein said. "But in other cases, when community members have opportunities to discuss seeming disagreements, they realize that they don't actually disagree."

Wikibench is currently on hold while the Wikimedia Foundation transitions to a new machine learning system, but the work was presented at this month's Association for Computing Machinery Conference on Human Factors in Computing Systems (CHI 2024) in Honolulu.  Learn more about the project and its future on the Wikibench research page

For More Information

Aaron Aupperlee | 412-268-9068 |