A study led by Stanford University shows that in contract law reasoning tasks, law professors are more likely to choose AI-generated answers than versions written by their peers. The research team believes this demonstrates that large language models, in certain specialized scenarios, are already approaching common evaluation standards in the legal field.
Nearly 3,000 blind test comparisons
The study invited 16 professors from 14 law schools in the United States to participate in creating the questions, including Stanford, Yale, New York University, University of Chicago, Georgetown University, UCLA, and the University of Virginia. The 40 questions covered contract law principles, case law, hypothetical questions, and policy discussions.
In 2,918 blind tests, reviewing professors were required to choose between two anonymous responses that they would prefer to give to their students. The results showed that Google's Gemini 2.5 Pro outperformed human responses by 75.92%, while NotebookLM had a win rate of 74.75%.
Dominant in multiple question types
The study found that AI outperformed human answers across multiple question types, including memorization-based questions involving case law, legal provisions, and legal principles, as well as hypothetical analysis and policy discussions. Researchers also examined whether professors' judgments were merely a matter of personal preference, finding a higher level of consistency than randomness.
To rule out the possibility that it was merely a matter of more formal writing style, the team further analyzed characteristics such as answer length, structure, level of reasoning, legal basis, tone, clarity, and pedagogical support. The study concludes that these superficial factors are insufficient to fully explain professors' preferences for AI answers.
Fewer harmful content tags
This study also compared the proportion of answers marked as harmful. Gemini's proportion was 3.41%, NotebookLM's was 3.64%, and human answers' was 12.06%. In another set of additional model comparisons, Anthropic's Claude Opus 4.7 ranked first, followed by OpenAI's ChatGPT 5.4.
However, the study also suggests that this test did not measure whether the answers aligned with each professor's individual teaching preferences. Therefore, while AI answers may be generally acceptable, they may not precisely match the teaching style of a particular teacher.
The legal industry is still weighing the pace of adoption.
This research comes as courts, law firms, and law schools are still debating how AI should be integrated into legal workflows. Supporters argue that AI can improve the efficiency of legal services and will become one of the fundamental tools for future legal roles.
However, the legal industry remains wary of the potential for AI illusions. The report mentions that in April of this year, the law firm Sullivan & Cromwell admitted in a U.S. bankruptcy court that one of its documents contained AI-generated false quotations.












