Universities Are Racing to Catch AI Cheaters. Their Own Researchers Say the Tools Do not Work.
When Princeton University faculty voted earlier this month to require proctors for in-person exams starting July 1, ending a 133-year tradition of unsupervised testing under the school's honor code, they cited a crisis they could not actually measure. The crisis was AI cheating. The instrument they reached for to address it was designed for a threat nobody can count.
Cornell and Berkeley researchers published the largest survey of undergraduate AI use to date in Science last week. The study analyzed survey responses from more than 95,000 students at 20 public research universities in the United States, collected between March and August 2024 — the first full academic year after ChatGPT's November 2022 launch. The numbers dominated headlines everywhere: 37 percent used GenAI monthly or more; 62 percent of computer science students did, compared to 24 percent in the arts. Nine percent of students who used AI had submitted AI-generated work they knew might not be allowed. Among daily users, that figure rose to 26 percent.
Every outlet treated those percentages as settled facts. They are estimates, built on an indirect questioning technique the paper itself describes in terms that received almost no coverage.
To estimate cheating rates, researchers used list randomization. Students were randomly assigned to one of two groups. One group answered how many of three non-sensitive statements about AI use applied to them. The other received the same three statements plus a fourth: the sensitive one about submitting AI-generated work as their own. Nobody said which statement was true. Respondents reported only a number. The researchers used that number to infer the population-level cheating rate.
This is not a minor technical distinction. List randomization produces a population estimate. It cannot identify which individual student cheated, cannot be compared against a student's specific assignment, and cannot tell a dean or a professor that a particular piece of work was AI-generated. It tells you roughly how many people in the group did it, on average — under conditions that may not reflect how students actually behave when stakes are real and visible. The paper's authors note the 9 percent estimate is likely conservative because some students may not recognize their GenAI use as a violation. The estimate is also two years old. GenAI tools have grown significantly more capable since March 2024.
The same paper contains a passage about detection that received even less attention. Text-based detection methods are imperfect and likely to miss GenAI use when AI-assisted text is substantially edited, the authors write, "underscoring detection as an evolving cat-and-mouse problem rather than a settled technical solution." The researchers who produced the survey numbers that universities are citing as justification for policy changes are also saying, in the same paper, that the tools being deployed to catch what they measured do not reliably work.
Princeton's response illustrates the dynamic. The faculty abolished a 133-year tradition of unsupervised exams to require proctors. Columbia computer science professors are redesigning assessments to emphasize oral defense of work rather than the work itself. Across higher education, institutions are responding to a quantified problem with instruments the quantifiers say are unreliable.
The survey also contains findings that complicate the cheating frame. Cheating estimates varied sharply by discipline: economics showed 17 percent, journalism 16 percent, biology among the lowest at 5 percent. Non-STEM fields showed higher estimated cheating rates than STEM despite lower overall AI usage in those fields. Meanwhile, adoption gaps tracked demographic lines: 33 percent of female students used GenAI regularly versus 45 percent of male students; 29 percent of underrepresented minorities versus 39 percent of white and Asian students. The researchers note these disparities may widen as GenAI tools become more specialized and costly.
Igor Chirikov, the Berkeley co-author, told UC Berkeley News that the findings point to a need for discipline-specific assessment reform rather than blanket bans or universal detection. His co-author Rene Kizilcec put it more bluntly in the Cornell release: "Assessment reform is necessary and urgent." Both statements are correct. What most coverage missed is that the reform being proposed responds to a measurement that cannot target the individuals it describes, against a threat that the researchers' own assessment says their tools cannot reliably detect.
Chirikov himself noted that the survey already feels like it belongs to a past life. The policy wave it triggered is just beginning.