NotBot
ResearchAI detectionfalse positivesnon-native speakersGPTZero

GPTZero and Turnitin False Positives on Non-Native Writers

May 21, 2026  ·  7 min read

If English is not your first language and an AI detector flagged your paper, the research is on your side. Multiple peer-reviewed studies have found that detectors like GPTZero and Turnitin misclassify human-written text from non-native English speakers at rates that would be unacceptable in any other evidentiary context.

The research landscape on detector bias

The most widely cited study on this issue comes from a Stanford team led by Weixin Liang, published in the journal Patterns (Cell Press) in 2023. Liang and colleagues tested seven commercial GPT detectors against TOEFL essays written by non-native English speakers and against essays written by U.S. eighth-graders. The detectors performed reasonably well on the native-speaker essays. On the non-native essays, they failed badly.

Across the seven detectors tested, more than half of the TOEFL essays were misclassified as AI-generated by at least one tool. One detector flagged roughly 98 percent of the TOEFL essays as AI-written. All seven tools unanimously misclassified 19 percent of the TOEFL essays. The same detectors classified the U.S. eighth-grader essays as human-written with high accuracy.

What the Liang study actually measured

The methodology matters because it explains the size of the effect. The researchers used a publicly available corpus of 91 TOEFL essays from a Chinese educational forum, all written by non-native English speakers, and 88 essays written by U.S. eighth-grade students. They ran both sets through seven detectors: GPTZero, Originality.AI, Quil.org, Sapling, Crossplag, ZeroGPT, and OpenAI's own classifier (since withdrawn).

The TOEFL essays were entirely human-written. The detectors were not tested against ambiguous or edited text. They were tested against text with a known, verified human origin. The misclassifications were not edge cases. They were the dominant outcome on essays from non-native writers.

Note
The Liang study is widely referenced in policy discussions but is sometimes summarized inaccurately. The 61 percent figure often cited online refers to a specific subset of the data. The headline finding is broader: detector performance collapses on non-native English writing across multiple commercial tools.

Why detectors penalize non-native writing

The Liang team identified the mechanism. AI detectors rely heavily on perplexity, a measure of how statistically predictable a text is to a language model. Lower perplexity means the model finds the word choices more expected, and detectors interpret that as a signal of machine generation.

Non-native English writers, particularly at the TOEFL level, tend to:

  • Use a smaller, more common vocabulary, which produces lower perplexity
  • Avoid idioms, slang, and culturally specific phrasing that increases unpredictability
  • Favor simpler, more uniform sentence structures, which reduces burstiness
  • Stick closely to standard grammatical patterns rather than stylistic variation

These are the same features that careful AI-generated prose tends to exhibit. The detector cannot distinguish a non-native writer being cautious from a language model producing default output. The researchers demonstrated this directly: when they prompted GPT-4 to rewrite the TOEFL essays with more sophisticated vocabulary, the detectors stopped flagging them as AI.

What this means for GPTZero and Turnitin cases

GPTZero was one of the seven tools tested directly by Liang and colleagues. Turnitin was not included in that specific study, but Turnitin's AI detection feature uses the same underlying statistical approach. Turnitin has publicly acknowledged limitations in its detector and recommends that scores not be used as the sole basis for an academic integrity finding.

The 2023 paper by Weber-Wulff et al. in the International Journal of Educational Integrity tested fourteen detection tools and reached a similar conclusion: none performed reliably enough across varied conditions to be considered evidentiary. The non-native speaker problem is one specific failure mode within a broader reliability problem.

Using this research in your defense

If you are a non-native English speaker and a detector score is the basis of an accusation against you, the Liang study is directly relevant. Your written response should:

  1. Identify yourself as a non-native English speaker and state your first language and approximate years of formal English instruction
  2. Cite the Liang et al. 2023 study in Patterns by author, year, and journal
  3. Quote the specific finding that all seven tested detectors misclassified 19 percent of TOEFL essays as AI-generated
  4. Reference Turnitin's own published guidance that scores should not be the sole basis for a finding
  5. Provide drafts, notes, or writing samples that demonstrate your authentic writing voice and process

Our deeper look at Turnitin and non-native writers covers how to frame this in a hearing, and the procedural rights FAQ explains what evidence you are entitled to request from your institution.

Tip
If your institution uses Turnitin or GPTZero, ask in writing what threshold score it considers significant, whether that threshold accounts for non-native English speakers, and whether any human reviewer assessed the flagged passages before the accusation was filed.

If you have already had a hearing and the outcome went against you, the appeal package is built around the evidentiary grounds these cases tend to turn on. If you are drafting a response, NotBot generates a personalized defense package that cites the Liang study, addresses the specific detector that flagged you, and incorporates your own writing process and language background, ready in about a minute.

Build your defense package

A personalized response that cites the research and reflects your language background, ready in minutes.

Get your defense package

$49 one-time · Generated in 60 seconds

Related articles