AI tools in education are increasingly being judged on the wrong clock.
A large panel study from central China following more than 26,000 students across 30 months suggests that AI assistance can do exactly what product teams hope for in the short run: cut homework time, raise assignment scores, and push adoption toward ubiquity. But the same data also points to a slower-moving downside. Closed-book exam performance fell, and the full gap on high-stakes entrance exams did not become visible until roughly two years after first AI use.
That timing matters as much as the direction of the effect. Homework is immediate, measurable, and easy to optimize. Learning loss, at least in this dataset, is not. If an education product is only instrumented to report faster completion and better assignment grades, it can look healthy for months while quietly degrading the skill set that matters most when students must perform without assistance.
What the study found
The study tracked students in grades 7 through 12 in a county with more than one million residents, combining monthly exams, homework scores, homework completion times, and high-stakes entrance exam results. Self-reported AI usage rose from near zero to about 80% over the study period, with the sharpest increase lining up with the releases of DeepSeek V2.5 in September 2024 and DeepSeek R1 in January 2025. The most-used tools were Doubao, DeepSeek, ChatGLM, Ernie Bot, and Qwen.
The short-term numbers are the part most likely to get showcased in a product deck. Homework completion time dropped from 64 minutes to 45 minutes. Homework scores rose by about 18%. Those are meaningful gains in operational efficiency, and for teachers and students under time pressure, they are not trivial.
But the same study reports that closed-book exam scores fell by as much as 20% to 24%, depending on the analysis and exam type. More importantly, the measured effect on entrance exams lagged the start of AI use by around two years. In other words, the tools appear to deliver immediate labor savings while the larger academic cost only becomes detectable after students have accumulated enough dependence to face assessments without AI support.
That is not a final verdict on AI in education. It is a warning about evaluation horizons.
Why the lag changes the product problem
For developers of tutoring systems, homework copilots, and classroom AI workflows, the obvious temptation is to optimize for the metrics that move fastest: completion time, assignment accuracy, teacher satisfaction, and adoption. Those are all useful, but this study suggests they are incomplete and potentially misleading if treated as outcome proxies.
A product can improve task execution while weakening the underlying competence required for future independent performance. That gap is especially important in education, where the point of practice is not merely to finish work faster, but to build durable internal models: recall, transfer, procedural fluency, and resistance to prompt dependence.
Technically, the study argues for separating at least three layers of measurement:
- Immediate productivity metrics — time saved, answer completion rate, homework scores.
- Intermediate mastery metrics — closed-book quizzes, delayed recall tests, transfer tasks, concept variation prompts.
- Long-horizon outcome metrics — standardized exams, entrance exams, and other no-assistance assessments.
If a system looks good on layer one but weak on layers two and three, the product may be accelerating output at the expense of learning. That is not an implementation detail; it is the central design problem.
What should change in tooling
The study’s most practical implication is that education AI should be instrumented less like a chat feature and more like a longitudinal intervention.
That means building measurement into the product itself, not outsourcing all judgment to grades produced months later. Teams rolling out AI tutors should consider:
- Delayed evaluation windows that compare cohorts over 12, 18, and 24+ months rather than only weekly engagement or semester grades.
- Closed-book checkpoints that are intentionally AI-free and use item formats designed to detect retention, not just assisted completion.
- Transfer tests that vary surface form while preserving underlying concepts, so a system cannot hide shallow learning behind repeatable homework patterns.
- Dependency signals such as overuse of solution generation, declining time on unaided practice, or widening gaps between assisted and unassisted performance.
- Prompting constraints that encourage explanation, retrieval, and error correction rather than direct answer delivery.
This is where model design and product design start to converge. A tutoring assistant that can solve a problem instantly is not necessarily a tutoring assistant that helps a student learn to solve the next problem alone. In practice, that may require different interaction policies, different default behaviors, and different success metrics.
Guardrails for rollout
The clearest operational lesson is to resist full-scale, unmonitored deployment based on short-term gains alone.
A more defensible rollout strategy would look like this:
- Stagger adoption by cohort or classroom, so there is a baseline group for comparison.
- Pre-register the success metrics before launch, including unassisted assessment scores and delayed retention measures.
- Track cohorts for at least 24 months, because the study suggests the main high-stakes effect may not be detectable until roughly that point.
- Separate usage from authorization: allow AI for drafting, hints, and feedback, but preserve AI-free practice for core competencies.
- Audit for substitution effects, where students use AI to finish work faster but spend less time encoding the material themselves.
This is especially important for buyers making procurement decisions in schools and districts. A tool that improves homework throughput may still be net harmful if it reduces independent performance in standardized or entrance exams. The business case has to include the downstream cost of weaker mastery, not just the immediate efficiency gain.
What teams should watch next
For product leaders, the next phase is not about asking whether AI belongs in education at all. It is about building a measurement framework that can distinguish help from dependency.
That means reporting success in two registers at once: what the tool does for homework today, and what it does to unaided performance over time. If those curves diverge, the product is creating a hidden debt that will surface later in the lifecycle, likely after adoption has already become entrenched.
For educators and policymakers, the practical response is to require longer evaluation periods and broader outcome definitions before endorsing AI at scale. For researchers, the open question is not whether AI raises or lowers scores in the abstract, but which interaction patterns preserve learning while still capturing the productivity benefits students clearly want.
The study does not settle the education question. It does something more useful for anyone building or buying these systems: it shows why a fast win can become a slow liability if the measurement window ends before the damage becomes visible.



