Why LLMs are plateauing – and what that means for software security

By John Smith, CTO EMEA at Veracode.

Thursday, 5th February 2026 Posted 1 month ago in Tech & Trends by Phil Alsop

There’s no doubt the AI-generated code landscape evolved at an unprecedented rate over the last year. The rise of vibe coding, where developers use large language models (LLMs) to generate functional code, has fundamentally changed how software is built.

As AI becomes embedded into everything from apps to full-scale company operations, it's clear significant effort has gone into training LLMs for correctness, with newer and larger models becoming increasingly effective at generating code with the expected functionality. Less attention, however, has been paid to whether the produced code is secure. The result? Mountains of production code that works in practice but is quietly embedding and spreading significant security vulnerabilities.

At the same time, LLMs are enabling attackers to identify and exploit these flaws faster than ever. With defence capabilities lagging behind, the gap between attackers and defenders is widening at a critical time, just as enterprises are increasingly reliant on AI.

Security is flatlining across most AI models

Despite recent rapid surface-level progress in AI models and their ability to generate functional code, there is growing evidence that security has failed to keep pace. Recent research shows most generative AI (GenAI) tools are producing glaring security flaws, including popular models such as Anthropic’s Claude, Google’s Gemini, and xAI’s Grok. Across all models, languages, CWEs (common weakness enumeration) and tasks, only around 55% of generation tasks produce secure code, meaning LLMs are introducing a detectable OWASP Top 10 vulnerability nearly half the time.

Surprisingly, this heightened vulnerability risk was agnostic across the different model types, with no significant difference between the smaller and larger models. Whilst the ability to generate syntactically correct code has improved dramatically, security remains stubbornly stagnant. Simply scaling models or updating training data is insufficient to meaningfully improve security outcomes.

The notable exception is OpenAI’s reasoning GPT-5 models, which take extra steps to think through problems before producing code. These models achieved substantially higher security pass rates of 70% and above, compared to 50-60% for previous generations. However in contrast, GPT-5-chat, a non-reasoning variant, lagged at 52%, suggesting that reasoning alignment, not model scale, drives these gains. It’s possible OpenAI’s tuning examples here include high-quality secure code or explicitly teach models to reason about security trade-offs to achieve this higher rate.

Language-specific trends have also emerged. Many of the AI models perform much worse on Java code generation tasks than any other coding languages, with security pass rates at less than 30%, while Python, C# and JavaScript generally fall between 38% and 45%. At the same time, newer models, especially reasoning-tuned ones, are performing better at generating secure C# and Java code, likely reflecting AI labs’ focus on major enterprise languages.

Why is LLM security stagnating?

The root of the problem lies in the nature of the training data, made up of public code samples scraped from the internet. As a result, the data contains both secure and insecure examples, including deliberately vulnerable projects like WebGoat – an insecure Java application used for security training. The models then treat all these examples as legitimate ways to satisfy a coding request, learning patterns that don’t reliably distinguish safe from unsafe implementation.

With most LLMs training on this publicly available data, there are similar patterns in how they produce security risks. As the data remains largely unchanged over time, and is increasingly supplemented with synthetic and AI-generated code, model security performance has remained stagnant across generations of models.

This also helps explain why Java is particularly problematic. Java has a long history as a server-side implementation language, and predates widespread recognition of vulnerabilities like SQL injection. Its training data must therefore contain many more security vulnerabilities than other languages like C# or Python, leading models to perform significantly worse on Java-specific tasks.

The security blind spot in vibe coding

These findings raise huge concerns for AI-assisted development and the growing popularity of vibe coding. While these practices accelerate productivity, developers rarely specify security constraints when prompting LLMs, which would dramatically improve the security of generated code.

For example, a developer might prompt a model to generate a database query without specifying whether it should construct using a prepared statement (safe) or string concatenation (unsafe). This effectively leaves those decisions down to the LLMs which, as the findings show, choose incorrectly nearly half the time. Alarmingly, this issue shows little sign of improving.

And the risks are already surfacing in practice. A recent incident with an AI coding tool on the Replit platform caused the deletion of an entire live production database during a code freeze – a clear warning of what can go wrong when AI-generated code is trusted without sufficient guardrails.

The implications for developers and organisations

Given these persistent shortfalls, relying on model improvements alone is not a viable security strategy. While newer reasoning models offer a clear advantage, security performance remains highly variable and even the best-performing models introduce vulnerabilities in nearly a third of cases.

AI coding assistants are powerful tools, but they cannot replace skilled developers or comprehensive security programmes. A layered approach to risk management is essential: maintaining continuous scanning and validation using static analysis (SAST) and software composition analysis (SCA), regardless of code origin, and proactive blocking of malicious dependencies are crucial to preventing vulnerabilities from reaching production pipelines.

AI-powered remediation tools also assist developers by providing real-time guidance and automated fixes, yet responsibility for secure implementation ultimately remains human.

The hidden cost for AI generated code

AI coding assistants and agentic workflows represent the future of software development and will continue to evolve at a rapid pace. But while LLMs have become adept at generating functionally correct code, they continue to produce security vulnerabilities at a troublingly high rate – an issue that won’t be easy to fix.

The challenge for every organisation is ensuring security evolves alongside these new capabilities. Addressing this requires security-specific training, reasoning alignment, and a recognition that security cannot be an afterthought if we want to prevent the accumulation of masses of security debt.

Until AI labs prioritise security in training and alignment processes, developers and security teams must treat AI-generated code as an inherently untrusted input – a principle that must be considered in day-to-day vibe coding.

Why LLMs are plateauing – and what that means for software security

By John Smith, CTO EMEA at Veracode.

Why most agentic AI projects fail, and how to avoid being one of them

How MSPs can strengthen security through firewall modernisation

The age of promises is over, vendors must now lead with evidence-based assurances

The secret to turning AI hype into real business results

Building trust in AI SOC analyst solutions: A UK and EU CISO perspective

Momentum over noise: what MSPs really need from 2026

Cyber Essentials in 2026 - what do your customers need to know?

Your customers know the data sovereignty rules. They’re still getting hit. That’s your opportunity.