2023 AI Security

Grounding LLMs with Precise Code Intelligence

Large language models are increasingly embedded in software development workflows, from code completion to automated refactoring to security review. The central risk is well understood but poorly mitigated: these models hallucinate. They generate plausible-looking code that references functions that do not exist, struct fields that were renamed three versions ago, or API contracts that were never honoured by the actual implementation. In a security context, the consequences of hallucinated code passing review are severe. A fabricated authentication check that appears syntactically correct can sail through both automated CI and distracted human review, creating vulnerabilities that are invisible precisely because they look right.

Our approach to auditing LLM-assisted codebases centres on a principle we have validated across dozens of engagements: preventing the generation of incorrect information is categorically more effective than detecting it after the fact. The mechanism that makes this possible is compiler-accurate code intelligence—specifically, semantic code graphs produced by protocols such as SCIP (Source Code Intelligence Protocol). Unlike the approximate token-level understanding that language models derive from training corpora, SCIP indexes produce exact, compiler-verified maps of every symbol definition, reference, type relationship, and implementation boundary in a codebase. When this graph is provided as context to an LLM, the model's output is constrained to symbols and relationships that actually exist in the code under analysis.

Three Failure Modes We Observe Repeatedly

In our security assessments of AI-assisted development environments, three failure patterns dominate. The first is struct field hallucination: a model asked to generate code that accesses a data structure invents field names that are plausible given the naming conventions of the codebase but do not exist. In one engagement, a model generated an access control check against a user.IsAdmin field on a struct that contained user.Role with an enumerated type. The hallucinated field compiled in no context (it was used inside a comment-like suggestion), but had it been accepted, the check would have always evaluated to the zero value. With a SCIP-derived field list provided as context, the model correctly referenced user.Role == RoleAdmin.

The second pattern is confused sibling types. In large codebases with multiple packages that define similar types—a Request type in an HTTP handler package alongside a Request type in an RPC package, for example—models frequently cross-wire fields and methods between the two. This is particularly dangerous in security-sensitive code paths where the HTTP request type carries sanitised input while the RPC type carries raw internal data. The semantic code graph disambiguates these completely, because it carries fully qualified type identifiers rather than bare names.

The third and most subtle pattern is the missed hardcoded return. We have observed models generate what appears to be a complete implementation of a function, including error handling and edge cases, while failing to notice that the actual function in the codebase contains a hardcoded early return—a debugging artifact or a feature flag that was never removed. The generated code looks like a correct implementation, which causes reviewers to approve it without checking the existing function body. Compiler-accurate code intelligence surfaces the actual implementation, including control flow, which forces the model to acknowledge and work with what the code actually does rather than what it ought to do.

Architectural Recommendation

For organisations adopting LLM-assisted development at scale, we recommend a layered approach. First, generate and maintain SCIP indexes as part of the CI pipeline—they are a build artifact, not an afterthought. Second, any LLM integration that generates or modifies code should receive the relevant portion of the semantic graph as structured context, not just raw file contents. Third, establish automated verification that LLM-generated code references only symbols present in the current index. This last step functions as a lightweight static analysis gate that catches hallucinations before they reach human review. The cost of maintaining precise code intelligence is modest compared to the cost of a vulnerability introduced by a model that was confidently wrong.

The broader lesson from this work is that the security of AI-assisted software development depends less on the capability of the model and more on the quality of the context it receives. A smaller model with perfect context will produce safer output than a frontier model operating on partial file contents and inferred type information. Organisations investing in AI security should direct their resources accordingly: invest in the infrastructure that constrains model output to reality, rather than in post-hoc detection of the errors that unconstrained models inevitably produce.