It has been widely discussed that using LLMs for code development can create security vulnerabilities, given that they might unknowingly propagate insecure patterns found in their training data. Further, there is an ongoing debate about whether LLM-generated code may include snippets derived from copyrighted repositories, leading to IP concerns and creating legal risks. Also, there is the subtle danger of developer overreliance on LLMs. The convenience of having an AI write code can cause developers to skip foundational thinking or lose touch with core concepts. This then often leads to developers trusting output that is syntactically correct and seemingly functional without fully understanding or reviewing it. The result is oftentimes code that in the best case is not optimal and, in the worst, does not work at all.
The latter relates to the core argument of this article: Given their lack of context awareness, LLMs often make poor architectural decisions. LLMs operate on a snapshot of input. Without deep integration into the full codebase, they often lack the context needed for consistent naming conventions, variable reuse, or understanding of app-specific logic. Given this lack of a sense of system architecture, LLMs do not reason about scalability, modularity, or long-term maintainability the way a seasoned engineer does. Solutions might work for now, but create technical debt in the long run – tight coupling, global state overuse, or non-idiomatic design. Consequently, the time saved writing code is overcompensated by debugging and aligning the AI’s output. This is getting significantly more severe once LLMs are tasked to write long and complex code.
But why do LLMs often fail at writing long and complex code? There are several interrelated reasons rooted in both how they work and the nature of software engineering. Here is a breakdown of why this happens:
1. Limited Context Window
Most LLMs have a fixed “context window” – a limit on how much text they can “see” and remember at once. While this window is expanding (e.g., 32K–128K tokens), it is still much smaller than a large codebase. When generating long code, the model loses track of earlier definitions, variable names, and logic, which can lead to inconsistencies, forgotten imports, misused variables, and broken references. This is also true in the (for now) hypothetical case of an unlimited context window. The high noise-to-signal ratio will cause the model to hallucinate given that it won’t be able to effectively distinguish between more or less important tokens.
2. No True Understanding or Reasoning
LLMs do not truly “understand” code – they predict the next token based on patterns in training data. They can mimic syntax and common patterns, but they do not build or manipulate internal representations of program logic like a compiler or a human does. Consequently, they cannot reason about program flow, performance, or correctness at scale, which can lead to logical errors, inefficient structures, or broken edge cases that no pattern matching can fix.
3. Difficulty Maintaining Global State
Complex software often involves managing state across many functions, modules, and files. LLMs struggle with this because they do not persist internal state across generations (unless manually engineered to) and do not “know” the entire program as an evolving whole. For example, an LLM might define a data structure in one part and then forgets its shape or name later.
4. High Error Accumulation
When generating longer code, small mistakes compound. A bad assumption in an early part of the code can ripple downstream. So, it is quite possible that a single typo or logic flaw in a foundational function can corrupt everything that depends on it, which gets worse with complex code bases. The longer the code, the more likely it is to fail as cumulative errors grow.
5. Lack of Intent Awareness
Human developers build complex software by working toward clear, evolving goals. LLMs do not know the high-level intent of what a developer is trying to build unless they explicitly specify it in detail. As a result, the model may produce code that is technically correct but misaligned with a developer’s goal, leading to wrong abstractions, poor architecture, and missing features.
6. Training Data Bias and Gaps
LLMs are trained on public code, which is often incomplete or outdated, non-idiomatic or poorly written, and lacking real-world scale or complexity. So, when asked to generate enterprise-grade, multi-module systems, LLMs are simply out of their depth – because they have rarely (if ever) seen such code during training.
7. No Feedback Loop
Humans build software iteratively through constantly testing, refactoring, debugging, and improving. LLMs do not learn from feedback unless fine-tuned or specifically designed to do so in a reinforcement learning setting. So, the model cannot “realize” it is making a mistake unless someone tells it so. Developers end up with static, one-shot generations that lack iterative refinement.
In summary, LLMs are best at short, well-scoped problems – generating utility functions, refactoring small snippets, or answering specific questions. For large and complex projects, they can help accelerate the development but cannot take charge.