How Will Infrastructure Change in the Age of AI Agents?
I've been thinking about a few questions lately: How will infrastructure evolve in the AI era? As AI coding becomes increasingly capable, what is the core competitive advantage of infrastructure? What should next-generation infrastructure look like?
I want to set aside what I'm currently working on and reason from first principles about the direction and future of the infrastructure landscape as a whole. Not reasoning upward from a specific product, but looking downward from the top — the way AI applications run has changed, and the underlying infrastructure must change with it. So how will it change?
The following are some thoughts from recent reflection. They aren't necessarily correct, nor are they final conclusions — just my understanding at this particular stage. Each point will continue to be expanded and revised going forward. Keeping thinking is more important than reaching conclusions.
A Basic Cognitive Framework
Infrastructure is not created out of thin air — it is shaped by the demands of the applications above it.
In the internet era, millions of users accessed a website simultaneously, and traditional databases couldn't handle the load, so Redis emerged for caching and Nginx for load balancing. In the mobile internet era, backends split from monoliths into hundreds of microservices that needed asynchronous communication, so Kafka emerged. In the cloud computing era, applications needed to scale up and down at any time, so Kubernetes and Docker emerged.
The same logic applies every time: the pattern of upper-layer applications changes → existing infrastructure becomes a poor fit → new infrastructure is created.
So the core question for reasoning about AI-era infrastructure is: What is fundamentally different about the way AI applications (especially Agents) run compared to traditional applications?
Based on my understanding, I've identified seven possible changes, each of which could reshape the form of underlying infrastructure.
From "Request-Response" to "Always Alive"
Traditional software runs on a "request-response" model. A user clicks a button, the server processes it, returns a result, and it's done. The whole process might take only a few hundred milliseconds. The server doesn't need to remember you; the next request starts fresh.
AI Agents are different. An Agent is "alive" — once created, it runs continuously. It observes its environment in the background, collects information, thinks through strategies, executes actions, waits for feedback, and adjusts its strategy again. This process might last minutes, hours, or even days.
An analogy: traditional software is like a restaurant waiter — you order, it serves, then it moves on to the next table. An AI Agent is like your personal assistant, always by your side, observing your state and proactively handling things for you.
What does this mean for infrastructure?
Existing compute architectures assume tasks are short-lived. Serverless functions execute for a few seconds and end; Kubernetes Pods can be released once they finish handling a request. But Agents run for long periods, and their resource consumption is extremely uneven — they need GPUs when thinking, almost no resources when waiting for feedback, and then suddenly need heavy compute again when new information arrives.
Billing models don't fit either. Charge per request? Agents are always running, with no clear "requests." Charge by time? Agents spend most of their time waiting, and customers shouldn't pay for waiting. Charge by resource consumption? Consumption fluctuates too wildly for customers to budget.
A new compute layer is needed: one that can "suspend" Agents (preserving state without consuming resources) and "wake" them (instantly resuming execution), switching seamlessly between the two states. This is somewhat like mobile app background management — freeze when not in use, restore immediately when needed. But today's cloud computing was never designed for this model. In the future, an "Agent Runtime" designed specifically for Agents will emerge — an underlying platform for efficiently running, suspending, waking, and migrating Agents.
From "Fixed Pipelines" to "Dynamic Networks"
In traditional software, data flows are designed in advance by architects. User request → API gateway → Service A → Service B → Database: this chain is determined before deployment and doesn't change at runtime.
In the AI Agent era, data flows are dynamic. While executing a task, an Agent may decide on the fly to call another Agent, which in turn calls a third. This call relationship isn't designed upfront — it's decided by the Agent itself at runtime based on need.
An analogy: traditional software is like a factory assembly line, where each part follows a fixed path. AI Agents are like employees in a company — today collaborating with department A, tomorrow meeting with department B, the day after that pulling together an ad-hoc cross-departmental project. Who works with whom, and how, is decided in real time.
What does this mean for infrastructure?
Existing message brokers (Kafka, RabbitMQ) assume that topics and subscription relationships are predefined. You have to create the topic first and configure who produces and who consumes. But communication between Agents is established dynamically — two Agents may need to collaborate today and not tomorrow. If every collaboration requires manually creating a topic and configuring subscriptions, it simply can't keep pace with Agents.
Service discovery has the same problem. Existing service registries assume services are relatively fixed. But Agents are created and destroyed dynamically — a complex task might temporarily create ten Agents, all of which are destroyed when the task completes. Existing service discovery mechanisms can't handle this level of frequent registration and deregistration.
What's needed is: a communication infrastructure that supports dynamic topologies — Agents can establish communication channels at any time, join or leave at any time, messages can be reliably delivered, and none of this requires manual configuration.
From "Storing Data" to "Storing Knowledge"
Traditional applications store structured data — user tables, order tables, log files. The meaning of the data is given by application code; the database itself doesn't understand the semantics. If you store a row {"name": "John", "age": 30}, the database only knows this is a string and a number — it doesn't know who John is or what being 30 means.
AI systems need to store "knowledge" — not just raw data, but also the meaning of the data, the relationships between data points, and the context of the data. When an Agent makes a decision, it isn't querying a single row — it needs to understand "what is this user's recent behavioral pattern?", "which previous events is this event related to?", "how credible is this piece of information?"
An analogy: a traditional database is like a library's bookshelf, organized by call number — if you know where the book is, you can find it. What AI needs is a librarian who can understand content — you ask "is there material on the influence of 19th-century French economic crises on literature?" and it can find relevant books from different shelves, knowing which are most relevant and most authoritative.
What does this mean for infrastructure?
Current storage technology is fragmented — structured data goes in MySQL, documents in MongoDB, search in ES, vectors in Pinecone, graph relationships in Neo4j. Different data uses different systems, and querying requires stitching together multiple systems.
AI applications need a unified "knowledge layer" — when data is stored, the system automatically understands its semantics, builds vector indices, associates related data, and annotates context. Querying uses not exact SQL condition matching but semantic querying — "find all information related to this topic, sorted by relevance."
Today's vector databases are the first step in this direction, but they're too primitive. The "knowledge store" of the future will need: semantic indexing (understanding data meaning), relational reasoning (understanding associations between data), temporal awareness (understanding the timeliness of data), and credibility assessment (understanding the reliability of data). This is an entirely new storage category, and no mature product exists yet.
From "Watching Metrics" to "Understanding Causality"
Traditional observability is about watching numbers — CPU utilization at 80%, request latency at 200ms, error rate at 0.1%. These metrics tell you "what state is the system in," but not "why is it in this state." Experienced SREs can infer causes from metrics, but that inference happens in the human brain, not in the tooling.
AI system observability is an entirely different problem. An Agent makes a decision — say, rejecting a transaction, recommending a product, or sending an email. What you need to know is not "how much CPU," but: why did it make that decision? What data did it reference? What was its reasoning process? If it had been given different data, would it have decided differently?
An analogy: traditional observability is like a car dashboard — speed, fuel level, temperature; you see the numbers and know the state. What AI systems need is a dashcam plus a black box — not just recording the outcome, but recording the entire decision-making process, which can be replayed and analyzed after the fact.
What does this mean for infrastructure?
Existing monitoring tools (Prometheus, Grafana, Datadog) are all designed around "metrics" — collecting values, drawing charts, setting thresholds, triggering alerts. But an Agent's decision process is not a single value — it's a complex chain of reasoning: what was the input → which tools were called → what intermediate results were obtained → what conclusion was ultimately reasoned → what action was executed.
This reasoning chain needs to be fully recorded, searchable, replayable, and analyzable. And it's not enough to just do post-hoc analysis — during Agent execution, it may be necessary to monitor in real time whether the reasoning is deviating from expectations, and to be able to intervene promptly if it does.
This gives rise to an entirely new category: Agent observability. This requires visualization of reasoning chains, attribution analysis of decisions, automatic detection of anomalous reasoning, and cross-Agent causal tracing (Agent A's output caused Agent B to make a wrong decision). The market for this category may be no smaller than Datadog's — because every company using AI needs to understand and audit AI behavior, especially in highly regulated industries like finance, healthcare, and law.
From "Keeping Out Outsiders" to "Managing Insiders"
The traditional security model is simple: trust internal, defend against external. Firewalls separate the company network from the outside world, and internal services trust each other by default. Even with zero-trust architecture, the core is still "verify identity, then authorize" — who are you? Do you have permission to access this resource?
AI Agents introduce a completely new security problem: Agents themselves may do dangerous things — and not because they're "malicious," but because they've made a "judgment error." An Agent with permission to access a database might delete important data due to a reasoning error. An Agent with permission to send emails might send inappropriate content. It has legitimate permissions, but its behavior is wrong.
An analogy: traditional security is like a company's access control system — outsiders can't get in, and those who do can move freely within the company. AI Agent-era security is like managing a group of interns — they have badges that let them into the company, but you need to watch everything they do to make sure they don't cause a disaster.
What does this mean for infrastructure?
Security checks need to move from "at access time" down to "at execution time." It's not enough to let an Agent proceed once it has obtained permission — before every action is executed, you need to check: is this action reasonable in the current context? Is the amount of data it wants to delete abnormal? Does the content it wants to send comply with policy?
This is "intent verification" — not just verifying identity and permissions, but also verifying the reasonableness of the behavior. It requires a real-time behavioral analysis engine, a context-based policy engine, anomaly detection, and automatic circuit-breaker mechanisms. Almost no product does this today, but as Agents become more numerous and their permissions grow larger, this need will only become more urgent.
From "One Protocol Rules All" to "Protocol Jungle"
The communication protocols of the traditional internet were relatively unified — the web uses HTTP, inter-service communication uses gRPC or REST, messaging uses the Kafka protocol or AMQP. Each scenario has one dominant protocol, and everyone uses the same one.
In the AI Agent era, protocols will become extremely diverse. Because the entities an Agent communicates with are so varied: natural language for interacting with humans, MCP protocol for interacting with tools, A2A protocol for interacting with other Agents, MQTT for interacting with IoT devices, SQL or APIs for interacting with data systems. The same Agent may need to communicate with all of these simultaneously.
An analogy: traditional software is like a domestic company where everyone just speaks the same language. An AI Agent is like an employee at a multinational company — speaking English with American colleagues, French with French clients, Japanese with Japanese suppliers, and needing to accurately translate information between all these languages.
What does this mean for infrastructure?
Existing infrastructure is typically designed around a single protocol — Kafka serves the Kafka protocol, MQTT brokers serve the MQTT protocol, HTTP gateways serve the HTTP protocol. Bridging components are needed to interconnect them, and every bridge adds another layer of latency and another failure point.
The future requires "natively multi-lingual" infrastructure — a single system that naturally supports multiple protocols, with data flow between different protocols built in and requiring no external bridging. Instructions issued by an Agent via MCP can be directly converted into MQTT messages sent to devices, and device responses can be converted into Kafka messages entering a data pipeline. The entire process completes within one system, with protocol conversion automatic and transparent.
From "Humans Managing Systems" to "Systems Managing Systems"
Traditional infrastructure operations are human-driven. DBAs tune database parameters, SREs write alerting rules, architects design scaling strategies. Tools are aids; decisions are made by humans.
In the AI era, the operations of infrastructure itself will be taken over by AI Agents. Because the scale and complexity of AI systems exceeds what humans can manage. When a company runs thousands of Agents, each with its own resource needs, communication patterns, and storage requirements — and these needs are dynamically changing — no human ops team can manage this level of complexity in real time.
An analogy: traditional operations is like manual driving — watching the road, pressing the accelerator, turning the wheel; humans are in full control throughout. AI-era operations is like autonomous driving — the system perceives the environment, makes decisions, and executes operations on its own; humans only need to set the destination and supervise.
What does this mean for infrastructure?
The competitiveness of infrastructure products will gain a new dimension: "Can it be effectively managed by AI?"
APIs need to be sufficiently complete and consistent — AI Agents manage infrastructure through APIs, and if the API design is poor, the Agent can't manage it well. Behavior needs to be predictable — Agents need to understand "if I adjust this parameter, how will the system change?"; if behavior is unpredictable, Agents can't make correct decisions. Feedback needs to be real-time and structured — Agents need to know the result of an operation in real time, not buried in logs that humans have to dig through.
This will lead to an interesting outcome: infrastructure products with cleaner designs will actually have an advantage. Legacy systems with heavy historical baggage, messy APIs, and unpredictable behavior will become increasingly difficult for AI to manage. Newly designed products that consider AI manageability from day one will gain additional competitive advantages because they are easier for AI to operate.
Summary: The Overall Direction of Infrastructure Evolution
Looking at all seven changes together, the direction of infrastructure evolution is:
From stateless to stateful, from fixed topologies to dynamic meshes, from storing capacity to storing semantics, from monitoring to understanding, from perimeter security to intent security, from single-protocol to multi-protocol, from manual operations to AI autonomy.
In one sentence: from "tools designed for humans" to "platforms designed for AI."
The design assumption of traditional infrastructure is: humans configure, humans monitor, humans decide, humans operate. The design assumption of future infrastructure will become: AI uses, AI monitors, AI decides, AI operates. The human role shifts from "operator" to "policy setter" — defining goals and constraints, leaving the actual execution to AI.
This is not a distant future — it is a gradual process already underway. Each of these directions will give rise to new infrastructure products, and most directions currently have no mature solutions.
We are standing at the boundary between two eras — running new-paradigm applications on old-paradigm infrastructure. The gap between them is the opportunity for this generation of infrastructure.
