[{"data":1,"prerenderedAt":2920},["ShallowReactive",2],{"/blog/top-5-voice-ai-agent-failures-and-how-to-fix-them":3,"related-/blog/top-5-voice-ai-agent-failures-and-how-to-fix-them":510},{"id":4,"title":5,"authorId":6,"body":7,"category":461,"created":462,"description":463,"extension":464,"faqs":465,"featurePriority":481,"head":482,"landingPath":482,"meta":483,"navigation":498,"ogImage":482,"path":499,"robots":482,"schemaOrg":482,"seo":500,"sitemap":501,"stem":502,"tags":503,"__hash__":509},"blog/blog/1036.top-5-voice-ai-agent-failures-and-how-to-fix-them.md","Top 5 Voice AI Agent Failures and How to Fix Them","salome-koshadze",{"type":8,"value":9,"toc":452},"minimark",[10,14,17,26,34,37,40,68,71,117,123,126,130,133,159,162,205,211,214,217,221,224,250,253,285,291,294,297,323,326,370,376,379,382,408,411,449],[11,12,13],"p",{},"Voice AI agents break in predictable ways. The STT misreads a word and triggers the wrong action. The agent forgets what was said two turns ago. A user asks something slightly outside scope and hits a dead end. The response takes three seconds and the user assumes the call dropped. Someone calls in distress and gets a robotic recitation of account details.",[11,15,16],{},"None of these are unsolvable research problems. They are engineering and design gaps - specific, fixable, and common enough that most production voice agents will encounter all five. This article covers each failure in concrete terms and explains exactly what to do about it.",[18,19],"nuxt-picture",{":height":20,":width":21,"alt":22,"loading":23,"provider":24,"src":25},"450","800","Overview of common voice AI agent failures including speech recognition errors, context loss, and robotic speech","lazy","none","/blog/top-5-voice-ai-agent-failures-and-how-to-fix-them/1.svg",[27,28,30],"h2",{"id":29},"_1-incorrect-speech-to-text-conversion-and-intent-interpretation",[31,32,33],"strong",{},"1. Incorrect Speech-to-Text Conversion and Intent Interpretation",[11,35,36],{},"Voice AI agents rely on accurately converting spoken words into text and then correctly identifying what the user wants. When this process fails, the agent cannot provide a helpful or relevant response. This often leads to user frustration and a breakdown in the interaction. These problems manifest in various ways, from simple transcription errors to complete misinterpretations of a user's goal.",[11,38,39],{},"A common real-world scenario: a user calls support and says \"cancel my renewal,\" but background noise causes the agent to transcribe it as \"cancel my annual.\" The agent pulls up annual plan details instead of processing a cancellation. The user repeats themselves twice, grows frustrated, and either hangs up or demands a human - an entirely avoidable failure.",[41,42,43,50,56,62],"ul",{},[44,45,46,49],"li",{},[31,47,48],{},"Irrelevant Responses:"," The agent gives answers unrelated to the user's actual question.",[44,51,52,55],{},[31,53,54],{},"Repetitive Clarifications:"," The agent repeatedly asks the user to rephrase or clarify their request.",[44,57,58,61],{},[31,59,60],{},"Task Failure:"," The agent attempts an action that does not align with what the user asked for, or fails to start any action.",[44,63,64,67],{},[31,65,66],{},"Long Interaction Times:"," Users spend too much time trying to get their point across, lengthening calls or sessions.",[11,69,70],{},"These issues often come from limitations in the agent's acoustic models or natural language processing (NLP) components. Background noise, diverse accents, rapid speech, and domain-specific terminology can all make speech conversion difficult. A narrow intent model may also struggle with varied phrasing or complex requests, even when the speech conversion is perfect. A useful benchmark: a word error rate (WER) above 10-15% in production typically translates directly into failed interactions - users begin dropping off when they have to repeat themselves more than once.",[41,72,73,93,99,105,111],{},[44,74,75,78,79,86,87,92],{},[31,76,77],{},"Extensive Data Training:"," Train the speech recognition model with a wide array of audio data, including different accents, speaking speeds, and environmental noises. Modern STT providers like ",[80,81,85],"a",{"href":82,"rel":83},"https://deepgram.com",[84],"nofollow","Deepgram"," (Nova-2 model) and ",[80,88,91],{"href":89,"rel":90},"https://www.assemblyai.com",[84],"AssemblyAI"," offer models pre-trained on millions of hours of diverse speech, which is a far better baseline than training from scratch.",[44,94,95,98],{},[31,96,97],{},"Domain-Specific Vocabulary:"," Create and regularly update a vocabulary list relevant to the agent's tasks. Most production STT APIs - including Deepgram, AssemblyAI, and Google Speech-to-Text - support keyword boosting or custom vocabulary, letting you weight industry terms, product names, and service details so the model favors them during transcription.",[44,100,101,104],{},[31,102,103],{},"Contextual NLP Models:"," Implement NLP models that consider the full conversation history and user profile to better predict intent. LLM-based approaches (using GPT-4o or Claude as the intent layer) handle ambiguous or varied phrasing far better than rigid rule-based classifiers or older slot-filling frameworks.",[44,106,107,110],{},[31,108,109],{},"Fallback to Human Agents:"," If the AI agent detects low confidence in its interpretation, it should offer a smooth transfer to a human support agent. Set a confidence threshold (commonly 0.6-0.7) below which the agent stops guessing and hands off instead. This prevents endless loops of confusion.",[44,112,113,116],{},[31,114,115],{},"User Feedback Loops:"," Log every interaction where the confidence score was low or the user had to repeat themselves. Review these transcripts weekly - even a sample of 50-100 failed calls is usually enough to surface the top misrecognized phrases and intent gaps. Feed those patterns back into your vocabulary lists and intent training data.",[27,118,120],{"id":119},"_2-difficulty-with-multi-turn-conversations-and-context-retention",[31,121,122],{},"2. Difficulty with Multi-Turn Conversations and Context Retention",[11,124,125],{},"Voice AI agents frequently struggle to maintain context across several turns of a conversation. This limitation prevents them from engaging in natural, human-like dialogue, often leading to fragmented and frustrating user experiences.",[18,127],{":height":20,":width":21,"alt":128,"loading":23,"provider":24,"src":129},"Diagram showing how poor context retention breaks multi-turn voice conversations","/blog/top-5-voice-ai-agent-failures-and-how-to-fix-them/2.svg",[11,131,132],{},"Consider this three-turn exchange with a travel booking agent. Turn 1: \"I want to fly to Paris next Friday.\" Turn 2: \"Make it business class.\" Turn 3: \"What's the total price?\" - a stateless agent hits turn 3 with no memory of Paris or business class, returns a confused response, and the user has to start over. Each turn was handled correctly in isolation; the failure is entirely in the connective tissue between them.",[41,134,135,141,147,153],{},[44,136,137,140],{},[31,138,139],{},"Disregard for Prior Statements:"," The agent asks for information already provided by the user within the same interaction.",[44,142,143,146],{},[31,144,145],{},"Inability to Answer Follow-Up Questions:"," Users cannot ask clarifying questions about a previous agent response or build upon it.",[44,148,149,152],{},[31,150,151],{},"Fragmented Interactions:"," Each exchange feels like a new, isolated conversation, rather than a continuous dialogue with a memory of past inputs.",[44,154,155,158],{},[31,156,157],{},"User Frustration from Repetition:"," Users need to re-state information or re-explain their situation frequently, lengthening the interaction.",[11,160,161],{},"The core problem often lies in how the agent's memory and state management systems are designed. Many basic agents reset their understanding after each user utterance, failing to store or effectively retrieve details from earlier in the discussion. This can result from simple session limitations or a lack of sophisticated context-tracking mechanisms in the underlying AI models.",[41,163,164,175,181,193,199],{},[44,165,166,169,170,174],{},[31,167,168],{},"Effective Session Management:"," The most direct fix is passing the full conversation history to your LLM on every turn - the same ",[171,172,173],"code",{},"messages"," array pattern used by the OpenAI and Anthropic APIs. Store each user utterance and agent response as it happens (in memory, or in Redis for multi-server deployments), then include the entire history in each new API call. This alone eliminates most context-loss failures.",[44,176,177,180],{},[31,178,179],{},"Contextual Variable Tracking:"," As each turn is processed, extract and store key entities - dates, names, amounts, locations - in a session object alongside the conversation history. On each subsequent turn, inject these into the prompt so the agent can reference them explicitly, even if the LLM's attention drifts across a long conversation.",[44,182,183,186,187,192],{},[31,184,185],{},"Dialogue State Tracking:"," Maintain a lightweight state object that records the user's current goal and what information has already been collected. Frameworks like ",[80,188,191],{"href":189,"rel":190},"https://www.langchain.com/langgraph",[84],"LangGraph"," are built specifically for this - they let you define conversation states as nodes in a graph, with conditional transitions based on what the user has and hasn't confirmed.",[44,194,195,198],{},[31,196,197],{},"Anaphora Resolution:"," When you pass full conversation history to a capable LLM (GPT-4o, Claude), pronoun resolution (\"it,\" \"that,\" \"them\") is largely handled automatically - the model has the antecedents in context. The failure mode is usually a missing history, not a missing capability.",[44,200,201,204],{},[31,202,203],{},"Conversation Testing:"," Rather than treating multi-turn failures as a training problem, write automated conversation tests: scripted dialogues where you assert the agent's state at each turn. Catching regressions early - before they reach users - is more practical than tuning after the fact.",[27,206,208],{"id":207},"_3-limited-scope-and-handling-of-out-of-scope-requests",[31,209,210],{},"3. Limited Scope and Handling of Out-of-Scope Requests",[11,212,213],{},"Voice AI agents are usually built to handle a specific set of tasks or answer questions within a defined knowledge domain. Users, however, do not always know these operational boundaries and might ask questions that fall outside the agent's programmed capabilities. This mismatch often leads to ineffective interactions, leaving users without the help they seek.",[11,215,216],{},"These problems surface when an agent cannot address even slightly related inquiries or offer guidance for topics it was not explicitly trained on. The result can be a rigid, unhelpful system that quickly diverts users to other channels, rather than providing a complete self-service option.",[18,218],{":height":20,":width":21,"alt":219,"loading":23,"provider":24,"src":220},"Illustration of a voice AI agent hitting its scope limit and failing to handle out-of-domain requests","/blog/top-5-voice-ai-agent-failures-and-how-to-fix-them/3.svg",[11,222,223],{},"A typical example: a banking voice agent built to handle balance checks and fund transfers. A user asks \"can you help me dispute a charge?\" - a completely reasonable request for a banking assistant. The agent responds \"I'm sorry, I can't help with that\" and ends the call. No alternative offered, no direction given, no handoff attempted. The user calls back, waits in a queue, and speaks to a human who handles it in two minutes.",[41,225,226,232,238,244],{},[44,227,228,231],{},[31,229,230],{},"Repetitive \"Cannot Help\" Responses:"," The agent frequently states it cannot assist with a request, without offering alternative solutions.",[44,233,234,237],{},[31,235,236],{},"Abrupt Conversation Endings:"," The system terminates the interaction or sends the user to a generic support line for requests it does not understand.",[44,239,240,243],{},[31,241,242],{},"Inability to Adapt:"," The agent fails to answer slightly rephrased questions even if the core intent is within its general area of knowledge.",[44,245,246,249],{},[31,247,248],{},"Unnecessary Channel Shifts:"," Users are moved to a human agent or directed to a website for minor requests that a more capable AI could potentially manage.",[11,251,252],{},"The underlying cause for these limitations often stems from the agent's architecture. Agents built on rigid intent classifiers - where every request must match a predefined label - fail hard at the boundary. An unrecognized intent returns nothing. Agents built on LLMs with a well-crafted system prompt handle edge cases far more gracefully, because the model can reason about related topics even if they weren't explicitly covered in the prompt.",[41,254,255,261,267,273,279],{},[44,256,257,260],{},[31,258,259],{},"LLM-Based Intent Handling:"," Replace or supplement rigid intent classifiers with an LLM reasoning layer. Give the model a clear system prompt defining its scope, then let it decide what it can and cannot help with - it will handle near-miss and rephrased queries far better than a fixed classifier ever will.",[44,262,263,266],{},[31,264,265],{},"Graceful Handoff Protocols:"," The difference between a good and bad handoff is specificity. \"I can't help with that\" is a dead end. \"I can't process disputes directly, but I can transfer you to our disputes team now - they typically resolve these within 24 hours. Want me to connect you?\" is a complete interaction. Script these handoff responses for every known out-of-scope category and route to the correct destination, not a generic queue.",[44,268,269,272],{},[31,270,271],{},"Clear Scope Communication at the Start:"," Open every session with a one-sentence scope statement - \"I can help you check balances, transfer funds, or pay bills\" - so users know what to ask before they hit a wall. This also reduces the volume of out-of-scope requests significantly.",[44,274,275,278],{},[31,276,277],{},"Basic Conversational Capacity:"," Handle greetings, thank-yous, and minor digressions naturally rather than returning an error. These are trivial to cover in a system prompt and go a long way toward making the agent feel competent rather than brittle.",[44,280,281,284],{},[31,282,283],{},"Continuous Feedback and Expansion:"," Export every interaction flagged as out-of-scope and review the top patterns monthly. If the same unhandled request appears more than a handful of times, it belongs in scope - add it to the system prompt or route it to the appropriate resource.",[27,286,288],{"id":287},"_4-response-latency-and-unnatural-speech-delivery",[31,289,290],{},"4. Response Latency and Unnatural Speech Delivery",[11,292,293],{},"Voice AI agents seek to offer prompt, effective service. However, delays in their replies or an overly robotic speaking style can lessen the quality of the user experience. These difficulties make interactions feel less natural and more frustrating, often discouraging users from depending on automated systems.",[11,295,296],{},"The dead air problem is concrete: a user asks a question, the agent goes silent for three seconds while processing, and the user says \"hello?\" - assuming the call dropped. Many hang up. Research consistently shows that humans begin perceiving pauses as uncomfortable around 1.5 seconds; beyond 3 seconds, drop-off rates climb sharply. In voice, latency is not a performance metric - it is a user experience failure.",[41,298,299,305,311,317],{},[44,300,301,304],{},[31,302,303],{},"Hesitation in Responses:"," The agent takes noticeable pauses before speaking, creating awkward silence.",[44,306,307,310],{},[31,308,309],{},"Robotic or Monotone Speech:"," The agent's voice lacks natural intonation, making it difficult to listen to for extended periods.",[44,312,313,316],{},[31,314,315],{},"Slow Interaction Pace:"," The overall conversation feels sluggish due to delays at each turn of the dialogue.",[44,318,319,322],{},[31,320,321],{},"Increased User Impatience:"," Users may interrupt or terminate the call due to long wait times for replies.",[11,324,325],{},"The pipeline behind every voice agent response has three sequential steps: speech-to-text, LLM inference, and text-to-speech synthesis. Each adds latency, and most teams only optimize one of the three. Unnatural speech is a separate problem - it usually comes from choosing a TTS engine based on price rather than voice quality.",[41,327,328,334,352,358,364],{},[44,329,330,333],{},[31,331,332],{},"Stream the Full Pipeline:"," The single biggest latency win is streaming - start the TTS engine as soon as the LLM outputs its first tokens, rather than waiting for the complete response. Most modern LLM APIs (OpenAI, Anthropic) and TTS providers support streaming. Chaining them cuts perceived response time by 40-60% without changing any underlying model.",[44,335,336,339,340,345,346,351],{},[31,337,338],{},"Choose Low-Latency Providers at Each Step:"," Not all APIs perform equally under real-time constraints. For TTS, ",[80,341,344],{"href":342,"rel":343},"https://elevenlabs.io",[84],"ElevenLabs"," and ",[80,347,350],{"href":348,"rel":349},"https://cartesia.ai",[84],"Cartesia"," are optimized for low time-to-first-audio; Cartesia in particular is built for real-time voice applications with sub-200ms latency. For STT, Deepgram's streaming API returns results word-by-word as the user speaks, eliminating the wait for silence detection.",[44,353,354,357],{},[31,355,356],{},"Bridge Unavoidable Gaps with Filler Phrases:"," When a lookup or API call genuinely takes time, have the agent speak a bridging phrase immediately - \"Let me pull that up for you\" or \"One moment while I check.\" This is not a hack; it is how humans naturally handle processing time in conversation. Pre-generate these as cached audio clips to play instantly.",[44,359,360,363],{},[31,361,362],{},"Pre-compute Common Responses:"," Greetings, confirmations, error messages, and other high-frequency responses do not need real-time synthesis. Generate and cache their audio at deployment time - the agent serves a file, not a live API call.",[44,365,366,369],{},[31,367,368],{},"Use a Modern TTS Voice:"," Older TTS engines (including many telephony defaults) produce robotic output because they lack neural prosody modeling. ElevenLabs, PlayHT, and Azure Neural TTS all offer voices that pass casual listening. Test any voice by playing it to someone unfamiliar with the project - if they notice it's synthetic within 10 seconds, it needs upgrading.",[27,371,373],{"id":372},"_5-lack-of-empathy-and-inappropriate-tone",[31,374,375],{},"5. Lack of Empathy and Inappropriate Tone",[11,377,378],{},"Voice AI agents, by their nature, are machines. However, the absence of perceived empathy or the use of an unsuitable tone can make user interactions cold, unhelpful, or even offensive. This is particularly noticeable when users are dealing with sensitive issues or expressing frustration.",[11,380,381],{},"The failure mode here is jarring. A user says: \"I've been trying to fix this for three days and I'm absolutely exhausted.\" The agent responds: \"I understand. Your account status is active. Would you like to reset your password?\" - it processed the words but ignored everything that mattered. The user doesn't feel heard, and no amount of accurate information makes up for that in the moment.",[41,383,384,390,396,402],{},[44,385,386,389],{},[31,387,388],{},"Insensitive Responses:"," The agent gives factual, but emotionally detached, replies to users experiencing difficulties.",[44,391,392,395],{},[31,393,394],{},"Monotone Delivery for Emotional Context:"," The voice agent uses a flat tone when a situation calls for a calming or understanding voice.",[44,397,398,401],{},[31,399,400],{},"Failure to Acknowledge User Feelings:"," The agent does not recognize or respond to expressed emotions such as frustration, sadness, or anger.",[44,403,404,407],{},[31,405,406],{},"Generic or Robotic Language:"," The use of overly formal or standard phrases when a more personable approach would be helpful.",[11,409,410],{},"The core issue is that most voice agents treat every turn as a factual query. Emotional context requires a different response pattern - acknowledge first, then inform. This is less a model capability problem and more a design and prompting problem.",[41,412,413,425,431,437,443],{},[44,414,415,418,419,424],{},[31,416,417],{},"Vocal Emotion Detection:"," ",[80,420,423],{"href":421,"rel":422},"https://www.hume.ai",[84],"Hume AI"," is built specifically for this - it analyzes vocal prosody, pace, and tone to return emotion scores in real time, separate from the words spoken. Integrating it as a pre-processing step lets the agent know a user is distressed before the LLM formulates its reply.",[44,426,427,430],{},[31,428,429],{},"Acknowledge Before Answering:"," The most reliable empathy fix is a system prompt instruction: require the agent to acknowledge the user's stated emotion before providing any information when distress signals are detected. Not \"I understand this must be frustrating\" as a boilerplate prefix - but a response that references the specific situation, like \"Three days of this sounds genuinely exhausting - let me make sure we sort it out right now.\"",[44,432,433,436],{},[31,434,435],{},"Escalation as Empathy:"," Define hard escalation triggers - a user expresses anger twice, mentions legal action, or uses distress language in consecutive turns. At that point, the most empathetic response is an immediate human handoff, not another attempt by the bot. Script the handoff warmly: \"I want to make sure you get the right help - let me connect you with someone directly.\"",[44,438,439,442],{},[31,440,441],{},"Expressive TTS Settings:"," ElevenLabs allows control over stability and similarity boost, which directly affects how expressive or measured a voice sounds. A lower stability setting produces more natural variation - useful for emotionally engaged conversation. Test different configurations for your specific use case rather than using defaults.",[44,444,445,448],{},[31,446,447],{},"Review Escalated Interactions:"," Flag every session that ended in an emotional escalation or human handoff and listen to a sample weekly. The goal is not to prevent all escalations - some are correct - but to identify where the agent made the situation worse and adjust the response scripts accordingly.",[11,450,451],{},"Most voice AI failures are not model failures - they are design failures. The STT gets the words wrong because nobody added domain vocabulary. The agent forgets context because nobody passed the message history. The handoff feels cold because nobody scripted it. The fixes across all five areas here are engineering decisions, not research problems, and most can be implemented incrementally without rebuilding from scratch. Pick the failure that is costing the most right now and start there.",{"title":453,"searchDepth":454,"depth":454,"links":455},"",2,[456,457,458,459,460],{"id":29,"depth":454,"text":33},{"id":119,"depth":454,"text":122},{"id":207,"depth":454,"text":210},{"id":287,"depth":454,"text":290},{"id":372,"depth":454,"text":375},"voice-ai","2026-03-25","Explore the top 5 common voice AI agent failures-from speech recognition errors to lack of empathy-and learn practical strategies to fix them for better user experiences.","md",[466,469,472,475,478],{"question":467,"answer":468},"What is the most common failure in voice AI agents?","Incorrect speech-to-text conversion and intent misinterpretation is the most frequent failure. A word error rate above 10-15% in production typically translates directly into failed interactions. Using a modern STT provider like Deepgram or AssemblyAI with keyword boosting for domain-specific terms reduces this significantly.",{"question":470,"answer":471},"How can I improve context retention in a voice AI agent?","The most direct fix is passing the full conversation history to your LLM on every turn - the same messages array pattern used by the OpenAI and Anthropic APIs. Store each utterance as it happens and include the entire history in each new API call. For more complex flows, LangGraph lets you define conversation states with conditional transitions.",{"question":473,"answer":474},"Why does my voice AI agent sound robotic?","Robotic speech usually means an outdated TTS engine without neural prosody modeling. Switching to ElevenLabs, PlayHT, or Cartesia makes an immediate difference. Also check streaming - if you wait for the full LLM response before starting synthesis, perceived latency makes the interaction feel unnatural regardless of voice quality.",{"question":476,"answer":477},"How do I handle requests that fall outside my voice agent's scope?","Replace rigid intent classifiers with an LLM reasoning layer - it handles near-miss and rephrased queries far better. For true out-of-scope requests, script specific handoff responses for each category rather than a generic 'I can't help with that.' Open every session with a one-sentence scope statement so users know what to ask before they hit a wall.",{"question":479,"answer":480},"Can a voice AI agent detect user emotions?","Yes. Hume AI analyzes vocal prosody, pace, and tone to return emotion scores in real time, separate from the words spoken. Pair this with a system prompt instruction to acknowledge emotional state before responding, and define hard escalation triggers - like repeated expressions of anger - for immediate human handoff.",0,null,{"shortTitle":484,"relatedLinks":485},"Top 5 Voice AI Agent Failures",[486,490,494],{"text":487,"href":488,"description":489},"Challenges of Building Reliable Voice AI Agents on Live Websites","/blog/challenges-of-building-reliable-voice-ai-agents-on-live-websites","A deep dive into the technical and operational hurdles of deploying voice agents reliably in production.",{"text":491,"href":492,"description":493},"The Rise of the Voice-Enabled Web: 10 Use Cases You Can't Ignore","/blog/the-rise-of-the-voice-enabled-web-10-use-cases-you-cant-ignore","Discover how voice-enabled web experiences are transforming industries across ten real-world use cases.",{"text":495,"href":496,"description":497},"Top 5 Voice AI Agents for Website Integration in 2026","/blog/top-5-voice-ai-agents-for-website-integration-in-2026","A curated comparison of the best voice AI platforms you can integrate into your website today.",true,"/blog/top-5-voice-ai-agent-failures-and-how-to-fix-them",{"title":5,"description":463},{"loc":499},"blog/1036.top-5-voice-ai-agent-failures-and-how-to-fix-them",[504,505,506,507,508],"voice-agents","ai-agents","speech-recognition","nlp","conversational-ai","LhLOwjO56ICwloR8vK_xRWwzvJJoduYEuw11FYxKjz0",[511,2172],{"id":512,"title":513,"authorId":514,"body":515,"category":505,"created":2147,"description":2148,"extension":464,"faqs":482,"featurePriority":482,"head":482,"landingPath":482,"meta":2149,"navigation":498,"ogImage":482,"path":2161,"robots":482,"schemaOrg":482,"seo":2162,"sitemap":2163,"stem":2164,"tags":2165,"__hash__":2171},"blog/blog/1012.dom-downsampling-for-llm-based-web-agents.md","DOM Downsampling for LLM-Based Web Agents","thassilo-schiepanski",{"type":8,"value":516,"toc":2132},[517,523,546,550,557,562,578,582,588,592,610,636,639,643,646,657,663,694,698,718,730,735,750,764,767,771,791,795,803,815,819,822,1215,1221,1228,1392,1399,1490,1497,1569,1578,1584,1593,1597,1603,1613,1625,1859,1877,1899,1905,1948,1952,1964,1973,1977,1982,1985,1989,1995,2000,2038,2042,2048,2052,2062,2066,2069,2128],[18,518],{":width":519,"alt":520,"format":521,"loading":23,"src":522},"900","Downsampling visualised for digital images and HTML","webp","/blog/dom-downsampling-for-web-agents/1.png",[11,524,525,530,531,530,536,541,542,545],{},[80,526,529],{"href":527,"rel":528},"https://operator.chatgpt.com",[84],"Operator (OpenAI)",", ",[80,532,535],{"href":533,"rel":534},"https://www.director.ai",[84],"Director (Browserbase)",[80,537,540],{"href":538,"rel":539},"https://browser-use.com",[84],"Browser Use"," – we are currently witnessing the rise of ",[31,543,544],{},"web AI agents",". The first iteration of serviceable web agents was enabled by frontier LLMs, which act as instantaneous domain model backends. The domain, hereby, corresponds to the landscape of web application UIs.",[27,547,549],{"id":548},"what-is-a-snapshot","What is a Snapshot?",[11,551,552,553,556],{},"Web agents provide an LLM with a task, and serialised runtime state of a currently browsed web application (e.g., a screenshot). The LLM is ought to suggest relevant actions to perform in the web application. Serialisation of such runtime state is referred to as a ",[31,554,555],{},"snapshot",". And the snapshot technique primarily decides the quality of LLM interaction suggestions.",[558,559,561],"h3",{"id":560},"gui-snapshots","GUI Snapshots",[11,563,564,565,568,569,573,574,577],{},"Screenshots – for consistency reasons referred to as ",[31,566,567],{},"GUI snapshots"," – resemble how humans visually perceive web application UIs. LLM APIs subsidise the use of image input through upstream compression. Compresssion, however, irreversibly affects image dimensions, which takes away pixel precision; no way to suggest interactions like ",[570,571,572],"em",{},"“click at 100, 735”",". As a workaround, early web agents used ",[570,575,576],{},"grounded"," GUI snapshots. Grounding describes adding visual cues to the GUI, such as bounding boxes with numerical identifiers. Grounding lets the LLM refer to specific parts of the page by identifier, so the agent can trace back interaction targets.",[18,579],{":width":519,"alt":580,"format":521,"loading":23,"src":581},"Grounded GUI snapshot as implemented by Browser Use","/blog/dom-downsampling-for-web-agents/2.png",[11,583,584],{},[585,586,587],"small",{},"Grounded GUI snapshot as implemented by Browser Use.",[558,589,591],{"id":590},"dom-snapshots","DOM Snapshots",[11,593,594,595,605,606,609],{},"LLMs arguably are much better at understanding code than images. Research supports they excel at describing and classifying HTML, and also navigating an inherent UI",[596,597,598],"sup",{},[80,599,604],{"href":600,"ariaDescribedBy":601,"dataFootnoteRef":453,"id":603},"#user-content-fn-1",[602],"footnote-label","user-content-fnref-1","1",". The DOM (document object model) – a web browser's runtime state model of a web application – translates back to HTML. For this reason, ",[31,607,608],{},"DOM snapshots"," offer a compelling alternative to GUI snapshots. DOM snapshots offer a handful of key advantages:",[611,612,613,616,619,622,625],"ol",{},[44,614,615],{},"DOM snapshots connect with LLM code (HTML) interpretation abilities.",[44,617,618],{},"DOM snapshots can be compiled from deep clones, hidden from supervision (unlike GUI grounding).",[44,620,621],{},"DOM snapshots render text input that on average consume less bandwidth than screnshots.",[44,623,624],{},"DOM snapshots allow for exact programmatic targeting of elements (e.g., via CSS selectors).",[44,626,627,628,631,632,635],{},"DOM snapshots are available with the ",[171,629,630],{},"DOMContentLoaded"," event (whereas the GUI completes initial rendering with ",[171,633,634],{},"load",").",[11,637,638],{},"Yet, DOM snapshots have a major problem: potentially exhaustive model context. Whereas GUI snapshot commonly cost four figures of tokens, a raw DOM snapshot can cost into hundreds of thousands of tokens. To connect with LLM code interpretation abilities, however, developers have used element extraction techniques – picking only (likely) important elements from the DOM. Element extraction flattens the DOM tree, which disregards hierarchy as a potential UI feature (how do elements relate to each other?).",[27,640,642],{"id":641},"dom-downsampling-a-novel-approach","DOM Downsampling: A Novel Approach",[11,644,645],{},"To enable DOM snapshots for use with web agents, it requires client-side pre-processing – similar to how LLM vision APIs process image input. Downsampling is a fundamental signal processing technique that reduces data that scales out of time or space constraints under the assumption that the majority of relevant features is retained. Picture JPEG compression as an example: put simply, a JPEG image stores only an average colour for patches of pixels. The bigger the patches, the smaller the file. Although some detail is lost, key image features – colours, edges, objects – keep being recognisable – up to a large patch size.",[11,647,648,649,652,653,656],{},"We transfer the concept of ",[31,650,651],{},"downsampling"," to ",[31,654,655],{},"DOMs",". Particularly, since such an approach retains HTML characteristics that might be valuable for an LLM backend. We define UI features as concepts that, to a substantial degree, facilitate LLM suggestions on how to act in the UI in order to solve related web-based tasks.",[27,658,660],{"id":659},"d2snap",[570,661,662],{},"D2Snap",[11,664,665,666,674,682,690,691,693],{},"We recently proposed ",[80,667,670],{"href":668,"rel":669},"https://arxiv.org/abs/2508.04412",[84],[31,671,672],{},[570,673,662],{},[596,675,676],{},[80,677,681],{"href":678,"ariaDescribedBy":679,"dataFootnoteRef":453,"id":680},"#user-content-fn-2",[602],"user-content-fnref-2","2",[596,683,684],{},[80,685,689],{"href":686,"ariaDescribedBy":687,"dataFootnoteRef":453,"id":688},"#user-content-fn-3",[602],"user-content-fnref-3","3"," – a first-of-its-kind downsampling algorithm for DOMs. Herein, we'll briefly explain how the ",[570,692,662],{}," algorithm works, and how it can be utilised to build efficient and performant web agents.",[558,695,697],{"id":696},"how-it-works","How it works",[11,699,700,701,703,704,530,707,710,711,714,715,635],{},"There are basically three redundant types of DOM nodes, and HTML concepts: elements, text, and attributes. We defined and empirically adjusted three node-specific procedures. ",[570,702,662],{}," downsamples at a variable ratio, configured through procedure-specific parameters  ",[171,705,706],{},"k",[171,708,709],{},"l",", and ",[171,712,713],{},"m"," (",[171,716,717],{},"∈ [0, 1]",[719,720,721],"blockquote",{},[11,722,723,724,729],{},"We used ",[80,725,728],{"href":726,"rel":727},"https://openai.com/index/hello-gpt-4o/",[84],"GPT-4o"," to create a downsampling ground truth dataset by having it classify HTML elements and scoring semantics regarding relevance for understanding the inherent UI – a UI feature degree.",[731,732,734],"h4",{"id":733},"procedure-elements","Procedure: Elements",[11,736,737,739,740,345,743,746,747,749],{},[570,738,662],{}," downsamples (simplifies) elements by merging container elements like ",[171,741,742],{},"section",[171,744,745],{},"div"," together. A parameter ",[171,748,706],{}," controls the merge ratio depending on the total DOM tree height. For competing concepts, such as element name, the ground truth determines which element's characterisitics to keep – comparing UI feature scores.",[11,751,752,753,530,755,757,758,763],{},"Elements in content elements (",[171,754,11],{},[171,756,719],{},", ...) are translated to a more comprehensive ",[80,759,762],{"href":760,"rel":761},"https://www.markdownguide.org/basic-syntax/",[84],"Markdown"," representation.",[11,765,766],{},"Interactive elements, definite interaction target candidates, are kept as is.",[731,768,770],{"id":769},"procedure-text","Procedure: Text",[11,772,773,775,776,779,787,788,790],{},[570,774,662],{}," downsamples text by dropping a fraction. Natural units of text are space-separated words, or punctuation-separated sentences. We reuse the ",[570,777,778],{},"TextRank",[596,780,781],{},[80,782,786],{"href":783,"ariaDescribedBy":784,"dataFootnoteRef":453,"id":785},"#user-content-fn-4",[602],"user-content-fnref-4","4"," algorithm to rank sentences in text nodes. The lowest-ranking fraction of sentences, denoted by parameter ",[171,789,709],{},", is dropped.",[731,792,794],{"id":793},"procedure-attributes","Procedure: Attributes",[11,796,797,799,800,802],{},[570,798,662],{}," downsamples attributes by dropping those with a name that, according to ground truth, holds a UI feature degree below a threshold. Parameter ",[171,801,713],{}," denotes this threshold.",[719,804,805],{},[11,806,807,808,814],{},"Check out the ",[80,809,811,813],{"href":668,"rel":810},[84],[570,812,662],{}," paper"," to learn about the algorithm in-depth.",[558,816,818],{"id":817},"example-of-a-downsampled-dom","Example of a Downsampled DOM",[11,820,821],{},"Consider a partial DOM state, serialised as HTML:",[823,824,828],"pre",{"className":825,"code":826,"language":827,"meta":453,"style":453},"language-html shiki shiki-themes catppuccin-latte night-owl","\u003Csection class=\"container\" tabindex=\"3\" required=\"true\" type=\"example\">\n  \u003Cdiv class=\"mx-auto\" data-topic=\"products\" required=\"false\">\n    \u003Ch1>Our Pizza\u003C/h1>\n    \u003Cdiv>\n      \u003Cdiv class=\"shadow-lg\">\n        \u003Ch2>Margherita\u003C/h2>\n        \u003Cp>\n          A simple classic: mozzarela, tomatoes and basil.\n          An everyday choice!\n        \u003C/p>\n        \u003Cbutton type=\"button\">Add\u003C/button>\n      \u003C/div>\n      \u003Cdiv class=\"shadow-lg\">\n        \u003Ch2>Capricciosa\u003C/h2>\n        \u003Cp>\n          A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.\n          A true favourite!\n          \u003C/p>\n        \u003Cbutton type=\"button\">Add\u003C/button>\n      \u003C/div>\n    \u003C/div>\n  \u003C/div>\n\u003C/section>\n","html",[171,829,830,897,940,963,972,993,1012,1021,1027,1033,1043,1072,1082,1101,1119,1128,1134,1140,1150,1177,1186,1196,1206],{"__ignoreMap":453},[831,832,835,839,842,846,849,853,857,859,862,864,866,868,870,873,875,877,880,882,885,887,889,892,894],"span",{"class":833,"line":834},"line",1,[831,836,838],{"class":837},"s9rnR","\u003C",[831,840,742],{"class":841},"sY2RG",[831,843,845],{"class":844},"swkLt"," class",[831,847,848],{"class":837},"=",[831,850,852],{"class":851},"sbuKk","\"",[831,854,856],{"class":855},"sfrMT","container",[831,858,852],{"class":851},[831,860,861],{"class":844}," tabindex",[831,863,848],{"class":837},[831,865,852],{"class":851},[831,867,689],{"class":855},[831,869,852],{"class":851},[831,871,872],{"class":844}," required",[831,874,848],{"class":837},[831,876,852],{"class":851},[831,878,879],{"class":855},"true",[831,881,852],{"class":851},[831,883,884],{"class":844}," type",[831,886,848],{"class":837},[831,888,852],{"class":851},[831,890,891],{"class":855},"example",[831,893,852],{"class":851},[831,895,896],{"class":837},">\n",[831,898,899,902,904,906,908,910,913,915,918,920,922,925,927,929,931,933,936,938],{"class":833,"line":454},[831,900,901],{"class":837},"  \u003C",[831,903,745],{"class":841},[831,905,845],{"class":844},[831,907,848],{"class":837},[831,909,852],{"class":851},[831,911,912],{"class":855},"mx-auto",[831,914,852],{"class":851},[831,916,917],{"class":844}," data-topic",[831,919,848],{"class":837},[831,921,852],{"class":851},[831,923,924],{"class":855},"products",[831,926,852],{"class":851},[831,928,872],{"class":844},[831,930,848],{"class":837},[831,932,852],{"class":851},[831,934,935],{"class":855},"false",[831,937,852],{"class":851},[831,939,896],{"class":837},[831,941,943,946,949,952,956,959,961],{"class":833,"line":942},3,[831,944,945],{"class":837},"    \u003C",[831,947,948],{"class":841},"h1",[831,950,951],{"class":837},">",[831,953,955],{"class":954},"s2kId","Our Pizza",[831,957,958],{"class":837},"\u003C/",[831,960,948],{"class":841},[831,962,896],{"class":837},[831,964,966,968,970],{"class":833,"line":965},4,[831,967,945],{"class":837},[831,969,745],{"class":841},[831,971,896],{"class":837},[831,973,975,978,980,982,984,986,989,991],{"class":833,"line":974},5,[831,976,977],{"class":837},"      \u003C",[831,979,745],{"class":841},[831,981,845],{"class":844},[831,983,848],{"class":837},[831,985,852],{"class":851},[831,987,988],{"class":855},"shadow-lg",[831,990,852],{"class":851},[831,992,896],{"class":837},[831,994,996,999,1001,1003,1006,1008,1010],{"class":833,"line":995},6,[831,997,998],{"class":837},"        \u003C",[831,1000,27],{"class":841},[831,1002,951],{"class":837},[831,1004,1005],{"class":954},"Margherita",[831,1007,958],{"class":837},[831,1009,27],{"class":841},[831,1011,896],{"class":837},[831,1013,1015,1017,1019],{"class":833,"line":1014},7,[831,1016,998],{"class":837},[831,1018,11],{"class":841},[831,1020,896],{"class":837},[831,1022,1024],{"class":833,"line":1023},8,[831,1025,1026],{"class":954},"          A simple classic: mozzarela, tomatoes and basil.\n",[831,1028,1030],{"class":833,"line":1029},9,[831,1031,1032],{"class":954},"          An everyday choice!\n",[831,1034,1036,1039,1041],{"class":833,"line":1035},10,[831,1037,1038],{"class":837},"        \u003C/",[831,1040,11],{"class":841},[831,1042,896],{"class":837},[831,1044,1046,1048,1051,1053,1055,1057,1059,1061,1063,1066,1068,1070],{"class":833,"line":1045},11,[831,1047,998],{"class":837},[831,1049,1050],{"class":841},"button",[831,1052,884],{"class":844},[831,1054,848],{"class":837},[831,1056,852],{"class":851},[831,1058,1050],{"class":855},[831,1060,852],{"class":851},[831,1062,951],{"class":837},[831,1064,1065],{"class":954},"Add",[831,1067,958],{"class":837},[831,1069,1050],{"class":841},[831,1071,896],{"class":837},[831,1073,1075,1078,1080],{"class":833,"line":1074},12,[831,1076,1077],{"class":837},"      \u003C/",[831,1079,745],{"class":841},[831,1081,896],{"class":837},[831,1083,1085,1087,1089,1091,1093,1095,1097,1099],{"class":833,"line":1084},13,[831,1086,977],{"class":837},[831,1088,745],{"class":841},[831,1090,845],{"class":844},[831,1092,848],{"class":837},[831,1094,852],{"class":851},[831,1096,988],{"class":855},[831,1098,852],{"class":851},[831,1100,896],{"class":837},[831,1102,1104,1106,1108,1110,1113,1115,1117],{"class":833,"line":1103},14,[831,1105,998],{"class":837},[831,1107,27],{"class":841},[831,1109,951],{"class":837},[831,1111,1112],{"class":954},"Capricciosa",[831,1114,958],{"class":837},[831,1116,27],{"class":841},[831,1118,896],{"class":837},[831,1120,1122,1124,1126],{"class":833,"line":1121},15,[831,1123,998],{"class":837},[831,1125,11],{"class":841},[831,1127,896],{"class":837},[831,1129,1131],{"class":833,"line":1130},16,[831,1132,1133],{"class":954},"          A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.\n",[831,1135,1137],{"class":833,"line":1136},17,[831,1138,1139],{"class":954},"          A true favourite!\n",[831,1141,1143,1146,1148],{"class":833,"line":1142},18,[831,1144,1145],{"class":837},"          \u003C/",[831,1147,11],{"class":841},[831,1149,896],{"class":837},[831,1151,1153,1155,1157,1159,1161,1163,1165,1167,1169,1171,1173,1175],{"class":833,"line":1152},19,[831,1154,998],{"class":837},[831,1156,1050],{"class":841},[831,1158,884],{"class":844},[831,1160,848],{"class":837},[831,1162,852],{"class":851},[831,1164,1050],{"class":855},[831,1166,852],{"class":851},[831,1168,951],{"class":837},[831,1170,1065],{"class":954},[831,1172,958],{"class":837},[831,1174,1050],{"class":841},[831,1176,896],{"class":837},[831,1178,1180,1182,1184],{"class":833,"line":1179},20,[831,1181,1077],{"class":837},[831,1183,745],{"class":841},[831,1185,896],{"class":837},[831,1187,1189,1192,1194],{"class":833,"line":1188},21,[831,1190,1191],{"class":837},"    \u003C/",[831,1193,745],{"class":841},[831,1195,896],{"class":837},[831,1197,1199,1202,1204],{"class":833,"line":1198},22,[831,1200,1201],{"class":837},"  \u003C/",[831,1203,745],{"class":841},[831,1205,896],{"class":837},[831,1207,1209,1211,1213],{"class":833,"line":1208},23,[831,1210,958],{"class":837},[831,1212,742],{"class":841},[831,1214,896],{"class":837},[11,1216,1217,1218,1220],{},"Here are some ",[570,1219,662],{}," downsampling results, which are based on different parametric configurations. A percentage denotes the reduced size.",[731,1222,1224,1227],{"id":1223},"k3-l3-m3-55",[171,1225,1226],{},"k=.3, l=.3, m=.3"," (55%)",[823,1229,1231],{"className":825,"code":1230,"language":827,"meta":453,"style":453},"\u003Csection tabindex=\"3\" type=\"example\" class=\"container\" required=\"true\">\n  # Our Pizza\n  \u003Cdiv class=\"shadow-lg\">\n    ## Margherita\n    A simple classic: mozzarela, tomatoes, and basil.\n    \u003Cbutton type=\"button\">Add\u003C/button>\n    ## Capricciosa\n    A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.\n    \u003Cbutton type=\"button\">Add\u003C/button>\n  \u003C/div>\n\u003C/section>\n",[171,1232,1233,1281,1286,1304,1309,1314,1340,1345,1350,1376,1384],{"__ignoreMap":453},[831,1234,1235,1237,1239,1241,1243,1245,1247,1249,1251,1253,1255,1257,1259,1261,1263,1265,1267,1269,1271,1273,1275,1277,1279],{"class":833,"line":834},[831,1236,838],{"class":837},[831,1238,742],{"class":841},[831,1240,861],{"class":844},[831,1242,848],{"class":837},[831,1244,852],{"class":851},[831,1246,689],{"class":855},[831,1248,852],{"class":851},[831,1250,884],{"class":844},[831,1252,848],{"class":837},[831,1254,852],{"class":851},[831,1256,891],{"class":855},[831,1258,852],{"class":851},[831,1260,845],{"class":844},[831,1262,848],{"class":837},[831,1264,852],{"class":851},[831,1266,856],{"class":855},[831,1268,852],{"class":851},[831,1270,872],{"class":844},[831,1272,848],{"class":837},[831,1274,852],{"class":851},[831,1276,879],{"class":855},[831,1278,852],{"class":851},[831,1280,896],{"class":837},[831,1282,1283],{"class":833,"line":454},[831,1284,1285],{"class":954},"  # Our Pizza\n",[831,1287,1288,1290,1292,1294,1296,1298,1300,1302],{"class":833,"line":942},[831,1289,901],{"class":837},[831,1291,745],{"class":841},[831,1293,845],{"class":844},[831,1295,848],{"class":837},[831,1297,852],{"class":851},[831,1299,988],{"class":855},[831,1301,852],{"class":851},[831,1303,896],{"class":837},[831,1305,1306],{"class":833,"line":965},[831,1307,1308],{"class":954},"    ## Margherita\n",[831,1310,1311],{"class":833,"line":974},[831,1312,1313],{"class":954},"    A simple classic: mozzarela, tomatoes, and basil.\n",[831,1315,1316,1318,1320,1322,1324,1326,1328,1330,1332,1334,1336,1338],{"class":833,"line":995},[831,1317,945],{"class":837},[831,1319,1050],{"class":841},[831,1321,884],{"class":844},[831,1323,848],{"class":837},[831,1325,852],{"class":851},[831,1327,1050],{"class":855},[831,1329,852],{"class":851},[831,1331,951],{"class":837},[831,1333,1065],{"class":954},[831,1335,958],{"class":837},[831,1337,1050],{"class":841},[831,1339,896],{"class":837},[831,1341,1342],{"class":833,"line":1014},[831,1343,1344],{"class":954},"    ## Capricciosa\n",[831,1346,1347],{"class":833,"line":1023},[831,1348,1349],{"class":954},"    A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.\n",[831,1351,1352,1354,1356,1358,1360,1362,1364,1366,1368,1370,1372,1374],{"class":833,"line":1029},[831,1353,945],{"class":837},[831,1355,1050],{"class":841},[831,1357,884],{"class":844},[831,1359,848],{"class":837},[831,1361,852],{"class":851},[831,1363,1050],{"class":855},[831,1365,852],{"class":851},[831,1367,951],{"class":837},[831,1369,1065],{"class":954},[831,1371,958],{"class":837},[831,1373,1050],{"class":841},[831,1375,896],{"class":837},[831,1377,1378,1380,1382],{"class":833,"line":1035},[831,1379,1201],{"class":837},[831,1381,745],{"class":841},[831,1383,896],{"class":837},[831,1385,1386,1388,1390],{"class":833,"line":1045},[831,1387,958],{"class":837},[831,1389,742],{"class":841},[831,1391,896],{"class":837},[731,1393,1395,1398],{"id":1394},"k4-l6-m8-27",[171,1396,1397],{},"k=.4, l=.6, m=.8"," (27%)",[823,1400,1402],{"className":825,"code":1401,"language":827,"meta":453,"style":453},"\u003Csection>\n  # Our Pizza\n  \u003Cdiv>\n    ## Margherita\n    A simple classic:\n    \u003Cbutton>Add\u003C/button>\n    ## Capricciosa\n    A rich taste:\n    \u003Cbutton>Add\u003C/button>\n  \u003C/div>\n\u003C/section>\n",[171,1403,1404,1412,1416,1424,1428,1433,1449,1453,1458,1474,1482],{"__ignoreMap":453},[831,1405,1406,1408,1410],{"class":833,"line":834},[831,1407,838],{"class":837},[831,1409,742],{"class":841},[831,1411,896],{"class":837},[831,1413,1414],{"class":833,"line":454},[831,1415,1285],{"class":954},[831,1417,1418,1420,1422],{"class":833,"line":942},[831,1419,901],{"class":837},[831,1421,745],{"class":841},[831,1423,896],{"class":837},[831,1425,1426],{"class":833,"line":965},[831,1427,1308],{"class":954},[831,1429,1430],{"class":833,"line":974},[831,1431,1432],{"class":954},"    A simple classic:\n",[831,1434,1435,1437,1439,1441,1443,1445,1447],{"class":833,"line":995},[831,1436,945],{"class":837},[831,1438,1050],{"class":841},[831,1440,951],{"class":837},[831,1442,1065],{"class":954},[831,1444,958],{"class":837},[831,1446,1050],{"class":841},[831,1448,896],{"class":837},[831,1450,1451],{"class":833,"line":1014},[831,1452,1344],{"class":954},[831,1454,1455],{"class":833,"line":1023},[831,1456,1457],{"class":954},"    A rich taste:\n",[831,1459,1460,1462,1464,1466,1468,1470,1472],{"class":833,"line":1029},[831,1461,945],{"class":837},[831,1463,1050],{"class":841},[831,1465,951],{"class":837},[831,1467,1065],{"class":954},[831,1469,958],{"class":837},[831,1471,1050],{"class":841},[831,1473,896],{"class":837},[831,1475,1476,1478,1480],{"class":833,"line":1035},[831,1477,1201],{"class":837},[831,1479,745],{"class":841},[831,1481,896],{"class":837},[831,1483,1484,1486,1488],{"class":833,"line":1045},[831,1485,958],{"class":837},[831,1487,742],{"class":841},[831,1489,896],{"class":837},[731,1491,1493,1496],{"id":1492},"k-l0-m-35",[171,1494,1495],{},"k→∞, l=0, ∀m"," (35%)",[823,1498,1500],{"className":825,"code":1499,"language":827,"meta":453,"style":453},"# Our Pizza\n## Margherita\nA simple classic: mozzarela, tomatoes, and basil.\nAn everyday choice!\n\u003Cbutton>Add\u003C/button>\n## Capricciosa\nA rich taste: mozzarella, ham, mushrooms, artichokes, and olives.\nA true favourite!\n\u003Cbutton>Add\u003C/button>\n",[171,1501,1502,1507,1512,1517,1522,1538,1543,1548,1553],{"__ignoreMap":453},[831,1503,1504],{"class":833,"line":834},[831,1505,1506],{"class":954},"# Our Pizza\n",[831,1508,1509],{"class":833,"line":454},[831,1510,1511],{"class":954},"## Margherita\n",[831,1513,1514],{"class":833,"line":942},[831,1515,1516],{"class":954},"A simple classic: mozzarela, tomatoes, and basil.\n",[831,1518,1519],{"class":833,"line":965},[831,1520,1521],{"class":954},"An everyday choice!\n",[831,1523,1524,1526,1528,1530,1532,1534,1536],{"class":833,"line":974},[831,1525,838],{"class":837},[831,1527,1050],{"class":841},[831,1529,951],{"class":837},[831,1531,1065],{"class":954},[831,1533,958],{"class":837},[831,1535,1050],{"class":841},[831,1537,896],{"class":837},[831,1539,1540],{"class":833,"line":995},[831,1541,1542],{"class":954},"## Capricciosa\n",[831,1544,1545],{"class":833,"line":1014},[831,1546,1547],{"class":954},"A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.\n",[831,1549,1550],{"class":833,"line":1023},[831,1551,1552],{"class":954},"A true favourite!\n",[831,1554,1555,1557,1559,1561,1563,1565,1567],{"class":833,"line":1029},[831,1556,838],{"class":837},[831,1558,1050],{"class":841},[831,1560,951],{"class":837},[831,1562,1065],{"class":954},[831,1564,958],{"class":837},[831,1566,1050],{"class":841},[831,1568,896],{"class":837},[11,1570,1571,1572,1574,1575,1577],{},"Asymptotic ",[171,1573,706],{}," (kind of 'infinite' ",[171,1576,706],{},") completely flattens the DOM, that is, leads to a full content linearisation similar to reader views as present in most browsers. Notably, it preserves all interactive elements like buttons – which are essential for a web agent.",[558,1579,1581],{"id":1580},"adaptived2snap",[570,1582,1583],{},"AdaptiveD2Snap",[11,1585,1586,1587,1589,1590,1592],{},"Fixed parameters might not be ideal for arbitrary DOMs – sourced from a landscape of web applications. We created ",[570,1588,1583],{}," – a wrapper for ",[570,1591,662],{}," that infers suitable parameters from a given DOM in order to hit a certain token budget.",[558,1594,1596],{"id":1595},"implementation-integration","Implementation & Integration",[11,1598,1599,1600,1602],{},"Picture an LLM-based weg agent that is premised on DOM snapshots. Implementing ",[570,1601,662],{}," is simple: Deep clone the DOM, and feed it to the algorithm. Now, take the snapshot; this is, serialise the resulting DOM. Done.",[719,1604,1605],{},[11,1606,1607,1608,1612],{},"Read our ",[80,1609,1611],{"href":1610},"/blog/a-gentle-introduction-to-ai-agents-for-the-web","gentle introduction to AI agents for the web"," to get started with high-level web agent concepts.",[11,1614,1615,1616,1618,1619,1624],{},"The open source ",[570,1617,662],{}," API, provided as a ",[80,1620,1623],{"href":1621,"rel":1622},"https://github.com/webfuse-com/D2Snap",[84],"package on GitHub"," provides the following signature:",[823,1626,1630],{"className":1627,"code":1628,"language":1629,"meta":453,"style":453},"language-ts shiki shiki-themes catppuccin-latte night-owl","type DOM = Document | Element | string;\ntype Options = {\n  assignUniqueIDs?: boolean; // false\n  debug?: boolean;           // true\n};\n\nD2Snap.d2Snap(\n  dom: DOM,\n  k: number, l: number, m: number,\n  options?: Options\n): Promise\u003Cstring>\n\nD2Snap.adaptiveD2Snap(\n  dom: DOM,\n  maxTokens: number = 4096,\n  maxIterations: number = 5,\n  options?: Options\n): Promise\u003Cstring>\n\n","ts",[171,1631,1632,1665,1677,1696,1710,1715,1720,1735,1747,1765,1775,1791,1795,1806,1814,1827,1839,1847],{"__ignoreMap":453},[831,1633,1634,1638,1642,1645,1649,1652,1655,1657,1661],{"class":833,"line":834},[831,1635,1637],{"class":1636},"s76yb","type",[831,1639,1641],{"class":1640},"sXbZB"," DOM ",[831,1643,848],{"class":1644},"s-_ek",[831,1646,1648],{"class":1647},"s-DR7"," Document",[831,1650,1651],{"class":837}," |",[831,1653,1654],{"class":1647}," Element",[831,1656,1651],{"class":837},[831,1658,1660],{"class":1659},"scrte"," string",[831,1662,1664],{"class":1663},"scGhl",";\n",[831,1666,1667,1669,1672,1674],{"class":833,"line":454},[831,1668,1637],{"class":1636},[831,1670,1671],{"class":1640}," Options ",[831,1673,848],{"class":1644},[831,1675,1676],{"class":1663}," {\n",[831,1678,1679,1683,1686,1689,1692],{"class":833,"line":942},[831,1680,1682],{"class":1681},"swl0y","  assignUniqueIDs",[831,1684,1685],{"class":837},"?:",[831,1687,1688],{"class":1659}," boolean",[831,1690,1691],{"class":1663},";",[831,1693,1695],{"class":1694},"sDmS1"," // false\n",[831,1697,1698,1701,1703,1705,1707],{"class":833,"line":965},[831,1699,1700],{"class":1681},"  debug",[831,1702,1685],{"class":837},[831,1704,1688],{"class":1659},[831,1706,1691],{"class":1663},[831,1708,1709],{"class":1694},"           // true\n",[831,1711,1712],{"class":833,"line":974},[831,1713,1714],{"class":1663},"};\n",[831,1716,1717],{"class":833,"line":995},[831,1718,1719],{"emptyLinePlaceholder":498},"\n",[831,1721,1722,1724,1728,1732],{"class":833,"line":1014},[831,1723,662],{"class":954},[831,1725,1727],{"class":1726},"s5FwJ",".",[831,1729,1731],{"class":1730},"sNstc","d2Snap",[831,1733,1734],{"class":954},"(\n",[831,1736,1737,1740,1744],{"class":833,"line":1023},[831,1738,1739],{"class":954},"  dom: ",[831,1741,1743],{"class":1742},"sqxXB","DOM",[831,1745,1746],{"class":1663},",\n",[831,1748,1749,1752,1755,1758,1760,1763],{"class":833,"line":1029},[831,1750,1751],{"class":954},"  k: number",[831,1753,1754],{"class":1663},",",[831,1756,1757],{"class":954}," l: number",[831,1759,1754],{"class":1663},[831,1761,1762],{"class":954}," m: number",[831,1764,1746],{"class":1663},[831,1766,1767,1770,1772],{"class":833,"line":1035},[831,1768,1769],{"class":954},"  options",[831,1771,1685],{"class":1644},[831,1773,1774],{"class":954}," Options\n",[831,1776,1777,1780,1784,1786,1789],{"class":833,"line":1045},[831,1778,1779],{"class":954},"): ",[831,1781,1783],{"class":1782},"s8Irk","Promise",[831,1785,838],{"class":1644},[831,1787,1788],{"class":954},"string",[831,1790,896],{"class":1644},[831,1792,1793],{"class":833,"line":1074},[831,1794,1719],{"emptyLinePlaceholder":498},[831,1796,1797,1799,1801,1804],{"class":833,"line":1084},[831,1798,662],{"class":954},[831,1800,1727],{"class":1726},[831,1802,1803],{"class":1730},"adaptiveD2Snap",[831,1805,1734],{"class":954},[831,1807,1808,1810,1812],{"class":833,"line":1103},[831,1809,1739],{"class":954},[831,1811,1743],{"class":1742},[831,1813,1746],{"class":1663},[831,1815,1816,1819,1821,1825],{"class":833,"line":1121},[831,1817,1818],{"class":954},"  maxTokens: number ",[831,1820,848],{"class":1644},[831,1822,1824],{"class":1823},"sZ_Zo"," 4096",[831,1826,1746],{"class":1663},[831,1828,1829,1832,1834,1837],{"class":833,"line":1130},[831,1830,1831],{"class":954},"  maxIterations: number ",[831,1833,848],{"class":1644},[831,1835,1836],{"class":1823}," 5",[831,1838,1746],{"class":1663},[831,1840,1841,1843,1845],{"class":833,"line":1136},[831,1842,1769],{"class":954},[831,1844,1685],{"class":1644},[831,1846,1774],{"class":954},[831,1848,1849,1851,1853,1855,1857],{"class":833,"line":1142},[831,1850,1779],{"class":954},[831,1852,1783],{"class":1782},[831,1854,838],{"class":1644},[831,1856,1788],{"class":954},[831,1858,896],{"class":1644},[11,1860,1861,1862,1864,1865,1870,1871,1876],{},"Moreover, ",[570,1863,662],{}," it is available on the ",[80,1866,1869],{"href":1867,"rel":1868},"https://dev.webfuse.com/automation-api",[84],"Webfuse Automation API",". ",[80,1872,1875],{"href":1873,"rel":1874},"https://www.webfuse.com",[84],"Webfuse"," essentially is a proxy to seamlessly serve any existing web application with custom augmentations, such as a web agent widget.",[823,1878,1882],{"className":1879,"code":1880,"language":1881,"meta":453,"style":453},"language-js shiki shiki-themes catppuccin-latte night-owl","const domSnapshot = await browser.webfuseSession\n    .automation\n    .take_dom_snapshot({ modifier: 'downsample' })\n","js",[171,1883,1884,1889,1894],{"__ignoreMap":453},[831,1885,1886],{"class":833,"line":834},[831,1887,1888],{},"const domSnapshot = await browser.webfuseSession\n",[831,1890,1891],{"class":833,"line":454},[831,1892,1893],{},"    .automation\n",[831,1895,1896],{"class":833,"line":942},[831,1897,1898],{},"    .take_dom_snapshot({ modifier: 'downsample' })\n",[11,1900,1901,1902,1904],{},"Need precise control over the underlying ",[570,1903,662],{}," invocation? Configure it exactly how you want:",[823,1906,1908],{"className":1879,"code":1907,"language":1881,"meta":453,"style":453},"const domSnapshot = await browser.webfuseSession\n    .automation\n    .take_dom_snapshot({\n        modifier: {\n            name: 'D2Snap',\n            params: { hierarchyRatio: 0.6, textRatio: 0.2, attributeRatio: 0.8 }\n        }\n    })\n",[171,1909,1910,1914,1918,1923,1928,1933,1938,1943],{"__ignoreMap":453},[831,1911,1912],{"class":833,"line":834},[831,1913,1888],{},[831,1915,1916],{"class":833,"line":454},[831,1917,1893],{},[831,1919,1920],{"class":833,"line":942},[831,1921,1922],{},"    .take_dom_snapshot({\n",[831,1924,1925],{"class":833,"line":965},[831,1926,1927],{},"        modifier: {\n",[831,1929,1930],{"class":833,"line":974},[831,1931,1932],{},"            name: 'D2Snap',\n",[831,1934,1935],{"class":833,"line":995},[831,1936,1937],{},"            params: { hierarchyRatio: 0.6, textRatio: 0.2, attributeRatio: 0.8 }\n",[831,1939,1940],{"class":833,"line":1014},[831,1941,1942],{},"        }\n",[831,1944,1945],{"class":833,"line":1023},[831,1946,1947],{},"    })\n",[558,1949,1951],{"id":1950},"performance-evaluation","Performance Evaluation",[11,1953,1954,1955,1957,1958,1960,1961,1963],{},"Now for the moment of truth: How does ",[570,1956,662],{}," stack up against the industry standard? We evaluated ",[570,1959,662],{}," in comparison to a grounded GUI snapshot baseline close to those used by ",[570,1962,540],{}," – coloured bounding boxes around visible interactive elements.",[11,1965,1966,1967,1972],{},"To evaluate snapshots isolated from specific agent logic, we crafted a dataset that spans all UI states that occur while solving a related task. We sampled our dataset from the existing ",[80,1968,1971],{"href":1969,"rel":1970},"https://github.com/OSU-NLP-Group/Online-Mind2Web",[84],"Online-Mind2Web"," dataset.",[18,1974],{":width":21,"alt":1975,"format":521,"loading":23,"src":1976},"Exemplary solution UI state trajectory of a defined web-based task","/blog/dom-downsampling-for-web-agents/3.png",[11,1978,1979],{},[585,1980,1981],{},"Exemplary solution UI state trajectory for the task: “View the pricing plan for 'Business'. Specifically, we have 100 users. We need a 1PB storage quota and a 50 TB transfer quota.”",[11,1983,1984],{},"These are our key findings...",[731,1986,1988],{"id":1987},"substantial-success-rates","Substantial Success Rates",[11,1990,1991,1992,1994],{},"The results exceeded our expectations. Not only did ",[570,1993,662],{}," meet the baseline's performance – our best configuration outperformed it by a significant margin. Full linearisation matches performance, and estimated model input token size order of the baseline.",[18,1996],{":width":1997,"alt":1998,"format":521,"loading":23,"src":1999},"550","Success rate per web agent snapshot subject evaluated across the dataset","/blog/dom-downsampling-for-web-agents/4.png",[585,2001,2002,2003,2010,2011,2013,2014,2017,2018,2021,2022,2025,2026,2029,2030,2033,2034,2037],{},"\n  Success rate per web agent snapshot subject evaluated across the dataset.\n  Labels: ",[171,2004,2005,2006],{},"GUI",[2007,2008,2009],"sub",{}," gr.",": Baseline, ",[171,2012,1743],{},": Raw DOM (cut-off at ~8K tokens), ",[171,2015,2016],{},"k( l m)",": Parameter values; e.g., ",[171,2019,2020],{},".9 .3 .6",", or ",[171,2023,2024],{},".4"," if equal). ",[171,2027,2028],{},"∞",": Linearisation,  ",[171,2031,2032],{},"8192 / 32768",": via token-limited (resp.) ",[2035,2036,1583],"i",{},".\n",[731,2039,2041],{"id":2040},"containable-token-and-byte-size","Containable Token and Byte Size",[11,2043,2044,2045,2047],{},"Even light downsampling delivers dramatic size reductions. Most ",[570,2046,662],{}," configurations average just one token order above the baseline – a massive improvement over raw DOM snapshots. Better yet, most DOMs from the dataset could actually be downsampled to the baseline order. And while image data balloons in file size, our text-based approach stays lean and efficient.",[18,2049],{":width":21,"alt":2050,"format":521,"loading":23,"src":2051},"Comparison of mean input size across and per subject","/blog/dom-downsampling-for-web-agents/5.png",[585,2053,2054,2055,2058,2059,2061],{},"\n  Left: Comparison of mean input size (tokens vs bytes) across and per subject.",[2056,2057],"br",{},"\n  Right: Estimated input token size across the dataset created by a single ",[2035,2060,662],{}," evaluation subject.\n",[731,2063,2065],{"id":2064},"hierarchy-actually-matters","Hierarchy Actually Matters",[11,2067,2068],{},"Which UI feature matters most for LLM web agent backend performance? We alternated parameter configurations to find out. Interestingly, hierarchy reveals itself as the strongest of the three assessed features. Element extraction throws away hierarchy, which suggests that downsampling is a superior technique.",[742,2070,2073,2078],{"className":2071,"dataFootnotes":453},[2072],"footnotes",[27,2074,2077],{"className":2075,"id":602},[2076],"sr-only","Footnotes",[611,2079,2080,2094,2105,2116],{},[44,2081,2083,418,2087],{"id":2082},"user-content-fn-1",[80,2084,2085],{"href":2085,"rel":2086},"https://arxiv.org/abs/2210.03945",[84],[80,2088,2093],{"href":2089,"ariaLabel":2090,"className":2091,"dataFootnoteBackref":453},"#user-content-fnref-1","Back to reference 1",[2092],"data-footnote-backref","↩",[44,2095,2097,418,2100],{"id":2096},"user-content-fn-2",[80,2098,668],{"href":668,"rel":2099},[84],[80,2101,2093],{"href":2102,"ariaLabel":2103,"className":2104,"dataFootnoteBackref":453},"#user-content-fnref-2","Back to reference 2",[2092],[44,2106,2108,418,2111],{"id":2107},"user-content-fn-3",[80,2109,1621],{"href":1621,"rel":2110},[84],[80,2112,2093],{"href":2113,"ariaLabel":2114,"className":2115,"dataFootnoteBackref":453},"#user-content-fnref-3","Back to reference 3",[2092],[44,2117,2119,418,2123],{"id":2118},"user-content-fn-4",[80,2120,2121],{"href":2121,"rel":2122},"https://aclanthology.org/W04-3252",[84],[80,2124,2093],{"href":2125,"ariaLabel":2126,"className":2127,"dataFootnoteBackref":453},"#user-content-fnref-4","Back to reference 4",[2092],[2129,2130,2131],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .s9rnR, html code.shiki .s9rnR{--shiki-default:#179299;--shiki-dark:#7FDBCA}html pre.shiki code .sY2RG, html code.shiki .sY2RG{--shiki-default:#1E66F5;--shiki-dark:#CAECE6}html pre.shiki code .swkLt, html code.shiki .swkLt{--shiki-default:#DF8E1D;--shiki-default-font-style:inherit;--shiki-dark:#C5E478;--shiki-dark-font-style:italic}html pre.shiki code .sbuKk, html code.shiki .sbuKk{--shiki-default:#40A02B;--shiki-dark:#D9F5DD}html pre.shiki code .sfrMT, html code.shiki .sfrMT{--shiki-default:#40A02B;--shiki-dark:#ECC48D}html pre.shiki code .s2kId, html code.shiki .s2kId{--shiki-default:#4C4F69;--shiki-dark:#D6DEEB}html pre.shiki code .s76yb, html code.shiki .s76yb{--shiki-default:#8839EF;--shiki-dark:#C792EA}html pre.shiki code .sXbZB, html code.shiki .sXbZB{--shiki-default:#DF8E1D;--shiki-default-font-style:italic;--shiki-dark:#D6DEEB;--shiki-dark-font-style:inherit}html pre.shiki code .s-_ek, html code.shiki .s-_ek{--shiki-default:#179299;--shiki-dark:#C792EA}html pre.shiki code .s-DR7, html code.shiki .s-DR7{--shiki-default:#DF8E1D;--shiki-default-font-style:italic;--shiki-dark:#FFCB8B;--shiki-dark-font-style:inherit}html pre.shiki code .scrte, html code.shiki .scrte{--shiki-default:#8839EF;--shiki-dark:#C5E478}html pre.shiki code .scGhl, html code.shiki .scGhl{--shiki-default:#7C7F93;--shiki-dark:#D6DEEB}html pre.shiki code .swl0y, html code.shiki .swl0y{--shiki-default:#4C4F69;--shiki-default-font-style:italic;--shiki-dark:#D6DEEB;--shiki-dark-font-style:inherit}html pre.shiki code .sDmS1, html code.shiki .sDmS1{--shiki-default:#7C7F93;--shiki-default-font-style:italic;--shiki-dark:#637777;--shiki-dark-font-style:italic}html pre.shiki code .s5FwJ, html code.shiki .s5FwJ{--shiki-default:#179299;--shiki-default-font-style:inherit;--shiki-dark:#C792EA;--shiki-dark-font-style:italic}html pre.shiki code .sNstc, html code.shiki .sNstc{--shiki-default:#1E66F5;--shiki-default-font-style:italic;--shiki-dark:#82AAFF;--shiki-dark-font-style:italic}html pre.shiki code .sqxXB, html code.shiki .sqxXB{--shiki-default:#4C4F69;--shiki-dark:#82AAFF}html pre.shiki code .s8Irk, html code.shiki .s8Irk{--shiki-default:#DF8E1D;--shiki-default-font-style:italic;--shiki-dark:#C5E478;--shiki-dark-font-style:inherit}html pre.shiki code .sZ_Zo, html code.shiki .sZ_Zo{--shiki-default:#FE640B;--shiki-dark:#F78C6C}",{"title":453,"searchDepth":454,"depth":454,"links":2133},[2134,2138,2139,2146],{"id":548,"depth":454,"text":549,"children":2135},[2136,2137],{"id":560,"depth":942,"text":561},{"id":590,"depth":942,"text":591},{"id":641,"depth":454,"text":642},{"id":659,"depth":454,"text":662,"children":2140},[2141,2142,2143,2144,2145],{"id":696,"depth":942,"text":697},{"id":817,"depth":942,"text":818},{"id":1580,"depth":942,"text":1583},{"id":1595,"depth":942,"text":1596},{"id":1950,"depth":942,"text":1951},{"id":602,"depth":454,"text":2077},"2025-08-18","We propose D2Snap – a first-of-its-kind downsampling algorithm for DOMs. D2Snap can be used as a pre-processing technique for DOM snapshots to optimise web agency context quality and token costs.",{"homepage":498,"relatedLinks":2150},[2151,2155,2158],{"text":2152,"href":2153,"description":2154},"What is a Website Snapshot?","/blog/snapshots-provide-llms-with-website-state","Learn what a website snapshot is and how to utilise it for web agents",{"text":2156,"href":1610,"description":2157},"What is a Web Agent?","Learn the basics of web agents",{"text":1869,"href":2159,"external":498,"description":2160},"https://dev.webfuse.com/automation-api#take_dom_snapshot","Check out the Webfuse Automation API","/blog/dom-downsampling-for-llm-based-web-agents",{"title":513,"description":2148},{"loc":2161},"blog/1012.dom-downsampling-for-llm-based-web-agents",[505,2166,2167,2168,2169,2170],"browser-agents","llms","llm-context","web-agents","web-automation","bGJtg_9k7O95O2CJswaRFj4ONGhX4hGr_8aL5dhDZms",{"id":2173,"title":2174,"authorId":514,"body":2175,"category":505,"created":2904,"description":2905,"extension":464,"faqs":482,"featurePriority":454,"head":482,"landingPath":482,"meta":2906,"navigation":498,"ogImage":482,"path":1610,"robots":482,"schemaOrg":482,"seo":2915,"sitemap":2916,"stem":2917,"tags":2918,"__hash__":2919},"blog/blog/1011.a-gentle-introduction-to-ai-agents-for-the-web.md","A Gentle Introduction to AI Agents for the Web",{"type":8,"value":2176,"toc":2885},[2177,2191,2194,2201,2207,2211,2214,2229,2233,2243,2247,2251,2264,2268,2272,2275,2280,2284,2293,2297,2308,2313,2317,2335,2339,2345,2449,2452,2685,2701,2705,2708,2713,2717,2720,2724,2741,2766,2773,2777,2815,2818,2829,2833,2836,2864,2868,2876,2882],[11,2178,2179,2180,530,2184,710,2187,2190],{},"In no time, AI became a natural part of modern web interfaces. AI agents for the web enjoy a recent hype, sparked by the means of ",[80,2181,529],{"href":2182,"rel":2183},"https://openai.com/index/introducing-operator/",[84],[80,2185,535],{"href":533,"rel":2186},[84],[80,2188,540],{"href":538,"rel":2189},[84],". By now, it is within reach to automate arbitrary web-based tasks, such as booking the cheapest flight from Berlin to Amsterdam.",[27,2192,2156],{"id":2193},"what-is-a-web-agent",[11,2195,2196,2197,2200],{},"For starters, let us break down the term ",[31,2198,2199],{},"web AI agent",": An agent is an entity that autonomously acts on behalf of another entity. An artificially intelligent agent is an application that acts on behalf of a human. In contrast to non-AI computer agents, it solves complex tasks with at least human-grade effectiveness and efficiency. For a human-centric web, web agents have deliberately been designed to browse the web in a human fashion – through UIs rather than APIs.",[18,2202],{":width":2203,"alt":2204,"format":2205,"loading":23,"src":2206},"610","High-level agent description comparing human and computer agents","svg","/blog/a-gentle-introduction-to-ai-agents-for-the-web/1.svg",[558,2208,2210],{"id":2209},"the-role-of-frontier-llms","The Role of Frontier LLMs",[11,2212,2213],{},"Web agents have been a vague desire for a long time. AI agents used to rely on complete models of a problem domain in order to allow (heuristic) search through problem states. Such models would comprise the problem world (e.g., a chessboard), actors (pawns, rooks, etc.), possible actions per actor (rook moves straight), and constraints (i.a., max one piece per field). A heterogeneous space of web application UIs describes the problem domain of a web agent: how to understand a web page, and how to interact with it to solve the declared task?",[11,2215,2216,2217,2224,2225,2228],{},"Frontier LLMs disrupted the AI agent world: explicit problem domain models beyond feasibility can now be replaced by an LLM. The LLM thereby acts as an instantaneous domain model backend that can be consulted with twofold context: serialised problem state, such as a chess position code (",[570,2218,2219,2220,2223],{},"“",[831,2221,2222],{},"..."," e4 e5 2. Nc3 f5”","), and the respective task (",[570,2226,2227],{},"“What is the best move for white?”","). For web agents, problem state corresponds to the currently browsed web application's runtime state, for instance, a screenshot.",[558,2230,2232],{"id":2231},"generalist-web-agents","Generalist Web Agents",[11,2234,2235,2236,710,2239,2242],{},"Generalist web agents are supposed to solve arbitrary tasks through a web browser. Web-based tasks can be as diverse as ",[570,2237,2238],{},"“Find a picture of a cat.”",[570,2240,2241],{},"“Book the cheapest flight from Berlin to Amsterdam tomorrow afternoon (business class, window seat).”"," In reality, generalist agents still fail uncommon or too precise tasks. While they have been critically acclaimed, they mainly act as early proofs-of-concept. Tasks that are indeed solvable with a generalist agent promise great results with an according specialist agent.",[18,2244],{":width":519,"alt":2245,"format":521,"loading":23,"src":2246},"Screenshot of a generalist web agent UI (Director)","/blog/a-gentle-introduction-to-ai-agents-for-the-web/2.png",[558,2248,2250],{"id":2249},"specialist-web-agents","Specialist Web Agents",[11,2252,2253,2254,2257,2258,2263],{},"Other than generalist agents, specialist web agents are constrained to a certain task and application domain. Specialist agents bear the major share of commercial value. Most prominently, modal chat agents that provide users with on-page help. Picture a little floating widget that can be chatted to via text or voice input. In most cases, in fact, the term ",[570,2255,2256],{},"web (AI) agent"," refers to chat agents. Chat agents – text or voice – can be implemented on top of virtually any existing website. Frontier LLMs provide a lot of commonsense out-of-the-box. A ",[80,2259,2262],{"href":2260,"rel":2261},"https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/system-prompts",[84],"system prompt"," can, moreover, be leveraged to drive specialist agent quality for the respective problem domain.",[18,2265],{":width":519,"alt":2266,"format":521,"loading":23,"src":2267},"Screenshots of two modal specialist web agent UIs augmenting an underlying website's UI","/blog/a-gentle-introduction-to-ai-agents-for-the-web/3.png",[27,2269,2271],{"id":2270},"how-does-a-web-agent-work","How Does a Web Agent Work?",[11,2273,2274],{},"LLM-based web agents are premised on a more or less uniform architecture. The agent application embodies a mediator between a web browser (environment), and the LLM backend (model).",[18,2276],{":width":2277,"alt":2278,"format":2205,"loading":23,"src":2279},"480","High-level web agent architecture component view","/blog/a-gentle-introduction-to-ai-agents-for-the-web/4.svg",[558,2281,2283],{"id":2282},"the-agent-lifecycle","The Agent Lifecycle",[11,2285,2286,2287,2292],{},"To reduce a user's cognitive load, solving a web-based task is usually chunked into a sequence of UI states. Consider looking for rental apartments on ",[80,2288,2291],{"href":2289,"rel":2290},"https://www.redfin.com",[84],"redfin.com",": In the first step, you specify a location. Only subsequently are you provided with a grid of available apartments for that location.",[18,2294],{":width":519,"alt":2295,"format":521,"loading":23,"src":2296},"Example of separated UI states in a rental home search application","/blog/a-gentle-introduction-to-ai-agents-for-the-web/5.png",[11,2298,2299,2300,2307],{},"Web agent logic is iterative; not least for a sequential web interaction model, but also for a conversational agent interaction model. Browsing the web, human and computer agents represent users alike. That said, Norman's well-known ",[80,2301,2304],{"href":2302,"rel":2303},"https://mitpress.mit.edu/9780262640374/the-design-of-everyday-things/",[84],[570,2305,2306],{},"Seven Stages of Action",", which hierarchically model the human cognition cycle, transfer to the web agent lifecycle. For each UI state in a web browser (environment) and web-based task (action intention); decide where to click, type, etc. (action planning), and perform those clicks, etc. (action execution). Afterwards, perceive, interpret, and evaluate the results of those actions in the web browser (state). As long as there is a mismatch between the evaluated state and the declared goal state, repeat that cycle. Potentially prompt the user with more required information.",[18,2309],{":width":2310,"alt":2311,"format":2205,"loading":23,"src":2312},"580","Donald 'Norman's Seven Stages of Action' model of the human cognition cycle that transfers to non-human agents","/blog/a-gentle-introduction-to-ai-agents-for-the-web/6.svg",[558,2314,2316],{"id":2315},"web-context-for-llms","Web Context for LLMs",[11,2318,2319,2320,2322,2323,2326,2327,2330,2331,2334],{},"The gap from an agent towards the environment, according to ",[570,2321,2306],{},", is known as the ",[570,2324,2325],{},"gulf of execution",". In real-world scenarios, how to act in the environment in respect to a planned sequence of actions might be difficult (e.g., how to actually open the trunk of a new car?). Arguably, web agents face a novel ",[570,2328,2329],{},"gulf of intention"," towards the action planning stage: how to serialise a currently browsed web page's runtime state for LLMs? ",[570,2332,2333],{},"Snapshot"," is a more comprehensive term to describe the serialisation of a web page's current runtime state. Screenshots, for instance, represent a type of snapshot that closely resembles how humans perceive a web page at a given point in time. But are they as accessible to LLMs?",[558,2336,2338],{"id":2337},"agentic-ui-interaction","Agentic UI Interaction",[11,2340,2341,2342,2344],{},"With a qualified set of well-defined actuation methods, web agents are able to close the ",[570,2343,2325],{}," quite well. HTML element types strongly afford a certain action (e.g., click a button, type to a field). Below is how an actuation schema to present the LLM backend with could look like:",[823,2346,2348],{"className":1627,"code":2347,"language":1629,"meta":453,"style":453},"interface ActuationSchema = {\n    thought: string;\n    action: \"click\"\n        | \"scroll\"\n        | \"type\";\n    cssSelector: string;\n    data?: string;\n}[];\n",[171,2349,2350,2364,2376,2393,2405,2417,2428,2439],{"__ignoreMap":453},[831,2351,2352,2355,2358,2361],{"class":833,"line":834},[831,2353,2354],{"class":1636},"interface",[831,2356,2357],{"class":1640}," ActuationSchema",[831,2359,2360],{"class":954}," = ",[831,2362,2363],{"class":1663},"{\n",[831,2365,2366,2369,2372,2374],{"class":833,"line":454},[831,2367,2368],{"class":954},"    thought",[831,2370,2371],{"class":837},":",[831,2373,1660],{"class":1659},[831,2375,1664],{"class":1663},[831,2377,2378,2381,2383,2386,2390],{"class":833,"line":942},[831,2379,2380],{"class":954},"    action",[831,2382,2371],{"class":837},[831,2384,2385],{"class":851}," \"",[831,2387,2389],{"class":2388},"sgAC-","click",[831,2391,2392],{"class":851},"\"\n",[831,2394,2395,2398,2400,2403],{"class":833,"line":965},[831,2396,2397],{"class":837},"        |",[831,2399,2385],{"class":851},[831,2401,2402],{"class":2388},"scroll",[831,2404,2392],{"class":851},[831,2406,2407,2409,2411,2413,2415],{"class":833,"line":974},[831,2408,2397],{"class":837},[831,2410,2385],{"class":851},[831,2412,1637],{"class":2388},[831,2414,852],{"class":851},[831,2416,1664],{"class":1663},[831,2418,2419,2422,2424,2426],{"class":833,"line":995},[831,2420,2421],{"class":954},"    cssSelector",[831,2423,2371],{"class":837},[831,2425,1660],{"class":1659},[831,2427,1664],{"class":1663},[831,2429,2430,2433,2435,2437],{"class":833,"line":1014},[831,2431,2432],{"class":954},"    data",[831,2434,1685],{"class":837},[831,2436,1660],{"class":1659},[831,2438,1664],{"class":1663},[831,2440,2441,2444,2447],{"class":833,"line":1023},[831,2442,2443],{"class":1663},"}",[831,2445,2446],{"class":954},"[]",[831,2448,1664],{"class":1663},[11,2450,2451],{},"And a suggested actions response could, in turn, look as follows:",[823,2453,2457],{"className":2454,"code":2455,"language":2456,"meta":453,"style":453},"language-json shiki shiki-themes catppuccin-latte night-owl","[\n    {\n        \"thought\": \"Scroll newsletter cta into view\",\n        \"action\": \"scroll\",\n        \"cssSelector\": \"section#newsletter\"\n    },\n    {\n        \"thought\": \"Type email address to newsletter cta\",\n        \"action\": \"type\",\n        \"cssSelector\": \"section#newsletter > input\",\n        \"data\": \"user@example.org\"\n    },\n    {\n        \"thought\": \"Submit newsletter sign up\",\n        \"action\": \"click\",\n        \"cssSelector\": \"section#newsletter > button\"\n    }\n]\n","json",[171,2458,2459,2464,2469,2493,2512,2530,2535,2539,2558,2576,2595,2613,2617,2621,2640,2658,2675,2680],{"__ignoreMap":453},[831,2460,2461],{"class":833,"line":834},[831,2462,2463],{"class":1663},"[\n",[831,2465,2466],{"class":833,"line":454},[831,2467,2468],{"class":1663},"    {\n",[831,2470,2471,2475,2479,2481,2483,2485,2489,2491],{"class":833,"line":942},[831,2472,2474],{"class":2473},"srFR9","        \"",[831,2476,2478],{"class":2477},"s30W1","thought",[831,2480,852],{"class":2473},[831,2482,2371],{"class":1663},[831,2484,2385],{"class":851},[831,2486,2488],{"class":2487},"sCC8C","Scroll newsletter cta into view",[831,2490,852],{"class":851},[831,2492,1746],{"class":1663},[831,2494,2495,2497,2500,2502,2504,2506,2508,2510],{"class":833,"line":965},[831,2496,2474],{"class":2473},[831,2498,2499],{"class":2477},"action",[831,2501,852],{"class":2473},[831,2503,2371],{"class":1663},[831,2505,2385],{"class":851},[831,2507,2402],{"class":2487},[831,2509,852],{"class":851},[831,2511,1746],{"class":1663},[831,2513,2514,2516,2519,2521,2523,2525,2528],{"class":833,"line":974},[831,2515,2474],{"class":2473},[831,2517,2518],{"class":2477},"cssSelector",[831,2520,852],{"class":2473},[831,2522,2371],{"class":1663},[831,2524,2385],{"class":851},[831,2526,2527],{"class":2487},"section#newsletter",[831,2529,2392],{"class":851},[831,2531,2532],{"class":833,"line":995},[831,2533,2534],{"class":1663},"    },\n",[831,2536,2537],{"class":833,"line":1014},[831,2538,2468],{"class":1663},[831,2540,2541,2543,2545,2547,2549,2551,2554,2556],{"class":833,"line":1023},[831,2542,2474],{"class":2473},[831,2544,2478],{"class":2477},[831,2546,852],{"class":2473},[831,2548,2371],{"class":1663},[831,2550,2385],{"class":851},[831,2552,2553],{"class":2487},"Type email address to newsletter cta",[831,2555,852],{"class":851},[831,2557,1746],{"class":1663},[831,2559,2560,2562,2564,2566,2568,2570,2572,2574],{"class":833,"line":1029},[831,2561,2474],{"class":2473},[831,2563,2499],{"class":2477},[831,2565,852],{"class":2473},[831,2567,2371],{"class":1663},[831,2569,2385],{"class":851},[831,2571,1637],{"class":2487},[831,2573,852],{"class":851},[831,2575,1746],{"class":1663},[831,2577,2578,2580,2582,2584,2586,2588,2591,2593],{"class":833,"line":1035},[831,2579,2474],{"class":2473},[831,2581,2518],{"class":2477},[831,2583,852],{"class":2473},[831,2585,2371],{"class":1663},[831,2587,2385],{"class":851},[831,2589,2590],{"class":2487},"section#newsletter > input",[831,2592,852],{"class":851},[831,2594,1746],{"class":1663},[831,2596,2597,2599,2602,2604,2606,2608,2611],{"class":833,"line":1045},[831,2598,2474],{"class":2473},[831,2600,2601],{"class":2477},"data",[831,2603,852],{"class":2473},[831,2605,2371],{"class":1663},[831,2607,2385],{"class":851},[831,2609,2610],{"class":2487},"user@example.org",[831,2612,2392],{"class":851},[831,2614,2615],{"class":833,"line":1074},[831,2616,2534],{"class":1663},[831,2618,2619],{"class":833,"line":1084},[831,2620,2468],{"class":1663},[831,2622,2623,2625,2627,2629,2631,2633,2636,2638],{"class":833,"line":1103},[831,2624,2474],{"class":2473},[831,2626,2478],{"class":2477},[831,2628,852],{"class":2473},[831,2630,2371],{"class":1663},[831,2632,2385],{"class":851},[831,2634,2635],{"class":2487},"Submit newsletter sign up",[831,2637,852],{"class":851},[831,2639,1746],{"class":1663},[831,2641,2642,2644,2646,2648,2650,2652,2654,2656],{"class":833,"line":1121},[831,2643,2474],{"class":2473},[831,2645,2499],{"class":2477},[831,2647,852],{"class":2473},[831,2649,2371],{"class":1663},[831,2651,2385],{"class":851},[831,2653,2389],{"class":2487},[831,2655,852],{"class":851},[831,2657,1746],{"class":1663},[831,2659,2660,2662,2664,2666,2668,2670,2673],{"class":833,"line":1130},[831,2661,2474],{"class":2473},[831,2663,2518],{"class":2477},[831,2665,852],{"class":2473},[831,2667,2371],{"class":1663},[831,2669,2385],{"class":851},[831,2671,2672],{"class":2487},"section#newsletter > button",[831,2674,2392],{"class":851},[831,2676,2677],{"class":833,"line":1136},[831,2678,2679],{"class":1663},"    }\n",[831,2681,2682],{"class":833,"line":1142},[831,2683,2684],{"class":1663},"]\n",[719,2686,2687],{},[11,2688,2689,2694,2695,2700],{},[80,2690,2693],{"href":2691,"rel":2692},"https://platform.openai.com/docs/guides/function-calling",[84],"Function Calling"," and the ",[80,2696,2699],{"href":2697,"rel":2698},"https://modelcontextprotocol.io",[84],"Model Context Protocol"," represent two ends to outsource an explicit actuation model – server- and client-side, respectively.",[558,2702,2704],{"id":2703},"agentic-ui-augmentation","Agentic UI Augmentation",[11,2706,2707],{},"An agent represents yet another feature to integrate with an application and its UI. Discoverability and availability, however, are among the most fundamental requirements of a web agent. Evidently, when a user experiences UI/UX friction, at least the agent should be interactive. That said, a scrolling modal web agent UI has been the go-to approach, that is, a little floating widget on top of the underlying application's UI. It comes with a major advantage: the agent application can be decoupled from the underlying, self-contained application.",[18,2709],{":width":2710,"alt":2711,"format":2205,"loading":23,"src":2712},"360","Depiction of a web agent application augmenting an underlying application in an isolated layer","/blog/a-gentle-introduction-to-ai-agents-for-the-web/7.svg",[27,2714,2716],{"id":2715},"how-to-build-a-web-agent","How to Build a Web Agent?",[11,2718,2719],{},"Believe it or not: enhancing an existing web application with a purposeful agent is a lower-hanging fruit. The evolving agent ecosystem provides you with a spectrum of solutions: instantly use a pre-compiled agent, tweak a templated agent, or develop an agent from scratch. Either way, LLMs and web browsers exist for reuse, boiling down agent development to LLM context engineering, and UI augmentation.",[558,2721,2723],{"id":2722},"develop-a-web-agent","Develop a Web Agent",[11,2725,2726,2727,2730,2731,710,2735,2740],{},"Opting for a ",[31,2728,2729],{},"pre-compiled agent"," does not necessarily involve any actual development step. Instead, pre-compiled agents allow for high-level configuration through an agent-as-a-service provider's interface. Popular agent-as-a-service providers are, i.a., ",[80,2732,344],{"href":2733,"rel":2734},"https://elevenlabs.io/conversational-ai",[84],[80,2736,2739],{"href":2737,"rel":2738},"https://www.intercom.com/drlp/ai-agent",[84],"Intercom",". Serviced agents hide LLM communication and potentially interaction with a web browser behind the configuration interface.",[11,2742,2743,2744,2747,2748,2753,2754,2759,2760,2765],{},"Using a ",[31,2745,2746],{},"templated agent"," resembles the agent-as-a-service approach on a lower level. Openly sourced from a ",[80,2749,2752],{"href":2750,"rel":2751},"https://github.com/webfuse-com/agent-extension-blueprint",[84],"code repository",", templated agents allow for any kind of development tweaks. Favourably, agent templates shortcut integration with ",[80,2755,2758],{"href":2756,"rel":2757},"https://openai.com/api/",[84],"LLM APIs"," and web ",[80,2761,2764],{"href":2762,"rel":2763},"https://developer.mozilla.org/en-US/docs/Web/API",[84],"browser APIs",". Using a templated agent usually represents the preferable, best-of-both-worlds approach; common- and best-practice code snippets are available from the beginning, but everything can be customised as desired.",[11,2767,2768,2769,2772],{},"Of course, developing an ",[31,2770,2771],{},"agent from scratch"," is always an option. It is preferable whenever agent requirements deviate to a large extent from what exists in the service or template landscape.",[558,2774,2776],{"id":2775},"deploy-a-web-agent","Deploy a Web Agent",[11,2778,2779,2780,345,2785,2790,2791,2796,2797,2802,2803,2808,2809,2814],{},"When web agent code lives side-by-side with the augmented application's code, agent deployment is covered by a generic pipeline. Something like: ",[80,2781,2784],{"href":2782,"rel":2783},"https://eslint.org",[84],"linting",[80,2786,2789],{"href":2787,"rel":2788},"https://prettier.io",[84],"formatting"," agent code, ",[80,2792,2795],{"href":2793,"rel":2794},"https://esbuild.github.io",[84],"transpiling and bundling"," agent modules, ",[80,2798,2801],{"href":2799,"rel":2800},"https://www.cypress.io",[84],"testing"," agent, ",[80,2804,2807],{"href":2805,"rel":2806},"https://pages.cloudflare.com",[84],"hosting"," agent bundle, and ",[80,2810,2813],{"href":2811,"rel":2812},"https://docs.github.com/en/actions/get-started/continuous-integration",[84],"tiggering"," post deployment events. In that case, an agent represents a modular feature component in the application, no different than, for instance, a sign-up component.",[11,2816,2817],{},"Web agent source code right inside the application codebase comes at a cost:",[41,2819,2820,2823,2826],{},[44,2821,2822],{},"Agent developers can manipulate the source code of the underlying application.",[44,2824,2825],{},"Agent functionality could introduce side effects on the underlying application.",[44,2827,2828],{},"Agent changes require deployment of the entire application.",[558,2830,2832],{"id":2831},"best-practices-of-agentic-ux","Best Practices of Agentic UX",[11,2834,2835],{},"When designing user experiences for agent-enhanced applications, there are a few things to consider:",[41,2837,2838,2839,2838,2848,2838,2856],{},"\n    ",[44,2840,2841,2842,2841,2845,2847],{},"\n        ",[31,2843,2844],{},"Stream input and output to reduce latency",[2056,2846],{},"\n        LLMs (re-)introduce noticeable communication round-trip time. To reduce wait time for the human user, stream chunks of data whenever they are available.\n    ",[44,2849,2841,2850,2841,2853,2855],{},[31,2851,2852],{},"Provide fine-grained feedback to bridge high-latency",[2056,2854],{},"\n        Human attention is sensitive to several seconds of [system response time](https://www.nngroup.com/articles/response-times-3-important-limits/). Periodically provide agent _thoughts_ as feedback to perceptibly break down round-trip time.\n    ",[44,2857,2841,2858,2841,2861,2863],{},[31,2859,2860],{},"Always prompt the human user for consent to perform critical actions",[2056,2862],{},"\n        Some actions in a web application lead to irreversible or significant changes of state. Never have the agent perform such actions on behalf of the user without explicitly asking for the permission.\n    ",[558,2865,2867],{"id":2866},"non-invasive-web-agents-with-webfuse","Non-Invasive Web Agents with Webfuse",[11,2869,2870,2875],{},[80,2871,2873],{"href":1873,"rel":2872},[84],[31,2874,1875],{}," is a configurable web proxy that lets you augment any web application. As pictured, web agents represent highly self-contained applications. Moreover, web agents and underlying applications communicate at runtime in the client. This does, in fact, render opportunities to bridge the above-mentioned drawbacks with Webfuse: Develop web agents with a sandbox extension methodology, and deploy them through the low-latency proxy layer. On demand, seamlessly serve users with your agent-enhanced website. Benefit from information hiding, safe code, and fewer deployments.",[2877,2878],"article-signup-cta",{":demoAction":2879,"heading":2880,"subtitle":2881},"{\"text\":\"Read more\",\"showIcon\":false,\"href\":\"https://www.webfuse.com/blog/category/ai-agents\"}","Deploy Web Agents with Webfuse","Develop or deploy web agents in minutes; serve agent-enhanced websites through an isolated application layer.",[2129,2883,2884],{},"html pre.shiki code .s76yb, html code.shiki .s76yb{--shiki-default:#8839EF;--shiki-dark:#C792EA}html pre.shiki code .sXbZB, html code.shiki .sXbZB{--shiki-default:#DF8E1D;--shiki-default-font-style:italic;--shiki-dark:#D6DEEB;--shiki-dark-font-style:inherit}html pre.shiki code .s2kId, html code.shiki .s2kId{--shiki-default:#4C4F69;--shiki-dark:#D6DEEB}html pre.shiki code .scGhl, html code.shiki .scGhl{--shiki-default:#7C7F93;--shiki-dark:#D6DEEB}html pre.shiki code .s9rnR, html code.shiki .s9rnR{--shiki-default:#179299;--shiki-dark:#7FDBCA}html pre.shiki code .scrte, html code.shiki .scrte{--shiki-default:#8839EF;--shiki-dark:#C5E478}html pre.shiki code .sbuKk, html code.shiki .sbuKk{--shiki-default:#40A02B;--shiki-dark:#D9F5DD}html pre.shiki code .sgAC-, html code.shiki .sgAC-{--shiki-default:#40A02B;--shiki-default-font-style:italic;--shiki-dark:#ECC48D;--shiki-dark-font-style:inherit}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .srFR9, html code.shiki .srFR9{--shiki-default:#7C7F93;--shiki-dark:#7FDBCA}html pre.shiki code .s30W1, html code.shiki .s30W1{--shiki-default:#1E66F5;--shiki-dark:#7FDBCA}html pre.shiki code .sCC8C, html code.shiki .sCC8C{--shiki-default:#40A02B;--shiki-dark:#C789D6}",{"title":453,"searchDepth":454,"depth":454,"links":2886},[2887,2892,2898],{"id":2193,"depth":454,"text":2156,"children":2888},[2889,2890,2891],{"id":2209,"depth":942,"text":2210},{"id":2231,"depth":942,"text":2232},{"id":2249,"depth":942,"text":2250},{"id":2270,"depth":454,"text":2271,"children":2893},[2894,2895,2896,2897],{"id":2282,"depth":942,"text":2283},{"id":2315,"depth":942,"text":2316},{"id":2337,"depth":942,"text":2338},{"id":2703,"depth":942,"text":2704},{"id":2715,"depth":454,"text":2716,"children":2899},[2900,2901,2902,2903],{"id":2722,"depth":942,"text":2723},{"id":2775,"depth":942,"text":2776},{"id":2831,"depth":942,"text":2832},{"id":2866,"depth":942,"text":2867},"2025-06-15","LLMs only recently enabled serviceable web agents: autonomous systems that browse web on behalf of a human. Get started with fundamental methodology, key design challenges, and technological opportunities.",{"homepage":498,"relatedLinks":2907},[2908,2909,2913],{"text":2152,"href":2153,"description":2154},{"text":2910,"href":2911,"description":2912},"Develop an AI Agent for Any Website with Webfuse","/blog/develop-an-ai-agent-for-any-website-with-webfuse","Learn how to develop and deploy a web agent for any website with Webfuse",{"text":1869,"href":2914,"external":498,"description":2160},"https://dev.webfuse.com/automation-api/",{"title":2174,"description":2905},{"loc":1610},"blog/1011.a-gentle-introduction-to-ai-agents-for-the-web",[505,2166,2167,2169,2170],"Ky-gggxmZkldeN3wb7OvPpBxNaP72MwefaxFypvbUzY",1777376332566]