MARK6582 · AI & Marketing · Georgetown University

Day 2 Study Guide

Chatbots, LLMs, and Customer Acquisition
Part 1

Economics of AI in Customer Acquisition

Customer Lifetime Value Acquisition Cost Funnel Stage Conversion Rate Breakeven ROI Incremental Revenue
Concept 1

Customer Lifetime Value as the Decision Lens

The question "Should we deploy an AI chatbot?" is ultimately an economics question, not a technology question. The answer depends on whether the financial benefit of deploying the chatbot exceeds its cost, accounting for the customer's full value to the business over time.

This is where customer lifetime value (CLV) enters. CLV is the total expected revenue a customer will generate minus the cost of acquiring and serving that customer. When deciding whether to deploy a chatbot at a given funnel stage, you must ask: What is the CLV of the customers at this stage, and can a cheaper AI system handle qualification or routing without destroying that value?

A chatbot that handles high-CLV prospects poorly is expensive. It destroys revenue. A chatbot that handles low-CLV prospects (who you would barely engage with a human anyway) efficiently creates huge value. This stage-dependent thinking is critical.

Examples
E-commerce: A customer browsing sneakers on a retail site (unknown value) is worth far less than a B2B procurement manager reading product specs (likely $100K+ customer). A chatbot can efficiently handle the first; the second needs human judgment.
SaaS: A free-trial user asking "How do I reset my password?" is low-CLV. A customer considering a $500K annual license asking "Does this integrate with our ERP?" is high-CLV. Different chatbot designs apply.
Banking: A customer opening an account needs efficiency (volume is enormous). A wealth management prospect needing to discuss portfolio strategy needs expertise. Different economics apply at each stage.
Why It Matters

This framework prevents the common mistake of thinking "we should use AI everywhere because it's cheaper." You should use AI where the economics make sense. That depends on the CLV of the segment and the cost of getting the interaction wrong. A 20% error rate on high-CLV prospects is catastrophic. A 20% error rate on low-CLV prospects might still be profitable.

Concept 2

The Funnel and Stage-Dependent Deployment

Most acquisition funnels have a common structure: large numbers of low-intent prospects at the top, progressively smaller numbers of higher-intent, higher-value prospects as you move down. This creates a fundamental mismatch: the biggest volume is where individual customer value is lowest, and vice versa.

This mismatch is where chatbots shine and where they fail. At the top of the funnel (TOFu), you have millions of visitors who are not ready for a salesperson and don't need one. A chatbot can efficiently identify who is interested enough to engage further. At the bottom of the funnel (BofU), you have dozens or hundreds of prospects, each worth significant revenue. A chatbot's error rate becomes expensive; human judgment is valuable.

The key insight: the right AI solution depends on what stage you're deploying at. You're not asking "is the chatbot as good as a human?" You're asking "at this particular stage, with this particular customer value, is a chatbot that's slightly worse still worth the cost savings?"

Examples
E-commerce top of funnel: A customer lands on the site and doesn't know where to start. A chatbot recommends product categories, asks about use case, and suggests items. Cost per interaction is pennies. Human reps couldn't scale to millions of visitors. Bot error rate of 20% is acceptable because the cost of getting it wrong is low.
SaaS middle of funnel: A prospect has watched a demo and has specific questions about feature X and how it compares to Competitor Y. A chatbot can handle FAQ-level questions (FAQs about feature availability, pricing tiers). But if the question reveals a complex use case, escalation to a sales engineer is justified.
Enterprise bottom of funnel: A buyer for a $2M deal is in final negotiation. A chatbot handling "what are your contract terms?" is inappropriate. This needs a human executive. The CLV is so high that human time is cheap relative to the deal value.
Why It Matters

This explains why "replace all human reps with chatbots" is rarely the answer, and why "chatbots are useless" is also wrong. The right answer is stage-dependent. Understanding this prevents expensive mistakes in both directions: deploying chatbots where they will damage high-value relationships, or avoiding chatbots where they could efficiently handle volume.

Concept 3

Breakeven Math: When Does Cheaper Make Economic Sense?

Let's make this concrete with numbers. Suppose you're deciding whether to deploy a chatbot at a particular funnel stage. You know:

  • The chatbot costs $2,000/year to operate
  • A human rep costs $60,000/year (salary + benefits + overhead)
  • The human converts 12.5% of leads at this stage to the next stage
  • The chatbot converts 10% (it's worse, but cheaper)
  • Each converted customer is worth $5,000 to the business
  • You have 10,000 leads at this stage annually

Human scenario: 10,000 leads × 12.5% = 1,250 conversions × $5,000 = $6,250,000 revenue. Cost: $60,000. Net: $6,190,000.

Chatbot scenario: 10,000 leads × 10% = 1,000 conversions × $5,000 = $5,000,000 revenue. Cost: $2,000. Net: $4,998,000.

The human generates $192,000 more revenue. But suppose the chatbot's job isn't to convert leads, but to qualify them (route high-intent to humans, low-intent to nurture). Then the math changes entirely, because the chatbot's cost is much lower and its error rate on routing (not conversion) is what matters.

The trap: savings that look good on a spreadsheet can hide real losses in conversion quality or deal size. If you deploy a bot that costs 1/30th as much but converts 20% fewer high-CLV prospects, the math might still say "bot wins." But you have just destroyed lifetime value. This is the HubSpot question: is the cost savings worth the conversion loss? The answer depends entirely on where in the funnel you deploy it and what segment you are targeting.

The point: you must do this math carefully, with realistic conversion rates for each approach at each stage. Most deployments fail not because the economics were wrong, but because the assumed conversion rates were optimistic.

Examples
Qualification vs. Conversion: A chatbot qualifying leads (routing to humans) can afford a higher error rate than a chatbot closing deals. If qualification has 30% error but costs 1/30th as much as humans doing it, the math works. If conversion has 20% error and costs 1/30th as much, it probably doesn't.
Volume play: A retail site with 1 million annual visitors and 2% self-identification rate gets 20,000 leads. A 1% improvement in identification (from 2% to 3%) via a smarter chatbot generates 10,000 additional leads, worth millions. The chatbot's cost ($50K) becomes trivial.
Cost of error: A chatbot that escalates 40% of customer service inquiries to humans is expensive operationally, but it's cheaper than having 40% of customers leave angry. Sometimes "cost of error" is measured in retention, not immediate conversion.
Why It Matters

This is the framework that separates smart AI decisions from hype. If you can't write down the numbers and show that the chatbot's cost savings outweigh the conversion loss (or quality loss), then you don't have a case for deployment. Many organizations deploy AI tools without this analysis, then wonder why ROI is negative.

Resources for This Section

Article The Next Frontier of Customer Engagement: AI-Enabled Customer Service (McKinsey)

Strategic analysis of where AI creates value in customer engagement across the funnel, including economics of bot vs. human deployment at each stage.

Article The Contact Center Crossroads: Finding the Right Mix of Humans and AI (McKinsey)

Deep dive into the stage-dependent tradeoff between human and AI agents, with data on conversion rates, customer satisfaction, and cost-benefit by interaction type.

Case Study How Generative AI Transforms Customer Service (BCG)

Practical framework for where generative AI creates value in customer service, with examples of successful and failed deployments across funnel stages.

Part 2

Three Design Questions Every Chatbot Faces

Transparency Disclosure Voice and Tone Personality Escalation Mode Clarity User Experience

Every organization deploying a chatbot (whether rule-based or LLM-powered) faces three fundamental design questions. These are not technical questions. They are strategic questions about how the chatbot will interact with customers. The answers determine whether the chatbot creates value or destroys it.

Design Question 1

Disclose or Conceal? Should Customers Know They're Talking to AI?

There's no universally right answer to the transparency question. Disclosure builds trust and manages expectations, but it can also trigger skepticism. Concealment increases engagement initially, but creates outrage if customers later discover they were chatting with a bot.

Here's the catch: most real-world chatbots aren't purely AI. They're hybrid systems. Lyft's support bot handles routine questions, then hands off complex cases to humans. Amazon routes order issues to bots, but escalates disputes and exceptions to people. In some systems, humans review and edit AI drafts before sending them. In others, the handoff to a human is explicit mid-conversation. Either way, it's not a binary AI-or-human decision. It's both, shifting based on context.

Disclosure changes behavior in ways that aren't intuitive. When customers know they're talking to a bot, they may get skeptical, but they're also more forgiving when the bot misunderstands something. When they think they're talking to a person but discover it's a bot, that same misunderstanding feels like deception. When humans are involved, customers communicate differently—more formally, more carefully. They know someone is watching.

So the real question isn't "disclose or conceal?" It's "how do we set expectations clearly in a system where AI and humans share the work?" In regulated industries (financial services, healthcare), disclosure is often legally required. In e-commerce, it varies. In customer service, transparency is becoming the norm. The key is consistency. Customers are forgiving of AI if that's what they expect. They're not forgiving of being surprised.

Examples
Full disclosure approach: "Chat with our support bot. I can help with orders, returns, and account questions. For billing disputes or complex issues, I'll connect you with a specialist." Sets clear boundaries. When limits appear, customers aren't surprised.
Partial disclosure approach: "Get help now" (bot responds). Later: "Your issue has been escalated to our team." Humans are involved, but the AI role stays implicit. Customers perceive care without a jarring disclosure moment.
Hybrid handoff approach: Bot resolves a straightforward return request, then says "I'll confirm this with an agent to make sure everything's set." Transparent about the handoff, but the conversation feels seamless because the bot prepared the context.
Why It Matters

Disclosure is fundamentally a trust management strategy. The risk isn't in choosing AI or human—it's in creating mismatches between what customers expect and what they experience. A bot that's honest about its limits can earn trust. A bot that claims to do something and fails will destroy it, regardless of whether customers knew it was AI.

Design Question 2

Voice and Personality: What Tone Should the AI Use?

Personality design is subtle but powerful. A chatbot can be formal and corporate, warm and conversational, casual and playful, or urgent and action-oriented. Each personality attracts different users, builds different kinds of trust, and succeeds in different contexts.

Research in human-computer interaction shows that personality affects engagement, perceived competence, and willingness to use the tool again. A formal, professional tone conveys expertise but can feel cold. A warm, conversational tone builds rapport but can feel less credible. There is no objective "right" voice. Only the voice that fits your brand and your customer expectations.

The key is consistency. A chatbot that's casual in one turn and formal in the next confuses users. Once you pick a voice, it needs to be consistent across all interactions.

Examples
Formal/Professional (IBM Watson, government chatbots): "Please enter your account number to proceed." Conveys authority, but can feel distant.
Warm/Conversational (Intercom, Drift): "Hey! What brings you in today?" Encourages engagement, but can feel scripted if not done well.
Casual/Playful (Some Slack bots, brand mascots): "What's up? Let's get you sorted." Memorable, but only works if it matches brand identity.
Why It Matters

Voice affects whether customers engage with the chatbot at all, how much they reveal, and whether they trust its answers. A chatbot recommending a $10,000 purchase needs a different voice than one recommending a $10 impulse buy. Misalignment between tone and task creates friction.

Design Question 3

Clarity of Escalation: Can Customers Easily Reach a Human?

A chatbot that can't escalate effectively is a customer service disaster. No matter how good the bot is, there will be cases it can't handle. The design question is: how obvious is the path to a human, and how well does the chatbot execute the handoff?

Poor escalation design is common. Some chatbots bury the "talk to an agent" button. Others make customers repeat information to the human they're transferred to. Some escalate too aggressively (wasting agent time on routine questions) or too conservatively (frustrating customers who need help).

The best design does three things: (1) makes escalation easy to find, (2) preserves context across the bot-to-human handoff (the human knows what the bot already tried), and (3) sets customer expectations upfront ("I can help with X and Y, but if you need Z, I'll transfer you to an agent").

Examples
Good escalation: "I can help with order tracking and basic returns. For more complex issues, I can connect you with an agent who can see our conversation. Ready?"
Bad escalation (unclear path): Bot gives a generic error message with no clear way to reach a human. Customer has to start over or leave the site.
Bad escalation (context loss): Bot transfers to agent, but agent has no record of what bot already tried. Customer repeats entire story.
Why It Matters

Escalation design determines whether chatbots improve or degrade the customer experience. A chatbot that handles 90% of cases well but botches escalation for the remaining 10% has failed. Those 10% are often the highest-value or most frustrated customers. Escalation failures are memorable and drive customer churn.

Resources for This Section

Practical Guide Managing AI Agent Escalation, Guidance, and Rules (Intercom Help)

Practical documentation on designing chatbot escalation policies and guardrails, including transparency and timing recommendations.

Research Does Bot Personality Matter? (Computers in Human Behavior)

An empirical study on how chatbot personality affects user engagement and trust. Includes nuance on when personality matters and when it doesn't.

Practical Guide Conversational AI Experience Design (Intercom Help)

Practitioner guidance on designing conversational interfaces with attention to disclosure, tone, and user experience expectations.

Part 3

How Large Language Models Work

Tokenization Embedding Attention Inference Sampling Temperature Guardrails Retrieval Tool Calls

Large language models (LLMs) like GPT and Claude power modern chatbots. But most people have a cartoon understanding of how they work: "they predict the next word." That's technically true but misleading. Let's open the hood and see the machinery.

Concept 1

Tokenization: Text to Numbers

An LLM doesn't work with text directly. It works with numbers. The first step is tokenization: converting text into numerical tokens that the model can process.

A token is roughly a word or word fragment. "Hello" might be one token. "Electricity" might be two tokens (electr- and -icity) depending on the tokenizer. This matters because token boundaries affect how the model understands meaning. A word split across token boundaries loses some structural information compared to a word that fits in a single token.

For marketing purposes, the key insight is: the model doesn't read your text as a human would. It reads it as a sequence of token IDs. Unusual spellings, abbreviations, or brand names that don't fit tokenizer dictionaries are handled less efficiently than common words.

Examples
Common word: "customer" → 1 token. Processed efficiently.
Uncommon word: "hyperscalable" → 3-4 tokens. Processed less efficiently, loses structural coherence.
Brand name: "ChatGPT" → varies by tokenizer. Might be 1 token or 3, affecting how the model represents it internally.
Why It Matters

This explains why LLMs sometimes seem to struggle with specific brand names, product codes, or industry jargon. It is not that they don't "know" the term. It is that the term is tokenized inefficiently, so the model has to work harder to process it. For marketing applications, this means you may need to add jargon or brand names to your guardrails explicitly, rather than relying on the model to "just know" them.

Concept 2

Embeddings: Numbers to Meaning

Once text is tokenized, each token becomes a vector of numbers (an embedding). This vector captures the semantic meaning of the token in a high-dimensional space. Tokens that are similar in meaning have embeddings that point in similar directions.

The key point: meaning is not programmed into embeddings. It's learned from data. The model was trained on billions of words and learned statistical patterns about which words co-occur, which tokens substitute for each other, which tokens predict other tokens. These patterns are compressed into the embedding vectors.

This is where "understanding" enters the system. Though "understanding" is a misnomer. The model doesn't understand like humans do. It has learned statistical regularities about language that approximate understanding for many tasks.

Examples
Semantic similarity: The embeddings for "dog," "puppy," and "canine" point in similar directions, so the model knows they're related. This is learned, not programmed.
Analogy: The embedding arithmetic roughly works: "king" - "man" + "woman" ≈ "queen". Again, learned from patterns.
Domain specificity: A model trained on Wikipedia embeddings and a model trained on medical literature embeddings will represent the same word differently, because the co-occurrence patterns (and thus the learned meaning) differ.
Why It Matters

This explains why LLMs work well for common language tasks (the training data is huge and representative) but can struggle with domain-specific jargon, technical terminology, or recent events (not well-represented in training data). It also explains why fine-tuning or retrieval-augmented generation (adding external knowledge) is necessary for specialized tasks. The base model's embeddings may not capture domain-specific meaning well.

Concept 3

Attention: Which Tokens Matter?

The attention mechanism is the core innovation that makes modern LLMs work. At its core, attention asks: for the next token prediction, which previous tokens in the input are most relevant?

In a customer service chatbot handling a return request, attention might focus heavily on the product ID and the reason for return, and less on incidental context. In a brainstorming task, attention might spread more evenly across the entire conversation. The model learns to weight token importance dynamically based on the task and context.

This is why LLMs can handle long conversations and complex context better than older models that couldn't attend selectively to relevant information. Attention lets the model scale to longer inputs without losing coherence.

Examples
Customer service: "I bought a DCD791D2 drill on 3/15/2024 and it broke today. Can I return it?" Attention focuses on product ID, date, and problem. Downweights "today" because the important date is purchase date.
Information synthesis: A long customer conversation about multiple products. Attention learns to focus on the most recent product mentioned and the current question, while downweighting older context.
Context length limits: Even with attention, there's a limit to how much context the model can handle (context window). Claude can handle ~200K tokens; GPT-4 handles ~128K. Long conversations may exceed this window.
Why It Matters

Attention is why LLMs can handle complex queries and long conversations without losing track. But it also has limits: if the relevant information is buried deep in a long conversation, or if the model's attention weights the wrong tokens, it can miss critical context. For customer service, this means the chatbot may lose track of important constraints if the conversation gets long, or may overfocus on recent statements at the expense of initial context.

Concept 4

Sampling and Temperature: Why LLMs Vary

At each step, the model doesn't just pick the highest-probability next token. It samples from a distribution of possible tokens, weighted by their probabilities. This sampling is controlled by a parameter called temperature.

High temperature (e.g., 1.0): The model samples from the full distribution, even low-probability tokens. Responses are creative but can be incoherent.

Low temperature (e.g., 0.1): The model heavily favors the highest-probability token. Responses are more predictable but can be repetitive.

No sampling (argmax): The model always picks the single highest-probability token. Response is deterministic. Same input always gives same output.

The choice of temperature determines the chatbot's personality. A customer service chatbot might use low temperature (reliability over creativity). A brainstorming bot might use higher temperature (creativity over consistency).

Examples
Temperature = 0.3 (low, for customer service): "Your return has been approved. Your label has been sent to your email."
Temperature = 0.7 (moderate, balanced): "Great news! Your return is approved. We've emailed you a label so you can drop it off at the nearest UPS location."
Temperature = 1.2+ (high, for creative tasks): "Awesome! We've got you covered. Check your inbox. We've sent a return label that will get your drill back to us quickly. No fuss, we promise."
Why It Matters

Temperature explains why the same chatbot, given the same input, might generate different outputs. It is not a bug. It is a feature. But for customer service, you probably want low temperature (consistency). For content generation, you might want higher temperature (variety). Understanding this tradeoff lets you tune the chatbot's behavior to match your goal.

Concept 5

Training Phase vs. Inference Phase

LLMs operate in two fundamentally different phases, and it's critical not to confuse them.

Training phase: The model processes billions of tokens from a large corpus (e.g., the entire internet, books, articles). For each position in the sequence, it tries to predict the next token. Its predictions are compared to the actual token, and errors are used to update the model's weights. This process takes weeks or months on massive clusters of GPUs, costs millions of dollars, and happens offline.

Inference phase: The trained model is fixed (not learning anymore). It processes user inputs and generates outputs in real time. Each token prediction takes milliseconds. This happens millions of times per day at near-zero marginal cost.

The gap between these two phases is important: the model was trained to predict the next token given any previous tokens. But in deployment, it's generating responses to customer queries. If there's a mismatch between how it was trained and how it's used, it will behave unpredictably.

Examples
Training to inference mismatch: A model trained on Wikipedia, where every sentence is factually accurate, will learn to predict confident, well-formed sentences. But those same prediction patterns, applied to customer service, may generate confident wrong answers (hallucination). The model isn't broken; it's operating the way it was trained.
Retrieval-augmented generation fixes this: By feeding the model relevant facts from a knowledge base at inference time (as part of the input), you change what the model predicts. Instead of predicting "the most likely next token given no external facts," it predicts "the most likely next token given these facts." The training hasn't changed, but inference is different.
Cost implications: Retraining a large model costs millions. But inference costs pennies per query. If the model isn't behaving right, you cannot "fix it" by retraining (too expensive). You fix it by changing how it is prompted, or adding retrieval, or guardrails. All inference-time interventions.
Why It Matters

This explains why raw LLMs are not products. They are components. A LLM fine-tuned or prompted to behave a certain way at inference time is not the same as having retrained the model. Most chatbot improvements are inference-time changes (better prompts, retrieval, guardrails), not training-time changes. Understanding this prevents the misconception that "you need to retrain the model" to fix behavior. Usually you do not.

Concept 6

Guardrails, Retrieval, and Tool Calls: From Component to Product

A raw LLM that predicts tokens is not a customer service chatbot. It's a token predictor. To make it a product, you add three layers on top.

Guardrails: Rules that constrain what the model can output. Examples: "Don't answer questions about medical treatment," "Always offer escalation to a human," "Never make up facts." Guardrails are often implemented as classifiers that run after the LLM generates a response and either accept or reject (or re-prompt) the output.

Retrieval: A system that looks up relevant facts (from a knowledge base, CRM, product database, etc.) and passes them to the LLM as part of the input. Instead of the LLM generating from pure language patterns, it generates with context. This dramatically improves factual accuracy.

Tool calls: Allowing the LLM to call functions or APIs. Instead of just generating text, it can retrieve a customer record, look up an order, or trigger a refund. This makes the chatbot agent-like rather than purely generative.

Most successful chatbots use all three. The raw model handles language. Guardrails prevent misuse. Retrieval provides facts. Tool calls enable action. Together, they form a system.

Examples
Guardrails only: A chatbot trained only with RLHF to be helpful. Works okay for general chat, but in customer service, can hallucinate policies or make promises the company can't keep.
Guardrails + Retrieval: The chatbot retrieves the customer's order history and current return policies before responding. Accuracy improves dramatically. But it still cannot actually process a return. It can only tell the customer what to do.
Guardrails + Retrieval + Tool Calls: The chatbot retrieves order history, checks return eligibility (tool call to the returns policy engine), and initiates the return (tool call to the order system). It's now a functional agent, not just a conversational interface.
Why It Matters

This framework explains why "just fine-tune the LLM" is not the solution to most chatbot problems. If the problem is factual accuracy, retrieval helps more than fine-tuning. If the problem is lack of agency, tool calls are required. Understanding which layer to improve prevents wasted effort on the wrong fixes.

Core Principle

LLMs Are Components, Not Products

This distinction is critical and often missed. An LLM by itself is a language model. It predicts the next token based on patterns in training data. It is not a chatbot. It is not a customer service agent. It is a component that generates text.

A working chatbot is the LLM plus retrieval plus guardrails plus tool calls plus human oversight. Remove any one layer and the system fails in predictable ways. Remove retrieval and you get hallucinations. Remove guardrails and you get misuse. Remove tool calls and you have a chatbot that can only talk, not act. Remove human escalation and you have a system that will confidently give terrible advice at scale.

This matters for how you think about improvement. If your chatbot is making mistakes, the instinct is often "retrain the model." Retraining is expensive (millions of dollars for large models) and slow (weeks of compute time). But most chatbot failures are not at the model layer. They're at the retrieval layer (stale facts), the guardrails layer (missing rules), or the escalation layer (not handing off when uncertain). Fixing these is fast and cheap—you change prompts, update guardrails, adjust routing logic. All inference-time changes.

This is why companies like Anthropic and OpenAI spend more time on retrieval, tool-calling frameworks, and safety measures than on retraining. The model is table stakes. The system architecture is what determines whether it actually works.

Examples
Mistake: "Our chatbot is hallucinating returns policies. We need to retrain." Reality: You probably need better retrieval (pull from the live policy document) or a guardrail that checks generated policies against a reference.
Mistake: "We deployed a new LLM and it still gives bad advice." Reality: The LLM may be fine. The system is missing retrieval, guardrails, or escalation logic. The problem is architectural, not the model.
Right approach: "Our chatbot escalation rate is 40%. That's expensive but better than having the chatbot give wrong advice. Let's add retrieval to reduce escalations to 25%, keeping quality high." Here you're thinking in terms of system layers, not the model itself.
Why It Matters

Understanding that LLMs are components prevents the misconception that "better models solve everything." Better models matter. But so does everything built on top. A small model with excellent retrieval, guardrails, and escalation logic will outperform a large model with none of those things. The system is more important than the component.

Resources for This Section

Interactive OpenAI Tokenizer

Paste text and see how it tokenizes. Experiment with different words and phrases to develop intuition for why tokenization matters.

Video · 15 min Attention Is All You Need Explained (3Blue1Brown)

An animated walkthrough of the attention mechanism, the core innovation in modern LLMs. Dense but visually clear.

Classic Paper Attention Is All You Need (Vaswani et al., 2017)

The seminal paper introducing the Transformer architecture, which underlies all modern LLMs. Highly technical; read the abstract and introduction for conceptual understanding.

Article Contextual Retrieval (Anthropic)

Current approach to retrieval-augmented generation and how it improves factual accuracy and relevance in LLM applications.

Short Read Function Calling in the OpenAI API (OpenAI Help)

A practical guide to implementing tool calls in LLM applications.

Part 4

Three Failure Modes in LLM Systems

Stale Retrieval Constraint Gap Confident Wrongness Hallucination Edge Cases Failure Modes

Rule-based chatbots failed loudly and obviously. When they couldn't handle something, they said so: "I don't understand that" or "Sorry, I can't help." LLM-based systems fail differently. They fail quietly. They generate fluent, confident responses that sound correct but are wrong. Understanding these three failure modes lets you design systems that can work around them.

Failure Mode 1

Stale Retrieval: Confidence About Outdated Information

Modern LLMs use retrieval: they look up facts from a knowledge base before generating. This is much better than pure generation (which can hallucinate). But retrieval systems have a flaw. They retrieve whatever matches the query best in their database, which may be outdated information.

Example: A customer asks "What's your return policy?" The retrieval system finds the return policy and passes it to the LLM. But the policy was last updated three months ago, and the company changed its 90-day window to 120 days yesterday. The retrieval system doesn't know what's "current"—it just knows what matches the query.

The LLM, given outdated information, generates a confident, fluent response with the wrong policy. The customer doesn't know it's wrong. The system appears to work perfectly.

This is harder to catch than hallucination because it's not the model's fault—it's the retrieval system's fault. But from the customer's perspective, the chatbot gave them bad information.

Examples
Retail return policy: Policy changed from 60 to 90 days, but the knowledge base wasn't updated. Customer gets told 60 days and can't return within the window.
Product specifications: A product was updated with new features, but the old product spec is still in the database. Chatbot confidently tells customer about old specs.
Pricing: A promotional price ended, but the price file wasn't updated. Chatbot confidently quotes the old price to a customer.
Why It Matters

This failure mode is insidious because it doesn't trigger alarms. The chatbot isn't confused or refusing to answer—it's answering with confidence, using plausible information. The failure only becomes visible when a customer acts on the wrong information and complains. By then, damage is done. Prevention requires rigorous data governance: version control on knowledge bases, recency metadata, and regular audits of what's actually current.

Failure Mode 2

Constraint Gap: Unhandled Edge Cases

No set of guardrails or training data can anticipate every edge case. Rules are written for typical scenarios. Guardrails are specified for common misuses. But real customers come up with situations that no one anticipated.

Example: A return policy says "Items must be returned within 90 days in original packaging." The guardrail is "Only approve returns within 90 days." But what about an item purchased 89 days ago that arrived damaged? Or an item purchased 91 days ago that the customer didn't open because it was a gift? These are edge cases. The guardrail doesn't address them.

When the chatbot encounters an edge case, it either escalates (conservative) or tries to apply the rule in a way that feels wrong (wrong interpretation). If it escalates to a human, it solves the problem but defeats the automation goal. If it applies the rule rigidly, it frustrates the customer.

The constraint gap happens because the person who wrote the guardrails didn't think of the edge case. This is not a technology problem—it's a specification problem.

Examples
Damaged in shipping: Return policy says "90-day window." Guardrail says "check date." But if the item arrived damaged, exception logic should apply. If not coded, the chatbot rejects a legitimate return.
Bulk/corporate returns: Return policy designed for retail customers. A contractor buying 50 units has different needs (partial returns, extended windows for project-based purchases). Guardrail doesn't address this.
Deceased account holder: Customer tries to process a return on a deceased relative's account. No guardrail contemplates this. Chatbot either refuses or escalates.
Why It Matters

Constraint gaps reveal the limits of rule-based systems at scale. You can't prewrite rules for every edge case. At some point, humans must make judgment calls. The chatbot's job is to handle the 95% of common cases well, then escalate the 5% edge cases confidently and with context preserved. Trying to make the chatbot handle 99% of cases often makes it worse at the 95%, because guardrails become overly complex.

Failure Mode 3

Confident Wrongness: Fluent Hallucination

This is the most famous LLM failure: the model generates text that is fluent, coherent, and completely false. All with high confidence.

This happens because the LLM was trained to predict the next token, not to predict the truth. Given a question like "What are the main features of the DCD791D2 drill?", the model predicts the tokens most likely to follow that question in its training data. If it's seen many product descriptions, it knows the statistical patterns of how product descriptions are written. It can generate a description that sounds like a real product description—but the specific features it lists might be wrong or made up.

This is not a training failure—it's a fundamental property of how LLMs work. The training objective was "predict the next token well," not "predict true facts." More training data or better training doesn't eliminate this failure mode. It shifts where the failures occur, but doesn't eliminate them.

Examples
Made-up product features: Chatbot confidently describes a product feature that the product doesn't have, because the feature description follows normal patterns for similar products.
Hallucinated references: Chatbot cites a company policy or return window that doesn't exist, but the citation format and the reasoning are perfect.
Invented comparisons: Chatbot claims "the DCD791D2 is more durable than the DCD791D1" with high confidence, even though no evidence supports this—it's just a plausible next token.
Why It Matters

This is why retrieval is essential for customer-facing applications. If the model hallucinates, but you feed it the real facts via retrieval, the hallucinations are much less likely. But retrieval only helps with things you can retrieve. For comparisons, reasoning, or things not in your knowledge base, the model can still hallucinate. This means customer service applications need guardrails that catch hallucinations (by checking generated facts against a knowledge base) or conservatively escalate uncertain responses.

But Isn't This Just Poor Training?

It's tempting to think "if we had better data or better training, we could eliminate hallucination." But that misses the point. Hallucination is structural, not accidental. The model's training objective is "predict the next token," not "predict the truth." A model trained purely on real, accurate data can still hallucinate because accuracy wasn't the goal—next-token prediction was.

You can reduce hallucination by using retrieval (feeding facts as input) or guardrails (checking outputs against a knowledge base). But you can't eliminate it without changing what "good" means, which would require retraining. And retraining has its own costs and failure modes. The tradeoff is real.

Why Do These Failure Modes Differ From Rule-Based Systems?

Rule-based systems (like Motion AI in 2017) failed loudly: "I don't understand." LLM systems fail quietly: fluent wrong answer. This is not an accident—it's a consequence of the approach. Rule-based systems know their boundaries (the menu) and escalate at the boundary. LLM systems have no sense of boundaries; they generate regardless. This means LLM systems scale better (they handle novel phrasings) but require different guardrails (to catch overconfidence, not to understand).

Resources for This Section

Research Hallucination in Large Language Models (Research)

An empirical study of when and why LLMs hallucinate, with categorization of different hallucination types.

Article Reducing Hallucinations: Contextual Retrieval (Anthropic)

Practical strategies for mitigating hallucination in deployed systems, including retrieval and guardrails.

Short Read Why Language Models Hallucinate (OpenAI)

An explanation of the structural reasons why LLMs generate confident false information and how to design around it.

Part 5

Five Acquisition Funnel Decisions and Their ML Types

Top of Funnel Middle of Funnel Bottom of Funnel Supervised Learning Reinforcement Learning Generative AI Bandit Algorithm

The question is never "where can we add AI?" It is "where does our funnel leak? Where are we slow? Where do we miss leads?" Once you know the bottleneck, here is which ML type typically solves it. Each decision uses a different ML approach, optimizes for a different metric, and has different constraints. Understanding this structure prevents the mistake of treating all funnel decisions as identical.

Decision 1

Who Sees the Ad? (Supervised Learning — Targeting)

At the very top of the funnel, you have millions of potential customers, most of whom don't care about your product. The first ML decision: who should even see your ads?

This is a supervised learning problem. You train a classifier on historical data: "Which customers clicked/converted/engaged?" The model learns patterns that predict engagement and generalizes to new customers you haven't seen before.

The constraint here is cost. Ad impressions are cheap individually but expensive at scale. A 1% improvement in targeting accuracy (showing ads to slightly higher-intent users) can save millions in ad spend. Conversely, a 10% error rate (showing ads to the wrong audience) is expensive but manageable at this funnel stage.

Examples
LinkedIn advertising: Target decision-makers in tech companies. Training data: "Which profiles visited this job posting?" Classifier predicts which new profiles are likely to be relevant.
E-commerce remarketing: Target users who visited product pages. Training data: "Which visitors ended up purchasing?" Classifier predicts which new visitors are ready to buy.
B2B lead generation: Target companies with sufficient IT budget. Training data: "Which companies responded to our sales outreach?" Classifier predicts company characteristics that predict responsiveness.
Why It Matters

Accuracy at this stage compounds through the funnel. A 5% improvement in who sees the ad means you reach a higher-quality audience, which makes the next decisions easier and more profitable. This is why large platforms (Google, Meta, LinkedIn) invest heavily in targeting. It's the highest-leverage decision in the funnel.

Decision 2

Which Creative to Show? (Reinforcement Learning — Bandit)

Once you've decided who to show ads to, you must decide which ad to show them. This is a reinforcement learning problem, typically solved with a multi-armed bandit algorithm.

The idea: you have several creative variants (different headlines, images, calls-to-action). Each user sees one variant. You observe whether they clicked/engaged (the reward). Over time, the algorithm learns which variants perform best and allocates more traffic to them.

The key tradeoff is exploration vs. exploitation. If you only show the current best-performing creative (exploitation), you might miss a better creative that hasn't been tested much yet (exploration). The bandit algorithm balances this tradeoff automatically.

The constraint here is learning speed. You need enough traffic through each variant to learn performance differences. If traffic is low, the algorithm takes longer to converge. If traffic is high, you can iterate quickly.

Examples
Email subject line testing: A/B test two subject lines: "50% Off Today" vs. "Last Chance: 50% Off." Bandit algorithm learns which gets more opens and gradually allocates more sends to the winner.
Social media image testing: Test different product images (lifestyle photo vs. product shot). Track engagement and allocate impressions to the winner.
Landing page headlines: Test three headlines on new traffic. The bandit algorithm learns which drives more conversions and reduces traffic to underperformers.
Why It Matters

This is where creativity and data meet. The algorithm doesn't create the creatives—humans do. But the algorithm figures out which humans created something good and accelerates it. Without this feedback loop, companies rely on gut feel or A/B tests that require huge sample sizes. With it, companies can run dozens of tests in parallel and learn quickly.

Decision 3

What Creative to Make? (Generative AI)

Creating ad creatives is expensive and slow: hire a designer, shoot photos, write copy, iterate. Generative AI can help you create variations at scale.

Instead of creating three ad variants and testing them, you can generate 50 variants and test them. Some will be better than the hand-crafted versions. Others will be worse. But at scale, more variants means more learning.

The constraint here is brand consistency and legal compliance. Generative AI can create fluent-sounding product descriptions, but it can make claims that aren't true. It can generate images that work aesthetically but might inadvertently violate trademarks or copyrights. Guardrails are essential.

Examples
Email copy generation: Give a LLM the product name and key features, ask it to generate five email copy variants in different tones. Review them, pick the best, test with real traffic.
Social media captions: Given a product and target audience, generate multiple captions. Human reviewers pick the ones that fit brand voice. Test the others.
Ad image generation: Generate variations of product photos (different backgrounds, models, lighting). Use the best-performing ones in ads.
Why It Matters

Generative AI trades speed for precision. Hand-crafted creatives are often better quality but slow and expensive to make. Generated creatives are fast and cheap but lower average quality. The win is: you can make so many variants that even though 80% are mediocre, the 20% that are good outperform hand-crafted variants in total volume. This requires tight guardrails and human review, but the economics work.

Decision 4

Who to Retarget? (Supervised + RL)

Not everyone who sees an ad converts immediately. Some are interested but not ready. The retargeting decision: which users should you show ads to again, and how often?

This uses supervised learning (predict who is likely to convert if retargeted) and RL (learn what frequency/timing works best). The tradeoff is between winning back interested users and annoying them with too many ads.

The constraint here is customer experience. Show too many retargeting ads, and users get frustrated or block the advertiser. Show too few, and you leave money on the table. The sweet spot differs by product and audience, so the algorithm must learn it.

Examples
E-commerce cart abandonment: Classify users who abandoned carts as "likely to return if reminded." Send one retargeting email after 4 hours, another after 24 hours if the first didn't work. Track the frequency at which users get annoyed (unsubscribe) and back off.
B2B opportunity nurturing: Identify prospects who downloaded a whitepaper but didn't request a demo. Retarget them with case studies and product videos. Learn the optimal email cadence.
App reengagement: Identify inactive users (haven't logged in 30 days). Send push notifications. Learn the optimal frequency before users opt out.
Why It Matters

Retargeting is one of the highest-ROI marketing decisions because it's addressing warm prospects, not cold ones. But it's also high-risk if executed poorly (users hate being stalked with ads). The combination of supervised learning (who to target) and RL (how often) is necessary to get both the targeting and the frequency right.

Decision 5

How to Qualify the Lead? (All Three Types)

At the bottom of the funnel, you have prospects ready to talk to sales. The lead qualification decision requires all three ML types working together.

Supervised learning predicts which prospects are most likely to buy (lead scoring). Reinforcement learning learns the optimal escalation policy (when to escalate to a human vs. continue with automation). Generative AI generates personalized responses and follow-ups.

The constraint here is precision. A low-value prospect treated as high-value wastes sales time. A high-value prospect treated as low-value loses revenue. At this stage, accuracy matters more than at earlier stages.

Examples
Chatbot qualification: A prospect arrives at the website and chats with a bot. Supervised learning scores the prospect. RL learns when to escalate to a human (high score = faster escalation, low score = try automated nurture). Generative AI personalizes the conversation.
Sales email sequencing: Supervised learning ranks prospects by likelihood to buy. RL learns optimal email timing (Monday vs. Friday, morning vs. evening). Generative AI personalizes content for each prospect.
Demo scheduling: Supervised learning identifies prospects most likely to show up and buy. RL learns optimal demo timing (next day vs. next week). Generative AI pre-qualifies questions and prepares the sales rep.
Why It Matters

At this stage, all three types of ML are relevant because the decision is complex. You're not just answering "who is interested?" (supervised) or "which message works best?" (RL). You're asking "which prospect is valuable and should get human attention?" That requires all three: prediction, learning, and personalization.

The Pattern: Start with Bottleneck, Not Tools

Across all five decisions, the pattern is the same. The marketing leaders who deploy AI successfully do not start by asking "which AI tool should we use?" They start by asking "where does our funnel break?" Maybe your targeting is too broad and wastes ad spend. Maybe your creative rotation is stale. Maybe your qualification process is 80% manual. Once you know the bottleneck, the ML type becomes obvious. And you avoid the mistake of treating all decisions as identical.

Resources for This Section

Article Google Ads AI Features Update (Google Blog)

Current overview of how AI powers targeting, creative selection, and optimization in Google Ads with recent feature updates.

Short Read Optimize Customer Engagement with Reinforcement Learning (AWS)

An intuitive introduction to how bandit algorithms work in advertising and experimentation.

Case Study Automatically Qualify Leads Using Workflows (Intercom Help)

A real example of supervised learning applied to lead qualification in B2B SaaS.

Part 6

The Core Principle: Constraint Specification and Reward Alignment

Constraint Optimization Target Reward Specification Misalignment Metric vs. Goal Feedback Loop

Here is the deepest lesson from Day 2: the model optimizes what you measure, not what you care about. This principle runs through every AI decision. Get the measurement wrong, and the system succeeds by the metric you wrote down while failing at what you actually want.

Concept 1

Optimization ≠ Alignment

An AI system can be perfectly optimized (performing exactly as you told it to) while failing at what you actually care about. This happens when the metric you optimized for is not aligned with your real goal.

Example: You build a content recommendation system and optimize for "time on site." The algorithm learns which content keeps users scrolling longest. It discovers that outrage, fear, and misinformation keep users engaged longer than balanced, accurate information. Result: the system is optimized perfectly, but it's promoting garbage.

Why does this happen? Because "time on site" was easy to measure. "User satisfaction" or "user learning" is harder to measure. So you picked the easy metric. The system optimized it. And you got the unintended consequences.

This is not a failure of the algorithm. The algorithm did exactly what you asked. It's a failure of constraint specification—you specified the wrong constraint.

Examples
Social media engagement: Optimize for "likes and shares." Result: sensational, controversial content thrives. Real goal: meaningful conversation. Metrics are misaligned.
Customer service chatbot: Optimize for "first-response resolution rate." Result: the chatbot avoids escalating complex issues to humans, making them worse. Real goal: customer satisfaction. Metrics are misaligned.
Ad targeting: Optimize for "click-through rate." Result: the system learns to show ads to people who click on anything (easily distracted). Real goal: show ads to people likely to buy. Metrics are misaligned.
Why It Matters

This explains why many AI deployments "work" (the metrics improve) but create negative outcomes. The model isn't broken. The metric is wrong. Fixing this requires going upstream: before you build the model, define what you actually care about, and figure out how to measure it, even if imperfectly. This is a marketing question, not a data science question.

Concept 2

The Measurement Problem: What You Can vs. What You Care About

The fundamental problem: what you can measure easily is often not what you care about.

Easy to measure: clicks, views, time-on-page, immediate conversions, revenue per user this quarter.

Hard to measure: customer lifetime value, long-term satisfaction, learning, brand trust, whether the product actually solved the customer's problem, whether customers would recommend you a year from now.

Because hard-to-measure things are hard to measure, many systems optimize for easy-to-measure proxies. These proxies are often correlated with what you care about, but the correlation breaks down at scale. Optimize the proxy hard enough, and you decouple from the goal.

The solution is to measure imperfectly but honestly. Run surveys. Sample interactions. Manually review a subset of decisions. Use these imperfect measurements as your north star, even if they're noisy and expensive. Let the easy metrics be secondary checks, not primary goals.

Examples
Lead quality: Easy to measure: number of leads. Hard to measure: quality of leads (will they buy?). Solution: measure both. Use lead count as output metric but sample leads monthly and manually rate quality. If quality drops while count rises, adjust the targeting model.
Customer support: Easy to measure: tickets resolved per agent. Hard to measure: customer satisfaction. Solution: measure resolution rate as output metric, but survey a sample of customers monthly. If satisfaction drops, it signals you're gaming the resolution metric.
Content personalization: Easy to measure: click-through rate. Hard to measure: whether the content was actually useful. Solution: measure CTR as output, but periodically ask customers whether recommended content was valuable. Adjust if opinion diverges from CTR.
Why It Matters

This is the guard against becoming a victim of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." If you only optimize for what is easy to measure, you will eventually optimize yourself into a corner. The solution is to keep harder, more honest measurements in the loop, even if they are expensive or noisy.

Concept 3

Constraint Specification: Making Your Goals Computable

The deepest question: how do you turn a goal (like "happy customers") into a constraint that an AI system can actually optimize?

Sometimes, the answer is "you can't, exactly." In those cases, you specify constraints that correlate with the goal and accept the imperfection.

Example: Your goal is "customers should only see products they can afford." How do you constrain an AI system to do this? You could set a hard rule: "Never recommend products above the customer's available budget." That's clear and computable. Or you could be softer: "Recommend a range: 20% below, current spend, 20% above." That allows some flexibility but still constrains the system.

The key is making the constraint explicit. If you don't specify constraints, the system will optimize for whatever metric you gave it, with no guardrails.

Examples
Fairness constraint: Goal: "Lending decisions should be fair." Constraint: "Don't use gender, race, or protected variables as features." This is computable but incomplete (these variables can be inferred from proxies). You must monitor for proxy discrimination.
Frequency constraint: Goal: "Retargeting shouldn't annoy customers." Constraint: "Never show the same customer more than 3 ads per day." This is computable but arbitrary (what if the customer wants 5 ads?). Imperfect but better than no constraint.
Accuracy constraint: Goal: "Recommendations should be accurate." Constraint: "Only recommend products with similarity score > 0.8." This is computable but conservative (maybe 0.7 would work better). You must tune the threshold based on feedback.
Why It Matters

This is your job as a marketer or product manager. Data scientists can build the model. But you must specify what "good" means. What constraints matter? What is the tradeoff between optimization and guardrails? Getting this wrong is more costly than getting the algorithm wrong. Because an algorithm can be retuned. But bad constraints are baked into the system's behavior from day one.

Can't We Just Use Multiple Metrics?

Yes and no. You can optimize for multiple metrics simultaneously, but tradeoffs are real. If you optimize for both "resolution speed" and "customer satisfaction," the system learns to resolve queries quickly even if the customer is not happy. Both metrics improve (speed up, ask for satisfaction survey). You must weight the tradeoffs explicitly or the system will find unintended ways to satisfy both metrics without satisfying your real goal.

Shouldn't the AI System Figure Out What Matters?

No. The system can't know what matters to you. It can only optimize what you ask it to optimize. Even reinforcement learning systems that learn from feedback are learning what the feedback signal is, not what you actually care about. If you reward the system for metrics that aren't aligned with your goal, it will optimize those metrics perfectly while failing at what you care about.

Resources for This Section

Classic Goodhart's Law (Wikipedia)

The foundational principle: when a measure becomes a target, it ceases to be a good measure. Essential reading for anyone building metrics for AI systems.

Article Reward Hacking (Wikipedia)

How systems get warped when metrics diverge from goals, with examples of metric misalignment and optimization failures.

All Resources at a Glance

Topic Resource Format
AI-Enabled Customer Engagement The Next Frontier of Customer Engagement (McKinsey) Article
Human-AI Mix in Customer Service The Contact Center Crossroads (McKinsey) Article
Generative AI in Customer Service How Generative AI Transforms Customer Service (BCG) Case Study
AI Agent Escalation Design Managing AI Agent Escalation (Intercom Help) Practical Guide
Chatbot Personality Does Bot Personality Matter? (Computers in Human Behavior) Research
Conversational AI Experience Design Conversational AI Experience Design (Intercom Help) Practical Guide
Tokenization OpenAI Tokenizer Interactive
Attention Mechanism Attention Is All You Need Explained (3Blue1Brown) Video · 15 min
Transformer Architecture Attention Is All You Need (Vaswani et al., 2017) Classic Paper
Contextual Retrieval Contextual Retrieval (Anthropic) Article
Function Calling Function Calling in the OpenAI API (OpenAI Help) Short Read
Hallucination Survey Hallucination in Large Language Models (Research) Research
Reducing Hallucination Contextual Retrieval (Anthropic) Article
Why LLMs Hallucinate Why Language Models Hallucinate (OpenAI) Article
Google Ads AI Features Google Ads AI Features Update (Google Blog) Article
Reinforcement Learning for Engagement Optimize Customer Engagement with RL (AWS) Short Read
Lead Qualification Workflows Automatically Qualify Leads Using Workflows (Intercom Help) Practical Guide
Goodhart's Law Goodhart's Law (Wikipedia) Classic
Reward Hacking Reward Hacking (Wikipedia) Article