AI, ML, and Targeting: Why We Still Need Experiments
Machine learning can identify patterns in data and encode them as targeting rules. This is useful. But it does not replace the marketer's job of deciding who to target, why, and whether acting on those patterns creates actual value. And it does not replace the need for experiments. This section explains why.
The Algorithm Does the Learning. The Constraints Do the Governing.
From Day 1, an AI system is a pipeline: perception acquires raw input, representation encodes it into something computation can act on, a model defines a simplified structure, an algorithm is the procedure that produces outputs, constraints are the rules that limit which outputs are allowable, and behavior is what the system does in the world.
A supervised learning algorithm for targeting sits at the algorithm layer. It is a procedure that learns weights from labeled data (which customers converted, which did not) and uses those weights to score new customers. That is all it does. The algorithm has no access to the constraints layer. It does not know whether acting on its scores is appropriate, how many times a customer can be contacted, whether a discount will cannibalize a conversion that was already coming, or whether targeting a particular segment raises legal or ethical issues. Those rules have to be specified separately and encoded as constraints before the pipeline produces behavior.
A well-trained algorithm and a well-designed system are not the same thing. The algorithm layer can be strong while the constraints layer is empty, underspecified, or misaligned with actual business goals. Deciding who to target, on what basis, with what action, and subject to what limits are managerial decisions. They belong in the constraints layer, and they require human judgment to specify.
Prediction Is Not Causation: The Incremental Value Problem
A targeting algorithm trained to predict who will convert does not answer the question a marketer needs answered: who will convert because of the intervention? These are different questions. A customer who was going to purchase regardless of whether they received a promotion is correctly predicted as a converter. Sending them a discount does not generate a sale; it reduces the margin on a sale that was already coming.
The right target for a targeting algorithm is not predicted conversion but predicted lift: the difference in outcome between receiving the treatment and not receiving it. This requires estimating a counterfactual for each customer. In the exercise, you used simple rules derived from the experimental data. In practice, methods like uplift modeling and causal forests are designed specifically for this task. Rather than predicting outcomes, they estimate individual-level treatment effects: for this customer, how much does the intervention change the probability of conversion?
The input those methods require is experimental data with randomized treatment assignment. Randomization is what makes the counterfactual estimable: because treatment was assigned by chance, the control group is a valid comparison for the treatment group, and you can attribute differences in outcomes to the intervention rather than to pre-existing differences between customers.
The Two-Experiment Structure: Learning a Rule Is Not the Same as Validating a Policy
Developing and deploying a data-driven targeting policy requires two conceptually distinct experiments, not one. Understanding why requires being clear about what each experiment does and what it cannot tell you.
Experiment 1 runs before any targeting. Customers are randomly assigned to receive the intervention or not. This randomization serves two purposes simultaneously. First, it generates causal estimates of the treatment effect: because assignment is random, differences in outcomes between the treated and control groups can be attributed to the intervention. Second, it generates the training data for the targeting algorithm. Within this experimental dataset, you split training and validation sets to learn and tune a model of who responds. The result is a targeting rule: a specification of which customers to treat in a future campaign.
The oracle problem is what makes a second experiment necessary. Even after Experiment 1, you cannot observe both potential outcomes for any individual customer. You know what each person did under the condition they were assigned to; you do not know what they would have done under the other condition. This means the targeting rule you learned is a prediction of who will respond, not a guarantee. Applied to a new population at a future time, you do not actually know whether the customers your rule selects will produce lift, or whether they would have converted anyway.
Experiment 2 is how you answer that question. Once you have a targeting rule, you deploy it to a new population but randomly withhold the intervention from a subset of the customers the rule would have targeted. This creates a valid comparison: targeted customers who received the intervention versus targeted customers who did not. The difference in outcomes between these two groups is the actual incremental effect of the policy. Not the model's prediction of it. The measured outcome.
Experiment 1 gives you a rule. Experiment 2 tells you whether the rule works.
Experiment 1 gives you average treatment effects and a model of heterogeneity. It does not tell you whether the targeting rule generalizes to a new population, a new time period, or a new context. Customer behavior shifts, and a rule learned on past data may select different people than intended when deployed. Experiment 2 measures the policy's actual performance in the deployment context, which is a different question than how well the model fit the training data.
No. Model fit measures how well the algorithm's predictions matched outcomes in the training data. It does not measure incremental lift in deployment. A model with high predictive accuracy can still produce zero lift if the customers it selects would have converted regardless. Accuracy and lift are independent quantities. Only an experiment that includes a holdout group can measure lift directly.
Resources for This Section
A clear primer on why controlled experiments are necessary to establish causal claims, including the logic of holdout groups and incrementality testing.
Explains the difference between predicting who will convert and predicting who will convert because of an intervention, which is the core question in incrementality-focused targeting.
The formal framework behind the counterfactual concept. Introduces potential outcomes notation, the language used to define what an oracle would actually tell you.
Algorithmic Fairness: Should We Segment by Demographics?
Marketers are trained to segment. Finding groups of customers who differ in their preferences, behaviors, or responses to marketing actions is the foundation of how we target and personalize. But when machine learning systems are used to make these decisions at scale, differential treatment can become systematically unfair in ways that are not obvious, and that are not fixed by simply removing protected variables from the model. This section covers the three main criteria used to evaluate algorithmic fairness, the mathematical constraint that limits how you can apply them, and what these criteria do and do not capture.
The Segmentation Question: When Does Differentiation Become a Fairness Problem?
Marketing optimization naturally produces differential treatment. A targeting algorithm that identifies who gets a discount, a credit offer, or a job ad will, by design, treat different people differently. That is the point.
The fairness question is not whether to differentiate; it is on what basis and with what consequences. Some bases for differentiation are uncontroversial: targeting people who recently searched for a product is efficient and generally seen as acceptable. Other bases are legally protected and morally contested: targeting based on race, gender, or age is restricted in lending, hiring, and housing in most jurisdictions.
The practical challenge is that even when protected attributes are excluded from a model, they can enter through correlated variables. A model trained on zip code, purchase history, or name-based features may effectively differentiate by race or gender without ever using those variables directly. Removing a variable from a model does not remove its influence when that variable is correlated with variables that remain.
Three Fairness Criteria
There is no single agreed-upon definition of algorithmic fairness. Researchers and practitioners have formalized three main criteria, each of which captures a different intuition about what "equal treatment" means. Understanding all three, including what each one requires and misses, is important before deploying any algorithm that makes decisions affecting people.
Independence
Equal outcomes across groups.
The algorithm's decisions are independent of group membership. The rate at which people receive a positive decision (a loan, a job ad, a coupon) is the same across groups.
What it requires: Equal selection rates regardless of individual characteristics.
What it misses: Groups may genuinely differ in the characteristic being predicted. Forcing equal selection rates can reduce accuracy and may require overriding real signal.
Separation
Equal errors across groups.
The algorithm's error rates are equal across groups, conditional on the true outcome. The false positive rate and the false negative rate are the same for each group.
What it requires: Among people who should receive a positive decision, the same fraction does; among those who should not, the same fraction does not.
What it misses: Requires knowing the true outcome, which may itself reflect historical bias.
Sufficiency
Equal meaning of scores across groups.
Among all people assigned the same score or decision, the actual outcome is the same regardless of group membership. The score means the same thing for everyone.
What it requires: Predictive accuracy is equally calibrated across groups.
What it misses: Can still produce very different selection rates across groups if base rates differ.
The Impossibility Result: You Cannot Satisfy All Three at Once
A central and counterintuitive result in algorithmic fairness is that when base rates differ across groups (that is, when the underlying outcome being predicted is more common in one group than another), it is mathematically impossible to satisfy Independence, Separation, and Sufficiency simultaneously. You can satisfy at most two, and often only one.
This is not a design flaw that better engineering can fix. It is an arithmetic constraint. Chouldechova (2017) and Kleinberg et al. (2016) proved this formally. The intuition is that each criterion asks something different to be equal, and when the underlying distributions differ, you cannot equalize everything at once without contradiction.
The consequence is that choosing a fairness criterion is a normative decision, not a technical one. It is a choice about what should be equal: outcomes, errors, or the meaning of scores. There is no neutral default. Deploying an algorithm without specifying a criterion means the criterion was chosen by default, implicitly, by whoever set the objective function.
Partial satisfaction does not resolve the underlying conflict; it obscures it. If base rates differ, improving performance on one criterion moves you away from another. The tradeoff is real and cannot be engineered away. What you can do is make the tradeoff explicit: decide which criterion matters for this application, document the choice, and monitor the consequences.
Exclusion does not eliminate disparate impact when demographic variables are correlated with the features that remain. A model trained on purchase history, browsing behavior, or geographic data will often reproduce demographic patterns without ever seeing demographic inputs. Fairness cannot be achieved by omission alone; it requires actively choosing a criterion and designing the system accordingly.
What These Criteria Do and Do Not Cover
The three criteria above are all examples of group fairness: they ask whether decisions are equitable at the level of demographic groups. Group fairness is the framework most often referenced in regulatory and policy discussions, in part because it is measurable with aggregate data. But it is not the only framework.
Individual fairness asks a different question: are similar individuals treated similarly? Two people with nearly identical profiles should receive nearly identical decisions, regardless of their group membership. This is intuitively appealing but technically harder; it requires defining what "similar" means for individuals, which is itself a contested question.
Causal fairness asks whether group membership is a causal driver of the disparity, or whether it is a proxy for something legitimate. A model that produces different outcomes for men and women in a domain where gender genuinely predicts the outcome may or may not be unfair, depending on whether gender is driving the outcome causally or merely correlating with something else. Answering that requires causal modeling, not just statistical parity checks.
The three group fairness criteria are a starting point, not a complete framework. They are the right tools for auditing whether a deployed system produces systematically different outcomes across protected groups. They are not sufficient to answer every fairness question that a marketer or product manager will encounter.
Marketing optimization and algorithmic fairness pull in different directions. An algorithm optimized for revenue will target whoever responds reliably, and those segments may systematically exclude protected groups, not through intent but through the correlation structure of the data. Understanding the three criteria, the impossibility result, and the limits of each framework gives you the tools to ask the right questions before deployment, not after.
Resources for This Section
The standard reference on algorithmic fairness. Free online. Chapters 2 and 3 cover the three criteria and the impossibility results in detail, with formal proofs and applied examples.
The formal proof that calibration and equal error rates cannot simultaneously hold when base rates differ. The paper that established the impossibility result most clearly for practitioners.
A parallel impossibility proof showing that Independence, Separation, and Sufficiency cannot all be satisfied simultaneously. Companion to Chouldechova's result.
The investigation that sparked the modern public debate on algorithmic fairness, examining a recidivism risk tool used in U.S. courts. Illustrates the Independence vs. Sufficiency tension with real data.
All Resources at a Glance
| Topic | Resource | Format |
|---|---|---|
| A/B Testing & Incrementality | A Refresher on A/B Testing (HBR) | Article |
| Uplift Modeling | Uplift Modeling: A guide from Booking.com | Article |
| Counterfactual Reasoning | Rubin Causal Model (Wikipedia) | Classic |
| Algorithmic Fairness (full reference) | Fairness and Machine Learning (Barocas, Hardt & Narayanan) | Textbook (free) |
| Impossibility Result | Inherent Trade-Offs in Fair Risk Scores (Chouldechova, 2017) | Research |
| Impossibility Result | Inherent Trade-Offs in Algorithmic Fairness (Kleinberg et al., 2016) | Research |
| Fairness in Practice | Machine Bias (ProPublica) | Article |