MARK6582 · AI & Marketing · Georgetown University

Day 3A Study Guide

AI Targeting, Experimentation, and Algorithmic Fairness
Part 1

AI, ML, and Targeting: Why We Still Need Experiments

Targeting Policy Targeting Rule Incremental Effect Lift Counterfactual Oracle Experimental Evaluation Supervised Learning

Machine learning can identify patterns in data and encode them as targeting rules. This is useful. But it does not replace the marketer's job of deciding who to target, why, and whether acting on those patterns creates actual value. And it does not replace the need for experiments. This section explains why.

Concept 1

The Algorithm Does the Learning. The Constraints Do the Governing.

From Day 1, an AI system is a pipeline: perception acquires raw input, representation encodes it into something computation can act on, a model defines a simplified structure, an algorithm is the procedure that produces outputs, constraints are the rules that limit which outputs are allowable, and behavior is what the system does in the world.

A supervised learning algorithm for targeting sits at the algorithm layer. It is a procedure that learns weights from labeled data (which customers converted, which did not) and uses those weights to score new customers. That is all it does. The algorithm has no access to the constraints layer. It does not know whether acting on its scores is appropriate, how many times a customer can be contacted, whether a discount will cannibalize a conversion that was already coming, or whether targeting a particular segment raises legal or ethical issues. Those rules have to be specified separately and encoded as constraints before the pipeline produces behavior.

A well-trained algorithm and a well-designed system are not the same thing. The algorithm layer can be strong while the constraints layer is empty, underspecified, or misaligned with actual business goals. Deciding who to target, on what basis, with what action, and subject to what limits are managerial decisions. They belong in the constraints layer, and they require human judgment to specify.

Example
Streaming service: A supervised learning algorithm trained on conversion data identifies that users who add three or more titles to their watchlist in their first week are high-probability converters. The algorithm layer is working. But the pipeline has no constraints layer yet. Nothing specifies whether to contact these users, what action to take, how many times per week they can be messaged, or whether offering them a discount reduces margin on a conversion that would have happened anyway at full price. Before this algorithm produces any behavior, those rules have to be written by a person.
Concept 2

Prediction Is Not Causation: The Incremental Value Problem

A targeting algorithm trained to predict who will convert does not answer the question a marketer needs answered: who will convert because of the intervention? These are different questions. A customer who was going to purchase regardless of whether they received a promotion is correctly predicted as a converter. Sending them a discount does not generate a sale; it reduces the margin on a sale that was already coming.

The right target for a targeting algorithm is not predicted conversion but predicted lift: the difference in outcome between receiving the treatment and not receiving it. This requires estimating a counterfactual for each customer. In the exercise, you used simple rules derived from the experimental data. In practice, methods like uplift modeling and causal forests are designed specifically for this task. Rather than predicting outcomes, they estimate individual-level treatment effects: for this customer, how much does the intervention change the probability of conversion?

The input those methods require is experimental data with randomized treatment assignment. Randomization is what makes the counterfactual estimable: because treatment was assigned by chance, the control group is a valid comparison for the treatment group, and you can attribute differences in outcomes to the intervention rather than to pre-existing differences between customers.

Example
Hotel loyalty program: A hotel chain trains a model on past booking data. The model identifies members who have recently searched for properties as highly likely to book, and it is right. The team sends them a bonus points offer. Bookings from this segment are strong and the campaign appears successful. But a holdout analysis finds the booking rate among untargeted members in the same segment is nearly identical. These customers were going to book anyway. The model was accurate. The lift was zero. Predicting conversion and predicting the effect of an intervention are different problems, and they require different methods.
Concept 3

The Two-Experiment Structure: Learning a Rule Is Not the Same as Validating a Policy

Developing and deploying a data-driven targeting policy requires two conceptually distinct experiments, not one. Understanding why requires being clear about what each experiment does and what it cannot tell you.

Experiment 1 runs before any targeting. Customers are randomly assigned to receive the intervention or not. This randomization serves two purposes simultaneously. First, it generates causal estimates of the treatment effect: because assignment is random, differences in outcomes between the treated and control groups can be attributed to the intervention. Second, it generates the training data for the targeting algorithm. Within this experimental dataset, you split training and validation sets to learn and tune a model of who responds. The result is a targeting rule: a specification of which customers to treat in a future campaign.

The oracle problem is what makes a second experiment necessary. Even after Experiment 1, you cannot observe both potential outcomes for any individual customer. You know what each person did under the condition they were assigned to; you do not know what they would have done under the other condition. This means the targeting rule you learned is a prediction of who will respond, not a guarantee. Applied to a new population at a future time, you do not actually know whether the customers your rule selects will produce lift, or whether they would have converted anyway.

Experiment 2 is how you answer that question. Once you have a targeting rule, you deploy it to a new population but randomly withhold the intervention from a subset of the customers the rule would have targeted. This creates a valid comparison: targeted customers who received the intervention versus targeted customers who did not. The difference in outcomes between these two groups is the actual incremental effect of the policy. Not the model's prediction of it. The measured outcome.

Experiment 1 gives you a rule. Experiment 2 tells you whether the rule works.

EXPERIMENT 1 — LEARN THE RULE Customer population e.g. 5,000 customers Random assignment Each customer randomly assigned to treatment or control Treatment group receives coupon Control group no coupon Outcomes observed Compare groups → causal estimate of lift and heterogeneity Train targeting algorithm on Exp. 1 data Train / validation split → targeting rule: who to give the coupon THE ORACLE PROBLEM The algorithm learned a rule from Exp. 1. But you cannot observe what would have happened to each person under the other condition. The rule is a prediction, not a guarantee. EXPERIMENT 2 — VALIDATE THE POLICY New customer population e.g. 6,000 new customers Targeting algorithm (from Exp. 1) scores population Selects customers predicted to respond to the coupon Second random assignment Among ML-selected customers: randomly assign coupon or not ML-selected + coupon Algorithm said yes, randomly assigned to receive it ML-selected + no coupon Algorithm said yes, randomly held back Measured lift = coupon rate - no-coupon rate Actual incremental effect of the targeting policy Both groups were ML-selected, so the difference is caused by the coupon
Does Experiment 1 already answer this if it was well-designed?

Experiment 1 gives you average treatment effects and a model of heterogeneity. It does not tell you whether the targeting rule generalizes to a new population, a new time period, or a new context. Customer behavior shifts, and a rule learned on past data may select different people than intended when deployed. Experiment 2 measures the policy's actual performance in the deployment context, which is a different question than how well the model fit the training data.

Can a good enough model make the second experiment unnecessary?

No. Model fit measures how well the algorithm's predictions matched outcomes in the training data. It does not measure incremental lift in deployment. A model with high predictive accuracy can still produce zero lift if the customers it selects would have converted regardless. Accuracy and lift are independent quantities. Only an experiment that includes a holdout group can measure lift directly.

Resources for This Section

Article A Refresher on A/B Testing (Harvard Business Review)

A clear primer on why controlled experiments are necessary to establish causal claims, including the logic of holdout groups and incrementality testing.

Article Uplift Modeling: A guide from Booking.com

Explains the difference between predicting who will convert and predicting who will convert because of an intervention, which is the core question in incrementality-focused targeting.

Classic Rubin Causal Model (Wikipedia)

The formal framework behind the counterfactual concept. Introduces potential outcomes notation, the language used to define what an oracle would actually tell you.

Part 2

Algorithmic Fairness: Should We Segment by Demographics?

Group Fairness Individual Fairness Causal Fairness Independence Separation Sufficiency Base Rate Impossibility Result Protected Attribute Proxy Variable

Marketers are trained to segment. Finding groups of customers who differ in their preferences, behaviors, or responses to marketing actions is the foundation of how we target and personalize. But when machine learning systems are used to make these decisions at scale, differential treatment can become systematically unfair in ways that are not obvious, and that are not fixed by simply removing protected variables from the model. This section covers the three main criteria used to evaluate algorithmic fairness, the mathematical constraint that limits how you can apply them, and what these criteria do and do not capture.

Concept 1

The Segmentation Question: When Does Differentiation Become a Fairness Problem?

Marketing optimization naturally produces differential treatment. A targeting algorithm that identifies who gets a discount, a credit offer, or a job ad will, by design, treat different people differently. That is the point.

The fairness question is not whether to differentiate; it is on what basis and with what consequences. Some bases for differentiation are uncontroversial: targeting people who recently searched for a product is efficient and generally seen as acceptable. Other bases are legally protected and morally contested: targeting based on race, gender, or age is restricted in lending, hiring, and housing in most jurisdictions.

The practical challenge is that even when protected attributes are excluded from a model, they can enter through correlated variables. A model trained on zip code, purchase history, or name-based features may effectively differentiate by race or gender without ever using those variables directly. Removing a variable from a model does not remove its influence when that variable is correlated with variables that remain.

Example
Credit scoring: A lender removes race from a credit scoring model. But the model uses zip code, which is highly correlated with race due to patterns of residential segregation. The model's decisions remain racially differentiated, even though race was never an input. This is called proxy discrimination: using a variable that carries the influence of a protected attribute through correlation.
Concept 2

Three Fairness Criteria

There is no single agreed-upon definition of algorithmic fairness. Researchers and practitioners have formalized three main criteria, each of which captures a different intuition about what "equal treatment" means. Understanding all three, including what each one requires and misses, is important before deploying any algorithm that makes decisions affecting people.

Criterion 1

Independence

Equal outcomes across groups.

The algorithm's decisions are independent of group membership. The rate at which people receive a positive decision (a loan, a job ad, a coupon) is the same across groups.

What it requires: Equal selection rates regardless of individual characteristics.

What it misses: Groups may genuinely differ in the characteristic being predicted. Forcing equal selection rates can reduce accuracy and may require overriding real signal.

Criterion 2

Separation

Equal errors across groups.

The algorithm's error rates are equal across groups, conditional on the true outcome. The false positive rate and the false negative rate are the same for each group.

What it requires: Among people who should receive a positive decision, the same fraction does; among those who should not, the same fraction does not.

What it misses: Requires knowing the true outcome, which may itself reflect historical bias.

Criterion 3

Sufficiency

Equal meaning of scores across groups.

Among all people assigned the same score or decision, the actual outcome is the same regardless of group membership. The score means the same thing for everyone.

What it requires: Predictive accuracy is equally calibrated across groups.

What it misses: Can still produce very different selection rates across groups if base rates differ.

Examples Across Criteria
Independence in advertising: A job ad algorithm shows the ad to equal proportions of men and women. It satisfies independence. But if men and women have different rates of relevant qualifications, the algorithm is overriding real signal to achieve parity.
Separation in lending: A credit model has the same false rejection rate for qualified applicants regardless of race. It satisfies separation. But if historical data on defaults is itself biased (e.g., minority applicants were denied credit at higher rates regardless of creditworthiness), the "true outcome" the model is calibrated to is contaminated.
Sufficiency in risk scoring: A recidivism algorithm assigns a risk score that is equally predictive across racial groups (a score of 7 means the same probability of reoffending for both Black and white defendants). It satisfies sufficiency. But because base rates differ, the algorithm assigns high-risk scores to Black defendants at higher rates overall, producing disparate impact in outcomes despite equal predictive accuracy.
Concept 3

The Impossibility Result: You Cannot Satisfy All Three at Once

A central and counterintuitive result in algorithmic fairness is that when base rates differ across groups (that is, when the underlying outcome being predicted is more common in one group than another), it is mathematically impossible to satisfy Independence, Separation, and Sufficiency simultaneously. You can satisfy at most two, and often only one.

This is not a design flaw that better engineering can fix. It is an arithmetic constraint. Chouldechova (2017) and Kleinberg et al. (2016) proved this formally. The intuition is that each criterion asks something different to be equal, and when the underlying distributions differ, you cannot equalize everything at once without contradiction.

The consequence is that choosing a fairness criterion is a normative decision, not a technical one. It is a choice about what should be equal: outcomes, errors, or the meaning of scores. There is no neutral default. Deploying an algorithm without specifying a criterion means the criterion was chosen by default, implicitly, by whoever set the objective function.

What if we just try to satisfy all three as best we can?

Partial satisfaction does not resolve the underlying conflict; it obscures it. If base rates differ, improving performance on one criterion moves you away from another. The tradeoff is real and cannot be engineered away. What you can do is make the tradeoff explicit: decide which criterion matters for this application, document the choice, and monitor the consequences.

What if we just exclude all demographic variables?

Exclusion does not eliminate disparate impact when demographic variables are correlated with the features that remain. A model trained on purchase history, browsing behavior, or geographic data will often reproduce demographic patterns without ever seeing demographic inputs. Fairness cannot be achieved by omission alone; it requires actively choosing a criterion and designing the system accordingly.

Note on Scope

What These Criteria Do and Do Not Cover

The three criteria above are all examples of group fairness: they ask whether decisions are equitable at the level of demographic groups. Group fairness is the framework most often referenced in regulatory and policy discussions, in part because it is measurable with aggregate data. But it is not the only framework.

Individual fairness asks a different question: are similar individuals treated similarly? Two people with nearly identical profiles should receive nearly identical decisions, regardless of their group membership. This is intuitively appealing but technically harder; it requires defining what "similar" means for individuals, which is itself a contested question.

Causal fairness asks whether group membership is a causal driver of the disparity, or whether it is a proxy for something legitimate. A model that produces different outcomes for men and women in a domain where gender genuinely predicts the outcome may or may not be unfair, depending on whether gender is driving the outcome causally or merely correlating with something else. Answering that requires causal modeling, not just statistical parity checks.

The three group fairness criteria are a starting point, not a complete framework. They are the right tools for auditing whether a deployed system produces systematically different outcomes across protected groups. They are not sufficient to answer every fairness question that a marketer or product manager will encounter.

Why It Matters for Marketers

Marketing optimization and algorithmic fairness pull in different directions. An algorithm optimized for revenue will target whoever responds reliably, and those segments may systematically exclude protected groups, not through intent but through the correlation structure of the data. Understanding the three criteria, the impossibility result, and the limits of each framework gives you the tools to ask the right questions before deployment, not after.

Resources for This Section

Textbook Fairness and Machine Learning (Barocas, Hardt & Narayanan)

The standard reference on algorithmic fairness. Free online. Chapters 2 and 3 cover the three criteria and the impossibility results in detail, with formal proofs and applied examples.

Research Inherent Trade-Offs in the Fair Determination of Risk Scores (Chouldechova, 2017)

The formal proof that calibration and equal error rates cannot simultaneously hold when base rates differ. The paper that established the impossibility result most clearly for practitioners.

Research Inherent Trade-Offs in Algorithmic Fairness (Kleinberg et al., 2016)

A parallel impossibility proof showing that Independence, Separation, and Sufficiency cannot all be satisfied simultaneously. Companion to Chouldechova's result.

Article Machine Bias (ProPublica)

The investigation that sparked the modern public debate on algorithmic fairness, examining a recidivism risk tool used in U.S. courts. Illustrates the Independence vs. Sufficiency tension with real data.

All Resources at a Glance

Topic Resource Format
A/B Testing & Incrementality A Refresher on A/B Testing (HBR) Article
Uplift Modeling Uplift Modeling: A guide from Booking.com Article
Counterfactual Reasoning Rubin Causal Model (Wikipedia) Classic
Algorithmic Fairness (full reference) Fairness and Machine Learning (Barocas, Hardt & Narayanan) Textbook (free)
Impossibility Result Inherent Trade-Offs in Fair Risk Scores (Chouldechova, 2017) Research
Impossibility Result Inherent Trade-Offs in Algorithmic Fairness (Kleinberg et al., 2016) Research
Fairness in Practice Machine Bias (ProPublica) Article