AI & Marketing
The situation
Alex Campbel, CEO of Artea, knows that 87% of website visitors never buy anything. She wants to know whether a 20% off coupon changes that.
She cannot just send coupons and see what happens.
She runs an A/B test.
Why does she need a control group?
2,500 customers → receive 20% coupon
↓
Outcome: Y_treatment
2,500 customers → receive nothing
↓
Outcome: Y_control
Lift = Y_treatment − Y_control
Without a concurrent control group, you cannot separate the effect of the coupon from seasonal variation, trends, or any other factor moving outcomes in the same period.
Historical data does not give you a counterfactual.
Average outcomes by experimental condition
| No Coupon | Coupon | Difference | |
|---|---|---|---|
| Transactions | 0.126 | 0.152 | +0.026 |
| Revenue (USD) | $7.78 | $7.54 | −$0.24 |
5,000 customers. Half received the coupon.
The coupon increased transactions.
The coupon did not increase revenue.
Are you confident in both findings?
Why the pattern makes sense
The coupon creates two opposing forces on revenue:
At scale, these forces roughly cancel out.
The average customer was going to spend around $65 whether or not they received the coupon. The discount moved some of them to buy — but it did not increase what they spent.
This does not mean the coupon failed. It means the average customer is the wrong unit of analysis.
Does the coupon effect differ by how the customer found Artea?
| Channel | Coupon effect on transactions | Coupon effect on revenue |
|---|---|---|
| −0.020 (not significant) | −$2.12 (significant) | |
| +0.082 (p < 0.05) | +$3.20 (p < 0.1) | |
| +0.079 (p < 0.01) | +$3.28 (p < 0.05) | |
| Referral | +0.074 (not significant) | +$2.46 (not significant) |
Coefficients show treatment effect relative to Google baseline. Source: regression with interaction terms, AB_test data.
Facebook and Instagram customers respond. Google customers do not — and their revenue goes down.
Why might this pattern exist?
Think about purchase intent at acquisition:
Google search → customer is actively looking, likely to buy eventually anyway. Coupon cannibalizes a sale that was coming.
Social media (FB/Instagram) → customer was browsing, not searching. Coupon creates a reason to buy that did not previously exist.
This is the difference between converting existing intent and creating new intent.
For Google customers: the coupon is a discount on a sale you were going to make anyway.
Does having an abandoned cart change how customers respond?
| Shopping cart | Coupon effect on transactions | Coupon effect on revenue |
|---|---|---|
| No abandoned cart | +0.008 (not significant) | −$0.74 (not significant) |
| Abandoned cart | +0.069 to +0.085 (p < 0.01) | +$2.12 to +$3.00 (p < 0.05) |
29% of the 5,000 customers had abandoned a cart.
For customers with abandoned carts, the coupon works on both transactions and revenue. For everyone else, it barely moves the needle.
The behavioral story
A customer who added an item to their cart and left is not a random visitor. They:
The coupon resolves that reason. It is not persuading someone to want something — it is removing a barrier for someone who already wanted it.
The coupon works as a nudge, not a pitch.
This is the highest-confidence finding in the data. The effect is large, consistent, and has a clean behavioral explanation.
Plant the seed: “Is there anything you know about cart abandoners as a group? What kinds of customers abandon carts more often?”
What you did — in two phases
Phase 1: Training
You had 5,000 customers with known outcomes. You used that data to find patterns — which customers responded to the coupon, and which didn’t.
The AB_test tab was your training data. You were the algorithm.
Phase 2: Inference
You took the rule you learned and applied it to 6,000 different customers — customers with no outcome data, because they were never in the experiment.
The Next_Campaign tab was your test set. You were making predictions, not observations.
This is the structure of every supervised learning problem: learn a function on labeled data, apply it to unlabeled data.
How the comparison table was built
The table you are about to see was not built from your predictions.
It was built from counterfactual outcomes that only the instructor version of the spreadsheet contains.
For each of the 6,000 Next_Campaign customers, that file records:
Both. For the same person. At the same time.
In reality, a customer either gets the coupon or doesn’t — you can never observe both outcomes. This dataset was simulated to allow policy evaluation without running a new experiment.
This is called an oracle. You had the rule. The oracle had the ground truth.
In practice you would not have this. You would run another A/B test — or accept the uncertainty.
Standard policies ranked by revenue
| Policy | Transactions | Revenue | % Targeted |
|---|---|---|---|
| 3a: shopping cart only | 804 | $43,823 | 29% |
| 3b: cart + FB/IG | 776 | $43,801 | 15% |
| 2b: Instagram only | 764 | $42,532 | 30% |
| 1a: no one | 678 | $41,694 | 0% |
| 3c: cart OR FB/IG | 821 | $41,666 | 67% |
| 2c: FB + IG | 795 | $41,644 | 52% |
| 4: no prior purchase | 664 | $40,979 | 33% |
| 2a: Facebook only | 707 | $40,806 | 22% |
| 1b: everyone | 807 | $39,680 | 100% |
| 1c: random | 712 | $39,459 | 49% |
Source: New Campaign Instructor tab, counterfactual outcomes.
Sending to everyone produces the lowest revenue of any policy — lower even than sending to no one.
When you give a coupon to a customer who was going to buy anyway, you do not gain a transaction. You lose 20% of the margin.
The two best policies target narrow groups: cart abandoners (29%), or cart abandoners from social (15%).
The tradeoff:
Neither is obviously right. The answer depends on whether Artea’s goal is conversion volume or margin protection.
Where does your team’s policy land?
What you had
AB_test — 5,000 customers Behavioral data: acquisition channel, past purchases, cart status, browsing time. Outcomes after the experiment. No demographics. This is what you used to find patterns.
Next_Campaign — 6,000 customers Same behavioral variables. No outcomes. No demographics. You added a 1 or 0 to each row. This is what you applied your rule to.
What both datasets share:
Neither contained gender. Neither contained race or ethnicity. Not because Artea doesn’t have customers with those characteristics — but because that data was held separately and was not used for targeting.
What else existed
For the same 6,000 Next_Campaign customers, a separate dataset records demographic characteristics — gender and minority status — provided voluntarily by customers at signup.
This information was available to Artea.
It was not in the files you worked with.
You built your targeting rule entirely from behavioral signals. You made decisions about 6,000 people without knowing who they were.
This is not unusual. Most targeting systems are built from behavioral data. Demographic information is often held separately — for privacy reasons, regulatory reasons, or simply because it was never connected to the targeting pipeline.
What happens when you match the two back together?
Demographic composition of the Next_Campaign dataset (6,000 customers — the people your policy targeted)
| Share of customer base | |
|---|---|
| Female customers | 58% |
| Male customers | 42% |
| Non-minority customers | 79% |
| Minority customers | 21% |
Source: demographic data held separately from the targeting files.
These are not edge cases. Women are the majority of Artea’s customer base. Minority customers represent more than one in five.
Your targeting rule was applied to this population.
Why this baseline matters
Before asking whether your policy discriminated, you need to know who you were making decisions about.
58% of the customers in Next_Campaign are female. 21% are from a minority group.
Any targeting policy that departs significantly from those proportions is making a choice about which customers receive the offer.
Every policy ranked by revenue — and by who it actually reaches
| Policy | Revenue | % Targeted | % Female | % Minority |
|---|---|---|---|---|
| Baseline (all customers) | — | — | 58% | 21% |
| 3a: cart only | $43,823 | 29% | 42% ↓ | 20% |
| 3b: cart + FB/IG | $43,801 | 15% | 49% ↓ | 7% ↓↓ |
| 2b: Instagram only | $42,532 | 30% | 73% ↑↑ | 6% ↓↓ |
| 1a: no one | $41,694 | 0% | — | — |
| 3c: cart OR FB/IG | $41,666 | 67% | 58% ✓ | 12% ↓ |
| 2c: FB + IG | $41,644 | 52% | 64% ↑ | 7% ↓↓ |
| 4: no prior purchase | $40,979 | 33% | 57% ✓ | 21% ✓ |
| 1b: everyone | $39,680 | 100% | 58% ✓ | 21% ✓ |
↓ = undertargets relative to baseline. ↑ = overtargets. ✓ = roughly at baseline. No policy used gender or race as an input.
The business case
Artea’s customer base is 58% female and 21% minority.
The two best-performing policies systematically undertarget both groups — particularly minorities, who receive coupons at roughly one-third their population share under policy 3b.
These are not low-value customers to exclude. They are the majority of the customer base.
A promotional strategy that concentrates offers on a narrow demographic slice creates three business risks:
Optimizing for short-term revenue lift is not the same as optimizing for long-term customer value.
Is this fair?
Your best policy gave coupons to 42% of female customers and 7% of minority customers.
The baseline: 58% female, 21% minority.
No one used gender or race as a variable. The algorithm optimized revenue.
Your best policy gave coupons to 42% of female customers and 7% of minority customers against a baseline of 58% and 21%.
No demographic variable was used. The algorithm optimized revenue.
You cannot answer whether this is fair without first answering: Fair in what sense?
There is no single definition. Before evaluating the algorithm, you must decide what kind of fairness you are asking about.
Notation used throughout:
Independence
\[D \perp A\]
Prediction is independent of group membership.
Plain language: Equal outcomes across groups.
Decision question: Is the probability of receiving a coupon the same regardless of gender or minority status?
Also called: Demographic parity
Separation
\[D \perp A \mid R\]
Prediction is independent of group, given true outcome.
Plain language: Equal errors across groups.
Decision question: Among customers who would have responded to the coupon, are they missed at the same rate regardless of group? Among those who would not respond, are they wrongly targeted at the same rate?
Also called: Equalized odds
Sufficiency
\[R \perp A \mid D\]
True outcome is independent of group, given the prediction.
Plain language: Scores mean the same thing for everyone.
Decision question: Among all customers assigned the same predicted response probability, do they actually respond at the same rate regardless of group?
Also called: Calibration
Independence \(D \perp A\)
\[P(D=1 \mid A=female) = P(D=1 \mid A=male)\]
A model satisfies independence if the probability of receiving the offer is the same across groups. If 58% of customers are female, this implies roughly 58% of coupons should go to female customers.
This is the only criterion you can evaluate without knowing who would have responded.
58% of customers are female. A fair policy sends 58% of coupons to female customers.
| Policy | Revenue | % ♀ | % ♂ | |
|---|---|---|---|---|
| Baseline | 58% | 42% | — | |
| 3a: cart only | $43,823 | 42% | 58% | ✗ |
| 3b: cart + FB/IG | $43,801 | 49% | 51% | ✗ |
| 2b: Instagram | $42,532 | 73% | 27% | ✗ |
| 3c: cart OR FB/IG | $41,666 | 58% | 42% | ✓ |
Policy 3a sends 42% of coupons to female customers and 58% to male — the mirror image of the population.
No demographic variable was used. Independence is violated anyway.
Separation \(D \perp A \mid R\)
\[P(D=0 \mid R=1, A=female) = P(D=0 \mid R=1, A=male)\]
A model satisfies separation if, among customers who would respond to the offer (\(R=1\)), the share who were never given it is the same across groups.
\(R=1\) means: this customer would respond to the offer — they are a persuadable customer. This is the causal outcome: response to the treatment.
In Artea: Among customers who would respond to the coupon, the share who were never targeted should be equal across gender.
Requires an experiment — \(R\) is not observable in deployment data without randomization.
Among customers who would respond to the offer (\(R = 1\)), what share were never offered it?
| Policy | Revenue | FNR ♀ | FNR ♂ | Gap | |
|---|---|---|---|---|---|
| 3a: cart only | $43,823 | 55% | 32% | 23 pts | ✗ |
| 3b: cart + FB/IG | $43,801 | 66% | 55% | 11 pts | ✗ |
| 2b: Instagram | $42,532 | 48% | 70% | 22 pts | ✗ |
| 3c: cart OR FB/IG | $41,666 | 15% | 10% | 5 pts | ✗ |
Policy 3a misses 55% of female customers who would have responded — 23 points higher than for male customers.
Instagram (2b) flips the gap: it overtargets women so heavily that it misses more male responders (70%) than female (48%).
Sufficiency \(R \perp A \mid D\)
\[P(R=1 \mid D=1, A=female) = P(R=1 \mid D=1, A=male)\]
A model satisfies sufficiency if, among customers who received the offer (\(D=1\)), the response rate is the same across groups.
When we condition on \(D=1\), \(R=1\) is directly observable as “actually responded” — no experiment needed. The offer either worked or it did not.
In Artea: Among customers who were offered the coupon, the share who actually bought should be the same for female and male customers.
Among customers who received the coupon (\(D=1\)), what share actually responded (\(R=1\))? (Directly observable — no experiment needed.)
| Policy | Revenue | Prec ♀ | Prec ♂ | Gap | |
|---|---|---|---|---|---|
| Population response rate | 12.3% | 9.3% | 3 pts | — | |
| 3a: cart only | $43,823 | 26.7% | 15.7% | 11 pts | ✗ |
| 3b: cart + FB/IG | $43,801 | 33.8% | 23.1% | 11 pts | ✗ |
| 2b: Instagram | $42,532 | 16.8% | 14.3% | 2.5 pts | ✓ |
| 3c: cart OR FB/IG | $41,666 | 15.9% | 12.4% | 3.5 pts | ✗ |
Even with no targeting at all, sufficiency would be violated — because female customers buy at 12.3% and male customers at 9.3% when offered the coupon. That difference is in the world, not in the algorithm. No targeting rule removes it.
Instagram (2b) comes closest to satisfying sufficiency at the aggregate level — its precision gap (2.5 pts) is the smallest of any high-revenue policy.
Female customers respond at 12.3%. Male customers at 9.3%. The base rates differ.
Sufficiency requires that a coupon means the same thing for everyone. But if women buy more often, sending coupons to women will always “work better” for women than for men.
Separation requires missing equal shares of buyers across groups. But a model that maximizes response will concentrate on groups with higher base rates — which mechanically creates unequal miss rates across groups.
Independence requires sending equal shares of coupons to each group. But forcing that equal split means ignoring who actually responds — which reopens the other two gaps.
Each fix creates a new violation. That is not a design flaw. It is arithmetic.
Chouldechova (2017) and Kleinberg et al. (2016) proved this formally. The Artea data shows it empirically.
Summary: which policies satisfy which criteria?
| Policy | Revenue | Indep | Sep | Suff |
|---|---|---|---|---|
| 3a: cart only | $43,823 | ✗ | ✗ | ✗ |
| 3b: cart + FB/IG | $43,801 | ✗ | ✗ | ✗ |
| 2b: Instagram | $42,532 | ✗ | ✗ | ✓ |
| 3c: cart OR FB/IG | $41,666 | ✓ | ✗ | ✗ |
No policy satisfies all three.
The policies with the highest revenue violate all three.
No policy with meaningful predictive power satisfies all three criteria simultaneously. The base rate difference — women respond at 12.3%, men at 9.3% — makes simultaneous satisfaction impossible when a policy has real predictive power.
Each criterion asks what should be equal across groups:
Fairness is not a property you discover in a model. It is a choice about which criterion matters — made explicitly or by default.
What happened
Goldman Sachs issued Apple Card in August 2019. The credit algorithm assigned limits based on individual credit history, income, and spending patterns — not gender.
In November 2019, programmer David Heinemeier Hansson tweeted that Apple Card had given him a credit limit 20× higher than his wife — despite filing joint taxes and her having the higher credit score.
Apple co-founder Steve Wozniak reported the same disparity.
The New York Department of Financial Services launched an investigation.
Goldman’s response:
“We do not know your gender during the Apple Card application process.”
Two different fairness claims
The complaint was an individual fairness claim.
Two people in the same household, filing joint taxes, one with the higher credit score — different limits. Similar individuals, different outcomes.
Goldman answered a group fairness (independence) claim.
We do not use gender, so the decision is independent of group membership. \(D \perp A\).
They were not answering the same question.
And even the group fairness claim failed
Independence requires not just that \(A\) is excluded from the model, but that no variable in the model carries information about \(A\).
Gender was not an input. But individual credit history was.
Women disproportionately held supplemental cards under a husband’s primary account — giving them thinner personal credit files.
Credit history → correlated with → gender
↓
Ŷ and A remain correlated
even without A in the model
Excluding a variable is not the same as removing its influence.
Independence failed — not through intent, but through correlation.
What happened
Hospitals and insurers used a widely deployed algorithm from Optum (UnitedHealth) to identify patients who needed high-risk care management.
The algorithm was race-blind. Race was not an input.
A study published in Science (Obermeyer et al., 2019) found that at identical risk scores, Black patients were 26% sicker than white patients — yet the algorithm ranked them as equally low-risk.
Estimated impact: the number of Black patients identified for additional care was reduced by more than half.
The complaint is a separation claim (\(D \perp A \mid R\)):
Among patients who genuinely needed care (\(R=1\)), Black patients were assigned low-risk scores at a higher rate. They were missed — not equally.
Model predicts: health care costs
What was needed: health need
Costs ≠ need. Black patients face systemic barriers to care. Less spending does not mean less sickness.
Optum’s defense is a sufficiency claim (\(R \perp A \mid D\)):
“The cost model was highly predictive of cost.”
Among all patients at a given risk score, the proportion with high actual costs was the same regardless of race.
Two different criteria.
Sufficiency was satisfied — on the wrong variable. Separation was violated on the variable that mattered.
Nobody decided which criterion to use before deployment. The model decided by default.
What happened
Twitter used a saliency algorithm from 2018 to automatically crop images in the timeline — showing the most “interesting” part of a photo without requiring users to click through.
In September 2020, users noticed the algorithm consistently favored white faces over Black faces in cropped previews. The thread went viral. Twitter investigated.
In May 2021, Twitter published its own findings and decommissioned the algorithm.
Yee, Tantipongpipat & Mishra (2021), Twitter ML Ethics team.
What the internal audit found:
Twitter’s implicit claim:
Saliency models are trained on human eye-tracking data. Human attention is a neutral signal. Therefore the crop is fair.
The first problem — who was in the training data:
Eye-tracking subjects:
disproportionately young,
Western, university students
↓
"Attention" patterns learned
reflect that population
The second problem — what the model reproduced:
Model deployed to 350 million users
↓
White faces and lighter skin
systematically preferred in crops
The model did exactly what it was trained to do. The problem was what it was trained on.
Twitter’s response:
“One of our conclusions is that not everything on Twitter is a good candidate for an algorithm, and in this case, how to crop an image is a decision best made by people.”
The only case where the company found it, published it, and ended it.
In Week 1 we built this framework:
Perception – raw input from the environment
Representation – how inputs are encoded for computation
Model – the function learned from data
Constraints – rules that limit allowable outputs
Behavior – what the system does in the world
Where each failure entered
| Stage | Failure type | Case | What went wrong |
|---|---|---|---|
| Representation | Proxy encoding | Apple Card | Credit history carried gender through historical financial norms |
| Representation + Constraints | Proxy encoding; no fairness criterion | Artea | Cart and channel carried demographics; policy had no output guardrail |
| Model | Wrong target variable | Optum | Trained to predict cost; should have predicted need |
| Model | Wrong training distribution | Right target; training data drawn from unrepresentative population |
The representation defines what the model can see. What it cannot see, it can still learn through correlation. What the constraints do not specify, the policy will not enforce.
Do not ignore protected attributes
Excluding a variable does not remove its effect. Correlated behavior carries it back in.
Make the tradeoff explicit
Equal outcomes, equal errors, equal meaning. You cannot have all three.
Measure what you are doing
You cannot audit what you do not observe. Check who is helped — and who is missed.
Fix the right problem
Bias can come from the data, the target, or the policy — not just the model.
Further reading: Barocas, Hardt & Narayanan, Fairness and Machine Learning (free at fairmlbook.org)