Day 3 — Targeting, Experimentation & Algorithmic Bias

AI & Marketing

Simon Blanchard

This Lecture

  • Part 1: The Experiment — Why Artea Ran an A/B Test
  • Part 2: What the Data Shows — Overall Effects
  • Part 3: Heterogeneity — Who Actually Responds?
  • Part 4: Your Targeting Policies — Live Comparison
  • Part 5: One More Thing
  • Part 6: Algorithmic Bias as a General Phenomenon

Part 1: The Experiment

The situation

Alex Campbel, CEO of Artea, knows that 87% of website visitors never buy anything. She wants to know whether a 20% off coupon changes that.

She cannot just send coupons and see what happens.

She runs an A/B test.

Why does she need a control group?

2,500 customers → receive 20% coupon
                    ↓
              Outcome: Y_treatment

2,500 customers → receive nothing
                    ↓
              Outcome: Y_control

Lift = Y_treatment − Y_control

Without a concurrent control group, you cannot separate the effect of the coupon from seasonal variation, trends, or any other factor moving outcomes in the same period.

Historical data does not give you a counterfactual.

This Lecture

  • Part 1: The Experiment — Why Artea Ran an A/B Test
  • Part 2: What the Data Shows — Overall Effects
  • Part 3: Heterogeneity — Who Actually Responds?
  • Part 4: Your Targeting Policies — Live Comparison
  • Part 5: One More Thing
  • Part 6: Algorithmic Bias as a General Phenomenon

Overall Effect: What Did the Coupon Do?

Average outcomes by experimental condition

No Coupon Coupon Difference
Transactions 0.126 0.152 +0.026
Revenue (USD) $7.78 $7.54 −$0.24

5,000 customers. Half received the coupon.

The coupon increased transactions.

The coupon did not increase revenue.

Are you confident in both findings?

Why the pattern makes sense

The coupon creates two opposing forces on revenue:

  • More customers transact → more revenue
  • Each transaction captures 20% less margin → less revenue per sale

At scale, these forces roughly cancel out.

The average customer was going to spend around $65 whether or not they received the coupon. The discount moved some of them to buy — but it did not increase what they spent.

This does not mean the coupon failed. It means the average customer is the wrong unit of analysis.

This Lecture

  • Part 1: The Experiment — Why Artea Ran an A/B Test
  • Part 2: What the Data Shows — Overall Effects
  • Part 3: Heterogeneity — Who Actually Responds?
  • Part 4: Your Targeting Policies — Live Comparison
  • Part 5: One More Thing
  • Part 6: Algorithmic Bias as a General Phenomenon

Heterogeneity: Acquisition Channel

Does the coupon effect differ by how the customer found Artea?

Channel Coupon effect on transactions Coupon effect on revenue
Google −0.020 (not significant) −$2.12 (significant)
Facebook +0.082 (p < 0.05) +$3.20 (p < 0.1)
Instagram +0.079 (p < 0.01) +$3.28 (p < 0.05)
Referral +0.074 (not significant) +$2.46 (not significant)

Coefficients show treatment effect relative to Google baseline. Source: regression with interaction terms, AB_test data.

Facebook and Instagram customers respond. Google customers do not — and their revenue goes down.

Why might this pattern exist?

Think about purchase intent at acquisition:

  • Google search → customer is actively looking, likely to buy eventually anyway. Coupon cannibalizes a sale that was coming.

  • Social media (FB/Instagram) → customer was browsing, not searching. Coupon creates a reason to buy that did not previously exist.

This is the difference between converting existing intent and creating new intent.

For Google customers: the coupon is a discount on a sale you were going to make anyway.

Heterogeneity: Shopping Cart Status

Does having an abandoned cart change how customers respond?

Shopping cart Coupon effect on transactions Coupon effect on revenue
No abandoned cart +0.008 (not significant) −$0.74 (not significant)
Abandoned cart +0.069 to +0.085 (p < 0.01) +$2.12 to +$3.00 (p < 0.05)

29% of the 5,000 customers had abandoned a cart.

For customers with abandoned carts, the coupon works on both transactions and revenue. For everyone else, it barely moves the needle.

The behavioral story

A customer who added an item to their cart and left is not a random visitor. They:

  • Already identified a product they wanted
  • Had a reason not to complete the purchase

The coupon resolves that reason. It is not persuading someone to want something — it is removing a barrier for someone who already wanted it.

The coupon works as a nudge, not a pitch.

This is the highest-confidence finding in the data. The effect is large, consistent, and has a clean behavioral explanation.

Plant the seed: “Is there anything you know about cart abandoners as a group? What kinds of customers abandon carts more often?”

From Learning to Decision

What you did — in two phases

Phase 1: Training

You had 5,000 customers with known outcomes. You used that data to find patterns — which customers responded to the coupon, and which didn’t.

The AB_test tab was your training data. You were the algorithm.

Phase 2: Inference

You took the rule you learned and applied it to 6,000 different customers — customers with no outcome data, because they were never in the experiment.

The Next_Campaign tab was your test set. You were making predictions, not observations.

This is the structure of every supervised learning problem: learn a function on labeled data, apply it to unlabeled data.

How the comparison table was built

The table you are about to see was not built from your predictions.

It was built from counterfactual outcomes that only the instructor version of the spreadsheet contains.

For each of the 6,000 Next_Campaign customers, that file records:

  • y0 — transactions without a coupon
  • y1 — transactions with a coupon

Both. For the same person. At the same time.

In reality, a customer either gets the coupon or doesn’t — you can never observe both outcomes. This dataset was simulated to allow policy evaluation without running a new experiment.

This is called an oracle. You had the rule. The oracle had the ground truth.

In practice you would not have this. You would run another A/B test — or accept the uncertainty.

This Lecture

  • Part 1: The Experiment — Why Artea Ran an A/B Test
  • Part 2: What the Data Shows — Overall Effects
  • Part 3: Heterogeneity — Who Actually Responds?
  • Part 4: Your Targeting Policies — Live Comparison
  • Part 5: One More Thing
  • Part 6: Algorithmic Bias as a General Phenomenon

All Targeting Policies Compared

Standard policies ranked by revenue

Policy Transactions Revenue % Targeted
3a: shopping cart only 804 $43,823 29%
3b: cart + FB/IG 776 $43,801 15%
2b: Instagram only 764 $42,532 30%
1a: no one 678 $41,694 0%
3c: cart OR FB/IG 821 $41,666 67%
2c: FB + IG 795 $41,644 52%
4: no prior purchase 664 $40,979 33%
2a: Facebook only 707 $40,806 22%
1b: everyone 807 $39,680 100%
1c: random 712 $39,459 49%

Source: New Campaign Instructor tab, counterfactual outcomes.

Sending to everyone produces the lowest revenue of any policy — lower even than sending to no one.

When you give a coupon to a customer who was going to buy anyway, you do not gain a transaction. You lose 20% of the margin.

The two best policies target narrow groups: cart abandoners (29%), or cart abandoners from social (15%).

The tradeoff:

  • 3a: more customers, higher transactions, tiny revenue edge
  • 3b: fewer customers, same revenue, lower margin erosion risk

Neither is obviously right. The answer depends on whether Artea’s goal is conversion volume or margin protection.

Where does your team’s policy land?

Three Datasets — One Decision

What you had

AB_test — 5,000 customers Behavioral data: acquisition channel, past purchases, cart status, browsing time. Outcomes after the experiment. No demographics. This is what you used to find patterns.

Next_Campaign — 6,000 customers Same behavioral variables. No outcomes. No demographics. You added a 1 or 0 to each row. This is what you applied your rule to.

What both datasets share:

Neither contained gender. Neither contained race or ethnicity. Not because Artea doesn’t have customers with those characteristics — but because that data was held separately and was not used for targeting.

What else existed

For the same 6,000 Next_Campaign customers, a separate dataset records demographic characteristics — gender and minority status — provided voluntarily by customers at signup.

This information was available to Artea.

It was not in the files you worked with.

You built your targeting rule entirely from behavioral signals. You made decisions about 6,000 people without knowing who they were.

This is not unusual. Most targeting systems are built from behavioral data. Demographic information is often held separately — for privacy reasons, regulatory reasons, or simply because it was never connected to the targeting pipeline.

What happens when you match the two back together?

Who Are Artea’s Customers?

Demographic composition of the Next_Campaign dataset (6,000 customers — the people your policy targeted)

Share of customer base
Female customers 58%
Male customers 42%
Non-minority customers 79%
Minority customers 21%

Source: demographic data held separately from the targeting files.

These are not edge cases. Women are the majority of Artea’s customer base. Minority customers represent more than one in five.

Your targeting rule was applied to this population.

Why this baseline matters

Before asking whether your policy discriminated, you need to know who you were making decisions about.

58% of the customers in Next_Campaign are female. 21% are from a minority group.

Any targeting policy that departs significantly from those proportions is making a choice about which customers receive the offer.

This Lecture

  • Part 1: The Experiment — Why Artea Ran an A/B Test
  • Part 2: What the Data Shows — Overall Effects
  • Part 3: Heterogeneity — Who Actually Responds?
  • Part 4: Your Targeting Policies — Live Comparison
  • Part 5: One More Thing
  • Part 6: Algorithmic Bias as a General Phenomenon

Who Gets the Coupon?

Every policy ranked by revenue — and by who it actually reaches

Policy Revenue % Targeted % Female % Minority
Baseline (all customers) 58% 21%
3a: cart only $43,823 29% 42% ↓ 20%
3b: cart + FB/IG $43,801 15% 49% ↓ 7% ↓↓
2b: Instagram only $42,532 30% 73% ↑↑ 6% ↓↓
1a: no one $41,694 0%
3c: cart OR FB/IG $41,666 67% 58% ✓ 12% ↓
2c: FB + IG $41,644 52% 64% ↑ 7% ↓↓
4: no prior purchase $40,979 33% 57% ✓ 21% ✓
1b: everyone $39,680 100% 58% ✓ 21% ✓

↓ = undertargets relative to baseline. ↑ = overtargets. ✓ = roughly at baseline. No policy used gender or race as an input.

Why This Matters

The business case

Artea’s customer base is 58% female and 21% minority.

The two best-performing policies systematically undertarget both groups — particularly minorities, who receive coupons at roughly one-third their population share under policy 3b.

These are not low-value customers to exclude. They are the majority of the customer base.

A promotional strategy that concentrates offers on a narrow demographic slice creates three business risks:

  • Loyalty erosion among the majority who never receive offers
  • Brand perception among customers who notice the pattern
  • Concentration of discount dependency in a narrow segment, increasing margin risk over time

Optimizing for short-term revenue lift is not the same as optimizing for long-term customer value.

Is this fair?

Your best policy gave coupons to 42% of female customers and 7% of minority customers.

The baseline: 58% female, 21% minority.

No one used gender or race as a variable. The algorithm optimized revenue.

This Lecture

  • Part 1: The Experiment — Why Artea Ran an A/B Test
  • Part 2: What the Data Shows — Overall Effects
  • Part 3: Heterogeneity — Who Actually Responds?
  • Part 4: Your Targeting Policies — Live Comparison
  • Part 5: One More Thing
  • Part 6: Algorithmic Bias as a General Phenomenon

Is This Algorithm Fair?

Your best policy gave coupons to 42% of female customers and 7% of minority customers against a baseline of 58% and 21%.

No demographic variable was used. The algorithm optimized revenue.

You cannot answer whether this is fair without first answering: Fair in what sense?

There is no single definition. Before evaluating the algorithm, you must decide what kind of fairness you are asking about.

Notation used throughout:

  • \(D\) — the decision (did the customer receive a coupon?)
  • \(R\) — the response (would the customer transact if given the coupon?)
  • \(A\) — group membership (gender, minority status)

What Should Be Equal?

Independence

\[D \perp A\]

Prediction is independent of group membership.

Plain language: Equal outcomes across groups.

Decision question: Is the probability of receiving a coupon the same regardless of gender or minority status?

Also called: Demographic parity

Separation

\[D \perp A \mid R\]

Prediction is independent of group, given true outcome.

Plain language: Equal errors across groups.

Decision question: Among customers who would have responded to the coupon, are they missed at the same rate regardless of group? Among those who would not respond, are they wrongly targeted at the same rate?

Also called: Equalized odds

Sufficiency

\[R \perp A \mid D\]

True outcome is independent of group, given the prediction.

Plain language: Scores mean the same thing for everyone.

Decision question: Among all customers assigned the same predicted response probability, do they actually respond at the same rate regardless of group?

Also called: Calibration

Independence in Artea

Independence \(D \perp A\)

\[P(D=1 \mid A=female) = P(D=1 \mid A=male)\]

A model satisfies independence if the probability of receiving the offer is the same across groups. If 58% of customers are female, this implies roughly 58% of coupons should go to female customers.

This is the only criterion you can evaluate without knowing who would have responded.

58% of customers are female. A fair policy sends 58% of coupons to female customers.

Policy Revenue % ♀ % ♂
Baseline 58% 42%
3a: cart only $43,823 42% 58%
3b: cart + FB/IG $43,801 49% 51%
2b: Instagram $42,532 73% 27%
3c: cart OR FB/IG $41,666 58% 42%

Policy 3a sends 42% of coupons to female customers and 58% to male — the mirror image of the population.

No demographic variable was used. Independence is violated anyway.

Separation in Artea

Separation \(D \perp A \mid R\)

\[P(D=0 \mid R=1, A=female) = P(D=0 \mid R=1, A=male)\]

A model satisfies separation if, among customers who would respond to the offer (\(R=1\)), the share who were never given it is the same across groups.

\(R=1\) means: this customer would respond to the offer — they are a persuadable customer. This is the causal outcome: response to the treatment.

In Artea: Among customers who would respond to the coupon, the share who were never targeted should be equal across gender.

Requires an experiment — \(R\) is not observable in deployment data without randomization.

Among customers who would respond to the offer (\(R = 1\)), what share were never offered it?

Policy Revenue FNR ♀ FNR ♂ Gap
3a: cart only $43,823 55% 32% 23 pts
3b: cart + FB/IG $43,801 66% 55% 11 pts
2b: Instagram $42,532 48% 70% 22 pts
3c: cart OR FB/IG $41,666 15% 10% 5 pts

Policy 3a misses 55% of female customers who would have responded — 23 points higher than for male customers.

Instagram (2b) flips the gap: it overtargets women so heavily that it misses more male responders (70%) than female (48%).

Sufficiency in Artea

Sufficiency \(R \perp A \mid D\)

\[P(R=1 \mid D=1, A=female) = P(R=1 \mid D=1, A=male)\]

A model satisfies sufficiency if, among customers who received the offer (\(D=1\)), the response rate is the same across groups.

When we condition on \(D=1\), \(R=1\) is directly observable as “actually responded” — no experiment needed. The offer either worked or it did not.

In Artea: Among customers who were offered the coupon, the share who actually bought should be the same for female and male customers.

Among customers who received the coupon (\(D=1\)), what share actually responded (\(R=1\))? (Directly observable — no experiment needed.)

Policy Revenue Prec ♀ Prec ♂ Gap
Population response rate 12.3% 9.3% 3 pts
3a: cart only $43,823 26.7% 15.7% 11 pts
3b: cart + FB/IG $43,801 33.8% 23.1% 11 pts
2b: Instagram $42,532 16.8% 14.3% 2.5 pts
3c: cart OR FB/IG $41,666 15.9% 12.4% 3.5 pts

Even with no targeting at all, sufficiency would be violated — because female customers buy at 12.3% and male customers at 9.3% when offered the coupon. That difference is in the world, not in the algorithm. No targeting rule removes it.

Instagram (2b) comes closest to satisfying sufficiency at the aggregate level — its precision gap (2.5 pts) is the smallest of any high-revenue policy.

Why You Usually Cannot Have All Three

Female customers respond at 12.3%. Male customers at 9.3%. The base rates differ.

Sufficiency requires that a coupon means the same thing for everyone. But if women buy more often, sending coupons to women will always “work better” for women than for men.

Separation requires missing equal shares of buyers across groups. But a model that maximizes response will concentrate on groups with higher base rates — which mechanically creates unequal miss rates across groups.

Independence requires sending equal shares of coupons to each group. But forcing that equal split means ignoring who actually responds — which reopens the other two gaps.

Each fix creates a new violation. That is not a design flaw. It is arithmetic.

Chouldechova (2017) and Kleinberg et al. (2016) proved this formally. The Artea data shows it empirically.

Summary: which policies satisfy which criteria?

Policy Revenue Indep Sep Suff
3a: cart only $43,823
3b: cart + FB/IG $43,801
2b: Instagram $42,532
3c: cart OR FB/IG $41,666

No policy satisfies all three.

The policies with the highest revenue violate all three.

No policy with meaningful predictive power satisfies all three criteria simultaneously. The base rate difference — women respond at 12.3%, men at 9.3% — makes simultaneous satisfaction impossible when a policy has real predictive power.

Each criterion asks what should be equal across groups:

  • Independence → outcomes
  • Separation → errors
  • Sufficiency → meaning

Fairness is not a property you discover in a model. It is a choice about which criterion matters — made explicitly or by default.

Apple Card (2019)

What happened

Goldman Sachs issued Apple Card in August 2019. The credit algorithm assigned limits based on individual credit history, income, and spending patterns — not gender.

In November 2019, programmer David Heinemeier Hansson tweeted that Apple Card had given him a credit limit 20× higher than his wife — despite filing joint taxes and her having the higher credit score.

Apple co-founder Steve Wozniak reported the same disparity.

The New York Department of Financial Services launched an investigation.

Goldman’s response:

“We do not know your gender during the Apple Card application process.”

Two different fairness claims

The complaint was an individual fairness claim.

Two people in the same household, filing joint taxes, one with the higher credit score — different limits. Similar individuals, different outcomes.

Goldman answered a group fairness (independence) claim.

We do not use gender, so the decision is independent of group membership. \(D \perp A\).

They were not answering the same question.

And even the group fairness claim failed

Independence requires not just that \(A\) is excluded from the model, but that no variable in the model carries information about \(A\).

Gender was not an input. But individual credit history was.

Women disproportionately held supplemental cards under a husband’s primary account — giving them thinner personal credit files.

Credit history → correlated with → gender
        ↓
Ŷ and A remain correlated
even without A in the model

Excluding a variable is not the same as removing its influence.

Independence failed — not through intent, but through correlation.

Optum Health Algorithm (2019)

What happened

Hospitals and insurers used a widely deployed algorithm from Optum (UnitedHealth) to identify patients who needed high-risk care management.

The algorithm was race-blind. Race was not an input.

A study published in Science (Obermeyer et al., 2019) found that at identical risk scores, Black patients were 26% sicker than white patients — yet the algorithm ranked them as equally low-risk.

Estimated impact: the number of Black patients identified for additional care was reduced by more than half.

The complaint is a separation claim (\(D \perp A \mid R\)):

Among patients who genuinely needed care (\(R=1\)), Black patients were assigned low-risk scores at a higher rate. They were missed — not equally.

Model predicts: health care costs
What was needed: health need

Costs ≠ need. Black patients face systemic barriers to care. Less spending does not mean less sickness.

Optum’s defense is a sufficiency claim (\(R \perp A \mid D\)):

“The cost model was highly predictive of cost.”

Among all patients at a given risk score, the proportion with high actual costs was the same regardless of race.

Two different criteria.

Sufficiency was satisfied — on the wrong variable. Separation was violated on the variable that mattered.

Nobody decided which criterion to use before deployment. The model decided by default.

Twitter Image Cropping (2020)

What happened

Twitter used a saliency algorithm from 2018 to automatically crop images in the timeline — showing the most “interesting” part of a photo without requiring users to click through.

In September 2020, users noticed the algorithm consistently favored white faces over Black faces in cropped previews. The thread went viral. Twitter investigated.

In May 2021, Twitter published its own findings and decommissioned the algorithm.

Yee, Tantipongpipat & Mishra (2021), Twitter ML Ethics team.

What the internal audit found:

  • 4% difference from parity favoring white individuals over Black individuals in cropped previews
  • 8% difference from parity favoring women over men
  • Evidence of “male gaze” bias: algorithm cropped to women’s bodies rather than faces in some images

Twitter’s implicit claim:

Saliency models are trained on human eye-tracking data. Human attention is a neutral signal. Therefore the crop is fair.

The first problem — who was in the training data:

Eye-tracking subjects:
disproportionately young,
Western, university students
    ↓
"Attention" patterns learned
reflect that population

The second problem — what the model reproduced:

Model deployed to 350 million users
    ↓
White faces and lighter skin
systematically preferred in crops

The model did exactly what it was trained to do. The problem was what it was trained on.

Twitter’s response:

“One of our conclusions is that not everything on Twitter is a good candidate for an algorithm, and in this case, how to crop an image is a decision best made by people.”

The only case where the company found it, published it, and ended it.

Every Failure Was Visible in Week 1

In Week 1 we built this framework:

Perception – raw input from the environment

Representation – how inputs are encoded for computation

Model – the function learned from data

Constraints – rules that limit allowable outputs

Behavior – what the system does in the world

Where each failure entered

Stage Failure type Case What went wrong
Representation Proxy encoding Apple Card Credit history carried gender through historical financial norms
Representation + Constraints Proxy encoding; no fairness criterion Artea Cart and channel carried demographics; policy had no output guardrail
Model Wrong target variable Optum Trained to predict cost; should have predicted need
Model Wrong training distribution Twitter Right target; training data drawn from unrepresentative population

The representation defines what the model can see. What it cannot see, it can still learn through correlation. What the constraints do not specify, the policy will not enforce.

What Managers Should Actually Do

Do not ignore protected attributes

Excluding a variable does not remove its effect. Correlated behavior carries it back in.

Make the tradeoff explicit

Equal outcomes, equal errors, equal meaning. You cannot have all three.

Measure what you are doing

You cannot audit what you do not observe. Check who is helped — and who is missed.

Fix the right problem

Bias can come from the data, the target, or the policy — not just the model.

Further reading: Barocas, Hardt & Narayanan, Fairness and Machine Learning (free at fairmlbook.org)