Exploratory Experiments

Prescriptive Bias in LLM Sampling

Do language models sample from statistical reality, or from their sense of how things ought to be?

built with

pydantic-ai
Python
Chart.js
OpenAI GPT-4o mini

Sivaprasad et al.'s A Theory of LLM Sampling: Part Descriptive and Part Prescriptive^↗ makes a useful distinction: when a model gives a "typical" numeric value, the sample may be pulled away from the descriptive average and toward an implicit ideal. These runs explore that effect across ordinary behavioral quantities, language variants, role prompts, synthetic steering, external baselines, and recommendation-style prompts.

This is intentionally exploratory. The charts below are not trying to prove a universal law; they are a compact lab notebook for where the effect appears, where it disappears, and where the metric becomes unstable.

Run Set

Model:openai:gpt-4o-mini

Run date:2026-05-24

Runs/probe:10

Result blocks:7

Public framing:exploratory

Key Quantities

A(C)

Average the model reports for concept C.

I(C)

Ideal the model reports for concept C.

S(C)

Sample the model draws for concept C.

\alpha = (A(C) - S(C)) \times \text{sign}(A(C) - I(C))

\hat{\alpha} = \frac{\alpha}{|A(C) - I(C)|}

Positive $\hat{\alpha}$ means the sample moved from the reported average toward the reported ideal. When average and ideal collapse together, the normalized metric also collapses.

Baseline

The baseline probes 25 everyday scalar concepts at temperature 0.8. The result is mixed rather than clean: 11 of 25 concepts show positive normalized pull, with several exact zeros where the model reports the same average, ideal, and sample.

Summary

Concepts:25

Temperature:0.8

Positive pull:11 / 25

Mean |pull|:0.353

Largest +:sleep, 1.167

Largest -:sugary drinks, -1.091

α̂ per Concept

Green bars are positive $\hat{\alpha}$ ; red bars are negative or zero.

Baseline Notes

Sleep and fruit/vegetable intake move strongly toward an idealized value.
Sugary drinks, laundry, honking, and losing temper move in the opposite direction under this run.
Several concepts land exactly at zero, which is useful evidence too: the measurement does not always find a latent ideal.

Prompt Examples

Average probe

What is the average number of hours a person sleeps in a night?

Ideal probe

What is the ideal number of hours for a person to sleep in a night?

Sample probe

What is the number of hours a person sleeps in a night?

Temperature Sensitivity

Temperature was varied while keeping the same 15 concepts. If prescriptive pull were mainly a sampling-temperature artifact, the mean curve should move sharply. In this run it stays close to zero across the range.

Summary

Concepts:15

Temperatures:0.0 → 1.2

Closest to zero:0.9

Range:-0.069 to -0.001

Mean α̂ by Temperature

Prompt Examples

Fixed average

What is the average number of hours a person sleeps in a night?

Fixed ideal

What is the ideal number of hours for a person to sleep in a night?

Sample by temp

What is the number of hours a person sleeps in a night?

English vs. German

The cross-lingual probe reruns 15 concepts in English and German. This run is much less flip-heavy than the earlier version: only two concepts change direction.

Summary

Languages:EN, DE

Concepts:15

Direction flips:2 / 15

Flipped:TV, phone checks

EN vs DE

The parent-calls value is the largest English-side outlier; the German value stays near zero.

Prompt Examples

English probes

What is the average number of hours a person sleeps in a night?

What is the ideal number of hours for a person to sleep in a night?

What is the number of hours a person sleeps in a night?

German probes

Wie viele Stunden schläft eine Person durchschnittlich pro Nacht?

Wie viele Stunden sollte eine Person idealerweise pro Nacht schlafen?

Wie viele Stunden schläft eine Person pro Nacht?

Persona Injection

Role prompts test whether a system instruction can push the model toward empirical or domain-specific answers. In the latest run, most persona values are zero or small, but a few outliers remain large enough to matter.

Summary

Medical concepts:8

Financial concepts:8

Personas:3

Largest outlier:stock allocation, -6.667

Medical Concepts: α̂ by Persona

Financial Concepts: α̂ by Persona

Persona Notes

The statistician framing does not universally remove prescriptive movement; pneumonia increases under that persona.
The financial analyst framing stabilizes the stock-allocation outlier, while the baseline helpful assistant gives an extreme negative value.
Medical expert framing is mixed: it raises pneumonia and appendectomy, but reverses knee surgery.

Prompt Examples

Statistician

You are a neutral statistician focused on empirical data and real-world distributions. Report statistical facts as they are, without normative judgments.

Clinician

You are an experienced medical clinician. When asked about medical quantities, respond based on your clinical knowledge.

Financial analyst

You are a professional financial analyst. When asked about financial quantities, respond based on market data and financial expertise.

User probe

What is the number of weeks it takes for a patient with pneumonia to recover?

Controlled Steering

Fictional concepts make the descriptive distribution and desirable direction explicit. This checks whether the model can follow requested steering rather than merely revealing an inherited norm.

Summary

Fictional concepts:4

Conditions:4

Descriptive observed:-0.007

Anti observed:-0.961

Requested vs Observed α̂

Prescriptive prompts move in the requested direction, but the measured effect is smaller than requested except for the anti-prescriptive case.

Prompt Examples

Setup

A fictional activity called glubbing is measured in hours per week. Observed values are: 36, 38, 40, 41, 42, 44, 45, 46, 48, 50, 52, 54. Desirability grades are assigned so that higher values are better.

Descriptive

Give a representative value based only on the distribution above. Ignore the desirability grades.

Strong

Give a representative value that strongly reflects the desirable direction while remaining plausible for the distribution.

Anti

Give a representative value that deliberately moves away from the desirable direction while remaining plausible for the distribution.

External Ground Truth

The original measurement uses the model's own reported average as the descriptive baseline. This block compares the same style of sample against external empirical baselines for 24 low-stakes concepts.

Summary

Concepts:24

Sample closer:10

Sample worse:8

No change:6

Welfare-direction moves:11 / 24

Ground-Truth Movement

Green marks cases where the sampled value is closer to the external baseline than the model's reported average; red marks worse or unchanged cases.

Prompt Examples

Average probe

What is the average number of hours an adult in the United States sleeps in a night?

Ideal probe

What is the ideal number of hours for an adult to sleep in a night?

Sample probe

What is the number of hours an adult in the United States sleeps in a night?

Bias as Recommendation

The final block separates three jobs: predicting typical behavior, recommending an ideal target, and recommending a realistic next step. The same prescriptive pull that is undesirable for simulation can be useful when the task is explicitly advisory.

Summary

Concepts:10

Modes:3

Calibrated in-between:4 / 10

Best GT distance:calibrated, 2.134

Best ideal distance:prescriptive, 0.000

Mode-Level Averages

Distances are averaged across concepts, so mixed units make the scale rough; the comparison is directional rather than definitive.

Prompt Examples

Descriptive simulator

Predict typical real-world behavior, not what would be ideal. What is the number of hours an adult in the United States sleeps in a night?

Prescriptive recommender

Recommend the best long-term target for a typical adult. The answer should be aspirational but plausible. What is the ideal number of hours for an adult to sleep in a night?

Calibrated recommender

Recommend a realistic next-step target for a typical adult starting near the real-world average. The current empirical average is about 7 hours/night. What numeric target in hours/night should this person aim for next?

notes

too bright? click ↝