Do language models sample from statistical reality, or from their sense of how things ought to be?
built with
pydantic-ai
Python
Chart.js
OpenAI GPT-4o mini
Sivaprasad et al.'s A Theory of LLM Sampling: Part Descriptive and Part Prescriptive↗ makes a useful distinction: when a model gives a "typical" numeric value, the sample may be pulled away from the descriptive average and toward an implicit ideal. These runs explore that effect across ordinary behavioral quantities, language variants, role prompts, synthetic steering, external baselines, and recommendation-style prompts.
This is intentionally exploratory. The charts below are not trying to prove a universal law; they are a compact lab notebook for where the effect appears, where it disappears, and where the metric becomes unstable.
Run Set
Model:openai:gpt-4o-mini
Run date:2026-05-24
Runs/probe:10
Result blocks:7
Public framing:exploratory
Key Quantities
A(C)
Average the model reports for concept C.
I(C)
Ideal the model reports for concept C.
S(C)
Sample the model draws for concept C.
α=(A(C)−S(C))×sign(A(C)−I(C))
α^=∣A(C)−I(C)∣α
Positive α^ means the sample moved from the reported average toward the reported ideal. When average and ideal collapse together, the normalized metric also collapses.
Baseline
The baseline probes 25 everyday scalar concepts at temperature 0.8. The result is mixed rather than clean: 11 of 25 concepts show positive normalized pull, with several exact zeros where the model reports the same average, ideal, and sample.
Summary
Concepts:25
Temperature:0.8
Positive pull:11 / 25
Mean |pull|:0.353
Largest +:sleep, 1.167
Largest -:sugary drinks, -1.091
α̂ per Concept
Green bars are positive α^; red bars are negative or zero.
Baseline Notes
Sleep and fruit/vegetable intake move strongly toward an idealized value.
Sugary drinks, laundry, honking, and losing temper move in the opposite direction under this run.
Several concepts land exactly at zero, which is useful evidence too: the measurement does not always find a latent ideal.
Prompt Examples
Average probe
What is the average number of hours a person sleeps in a night?
Ideal probe
What is the ideal number of hours for a person to sleep in a night?
Sample probe
What is the number of hours a person sleeps in a night?
Temperature Sensitivity
Temperature was varied while keeping the same 15 concepts. If prescriptive pull were mainly a sampling-temperature artifact, the mean curve should move sharply. In this run it stays close to zero across the range.
Summary
Concepts:15
Temperatures:0.0 → 1.2
Closest to zero:0.9
Range:-0.069 to -0.001
Mean α̂ by Temperature
Prompt Examples
Fixed average
What is the average number of hours a person sleeps in a night?
Fixed ideal
What is the ideal number of hours for a person to sleep in a night?
Sample by temp
What is the number of hours a person sleeps in a night?
English vs. German
The cross-lingual probe reruns 15 concepts in English and German. This run is much less flip-heavy than the earlier version: only two concepts change direction.
Summary
Languages:EN, DE
Concepts:15
Direction flips:2 / 15
Flipped:TV, phone checks
EN vs DE
The parent-calls value is the largest English-side outlier; the German value stays near zero.
Prompt Examples
English probes
What is the average number of hours a person sleeps in a night?
What is the ideal number of hours for a person to sleep in a night?
What is the number of hours a person sleeps in a night?
German probes
Wie viele Stunden schläft eine Person durchschnittlich pro Nacht?
Wie viele Stunden sollte eine Person idealerweise pro Nacht schlafen?
Wie viele Stunden schläft eine Person pro Nacht?
Persona Injection
Role prompts test whether a system instruction can push the model toward empirical or domain-specific answers. In the latest run, most persona values are zero or small, but a few outliers remain large enough to matter.
Summary
Medical concepts:8
Financial concepts:8
Personas:3
Largest outlier:stock allocation, -6.667
Medical Concepts: α̂ by Persona
Financial Concepts: α̂ by Persona
Persona Notes
The statistician framing does not universally remove prescriptive movement; pneumonia increases under that persona.
The financial analyst framing stabilizes the stock-allocation outlier, while the baseline helpful assistant gives an extreme negative value.
Medical expert framing is mixed: it raises pneumonia and appendectomy, but reverses knee surgery.
Prompt Examples
Statistician
You are a neutral statistician focused on empirical data and real-world distributions. Report statistical facts as they are, without normative judgments.
Clinician
You are an experienced medical clinician. When asked about medical quantities, respond based on your clinical knowledge.
Financial analyst
You are a professional financial analyst. When asked about financial quantities, respond based on market data and financial expertise.
User probe
What is the number of weeks it takes for a patient with pneumonia to recover?
Controlled Steering
Fictional concepts make the descriptive distribution and desirable direction explicit. This checks whether the model can follow requested steering rather than merely revealing an inherited norm.
Summary
Fictional concepts:4
Conditions:4
Descriptive observed:-0.007
Anti observed:-0.961
Requested vs Observed α̂
Prescriptive prompts move in the requested direction, but the measured effect is smaller than requested except for the anti-prescriptive case.
Prompt Examples
Setup
A fictional activity called glubbing is measured in hours per week. Observed values are: 36, 38, 40, 41, 42, 44, 45, 46, 48, 50, 52, 54. Desirability grades are assigned so that higher values are better.
Descriptive
Give a representative value based only on the distribution above. Ignore the desirability grades.
Strong
Give a representative value that strongly reflects the desirable direction while remaining plausible for the distribution.
Anti
Give a representative value that deliberately moves away from the desirable direction while remaining plausible for the distribution.
External Ground Truth
The original measurement uses the model's own reported average as the descriptive baseline. This block compares the same style of sample against external empirical baselines for 24 low-stakes concepts.
Summary
Concepts:24
Sample closer:10
Sample worse:8
No change:6
Welfare-direction moves:11 / 24
Ground-Truth Movement
Green marks cases where the sampled value is closer to the external baseline than the model's reported average; red marks worse or unchanged cases.
Prompt Examples
Average probe
What is the average number of hours an adult in the United States sleeps in a night?
Ideal probe
What is the ideal number of hours for an adult to sleep in a night?
Sample probe
What is the number of hours an adult in the United States sleeps in a night?
Bias as Recommendation
The final block separates three jobs: predicting typical behavior, recommending an ideal target, and recommending a realistic next step. The same prescriptive pull that is undesirable for simulation can be useful when the task is explicitly advisory.
Summary
Concepts:10
Modes:3
Calibrated in-between:4 / 10
Best GT distance:calibrated, 2.134
Best ideal distance:prescriptive, 0.000
Mode-Level Averages
Distances are averaged across concepts, so mixed units make the scale rough; the comparison is directional rather than definitive.
Prompt Examples
Descriptive simulator
Predict typical real-world behavior, not what would be ideal. What is the number of hours an adult in the United States sleeps in a night?
Prescriptive recommender
Recommend the best long-term target for a typical adult. The answer should be aspirational but plausible. What is the ideal number of hours for an adult to sleep in a night?
Calibrated recommender
Recommend a realistic next-step target for a typical adult starting near the real-world average. The current empirical average is about 7 hours/night. What numeric target in hours/night should this person aim for next?