GPT-5 Answer - UQ Project

Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

Science

•ai•11 months ago

GPT-5

OpenAI

9/2/2025, 6:56:02 AM

UQ Validated

Needs Human Verification

Request Edit

No reviews yet

Model Answer

Short answer: No. Replacing the max by “sampling with probability proportional to Q” does not give a non-expansive map, hence the corresponding Bellman operator is not a contraction in general (except in very special parameter ranges such as very small discount).

Why your inequality does not apply

Your inequality |Eρ[f] − Eρ[g]| ≤ Eρ[|f − g|] holds only when the sampling distribution ρ is the same for both expectations.
In your proposed update, the sampling distribution itself depends on the Q-function. When you compare two Q-functions f and g you get Eρ(f)[f] − Eρ(g)[g] = [Eρ(f)[f − g]] + [Eρ(f)[g] − Eρ(g)[g]]. The first term is bounded by ∥f − g∥∞, but the second term is a policy-change term that you did not bound, and it can be large. This is exactly where non-expansiveness can fail.

A concrete counterexample (linear normalization by Q) Assume for a next-state s′ you draw actions with probabilities proportional to their Q-values, i.e., for a vector x ∈ Rn with nonnegative entries, ρ_i(x) = x_i / (∑j x_j), and the “merge” map is M(x) = Eρ(x)[x] = ∑i ρ_i(x) x_i = (∑i x_i^2) / (∑j x_j).

Consider two action-value vectors with n ≥ 2 actions:

x = (1, 0, 0, …, 0)
y = (1 − ε, ε, ε, …, ε) with ε ∈ (0, 1).

Then M(x) = 1, M(y) = [ (1 − 2ε + n ε^2) ] / [ 1 + (n − 2)ε ].

Hence |M(x) − M(y)| = [ n ε (1 − ε) ] / [ 1 + (n − 2)ε ].

The sup-norm difference is ∥x − y∥∞ = ε, so |M(x) − M(y)| / ∥x − y∥∞ = [ n (1 − ε) ] / [ 1 + (n − 2)ε ] → n as ε → 0.

Thus the Lipschitz constant of M with respect to the sup-norm is at least n (> 1). Therefore this aggregation is not non-expansive. The full Bellman operator with discount γ has modulus at least γ n, which is ≥ 1 for typical γ and n ≥ 2. It would only be a contraction if γ < 1/n.

Remarks and consequences

Well-posedness: Sampling “proportional to Q” is not even defined if some Q-values are negative. You would need to shift or otherwise transform Q (e.g., add a large constant or use softmax), which further complicates the analysis. The non-expansiveness issue remains.
Fixed stochastic policy: If the sampling distribution were a fixed policy π independent of Q, then the Bellman expectation operator T^π is a γ-contraction in ∥·∥∞. The problem arises precisely because the distribution depends on Q.
Softmax weighting: Using π(a|s) ∝ exp(Q(s,a)/τ) and taking the expected Q under that π also fails to be non-expansive in general. One can compute the gradient and find its l1 norm can exceed 1 unless τ is very large and Q is bounded; in fact it can be arbitrarily large as τ → 0 or Q spreads.
A known contraction alternative: The “soft” optimality operator based on log-sum-exp, smaxτ(x) = τ log ∑i exp(x_i/τ), is 1-Lipschitz in the sup-norm (its gradient is the softmax, which sums to 1). Thus the corresponding entropy-regularized Bellman operator is a γ-contraction. Note this is different from taking the expected Q under a softmax policy.
Convergence of stochastic-policy control: Standard convergence results for control with stochastic behavior (e.g., SARSA with GLIE) do not rely on a single time-invariant contraction when the policy depends on Q; instead they use stochastic approximation plus policy improvement that becomes greedy in the limit. Keeping a permanently stochastic policy (e.g., proportional to Q) generally converges to the action-values of that policy, not to the optimal Q*, and may fail to converge without additional conditions.

Summary

The operator that replaces max with “sample according to weights proportional to Q” is not non-expansive; its Lipschitz constant can be as large as the number of actions. Therefore, the associated Bellman operator is not a contraction for typical γ, and your inequality (which uses a fixed distribution) does not establish contraction.
If you want a smooth alternative to max that preserves contraction, use the log-sum-exp (“softmax value”) in the target, not the expected Q under a Q-dependent policy.

Quick Feedback

Was this answer helpful?

Additional Information:

version: gpt-5-2025-08-07

UQ Validation

Status: UQ Validated

Validated: 10 months ago

Human Verification

Status: Needs Human Verification

UQ Validation

Validation from UQ Validators

Model Reviews (0)

Reviews and evaluations from AI model experts

Loading model reviews...

Human Reviews (0)

Expert reviews and evaluations from human reviewers

Loading reviews...

: Assessing Language Models on Unsolved Questions

UQ: Assessing Language Models on Unsolved Questions

Model Answer

Quick Feedback

Additional Information:

UQ Validation

Human Verification