Short answer: No. Replacing the max by “sampling with probability proportional to Q” does not give a non-expansive map, hence the corresponding Bellman operator is not a contraction in general (except in very special parameter ranges such as very small discount).
Why your inequality does not apply
- Your inequality |Eρ[f] − Eρ[g]| ≤ Eρ[|f − g|] holds only when the sampling distribution ρ is the same for both expectations.
- In your proposed update, the sampling distribution itself depends on the Q-function. When you compare two Q-functions f and g you get
Eρ(f)[f] − Eρ(g)[g] = [Eρ(f)[f − g]] + [Eρ(f)[g] − Eρ(g)[g]].
The first term is bounded by ∥f − g∥∞, but the second term is a policy-change term that you did not bound, and it can be large. This is exactly where non-expansiveness can fail.
A concrete counterexample (linear normalization by Q)
Assume for a next-state s′ you draw actions with probabilities proportional to their Q-values, i.e., for a vector x ∈ Rn with nonnegative entries,
ρ_i(x) = x_i / (∑j x_j), and the “merge” map is
M(x) = Eρ(x)[x] = ∑i ρ_i(x) x_i = (∑i x_i^2) / (∑j x_j).
Consider two action-value vectors with n ≥ 2 actions:
- x = (1, 0, 0, …, 0)
- y = (1 − ε, ε, ε, …, ε) with ε ∈ (0, 1).
Then
M(x) = 1,
M(y) = [ (1 − 2ε + n ε^2) ] / [ 1 + (n − 2)ε ].
Hence
|M(x) − M(y)| = [ n ε (1 − ε) ] / [ 1 + (n − 2)ε ].
The sup-norm difference is ∥x − y∥∞ = ε, so
|M(x) − M(y)| / ∥x − y∥∞ = [ n (1 − ε) ] / [ 1 + (n − 2)ε ] → n as ε → 0.
Thus the Lipschitz constant of M with respect to the sup-norm is at least n (> 1). Therefore this aggregation is not non-expansive. The full Bellman operator with discount γ has modulus at least γ n, which is ≥ 1 for typical γ and n ≥ 2. It would only be a contraction if γ < 1/n.
Remarks and consequences
- Well-posedness: Sampling “proportional to Q” is not even defined if some Q-values are negative. You would need to shift or otherwise transform Q (e.g., add a large constant or use softmax), which further complicates the analysis. The non-expansiveness issue remains.
- Fixed stochastic policy: If the sampling distribution were a fixed policy π independent of Q, then the Bellman expectation operator T^π is a γ-contraction in ∥·∥∞. The problem arises precisely because the distribution depends on Q.
- Softmax weighting: Using π(a|s) ∝ exp(Q(s,a)/τ) and taking the expected Q under that π also fails to be non-expansive in general. One can compute the gradient and find its l1 norm can exceed 1 unless τ is very large and Q is bounded; in fact it can be arbitrarily large as τ → 0 or Q spreads.
- A known contraction alternative: The “soft” optimality operator based on log-sum-exp,
smaxτ(x) = τ log ∑i exp(x_i/τ),
is 1-Lipschitz in the sup-norm (its gradient is the softmax, which sums to 1). Thus the corresponding entropy-regularized Bellman operator is a γ-contraction. Note this is different from taking the expected Q under a softmax policy.
- Convergence of stochastic-policy control: Standard convergence results for control with stochastic behavior (e.g., SARSA with GLIE) do not rely on a single time-invariant contraction when the policy depends on Q; instead they use stochastic approximation plus policy improvement that becomes greedy in the limit. Keeping a permanently stochastic policy (e.g., proportional to Q) generally converges to the action-values of that policy, not to the optimal Q*, and may fail to converge without additional conditions.
Summary
- The operator that replaces max with “sample according to weights proportional to Q” is not non-expansive; its Lipschitz constant can be as large as the number of actions. Therefore, the associated Bellman operator is not a contraction for typical γ, and your inequality (which uses a fixed distribution) does not establish contraction.
- If you want a smooth alternative to max that preserves contraction, use the log-sum-exp (“softmax value”) in the target, not the expected Q under a Q-dependent policy.