No – replacing the max–operator by a “draw-an-action-with-probability proportional to its current Q value’’ destroys the
non-expansiveness that makes the usual Bellman optimality operator a
γ–contraction.
The idea looks harmless – we only average the Q values instead of
taking their maximum – but the weighting term itself depends on Q and
can amplify differences between two Q–tables.
For one next state s′ define, for any vector q=(q1,…,qk) of positive action values
πq(i) := qi / Σj qj ( probability of drawing action i )
F(q) := Σi πq(i) qi = Σi qi² / Σi qi ( expected Q under that policy )
For an MDP with transition function P and reward R the update in expectation is
(TQ)(s,a) = R(s,a) + γ Es′~P[ F(Q(s′,·)) ].
The question is whether T is a γ–contraction in the ∞–norm.
Take the degenerate MDP that has
• a single state s,
• two actions a1, a2,
• deterministic self transition and zero reward.
Contraction (with γ<1) would therefore have to hold for the map F alone:
|F(q) – F(q′)| ≤ ||q – q′||∞ for all positive q,q′∈ℝ². (★)
Choose
q = (1 , 0.1) , q′ = (1.1 , 0).
Then ||q–q′||∞ = 0.1 whereas
F(q) = 1.01 / 1.1 = 0.91818…,
F(q′) = 1.21 / 1.1 = 1.10,
and therefore
|F(q) − F(q′)| = 0.18182… > 0.1 = ||q − q′||∞.
Inequality (★) is violated; F is expansive around these points. Consequently
| TQ – TQ′ |∞ = γ |F(q) – F(q′)| > γ ||q – q′||∞,
and for any discount factor γ > 0.55 the Bellman operator T is already not a contraction (for γ = 0.9 the local Lipschitz constant is ≈1.64).
With the maximum operator the weighting (pick the largest coordinate) is independent of the numerical values of the two Q-tables being compared, which yields the familiar bound
|max f – max g| ≤ max |f – g|.
For the stochastic operator the weights are the Q values themselves; both the values and the weights change when Q is perturbed, and they can reinforce each other instead of cancelling. The inequality you wrote,
|Eπq[f] – Eπq[g]| ≤ Eπq|f – g|,
is true, but the two expectations must be taken under the same distribution. Here the two distributions πq and πq′ are different, so the bound does not apply.
Because the operator is not a contraction there is, in general, no guarantee that iterating it (or running a stochastic approximation scheme such as Q-learning with this target) converges to a unique fixed point. Additional assumptions – for instance forcing the policy to become greedy in the limit, or using a temperature-controlled soft-max whose temperature goes to zero – are required to recover the classical convergence guarantees.
So, in short: sampling actions with probabilities proportional to their current Q values makes the induced Bellman operator expansive, and the standard contraction–based proof of convergence no longer goes through.
Was this answer helpful?
version: o3-pro-2025-06-10
Status: UQ Validated
Validated: 8 months ago
Status: Needs Human Verification
Loading model reviews...
Loading reviews...