8
Stack Exchange
votes

Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

Posted on 7/23/2020
UQ ID:399
Science
Artificial Intelligence
Original

It is proved that the Bellman update is a contraction (1).

Here is the Bellman update that is used for Q-Learning:

The proof of (\ref{1}) being contraction comes from one of the facts (the relevant one for the question) that max operation is non expansive; that is:

This is also proved in a lot of places and it is pretty intuitive.

Consider the following Bellman update:

where samples an action with respect to the Q values (weighted by their Q values) of each action in that state.

Is this new Bellman operation still a contraction?

Is the SAMPLE operation non-expansive? It is, of course, possible to generate samples that will not satisfy equation (\ref{2}). I ask is it non-expansive in expectation?

My approach is:

Equivalently:

(\ref{4}) is true since:

But, I am not sure if proving (\ref{4}) proves the theorem. Do you think that this is a legit proof that (\ref{3}) is a contraction.

(If so; this would mean that stochastic policy q learning theoretically converges and we can have stochastic policies with regular q learning; and this is why I am interested.)

Both intuitive answers and mathematical proofs are welcome.

Keywords & Tags
reinforcement-learning
q-learning
proofs
convergence
bellman-equations
Model Answers (9)
Submitted answers from various AI models
Unsolved
No reviews yet

K2-Think

Pending UQ Validation
Needs Human Verification
MBZUAI-IFM
Qwen2.5-32B
Sep 18, 2025, 01:50 PM

The Bellman update using sampling weighted by Q values (equation 3) is not guaranteed to be a contraction in expectation. Here's the detailed reasoning:

Key Background: Contraction Mappings in RL

The standard Q-learning update (equation 1) leverages the max operator, which is non-expansive. This property ensures that the Bellman operator ( T ) (defined as ( T(Q)(s,a) = r(s,a) + \gamma ...

0 model reviews
0 human reviews
No reviews yet

GPT-5

UQ Validated
Needs Human Verification
OpenAI
Sep 2, 2025, 06:56 AM

Short answer: No. Replacing the max by “sampling with probability proportional to Q” does not give a non-expansive map, hence the corresponding Bellman operator is not a contraction in general (except in very special parameter ranges such as very small discount).

Why your inequality does not apply

  • Your inequality |Eρ[f] − Eρ[g]| ≤ Eρ[|f − g|] holds only when the sampling distribution ρ is the s...
0 model reviews
0 human reviews
No reviews yet

o3 Pro

UQ Validated
Needs Human Verification
OpenAI
Aug 1, 2025, 12:00 AM

No – replacing the max–operator by a “draw-an-action-with-probability proportional to its current Q value’’ destroys the non-expansiveness that makes the usual Bellman optimality operator a γ–contraction.
The idea looks harmless – we only average the Q values instead of taking their maximum – but the weighting term itself depends on Q and can amplify differences between two Q–tables.

----------...

0 model reviews
0 human reviews
No reviews yet

o4 mini

Failed UQ Validation
Needs Human Verification
OpenAI
Aug 1, 2025, 12:00 AM

First, let us restate precisely what you are comparing:

  1. The optimality Bellman operator
    (T⋆Q)(s,a) = R(s,a) + γ ∑{s′}P(s′|s,a) max{a′}Q(s′,a′)
    is a γ‐contraction in the sup‐norm because
    ∥T⋆Q₁ − T⋆Q₂∥∞
    = γ ∥max_a′Q₁(·,a′) − max_a′Q₂(·,a′)∥∞
    ≤ γ ∥Q₁ − Q₂∥∞.

  2. The policy‐evaluation Bellman operator for a fixed policy π
    (T^πQ)(s...

0 model reviews
0 human reviews
No reviews yet

o3

Failed UQ Validation
Needs Human Verification
OpenAI
Aug 1, 2025, 12:00 AM

Notation
For one state–transition (s\rightarrow s^{\prime}) write the vector of the action-values in the next state as

q = \bigl(Q(s',1),\ldots ,Q(s',n)\bigr)\in\mathbb R^{n}_{+}; .

With ordinary Q–learning we use

max(q)=\max_{i}q_i,

while in the rule you propose we first turn the vector itself into a probability distribution

π_q(i)=\dfrac{q_i}{\sum_{j}q_j}\qquad(i=1,\dots ,n)

and...

0 model reviews
0 human reviews
No reviews yet

DeepSeek R1

Failed UQ Validation
Needs Human Verification
DeepSeek
deepseek-ai/DeepSeek-R1
Aug 1, 2025, 12:00 AM
<think> We are given the standard Q-learning Bellman update (equation 1) and know that the max operator is non-expansive (equation 2). The question is about a variant (equation 3) where instead of taking the max over the next state actions, we sample an action from the next state proportional to the Q-values. Specifically, we define:
0 model reviews
0 human reviews
No reviews yet

Gemini 2.5 Pro

Failed UQ Validation
Needs Human Verification
Google
Aug 1, 2025, 12:00 AM

Excellent question. This gets to the heart of the convergence proofs for value-based reinforcement learning.

The short answer is no, the Bellman operator using a weighted sample is not a contraction in the max-norm (), and your proof attempt has a subtle but critical flaw.

Let's break down why.

1. Formalizing the Operator

First, let's formalize the operator you're proposing. ...

0 model reviews
0 human reviews
No reviews yet

Claude 3.7 Sonnet

Failed UQ Validation
Needs Human Verification
Anthropic
Aug 1, 2025, 12:00 AM

Understanding the Bellman Equation with Q-value Weighted Sampling

Your question investigates whether a modified Bellman update that uses sampling weighted by Q-values (instead of the max operation) is a contraction mapping, which would guarantee convergence.

The Issue with Your Proof Approach

Your inequality approach correctly shows:

0 model reviews
0 human reviews
No reviews yet

Claude Opus 4

Failed UQ Validation
Needs Human Verification
Anthropic
Aug 1, 2025, 12:00 AM

This is an interesting question about whether a modified Bellman operator that uses sampling instead of max remains a contraction. Let me analyze this carefully.

The Issue with Your Current Approach

Your proof attempt has a fundamental problem. You're trying to show:

However, this doesn't ...

0 model reviews
0 human reviews
Submit Your Model's Answer
Your submission will be validated by our LLM validator and then by human experts

You need to be logged in to submit an answer