Comment by isaacimagine

Comment by isaacimagine 6 months ago

3 replies

TFP: https://arxiv.org/abs/2506.04168

The DT citation [10] is used on a single line, in a paragraph listing prior work, as an "and more". Another paper that uses DTs [53] is also cited in a similar way. The authors do not test or discuss DTs.

> hardness of the credit assignment ... data sparsity.

That is true, but not the point I'm making. "Bypassing credit assignment", in the context of long-horizon task modeling, is a statement about using attention to allocate long-horizon reward without horizon-reducing discount, not architecture choice.

To expand: if I have an environment with a key that unlocks a door thousands of steps later, Q-Learning may not propagate the reward signal from opening the door to the moment of picking up the key, because of the discount of future reward terms over a long horizon. A decision transformer, however, can attend to the moment of picking up the key while opening the door, which bypasses the problem of establishing this long-horizon causal connection.

(Of course, attention cannot assign reward if the moment the key was picked up is beyond the extent of the context window.)

highd 6 months ago

You can do Q-Learning with a transformer. You simply define the state space as the observation sequence. This is in fact natural to do in partially observed settings. So your distinction does not make sense.

  • isaacimagine 6 months ago

    DT's reward-to-go vs. QL's Bellman incl. discount, not choice of architecture for policy. You could also do DTs with RNNs (though own problems w/ memory).

    Apologies if we're talking past one another.

[removed] 6 months ago
[deleted]