Comment by isaacimagine

Comment by isaacimagine a day ago

5 replies

No mention of Decision Transformers or Trajectory Transformers? Both are offline approaches that tend to do very well at long-horizon tasks, as they bypass the credit assignment problem by virtue of having an attention mechanism.

Most RL researchers consider these approaches not to be "real RL", as they can't assign credit outside the context window, and therefore can't learn infinite-horizon tasks. With 1m+ context windows, perhaps this is less of an issue in practice? Curious to hear thoughts.

DT: https://arxiv.org/abs/2106.01345

TT: https://arxiv.org/abs/2106.02039

highd a day ago

TFP cites decision transformers. Just using a transformer does not bypass the credit assignment problem. Transformers are an architecture for solving sequence modeling problems, e.g. the credit assignment problem as arises in RL. There have been many other such architectures.

The hardness of the credit assignment problem is a statement about data sparsity. Architecture choices do not "bypass" it.

  • isaacimagine a day ago

    TFP: https://arxiv.org/abs/2506.04168

    The DT citation [10] is used on a single line, in a paragraph listing prior work, as an "and more". Another paper that uses DTs [53] is also cited in a similar way. The authors do not test or discuss DTs.

    > hardness of the credit assignment ... data sparsity.

    That is true, but not the point I'm making. "Bypassing credit assignment", in the context of long-horizon task modeling, is a statement about using attention to allocate long-horizon reward without horizon-reducing discount, not architecture choice.

    To expand: if I have an environment with a key that unlocks a door thousands of steps later, Q-Learning may not propagate the reward signal from opening the door to the moment of picking up the key, because of the discount of future reward terms over a long horizon. A decision transformer, however, can attend to the moment of picking up the key while opening the door, which bypasses the problem of establishing this long-horizon causal connection.

    (Of course, attention cannot assign reward if the moment the key was picked up is beyond the extent of the context window.)

    • highd a day ago

      You can do Q-Learning with a transformer. You simply define the state space as the observation sequence. This is in fact natural to do in partially observed settings. So your distinction does not make sense.

      • isaacimagine a day ago

        DT's reward-to-go vs. QL's Bellman incl. discount, not choice of architecture for policy. You could also do DTs with RNNs (though own problems w/ memory).

        Apologies if we're talking past one another.

    • [removed] a day ago
      [deleted]