Comment by b0a04gl
does the DiT here actually capture cross-token attention the same way as full SD 3.5 or is it simplified for clarity?
does the DiT here actually capture cross-token attention the same way as full SD 3.5 or is it simplified for clarity?