Comment by jsnell

I know there's a lot of rebuttals to this statement already, but I think there's a simpler way of showing it is incorrect:

Figure 2 in the paper shows the LMArena score of whatever model is used for "median" Gemini query. That score is consistent with Gemini Flash (probably 2.0, given the numbers are from May), not a "tiny model" used for summaries nobody is asking for.