Comment by solresol

I tried... it started with the idea was that log loss might not be the best option for training, and maybe it should be a loss related to how wrong the predicted word was. Predicting "dog" instead of "cat" should be less penalised than predicting "running".

That turns out to be an ultrametric loss, and the derivative of an ultrametric loss is zero in a large region around any local minimum, so it can't be trained by gradient descent -- it has to be trained by search.

Punchline: it's about one million times less effective than a more traditional architecture. https://github.com/solresol/ultratree-results