Comment by raincole

Comment by raincole 3 months ago

> Another issue is that some of the words are segmented very unnaturally

I immediately noticed that too. Are the "gaps" generated by an LLM? I think the model might not understand Japanese very well.

yorwba 3 months ago

It's a bit like segmenting "don't see" into "don't" and "see." ません is the negative of the auxiliary ます just as "don't" is the negative of the auxiliary "do." If you have to split Japanese text into words and want to be principled about it, treating ません as a separate word is not a bad way to go about it.

But of course there are other ways, so a "fill in the blank" question with two gaps right next to each other is generally a bad idea.

Reply View 3 replies

raincole 3 months ago

The point is not that you can't cut みません into み and ません. The point is that it should be one single gap in the first place.
It's like cutting gaps out of English sentence like this: I'm [go][ing] to beat the shit out of that guy. Sure we know the logical way to break down 'going' is 'go' and '-ing', but it should be one single gap anyway.

Reply View | 1 reply
- johnisgood 3 months ago
  
  Damn, where did that example come from? :P
  
  Reply View | 0 replies
owenpalmer 3 months ago

+1 this definitely makes sense, since you're gonna have a million verbs ending in "masen", just make it a separate word and understand that it's just part of the conjugation.

Reply View | 0 replies