Comment by hephaes7us

Comment by hephaes7us 18 hours ago

4 replies

Why do you even necessarily think that wouldn't happen?

As I understand it, we'd essentially be relying on something like an mp3 compression algorithm to fail to capture a particular, subtle transient -- the lossy nature itself is the only real protection.

I agree that it's vanishingly unlikely if one person includes a sensitive document in their context, but what if a company has a project context which includes the same document in 10,000 chats? Maybe then it's more much likely that whatever private memo could be captured in training...

simonw 18 hours ago

I did get an answer from a senior executive at one AI lab who called this the "regurgitation problem" and said that they pay very close attention to it, to the point that they won't ship model improvements if they are demonstrated to cause this.

  • nprateem 18 hours ago

    Lol and that was enough for you? You really think they can test every single prompt before release to see if it regurgitates stuff? Did this exec work in sales too :-D

    • TeMPOraL 15 hours ago

      They have a clear incentive to do exactly as said - regurgitation is a problem, because it indicates the model failed to learn from the data, and merely memorized it.

    • simonw 16 hours ago

      I think they can run benchmarks to see how likely it is for prompts to return exact copies of their training data and use those benchmarks to help tune their training procedures.