Comment by nickpsecurity

“Yeah ours is guaranteed ok, as we wrote code to generate it basically just from plain torch ops.”

This is where there might be claims. It already sounds safer than training on copyrighted works. The only thing that could remain is if it was a derivative work by reusing parts of copyrighted works in your process.

So, I’m curious about how you produced the specifications that the data was generated from. In my case, I was going to just use open versions of all kinds of equations that I’d hand-convert to internal representations. Others might be fair use if my description were high level enough that it wasn’t close to theirs. Some I couldn’t use at all because they were patented and independent versions are prohibited by law.

Did you all also derive your causal models from real-world formulas and data sets? If so, did you have a rule about putting distance between your representation and theirs? Or was it an entirely-random, search process across endless configurations? (I have a hard time imagining the latter would work.)