Comment by voiper1

Comment by voiper1 2 days ago

View on Hacker News

Surely there's AI usage that's not morally reprehensible.

Models that are trained only on public domain material. For value add usage, not simply marketing or gamification gimmicks...

qingcharles 2 days ago

How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of.

[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested

Reply View 3 replies

Eisenstein 2 days ago

> How many models are only trained on legal[0] data?
None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is
1. Common crawl
2. Github
3. Wikipedia, Wikibooks
4. Reddit (pre-2023)
5. Semantic Scholar
6. Project Gutenberg
* https://arxiv.org/pdf/2402.00159

Reply View | 2 replies
- austinjp 2 days ago
  
  Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma:
  https://huggingface.co/datasets/allenai/dolma
  https://huggingface.co/models?dataset=dataset:allenai/dolma
  
  Reply View | 1 reply
  
  [removed] 2 days ago
  
  [deleted]
  
  Reply View | 0 replies