HN Top New Show Ask Jobs

settings

Theme

Hand Mode

Feed

Comment by Eisenstein

Comment by Eisenstein 2 days ago

2 replies

View on Hacker News

> How many models are only trained on legal[0] data?

None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is

1. Common crawl

2. Github

3. Wikipedia, Wikibooks

4. Reddit (pre-2023)

5. Semantic Scholar

6. Project Gutenberg

* https://arxiv.org/pdf/2402.00159

austinjp 2 days ago

Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma:

https://huggingface.co/datasets/allenai/dolma

https://huggingface.co/models?dataset=dataset:allenai/dolma

Reply View | 1 reply
  • [removed] 2 days ago
    [deleted]
    Reply View | 0 replies