LeoPanthera 2 days ago

Congratulations on "shipping", I've had a background task to create pretty much exactly this site for a while. What is your cutoff date? I made this handy list, in research for mine:

  2017: Invention of transformer architecture
  June 2018: GPT-1
  February 2019: GPT-2
  June 2020: GPT-3
  March 2022: GPT-3.5
  November 2022: ChatGPT
You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.
  • jgrahamc a day ago

    I was taking "Release of ChatGPT" as the Trinity date.

VyseofArcadia 2 days ago

Clever name. I like the analogy.

  • freilanzer 2 days ago

    I don't seem to get it.

    • ziddoap 2 days ago

      Steel without nuclear contamination is sought after, and only available from pre-war / pre-atomic sources.

      The analogy is that data is now contaminated with AI like steel is now contaminated with nuclear fallout.

      https://en.wikipedia.org/wiki/Low-background_steel

      >Low-background steel, also known as pre-war steel[1] and pre-atomic steel,[2] is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s. Typically sourced from ships (either as part of regular scrapping or shipwrecks) and other steel artifacts of this era, it is often used for modern particle detectors because more modern steel is contaminated with traces of nuclear fallout.[3][4]

      • umvi 2 days ago

        > and only available from pre-war / pre-atomic sources.

        From the same wiki you linked:

        "Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature"

        and

        "For the most demanding items even low-background steel can be too radioactive and other materials like high-purity copper may be used"

    • AlphaAndOmega0 2 days ago

      It's a reference to the practise of scavenging steel from sources that were produced before nuclear testing began, as any steel produced afterwards is contaminated with nuclear isotopes from the fallout. Mostly ship wrecks, and WW2 means there are plenty of those. The pun in question is that his project tries to source text that hasn't been contaminated with AI generated material.

      https://en.m.wikipedia.org/wiki/Low-background_steel

    • ms512 2 days ago

      After the detonation of the first nuclear weapons, any newly produced steel has a low dose of nuclear fallout.

      For applications that need to avoid the background radiation (like physics research), pre atomic age steel is extracted, like from old shipwrecks.

      https://en.m.wikipedia.org/wiki/Low-background_steel

    • GreenWatermelon 2 days ago

      From the blog

      > Low Background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945.

    • voytec 2 days ago

      To whomever downvoted parent: please don't act against people brave enough to state that they don't know something.

      This is a desired quality, increasingly less present in IT work environments. People afraid of being shamed for stating knowledge gaps are not the folks you want to work with.

      • umvi 2 days ago

        I feel like there's a minimum "due diligence" bar to meet though before asking, otherwise it comes across as "I'm too lazy to google the reference and connect the dots myself, but can someone just go ahead and distill a nice summary for me"

    • KeplerBoy 2 days ago

      Steel made before atmospheric tests of nuclear bombs were a thing is referred to as low background steel and invaluable for some applications.

      LLMs pollute the internet like atomic bombs polluted the environment.

    • [removed] 2 days ago
      [deleted]
    • [removed] 2 days ago
      [deleted]
astennumero 2 days ago

That's exactly the opposite of what the author wanted IMO. The author no more wants to be a part of this mess. Aggregating these sources would just makes it so much more easier for the tech giants to scrape more data.

  • rovr138 2 days ago

    The sources are just aggregated. The source doesn't change.

    The new stuff generated does (and this is honestly already captured).

    This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.

    Their research and projects are great.

  • iak8god 2 days ago

    The main concerns expressed in Robyn's note, as I read them, seem to be 1) generative AI has polluted the web with text that was not written by humans, and so it is no longer feasible to produce reliable word frequency data that reflects how humans use natural language; and 2) simultaneously, sources of natural language text that were previously accessible to researchers are now less accessible because the owners of that content don't want it used by others to create AI models without their permission. A third concern seems to be that support for and practice of any other NLP approaches is vanishing.

    Making resources like wordfreq more visible won't exacerbate any of these concerns.

[removed] 2 days ago
[deleted]
Der_Einzige a day ago

FYI: My two datasets, DebateSum and OpenDebateEvidence/OpenCaseList in their current forms qualify for this, as they end at latest in 2022.

  • jgrahamc a day ago

    You can either add them to the site yourself via Tumblr or send them to me via email (jgc@cloudflare).

imhoguy 2 days ago

I am not sure we should trust a site contaminated by AI graphics. /s

  • gorkish 2 days ago

    The buildings and shipping containers that store low background steel aren't built out of the stuff either.

  • whywhywhywhy 2 days ago

    Yeah pay an illustrator if this is important to you.

    See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.

    • imhoguy 2 days ago

      I have updated my comment with "/s" as that is closer to what I've meant. However, seriously, from ethical point of view it is unlikely illustrators were asked or compensated for their work being used for training AI to produce the image.

      • heckelson 2 days ago

        I thought the header image was a symbol of AI slop contamination because it looked really off-putting

ClassyJacket 2 days ago

:'( I thought I was clever for realising this parallel myself! Guess it's more obvious than I thought.

Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.

  • thebruce87m a day ago

    I too realised this parallel and frequently tell people about it.

    Edit: just the first one