Comment by jgrahamc

Comment by jgrahamc a year ago

I created https://lowbackgroundsteel.ai/ in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr.

LeoPanthera a year ago

Congratulations on "shipping", I've had a background task to create pretty much exactly this site for a while. What is your cutoff date? I made this handy list, in research for mine:

  2017: Invention of transformer architecture
  June 2018: GPT-1
  February 2019: GPT-2
  June 2020: GPT-3
  March 2022: GPT-3.5
  November 2022: ChatGPT

You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.

Reply View 1 reply

jgrahamc a year ago

I was taking "Release of ChatGPT" as the Trinity date.

Reply View | 0 replies

VyseofArcadia a year ago

Clever name. I like the analogy.

Reply View 27 replies

freilanzer a year ago

I don't seem to get it.

Reply View | 26 replies
- ziddoap a year ago
  
  Steel without nuclear contamination is sought after, and only available from pre-war / pre-atomic sources.
  The analogy is that data is now contaminated with AI like steel is now contaminated with nuclear fallout.
  https://en.wikipedia.org/wiki/Low-background_steel
  >Low-background steel, also known as pre-war steel[1] and pre-atomic steel,[2] is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s. Typically sourced from ships (either as part of regular scrapping or shipwrecks) and other steel artifacts of this era, it is often used for modern particle detectors because more modern steel is contaminated with traces of nuclear fallout.[3][4]
  
  Reply View | 11 replies
  
  umvi a year ago
  
  > and only available from pre-war / pre-atomic sources.
  From the same wiki you linked:
  "Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature"
  and
  "For the most demanding items even low-background steel can be too radioactive and other materials like high-purity copper may be used"
  
  Reply View | 9 replies
  
  swyx a year ago
  
  and I applied to LLMs here: https://www.latent.space/p/nov-2023
  
  Reply View | 0 replies
- AlphaAndOmega0 a year ago
  
  It's a reference to the practise of scavenging steel from sources that were produced before nuclear testing began, as any steel produced afterwards is contaminated with nuclear isotopes from the fallout. Mostly ship wrecks, and WW2 means there are plenty of those. The pun in question is that his project tries to source text that hasn't been contaminated with AI generated material.
  https://en.m.wikipedia.org/wiki/Low-background_steel
  
  Reply View | 0 replies
- ms512 a year ago
  
  After the detonation of the first nuclear weapons, any newly produced steel has a low dose of nuclear fallout.
  For applications that need to avoid the background radiation (like physics research), pre atomic age steel is extracted, like from old shipwrecks.
  https://en.m.wikipedia.org/wiki/Low-background_steel
  
  Reply View | 0 replies
- GreenWatermelon a year ago
  
  From the blog
  > Low Background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945.
  
  Reply View | 0 replies
- voytec a year ago
  
  To whomever downvoted parent: please don't act against people brave enough to state that they don't know something.
  This is a desired quality, increasingly less present in IT work environments. People afraid of being shamed for stating knowledge gaps are not the folks you want to work with.
  
  Reply View | 6 replies
  
  umvi a year ago
  
  I feel like there's a minimum "due diligence" bar to meet though before asking, otherwise it comes across as "I'm too lazy to google the reference and connect the dots myself, but can someone just go ahead and distill a nice summary for me"
  
  Reply View | 5 replies
- KeplerBoy a year ago
  
  Steel made before atmospheric tests of nuclear bombs were a thing is referred to as low background steel and invaluable for some applications.
  LLMs pollute the internet like atomic bombs polluted the environment.
  
  Reply View | 0 replies
- cdman a year ago
  
  https://en.wikipedia.org/wiki/Low-background_steel
  
  Reply View | 0 replies
- [removed] a year ago
  
  [deleted]
  
  Reply View | 0 replies
- [removed] a year ago
  
  [deleted]
  
  Reply View | 0 replies

astennumero a year ago

That's exactly the opposite of what the author wanted IMO. The author no more wants to be a part of this mess. Aggregating these sources would just makes it so much more easier for the tech giants to scrape more data.

Reply View 2 replies

rovr138 a year ago

The sources are just aggregated. The source doesn't change.
The new stuff generated does (and this is honestly already captured).
This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.
Their research and projects are great.

Reply View | 0 replies
iak8god a year ago

The main concerns expressed in Robyn's note, as I read them, seem to be 1) generative AI has polluted the web with text that was not written by humans, and so it is no longer feasible to produce reliable word frequency data that reflects how humans use natural language; and 2) simultaneously, sources of natural language text that were previously accessible to researchers are now less accessible because the owners of that content don't want it used by others to create AI models without their permission. A third concern seems to be that support for and practice of any other NLP approaches is vanishing.
Making resources like wordfreq more visible won't exacerbate any of these concerns.

Reply View | 0 replies

[removed] a year ago

[deleted]

Reply View 0 replies

Der_Einzige a year ago

FYI: My two datasets, DebateSum and OpenDebateEvidence/OpenCaseList in their current forms qualify for this, as they end at latest in 2022.

Reply View 1 reply

jgrahamc a year ago

You can either add them to the site yourself via Tumblr or send them to me via email (jgc@cloudflare).

Reply View | 0 replies

imhoguy a year ago

I am not sure we should trust a site contaminated by AI graphics. /s

Reply View 4 replies

gorkish a year ago

The buildings and shipping containers that store low background steel aren't built out of the stuff either.

Reply View | 0 replies
whywhywhywhy a year ago

Yeah pay an illustrator if this is important to you.
See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.

Reply View | 2 replies
- imhoguy a year ago
  
  I have updated my comment with "/s" as that is closer to what I've meant. However, seriously, from ethical point of view it is unlikely illustrators were asked or compensated for their work being used for training AI to produce the image.
  
  Reply View | 1 reply
  
  heckelson a year ago
  
  I thought the header image was a symbol of AI slop contamination because it looked really off-putting
  
  Reply View | 0 replies

ClassyJacket a year ago

:'( I thought I was clever for realising this parallel myself! Guess it's more obvious than I thought.

Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.

Reply View 2 replies

thebruce87m a year ago

I too realised this parallel and frequently tell people about it.
Edit: just the first one

Reply View | 0 replies
sweeter a year ago

[flagged]

Reply View | 0 replies