Comment by echelon

Comment by echelon 2 days ago

This whole thing is pointless.

OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.

The firewall is now moot.

The bigger AI company, Google, has already been doing this for decades. They were the middlemen between your reader and you, and that position is unassailable. Without them, you don't have readers.

At this point, the only people you're keeping out with LLM firewalls are the smaller players, which further entrenches the leaders.

OpenAI and Google want you to block everybody else.

happyopossum 2 days ago

> Google, has already been doing this for decades

Do you have any proof, or even circumstantial evidence to point to this being the case?

If chrome actually scraped every site ever you visited and sent it off to Google, it’d be trivially simple to find some indication of that in network traffic, or heck - even chromium code.

Reply View 3 replies

echelon 2 days ago

Sorry, I mean they're between the customer relationship.
Who would dare block Google Search from indexing their site?
The relationship is adversarial, but necessary.

Reply View | 2 replies
- [removed] 2 days ago
  
  [deleted]
  
  Reply View | 0 replies
- ranger_danger a day ago
  
  > Who would dare block Google Search from indexing their site?
  People who don't want to be indexed. Or found at all.
  
  Reply View | 0 replies

Dylan16807 2 days ago

Is it confirmed that site loads go into the training database?

But for anyone whose main concern is their server staying up, Atlas isn't a problem. It's not doing a million extra loads.

Reply View 5 replies

heavyset_go 2 days ago

> Is it confirmed that site loads go into the training database?
Would you trust OpenAI if they told you it doesn't?
If you would, would you also trust Meta to tell you if its multibillion dollar investment was trained on terabytes of pirated media the company downloaded over BitTorrent?

Reply View | 4 replies
- viraptor 2 days ago
  
  We don't have to trust it or not. If there's such claim, surely someone can point at least at a pcap file with an unknown connection. Or at some decompiled code. Otherwise it's just a conspiracy theory.
  
  Reply View | 3 replies
  
  _flux 2 days ago
  
  Surely the data must go to the OpenAI servers, how else would they use LLMs on it? We cannot see if that data ends up in the training data.
  Personally I would just believe what they say for the time being; there would be backlash in doing something else, possibly legal one.
  
  Reply View | 1 reply
  
  viraptor 2 days ago
  
  I think the original claim was about something different. "Is it confirmed that site loads..." - I read it as the author taking about general browsing, not just explicit questions, with the context of the page.
  
  Reply View | 0 replies
  
  heavyset_go 2 days ago
  
  Whatever is included in context is in OpenAI's control from that point forward, and you just have to trust them not to do anything with it.
  That isn't a conspiracy theory, it's fundamentally how interfacing with 3rd party hosted LLMs works.
  
  Reply View | 0 replies

seba_dos1 2 days ago

The "LLM firewall" is usually there so AI companies don't take the server down, not to prevent model training (that's just an acceptable side effect).

Reply View 0 replies

_flux 2 days ago

As I understand it, the main point of Anubis is to reduce the costs caused by (AI company) bots and agent-generated load is still a lot less than simply spidering the complete web site; it might actually be quite close to what a user would manually browse.

Unless the user asked something that just needs visiting many pages, I suppose. For example, Google Gemini was pretty helpful in finding out the typical price ranges and dishes a local shopping centre coffee shops have, as the information was far from being just in a single page..

Reply View 0 replies

masklinn 2 days ago

> This whole thing is pointless.

It's definitely pointless if you completely miss the point of it.

> OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.

Cool. Anubis' fundamental purpose is not to prevent all bot access tho, as clearly spelled in its overview:

> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies.

OpenAI atlas piggybacking on the user's normal browsing is not within the remit of anubis, because it's not going to take a small site down or dramatically increase hosting costs.

> At this point, the only people you're keeping out with LLM firewalls are the smaller players

Oh no, who will think of the small assholes?

Reply View 0 replies