Nepenthes is a tarpit to catch AI web crawlers

714 points by blendergeek a year ago

bflesch a year ago

Haha, this would be an amazing way to test the ChatGPT crawler reflective DDOS vulnerability [1] I published last week.

Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.

The vulnerability is/was thoroughly ignored by OpenAI/Microsoft/BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.

The vulnerability reporting experience with OpenAI / BugCrowd was really horrific. It's always difficult to get attention for DOS/DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.

I spent a week trying to reach OpenAI/Microsoft to get this fixed, but I gave up and just published the writeup.

I don't recommend you to exploit this vulnerability due to legal reasons.

[1] https://github.com/bf/security-advisories/blob/main/2025-01-...

Reply View 76 replies

hassleblad23 a year ago

I am not surprised that OpenAI is not interested if fixing this.

Reply View | 28 replies
- bflesch a year ago
  
  Their security.txt email address replies and asks you to go on BugCrowd. BugCrowd staff is unwilling (or too incompetent) to run a bash curl command to reproduce the issue, while also refusing to forward it to OpenAI.
  The support@openai.com waits an hour before answering with ChatGPT answer.
  Issues raised on GitHub directly towards their engineers were not answered.
  Also Microsoft CERT & Azure security team do not reply or care respond to such things (maybe due to lack of demonstrated impact).
  
  Reply View | 26 replies
  
  permo-w a year ago
  
  why try this hard for a private company that doesn't employ you?
  
  Reply View | 24 replies
  
  khana a year ago
  
  [dead]
  
  Reply View | 0 replies
- [removed] a year ago
  
  [deleted]
  
  Reply View | 0 replies
JohnMakin a year ago

Nice find, I think one of my sites actually got recently hit by something like this. And yea, this kind of thing should be trivially preventable if they cared at all.

Reply View | 17 replies
- zanderwohl a year ago
  
  IDK, I feel that if you're doing 5000 HTTP calls to another website it's kind of good manners to fix that. But OpenAI has never cared about the public commons.
  
  Reply View | 4 replies
  
  chefandy a year ago
  
  Nobody in this space gives a fuck about anyone outside of the people paying for their top-tier services, and even then, they only care about them when their bill is due. They don't care about their regular users, don't care about the environment, don't care about the people that actually made the "data" they're re-selling... nobody.
  
  Reply View | 0 replies
  
  marginalia_nu a year ago
  
  Yeah, even beyond common decency, there's pretty strong incentives to fix it, as it's a fantastic way of having your bot's fingerprint end up on Cloudflare's shitlist.
  
  Reply View | 2 replies
- dewey a year ago
  
  > And yea, this kind of thing should be trivially preventable if they cared at all.
  Most of the time when someone says something is "trivial" without knowing anything about the internals, it's never trivial.
  As someone working close to the b2c side of a business, I can’t count the amount of times I've heard that something should be trivial while it's something we've thought about for years.
  
  Reply View | 11 replies
  
  bflesch a year ago
  
  The technical flaws are quite trivial to spot, if you have the relevant experience:
  - urls[] parameter has no size limit
  - urls[] parameter is not deduplicated (but their cache is deduplicating, so this security control was there at some point but is ineffective now)
  - their requests to same website / DNS / victim IP address rotate through all available Azure IPs, which gives them risk of being blocked by other hosters. They should come from the same IP address. I noticed them changing to other Azure IP ranges several times, most likely because they got blocked/rate limited by Hetzner or other counterparties from which I was playing around with this vulnerabilities.
  But if their team is too limited to recognize security risks, there is nothing one can do. Maybe they were occupied last week with the office gossip around the sexual assault lawsuit against Sam Altman. Maybe they still had holidays or there was another, higher-risk security vulnerability.
  Having interacted with several bug bounties in the past, it feels OpenAI is not very mature in that regard. Also why do they choose BugCrowd when HackerOne is much better in my experience.
  
  Reply View | 2 replies
  
  grahamj a year ago
  
  If you’re unable to throttle your own outgoing requests you shouldn’t be making any
  
  Reply View | 5 replies
  
  [removed] a year ago
  
  [deleted]
  
  Reply View | 0 replies
  
  jillyboel a year ago
  
  now try to reply to the actual content instead of some generalizing grandstanding bullshit
  
  Reply View | 0 replies
michaelbuckbee a year ago

What is the https://chatgpt.com/backend-api/attributions endpoint doing (or responsible for when not crushing websites).

Reply View | 13 replies
- bflesch a year ago
  
  When ChatGPT cites web sources in it's output to the user, it will call `backend-api/attributions` with the URL and the API will return what the website is about.
  Basically it does HTTP request to fetch HTML `<title/>` tag.
  They don't check length of supplied `urls[]` array and also don't check if it contains the same URL over and over again (with minor variations).
  It's just bad engineering all around.
  
  Reply View | 12 replies
  
  bentcorner a year ago
  
  Slightly weird that this even exists - shouldn't the backend generating the chat output know what attribution it needs, and just ask the attributions api itself? Why even expose this to users?
  
  Reply View | 8 replies
  
  JohnMakin a year ago
  
  Even if you were unwilling to change this behavior on the application layer or server side, you could add a directive in the proxy to prevent such large payloads from being accepted as an immediate mitigation step, unless they seriously need that parameter to have unlimited number of urls in it (guessing they have it set to some default like 2mb and it will break at some limit, but I am afraid to play with this too much). Somehow I doubt they need that? I don't know though.
  
  Reply View | 2 replies
andai a year ago

Is 5000 a lot? I'm out of the loop but I thought c10k was solved decades ago? Or is it about the "burstiness" of it?
(That all the requests come in simultaneously -- probably SSL code would be the bottleneck.)

Reply View | 2 replies
- bflesch a year ago
  
  I'm not a DDOS expert and didn't test out the limits due to potential harm to OpenAI.
  Based on my experience I recognized it as potential security risk and framed it as DDOS because there's a big amplification factor: 1 API request via Cloudflare -> 5000 incoming requests from OpenAI
  - their requests come in simultaneously from different ips
  - each request downloads up to 10mb of random data (tested with multi-gb file)
  - the requests come from different azure IP ranges, either bc they kept switching them or bc of different geolocations.
  - if you block them on the firewall their requests still hammer your server (it's not like the first request notices it can't establish connection and then the next request TO SAME IP would stop)
  I tried to get it recognized and fixed, and now apparently HN did its magic because they've disabled the API :)
  Previously, their engineers might have argued that this is a feature and not a bug. But now that they have disabled it, it shows that this clearly isn't intended behavior.
  
  Reply View | 0 replies
- hombre_fatal a year ago
  
  c10k is about efficiently scheduling socket connections. it doesn’t make sense in this context nor is it the same as 10k rps.
  
  Reply View | 0 replies
anthony42c a year ago

Where does the 5000 HTTP request limit come from? Is that the limit of the URLs array?
I was curious to learn more about the endpoint, but can't find any online API docs. The docs ChatGPT suggests are defined for api.openapi.com, rather than chatgpt.com/backend-api.
I wonder if its reasonable (from a functional perspective) for the attributions endpoint not to place a limit on the number of urls used for attribution. I guess potentially ChatGPT could reference hundreds of sites and thousands of web pages in searching for a complex question that covered a range of different interrelated topics? Or do I misunderstand the intended usage of that endpoint?

Reply View | 0 replies
[removed] a year ago

[deleted]

Reply View | 0 replies
smokel a year ago

Am I correct in understanding that you waited at most one week for a reply?
In my experience with large companies, that's rather short. Some nudging may be required every now and then, but expecting a response so fast seems slightly unreasonable to me.

Reply View | 0 replies
pabs3 a year ago

Could those 5000 HTTP requests be made to go back to the ChatGPT API?

Reply View | 0 replies
nurettin a year ago

They don't care. You are just raising their costs which they will in return charge their customers.

Reply View | 0 replies
dangoodmanUT a year ago

has anyone tested this working? I get a 301 in my terminal trying to send a request to my site

Reply View | 1 reply
- bflesch a year ago
  
  Hopefully they'd have it fixed by now. The magic of HN exposure...
  
  Reply View | 0 replies
soupfordummies a year ago

Try it and let us know :)

Reply View | 0 replies
mitjam a year ago

How can it reach localhost or is this only a placeholder for a real address?

Reply View | 3 replies
- bflesch a year ago
  
  The code in the github repo has some errors to prevent script kiddies from directly copy/pasting it.
  Obviously the proof-of-concept shared with OpenAI/BugCrowd didn't have such errors.
  
  Reply View | 2 replies
  
  mitjam a year ago
  
  Ah ok, thanks, that makes sense.
  Btw the ChatGPT Web App (haven’t tested with the Desktop App) can find info from local/private sites with the search tool, i assume they browse with a client side function.
  
  Reply View | 1 reply
  
  bflesch a year ago
  
  Yeah I first wanted to use this bug to scan their IP ranges and figure out their internal network (e.g. make requests to 10.0.0.1, 10.0.0.2, and so on). But then I realized that it will hallucinate an answer for every IP it is given :)
  So it would just come up with titles of random router admin panel websites.
  
  Reply View | 0 replies

m3047 a year ago

Having first run a bot motel in I think 2005, I'm thrilled and greatly entertained to see this taking off. When I first did it, I had crawlers lost in it literally for days; and you could tell that eventually some human would come back and try to suss the wreckage. After about a year I started seeing URLs like ../this-page-does-not-exist-hahaha.html. Sure it's an arms race but just like security is generally an afterthought these days, don't think that you can't be the woodpecker which destroys civilization. The comments are great too, this one in particular reflects my personal sentiments:

> the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do