Comment by bobbiechen

Comment by bobbiechen 10 hours ago

24 replies

>We’ve already done the work to render the page, and we’re trying to shed load, so why would I want to increase load by generating challenges and verifying responses? It annoys me when I click a seemingly popular blog post and immediately get challenged, when I’m 99.9% certain that somebody else clicked it two seconds before me. Why isn’t it in cache? We must have different objectives in what we’re trying to accomplish. Or who we’re trying to irritate.

+1000 I feel like so much bot detection (and fraud prevention against human actors, too) is so emotionally-driven. Some people hate these things so much, they're willing to cut off their nose to spite their face.

bayindirh 8 hours ago

My view on this is simple:

If you're a bot which will ignore all the licenses I put on that content, then I don't want to you to be able to reach that content.

No, any amount of monetary compensation is not welcome either. I use these licenses as a matter of principle, and my principles are not for sale.

That's all, thanks.

  • beeflet 7 hours ago

    I think the problem is that despite the effort, you will still end up in the dataset. So it's futile

  • warkdarrior 8 hours ago

    How can you tell a bot will ignore all your content licenses?

    • bayindirh 8 hours ago

      Currently all AI companies argue that the content they use falls under fair use, and disregard all licenses. This means any future ones respecting these licenses needs to be whitelisted.

      • diggan 8 hours ago

        How do you know that that bot is part of those AI companies? Maybe it's my personal bot you're blocking, should I also not have (indirectly) access to the content?

Vegenoid 8 hours ago

I think it’s better viewed through a lens of effort. Implementing systems that try harder to not challenge humans takes more work than just throwing up a catch-all challenge wall.

The author’s goal is admirable: “My primary principle is that I’d rather not annoy real humans more than strictly intended”. However, the primary goal for many people hosting content will be “block bots and allow humans with minimal effort and tuning”.

  • [removed] 7 hours ago
    [deleted]
jitl 10 hours ago

Really? If I’m an unsophisticated blog not using a CDN, and I get a $1000 bill for bandwidth overage or something, I’m gonna google a solve and slap it on there because I don’t want to pay another $1000 for Big Basilisk. I don’t think that’s emotional response, it’s common sense.

  • marginalia_nu 9 hours ago

    Seems like you've made profoundly questionable hosting or design choices for that to happen. Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.

    Misbehaving crawlers are a huge problem but bloggers are among the least affected by them. Something like a wiki or a forum is a better example, as they're in a category of websites where each page visit is almost unavoidably rendered on the fly using multiple expensive SQL queries due to the rapidly mutating nature of their datasets.

    Git forges, like the one TFA is discussing, are also fairly expensive, especially as crawlers traverse historical states. When the crawler is poorly implemented they'll get stuck doing this basically forever. Detecting and dealing with git hosts is an absolute must for any web crawler due to this.

    • mtlynch 8 hours ago

      >Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.

      I actually find this surprisingly difficult to find.

      I just want static hosting (like Netlify or Firebase Hosting), but there aren't many hosts that offer that.

      There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.

      • diggan 8 hours ago

        > There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.

        Yeah, that's true, there isn't a lot of "I give you money and HTML, you host it" services out there, surprisingly. Probably the most mature, cheapest and most reliable one today would be good ol' neocities.org (run by HN user kyledrake) which basically gives you 3TB/month for $5, pretty good deal :)

        Sometimes when I miss StumbleUpon I go to https://neocities.org/browse?sort_by=random which gives a fun little glimpse of the hobby/curiosity/creative web.

      • marginalia_nu 8 hours ago

        If you just want to host HTML for personal use github pages is free (and works with a custom domain). There are bandwidth limitations, but they definitely won't pull an AWS on you and send a bill that would cover a new car because a crawler acted up.

      • thaumaturgy 7 hours ago

        Interesting, I was under the impression this was more common than maybe it is. I know the hosting market has gotten pretty bad.

        So, I'm currently building pretty much this. After doing it on the side for clients for years, it's now my full-time effort. I have a solid and stable infrastructure, but not yet an API or web frontend. If somebody wants basically ssh, git, and static (or even not static!) hosting that comes with a sysadmin's contact information for a small number of dollars per month, I can be reached at sysop@biphrost.net.

        Environment is currently Debian-in-LXC-on-Debian-on-DigitalOcean.

      • ctoth 7 hours ago

        > There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.

        Dreamhost! They're still around and still lovely after how many years? I even find their custom control panel charming.

        • hobs 7 hours ago

          I really like DH(though I am still mad about the cloudatcost shenanigans) and use them but if you use 200x the resources the other shared sites consume you're getting the boot just like anyone.

  • phantompeace 9 hours ago

    Wouldn't it be easier to put the unsophisticated blog behind cloudflare

    • mhuffman 8 hours ago

      As much as I like to shit on cloudflare at every opportunity, it would obviously be easier to put it behind CF than install bot detection plugins.