Comment by llbbdd

Comment by llbbdd 2 days ago

21 replies

Yeah APIs exist because computers used to require very explicitly structured data, with LLMs a lot of the ambiguity of HTML disappears as far as a scraper is concerned.

swatcoder 2 days ago

> LLMs a lot of the ambiguity of HTML disappears as far as a scraper is concerned

The more effective way to think about it is that "the ambiguity" silently gets blended into the data. It might disappear from superficial inspection, but it's not gone.

The LLM is essentially just doing educated guesswork without leaving a consistent or thorough audit trail. This is a fairly novel capability and there are times where this can be sufficient, so I don't mean to understate it.

But it's a different thing than making ambiguity "disappear" when it comes to systems that actually need true accuracy, specificity, and non-ambiguity.

Where it matters, there's no substitute for "very explicit structured data" and never really can be.

  • llbbdd 2 days ago

    Disappear might be an extremely strong word here, but yeah as you said as the delta closes between what a human user and an AI user are able to interpret from the same text, it becomes good enough for some nines of cases. Even if on paper it became mathematically "good enough" for high-risk cases like medical or government data structured data will still have a lot of value. I just think more and more structured data is going to be cleaned up from unstructured data except for those higher precision cases.

dmitrygr 2 days ago

"computers used to require"

please do not write code. ever. Thinking like this is why people now think that 16GB RAM is to little and 4 cores is the minimum.

API -> ~200,000 cycles to get data, RAM O(size of data), precise result

HTML -> LLM -> ~30,000,000,000 cycles to get data, RAM O(size of LLM weights), results partially random and unpredictable

  • hartator 2 days ago

    If API doesn’t have the data you want, this point is moot.

    • dotancohen 2 days ago

      Not GP, but I disagree. I've written successful, robust web scrapers without LLMs for decades.

      What do you think the E in perl stands for?

      • llbbdd 2 days ago

        This is probably just a parallel discussion. I written plenty of successful web scrapers without LLM's, but in the last couple years, I've written a lot more where I didn't need to look at the web markup for more than a few seconds first, if at all. Often you can just copy-paste an example page into the LLM and have it generate accurate, consistent selectors. It's not much different when integrating with a formal API, except that the API usually has more explicit usage rules, and APIs will also often restrict data that can very obviously be used competitively.

      • llbbdd a day ago

        Double-posting so I'm sorry but the more I read this the less it makes sense. The parent reply was talking about data that was straight-up not available via the API, how does perl help with that?

      • [removed] 2 days ago
        [deleted]
  • llbbdd 2 days ago

    A lot of software engineering is recognizing the limitations of the domain that you're trying to work in, and adapting your tools to that environment, but thank you for your contribution to the discussion.

    EDIT: I hemmed and hawed about responding to your attitude directly, but do you talk to people anywhere but here? Is this the attitude you would bring to normal people in your life?

    Dick Van Dyke is 100 years old today. Do you think the embittered and embarrassing way you talk to strangers on the internet is positioning your health to enable you to live that long, or do you think the positive energy he brings to life has an effect? Will you readily die to support your animosity?

  • shadowgovt 2 days ago

    On the other hand, I already have an HTML parser, and your bespoke API would require a custom tool to access.

    Multiply that by every site, and that approach does not scale. Parsing HTML scales.

    • swiftcoder 2 days ago

      You already have a JSON and XML parser too, and the website offers standardised APIs in both of those

      • shadowgovt 2 days ago

        Not standardized enough; I can't guarantee the format of an API is RESTful, I can't know apriori what the response format is (arbitrary servers on the internet can't be trusted to be setting content type headers properly) or How to crawl it given the response data, etc. we ultimately never solved the problem of universal self- describing APIs, so a general crawling service can't trust they work.

        In contrast, I can always trust that whatever is returned to be consumed by the browser is in the format that is consumable by a browser, because if it isn't the site isn't a website. Html is pretty much the only format guaranteed to be working.

    • dmitrygr 2 days ago

      parsing html -> lazy but ok

      using an llm to parse html -> please do not

      • llbbdd 2 days ago

        > Lazy but ok

        You're absolutely welcome on your own free time to waste it on whatever feels right

        > using an llm to parse html -> please do not

        have you used any of these tools with a beginner's mindset in like, five years?

  • venturecruelty 2 days ago

    Weeping and gnashing of teeth because RAM is expensive, and then you learn that people buy 128 GB for their desktops so they can ask a chatbot how to scrape HTML. Amazing.

    • llbbdd a day ago

      The more I've thought about it the RAM part is hardly the craziest bit. Where the fuck do you even buy a computer with less than 4 cores in 2025? Pawn shop?

    • llbbdd 2 days ago

      isn't it ridiculous? This is hacker news. Nobody with the spare time to post here is living on the street. Buy some RAM or rent it. I can't believe honestly how many people on here I see bemoaning the fact that they haven't upgraded their laptops in 20 years and it's somehow anyone else's problem.

    • lechatonnoir 2 days ago

      it's kind of hard to tell what your position is here. should people not ask chatbots how to scrape html? should people not purchase RAM to run chatbots locally?

    • shadowgovt 2 days ago

      I may be out of the loop; is system RAM key for LLMs? I thought they were mostly graphics RAM constrained.