Comment by pegasus

Comment by pegasus 2 days ago

10 replies

The same way you test any system - you find a sampling of test subjects, have them interact with the system and then evaluate those interactions. No system is guaranteed to never fail, it's all about degree of effectiveness and resilience.

The thing is (and maybe this is what parent meant by non-determinism, in which case I agree it's a problem), in this brave new technological use-case, the space of possible interactions dwarfs anything machines have dealt with before. And it seems inevitable that the space of possible misunderstandings which can arise during these interactions will balloon similarly. Simply because of the radically different nature of our AI interlocutor, compared to what (actually, who) we're used to interacting with in this world of representation and human life situations.

drillsteps5 2 days ago

Does knowing the system architecture not help you with defining things like happy path vs edge case testing? I guess it's much less applicable for overall system testing, but in "normal" systems you test components separately before you test the whole thing, which is not the case with LLMs.

By "non-deterministic" I meant that it can give you different output for the same input. Ask the same question, get a different answer every time, some of which can be accurate, some... not so much. Especially if you ask the same question in the same dialog (so question is the same but the context is not so the answer will be different).

EDIT: More interestingly, I find an issue, what do I even DO? If it's not related to integrations or your underlying data, the black box just gave nonsensical output. What would I do to resolve it?

  • bhadass a day ago

    >EDIT: More interestingly, I find an issue, what do I even DO? If it's not related to integrations or your underlying data, the black box just gave nonsensical output. What would I do to resolve it?

    Lots of stuff you could do. Adjust the system prompt, add guardrails/filters (catching mistakes and then asking the LLM loop again), improve the RAG (assuming they have one), fine tune (if necessary), etc.

datsci_est_2015 2 days ago

> The same way you test any system - you find a sampling of test subjects, have them interact with the system and then evaluate those interactions.

That’s not strictly how I test my systems. I can release with confidence because of a litany of SWE best practices learned and borrowed from decades of my own and other people’s experiences.

> No system is guaranteed to never fail, it's all about degree of effectiveness and resilience.

It seems like the product space for services built on generative AI is diminishing by the day with respect to “effectiveness and resilience”. I was just laughing with some friends about how terrible most of the results are when using Apple’s new Genmoji feature. Apple, the company with one of the largest market caps in the world.

I can definitely use LLMs and other generative AI directly, and understand the caveats, and even get great results from them. But so far every service I’ve interacted with that was a “white label” repackaging of generative AI has been absolute dogwater.

[removed] 2 days ago
[deleted]
themafia 2 days ago

> radically different nature of our AI interlocutor

It's the training data that matters. Your "AI interlocutor" is nothing more than a lossy compression algorithm.

  • pegasus 2 days ago

    Yet it won't be easy not to anthropomorphize it, expecting it to just know what we mean, as any human would. And most of the time it will, but once in a while it will betray its unthinking nature, taking the user by surprise.

    • themafia 2 days ago

      > taking the user by surprise.

      And surprise is really what you want in computing. ;)

  • sebastiennight 2 days ago

    Most AI Chatbots do not rely on their training data, but on the data that is passed to them through RAG. In that sense they are not compressing the data, just searching and rewording it for you.

    • themafia 2 days ago

      > and rewording it

      Using the probabilities encoded in the training data.

      > In that sense they are not compressing the data

      You're right. In this case they're decompressing it.

      • sebastiennight 6 hours ago

        It feels like you're being pedantic, to defend your original claim which was inaccurate.

            User input: Does NYC provide disability benefits? if so, for how long?
        
            RAG pipeline: 1 result found in Postgres, here's the relevant fragment: "In New York City, disability benefits provide cash assistance to employees who are unable to work due to off-the-job injuries or illnesses, including disabilities from pregnancies. These benefits are typically equal to 50% of the employee's average weekly wage, with a maximum of $170 per week, and are available for up to 26 weeks within a 52-week period."
        
           LLM scaffolding: "You are a helpful chatbot. Given the question above and the data provided, reply to the user in a kind helpful way".
        
        
        the LLM here is only "using the probability encoded in the training data" to know that after "Yes, it does" it should output the token "!"

        However, it is not "decompressing" its "training data" to write

            the maximum duration, however, is 26 weeks within a 52-week period!
        
        It is just getting this from the data provided at run-time in the prompt, not from training data.