tech_ken 2 months ago

I would say that vast majority of type problems in data science/stats workflows come from data tables "trojan-horsing" type or missing data issues, rather than type problems strictly at the code level. Type annotations won't help you when your upstreams decide they want to change the format of their year-quarter strings without telling you.

  • dragonwriter 2 months ago

    > Type annotations won't help you when your upstreams decide they want to change the format of their year-quarter strings without telling you.

    IME with both Python and JS/TS, it helps a lot (which is different than completely solving the problem), for reasons which should generalize to other typing add-ons/supersets for untyped languages. Typing your code forces validations at the boundaries, which obviously doesn't stop upstream sources from messing with formats but it does mean that you are much more likely to catch it at the boundary rather than having weird breakages deep in your code that you have to trace back to bad upstream data.

    • tech_ken 2 months ago

      Is the idea that if my year_quarter parser is properly typed then it should detect the format change and throw an error? (kind of a silly example, just trying to be illustrative)

      • Nadya 2 months ago

        Yes. Your type can encode what the proper format for a string should be and if a string is passed that does not meet that format it will throw an error allowing you to make any necessary adjustments to handle the new date year_quarter format.

        eg. `type DateString = ${number}/${number}/${number}`

        A super naïve check for using "/" instead of "-" as the separator character for a date formatted as a string. If a date is provided with some other separator character it will throw an error. If my function takes a DateString the string must be formatted correctly to pass the type check. Obviously this isn't enough (YYYY/MM/DD is different than DD/MM/YYYY) but the intention was to show a way to enforce something via types rather than validating a string to check that your have a DateString you can simply enforce that you have one.

      • dllthomas 2 months ago

        "Typing your code forces validations at the boundaries" was too strong because of course you can type your code without actually doing the validations, but you can structure your code such that that won't happen accidentally: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

        The idea is that checking should be the only way of making a value of the type. That prevents you from forgetting to check when you turn some broader type (say, string) into the more narrow one (date, in this case).

        • dragonwriter 2 months ago

          > "Typing your code forces validations at the boundaries" was too strong because of course you can type your code without actually doing the validations

          Yeah, of course you can cheat the typechecking in the code at the boundary in several ways, or convert from wire format to internal types in a way which plugs in type-valid defaults for bad data rather than erroring, or just use too-broad internal types to start with (you can have "stringly-typed code"), and fail to help the problems. But if you use the types that make sense internally for what the code is doing, than conversion including validation at the boundary becomes the path of least resistance in most cases. "Forces" is not strictly true, but my experience is that adding types does create a strong push for boundary validation.

rscho 2 months ago

No, it will get in the way more than anything else. As has been said elsewhere in thread, what we need to ensure in R is mostly runtime constraints (array shape, number in specific interval, etc.) this would require a super heavy and complex type system, with at least refinement and probably fully-dependent types. It would be too complex to use for most people and use cases. A contract system would be far more practical and useful. See CHECK and other constraints in SQLite. That's exactly what we need.

ellisv 2 months ago

It is probably helpful in some cases and unhelpful in others. R uses multiple dispatch, so calling `foo` on different types can produce different output. It isn't clear to me how Vapour handles this. In general though, folks are passing around data.frame or similar objects.

levocardia 2 months ago

Not really, because honestly a lot of us who came into programming via research never learned typed languages or unit tests or any of those best practices - we were just hacking around in MATLAB, R, or Python from the start. What I really need is a seamless and easy way to run statistical models that can only be fit in R, but from Python or Node. There are several categories of statistical modeling where R completely blows python out of the water, and it's incredibly wasteful (and error-prone) to try to re-implement these yourself in Python.