Comment by tech_ken

Comment by tech_ken 10 hours ago

5 replies

I would say that vast majority of type problems in data science/stats workflows come from data tables "trojan-horsing" type or missing data issues, rather than type problems strictly at the code level. Type annotations won't help you when your upstreams decide they want to change the format of their year-quarter strings without telling you.

dragonwriter 9 hours ago

> Type annotations won't help you when your upstreams decide they want to change the format of their year-quarter strings without telling you.

IME with both Python and JS/TS, it helps a lot (which is different than completely solving the problem), for reasons which should generalize to other typing add-ons/supersets for untyped languages. Typing your code forces validations at the boundaries, which obviously doesn't stop upstream sources from messing with formats but it does mean that you are much more likely to catch it at the boundary rather than having weird breakages deep in your code that you have to trace back to bad upstream data.

  • tech_ken 9 hours ago

    Is the idea that if my year_quarter parser is properly typed then it should detect the format change and throw an error? (kind of a silly example, just trying to be illustrative)

    • Nadya 8 hours ago

      Yes. Your type can encode what the proper format for a string should be and if a string is passed that does not meet that format it will throw an error allowing you to make any necessary adjustments to handle the new date year_quarter format.

      eg. `type DateString = ${number}/${number}/${number}`

      A super naïve check for using "/" instead of "-" as the separator character for a date formatted as a string. If a date is provided with some other separator character it will throw an error. If my function takes a DateString the string must be formatted correctly to pass the type check. Obviously this isn't enough (YYYY/MM/DD is different than DD/MM/YYYY) but the intention was to show a way to enforce something via types rather than validating a string to check that your have a DateString you can simply enforce that you have one.

    • dllthomas 8 hours ago

      "Typing your code forces validations at the boundaries" was too strong because of course you can type your code without actually doing the validations, but you can structure your code such that that won't happen accidentally: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

      The idea is that checking should be the only way of making a value of the type. That prevents you from forgetting to check when you turn some broader type (say, string) into the more narrow one (date, in this case).

      • dragonwriter 6 hours ago

        > "Typing your code forces validations at the boundaries" was too strong because of course you can type your code without actually doing the validations

        Yeah, of course you can cheat the typechecking in the code at the boundary in several ways, or convert from wire format to internal types in a way which plugs in type-valid defaults for bad data rather than erroring, or just use too-broad internal types to start with (you can have "stringly-typed code"), and fail to help the problems. But if you use the types that make sense internally for what the code is doing, than conversion including validation at the boundary becomes the path of least resistance in most cases. "Forces" is not strictly true, but my experience is that adding types does create a strong push for boundary validation.