Comment by clircle
Comment by clircle 2 months ago
Statisticians and researchers, is this helpful?
Comment by clircle 2 months ago
Statisticians and researchers, is this helpful?
> Type annotations won't help you when your upstreams decide they want to change the format of their year-quarter strings without telling you.
IME with both Python and JS/TS, it helps a lot (which is different than completely solving the problem), for reasons which should generalize to other typing add-ons/supersets for untyped languages. Typing your code forces validations at the boundaries, which obviously doesn't stop upstream sources from messing with formats but it does mean that you are much more likely to catch it at the boundary rather than having weird breakages deep in your code that you have to trace back to bad upstream data.
Yes. Your type can encode what the proper format for a string should be and if a string is passed that does not meet that format it will throw an error allowing you to make any necessary adjustments to handle the new date year_quarter format.
eg. `type DateString = ${number}/${number}/${number}`
A super naïve check for using "/" instead of "-" as the separator character for a date formatted as a string. If a date is provided with some other separator character it will throw an error. If my function takes a DateString the string must be formatted correctly to pass the type check. Obviously this isn't enough (YYYY/MM/DD is different than DD/MM/YYYY) but the intention was to show a way to enforce something via types rather than validating a string to check that your have a DateString you can simply enforce that you have one.
"Typing your code forces validations at the boundaries" was too strong because of course you can type your code without actually doing the validations, but you can structure your code such that that won't happen accidentally: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
The idea is that checking should be the only way of making a value of the type. That prevents you from forgetting to check when you turn some broader type (say, string) into the more narrow one (date, in this case).
> "Typing your code forces validations at the boundaries" was too strong because of course you can type your code without actually doing the validations
Yeah, of course you can cheat the typechecking in the code at the boundary in several ways, or convert from wire format to internal types in a way which plugs in type-valid defaults for bad data rather than erroring, or just use too-broad internal types to start with (you can have "stringly-typed code"), and fail to help the problems. But if you use the types that make sense internally for what the code is doing, than conversion including validation at the boundary becomes the path of least resistance in most cases. "Forces" is not strictly true, but my experience is that adding types does create a strong push for boundary validation.
No, it will get in the way more than anything else. As has been said elsewhere in thread, what we need to ensure in R is mostly runtime constraints (array shape, number in specific interval, etc.) this would require a super heavy and complex type system, with at least refinement and probably fully-dependent types. It would be too complex to use for most people and use cases. A contract system would be far more practical and useful. See CHECK and other constraints in SQLite. That's exactly what we need.
It is probably helpful in some cases and unhelpful in others. R uses multiple dispatch, so calling `foo` on different types can produce different output. It isn't clear to me how Vapour handles this. In general though, folks are passing around data.frame or similar objects.
Not really, because honestly a lot of us who came into programming via research never learned typed languages or unit tests or any of those best practices - we were just hacking around in MATLAB, R, or Python from the start. What I really need is a seamless and easy way to run statistical models that can only be fit in R, but from Python or Node. There are several categories of statistical modeling where R completely blows python out of the water, and it's incredibly wasteful (and error-prone) to try to re-implement these yourself in Python.
rpy2 can be used to call R from Python: https://rviews.rstudio.com/2022/05/25/calling-r-from-python-...
reticulate works for going in the other direction: https://rstudio.github.io/reticulate/
With the good interoperability these days, let's stop rewriting functionality in other languages. If the interoperability is no good, work on fixing that, please.
I would say that vast majority of type problems in data science/stats workflows come from data tables "trojan-horsing" type or missing data issues, rather than type problems strictly at the code level. Type annotations won't help you when your upstreams decide they want to change the format of their year-quarter strings without telling you.