Comment by simsla

Comment by simsla a day ago

25 replies

This relates to one of my biggest pet peeves.

People interpret "statistically significant" to mean "notable"/"meaningful". I detected a difference, and statistics say that it matters. That's the wrong way to think about things.

Significance testing only tells you the probability that the measured difference is a "good measurement". With a certain degree of confidence, you can say "the difference exists as measured".

Whether the measured difference is significant in the sense of "meaningful" is a value judgement that we / stakeholders should impose on top of that, usually based on the magnitude of the measured difference, not the statistical significance.

It sounds obvious, but this is one of the most common fallacies I observe in industry and a lot of science.

For example: "This intervention causes an uplift in [metric] with p<0.001. High statistical significance! The uplift: 0.000001%." Meaningful? Probably not.

mustaphah a day ago

You're spot on that significant ≠ meaningful effect. But I'd push back slightly on the example. A very low p-value doesn't always imply a meaningful effect, but it's not independent of effect size either. A p-value comes from a test statistic that's basically:

(effect size) / (noise / sqrt(n))

Note that bigger test statistic means smaller p-value.

So very low p-values usually come from bigger effects or from very large sample sizes (n). That's why you can technically get p<0.001 with a microscopic effect, but only if you have astronomical sample sizes. In most empirical studies, though, p<0.001 does suggest the effect is going to be large because there are practical limits on the sample size.

  • specproc a day ago

    The challenge is that datasets are just much bigger now. These tools grew up in a world where n=2000 was considered pretty solid. I do a lot of work with social science types, and that's still a decent sized survey.

    I'm regularly working with datasets in the hundreds of thousands to millions, and that's small fry compared with what's out there.

    The use of regression, for me at least, is not getting that p-gotcha for a paper, but as a posh pivot table that accounts for all the variables at once.

    • refactor_master a day ago

      There’s a common misconception that high throughput methods = large n.

      For example, I’ve encountered the belief that just by recording something at ultra high temporal resolution gives you “millions of datapoints”. This then has all sorts of effects on the breakdown of statistics and hypothesis testing (seemingly).

      In reality, the replicability of the entire setup, the day it was performed, the person doing it, etc. means the n for the day is probably closer to 1. So to ensure replicability you’d have to at least do it on separate days, with separately prepared samples. Otherwise, how can you eliminate the chance that your ultra finicky sample just happened to vibe with that day’s temperature and humidity?

      But they don’t teach you in statistics what exactly “n” means, probably because a hundred years ago it was much more literal in nature. 100 samples is because I counted 100 mice, 100 peas, or 100 surveys.

      • clickety_clack a day ago

        I learned about experiment design in statistics, so I wouldn’t blame statisticians for this.

        There’s a lot of folks out there though who learned the mechanics of linear regression in a bootcamp or something without gaining an appreciation for the underlying theories, and those folks are looking for low p-value and as long as they get it it’s good enough.

        I saw this link yesterday and could barely believe it, but I guess these folks really live among us.

        https://stats.stackexchange.com/questions/185507/what-happen...

      • ImageXav 20 hours ago

        This is an interesting point. I've been trying to think about something similar recently but don't have much of an idea how to proceed. I'm gathering periodic time series data and am wondering how to factor in the frequency of my sampling for the statistical tests. I'm not sure how to assess the difference between 50Hz and 100Hz on the outcome, given that my periods are significantly longer. Would you have an idea of how to proceed? The person I'm working with currently just bins everything in hour long buckets and uses the mean for comparison between time series but this seems flawed to me.

  • pebbly_bread a day ago

    Depending on the nature of the study, there's lots of scientific disciplines where it's trivial to get populations in the millions. I got to see a fresh new student's poster where they had a p-value in the range of 10^-146 because every cell in their experiment was counted as it's own sample.

amelius a day ago

https://pmc.ncbi.nlm.nih.gov/articles/PMC3444174/

> Using Effect Size—or Why the P Value Is Not Enough

> Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.

– Gene V. Glass

tryitnow a day ago

Agreed. However, I think you're being overly charitable in calling it a "pet peeve", it's more like a pathological misunderstanding of stats that leads to a lot of bad outcomes especially in popular wellness media.

As an example, read just about any health or nutrition research article referenced in popular media and there's very often a pretty weak effect size even though they've achieved "statistical significance." People then end up making big changes to their lifestyles and habits based on research that really does not justify those changes.

jpcompartir a day ago

^

And if we increase N enough we will be able to find these 'good measurements' and 'statistically significant differences' everywhere.

Worse still if we did not agree in advance what hypotheses we were testing, and go looking back through historical data to find 'statistically significant' correlations.

  • ants_everywhere a day ago

    Which means that statistical significance is really a measure of whether N is big enough

    • kqr a day ago

      This has been known ever since the beginning of frequentist hypothesis testing. Fisher warned us not to place too much emphasis on the p-value he asked us to calculate, specifically because it is mainly a measure of sample size, not clinical significance.

      • ants_everywhere a day ago

        Yes the whole thing has been a bit of a tragedy IMO. A minor tragedy all things considered, but still one nonetheless.

        One interesting thing to keep in mind is that Ronald Fisher did most of his work before the publication of Kolmogorov's probability axioms (1933). There's a real sense in which the statistics used in social sciences diverged from mathematics before the rise of modern statistics.

        So there's a lot of tradition going back to the 19th century that's misguided, wrong, or maybe just not best practice.

    • energy123 a day ago

      It's not, that would be quite the misunderstanding of statistical power.

      N being big means that small real effects can plausibly be detected as being statistically significant.

      It doesn't mean that a larger proportion of measurements are falsely identified as being statistically significant. That will still occur at a 5% frequency or whatever your alpha value is, unless your null is misspecified.

      • ants_everywhere a day ago

        It's standard to set the null hypothesis to be a measure zero set (e.g. mu = 0 or mu1 = mu2). So the probability of the null hypothesis is 0 and the only question remaining is whether your measurement is good enough to detect that.

        But even though you know the measurement can't be exactly 0.000 (with infinitely many decimal places) a priori, you don't know if your measurement is any good a priori or whether you're measuring the right thing.

V__ a day ago

I really like this video [1] from 3blue1brown, where he proposes to think about significance as a way to update the probability. One positive test (or in this analog a study) updates the probability by X % and thus you nearly always need more tests (or studies) for a 'meaningful' judgment.

[1] https://www.youtube.com/watch?v=lG4VkPoG3ko

kqr a day ago

To add nuance, it is not that bad. Given reasonable levels of statistical power, experiments cannot show meaningless effect sizes with statistical significance. Of course, some people design experiments at power levels way beyond what's useful, and this is perhaps even more true when it comes to things where big data is available (like website analytics), but I would argue the problem is the unreasonable power level, rather than a problem with statistical significance itself.

When wielded correctly, statistical significance is a useful guide both to what's a real signal worth further investigation, and it filters out meaningless effect sizes.

A bigger problem even when statistical significance is used right is publication bias. If, out of 100 experiments, we only get to see the 7 that were significant, we already have a false:true ratio of 5:2 in the results we see – even though all are presented as true.

ants_everywhere a day ago

> Significance testing only tells you the probability that the measured difference is a "good measurement". With a certain degree of confidence, you can say "the difference exists as measured".

Significance does not tell you this. The p-value can be arbitrarily close to 0 while the probability of the null hypothesis being true is simultaneously arbitrarily close to one

  • wat10000 a day ago

    Right. The meaning of p-value is, in a world where there is no effect, what is the probability of getting the result you got purely by random chance? It doesn’t directly tell you anything about whether this is such a world or not.

tomrod a day ago

This is sort of the basis of econometrics, as well as a driving thought behind causal inference.

Econometrics cares not only about statistical significance but also usefulness/economic usefulness.

Causal inference builds on base statistics and ML, but its strength lies in how it uses design and assumptions to isolate causality. Tools like sensitivity analysis, robustness checks, and falsification tests help assess whether the causal story holds up. My one beef is that these tools still lean heavily on the assumption that the underlying theoretical model is correctly specified. In other words, causal inference helps stress-test assumptions, but it doesn’t always provide a clear way to judge whether one theoretical framework is more valid than another!

taneq a day ago

I’d say rather that “statistically significance” is a measure of surprise. It’s saying “If this default (the null hypothesis) is true, how surprised would I be to make these observations?”

  • kqr a day ago

    Maybe you can think of it as saying "should I be surprised" but certainly not "how surprised should I be". The magnitude of the p-value is a function of sample size. It is not an odds ratio for updating your beliefs.

prasadjoglekar a day ago

For all the shit that HN gives to MBAs, one thing they instill into you during the Managerial Stats class is Stag Sig not the same as Managerial Sig.