Comment by awkward

Comment by awkward 6 months ago

Pure, disinterested A/B testing where the goal is just to find the good way to do it, and there's enough leverage and traffic that funding that A/B testing is worthwhile is rare.

More frequently, A/B testing is a political technology that allows teams to move forward with changes to core, vital services of a site or app. By putting a new change behind an A/B test, the team technically derisks the change, by allowing it to be undone rapidly, and politically derisks the change, by tying it's deployment to rigorous testing that proves it at least does no harm to the existing process before applying it to all users. The change was judged to be valuable when development effort went into it, whether for technical, branding or other reasons.

In short, not many people want to funnel users through N code paths with slightly different behaviors, because not many people have a ton of users, a ton of engineering capacity, and a ton of potential upside from marginal improvements. Two path tests solve the more common problem of wanting to make major changes to critical workflows without killing the platform.

xp84 6 months ago

> politically derisks the change, by tying it's deployment to rigorous testing that proves it at least does no harm to the existing process before applying it to all users.

I just want to drop here the anecdata that I've worked for a total of about 10 years in startups that proudly call themselves "data-driven" and which worshipped "A/B testing." One of them hired a data science team which actually did some decently rigorous analysis on our tests and advised things like when we had achieved statistical significance, how many impressions we needed to have, etc. The other did not and just had someone looking at very simple comparisons in Optimizely.

In both cases, the influential management people who ultimately owned the decisions would simply rig every "test" to fit the story they already believed, by doing things like running the test until the results looked "positive" but not until it was statistically significant. Or, by measuring several metrics and deciding later on to make the decision based on whichever one was positive [at the time]. Or, by skipping testing entirely and saying we'd just "used a pre/post comparison" to prove it out. Or even by just dismissing a 'failure,' saying we would do it anyway because it's foundational to X, Y, and Z which really will improve (insert metric) The funny part is that none of these people thought they were playing dirty, they believed that they were making their decisions scientifically!

Basically, I suspect a lot of small and medium companies say they do "A/B testing" and are "data-driven" when really they're just using slightly fancy feature flags and relying on some director's gut feelings.

Reply View 9 replies

mikepurvis 6 months ago

At a small enough scale, gut feelings can be totally reasonable; taste is important and I'd rather follow an opinionated leader with good taste than someone who sits on their hands waiting for "the data". Anyway, your investors want you to move quickly because they're A/B testing you for surviveability against everything else in their portfolio.
The worst is surely when management make the investments in rigor but then still ignores the guidance and goes with their gut feelings that were available all along.

Reply View | 1 reply
- legendofbrando 6 months ago
  
  Huge plus one to this. We undervalue when to bet on data and when to be comfortable with gut.
  
  Reply View | 0 replies
weitendorf 6 months ago

I think your management was acting more competently than you are giving them credit for.
If A/B testing data is weak or inconclusively, and you’re at a startup with time/financial pressure, I’m sure it’s almost always better to just make a decision and move on than to spend even more time on analysis and waiting to achieve some fixed level of statistical power. It would be a complete waste of time for a company with limited manpower that needs to grow 30% per year to chase after marginal improvements.

Reply View | 4 replies
- zelphirkalt 6 months ago
  
  One shouldn't claim to be "data-driven", when one doesn't have a clue what that means. Just admit, that you will follow the leader's gut feeling at this company then.
  
  Reply View | 3 replies
  
  closewith 6 months ago
  
  In all cases, data-driven means we establish context for our gut decisions. In the end, it's always a judgement call.
  
  Reply View | 2 replies
petesergeant 6 months ago

> Basically, I suspect a lot of small and medium companies say they do "A/B testing" and are "data-driven" when really they're just using slightly fancy feature flags and relying on some director's gut feelings.
see also Scrum and Agile. Or continuous deployment. Or anything else that's hard to do well, and easier to just cargo-cult some results on and call it done.

Reply View | 0 replies
DanielHB 6 months ago

I worked at an almost-medium-sized company and we did quite a lot of A/B testing. In most cases the data people would be like "no meaningful difference in user behaviour". Going by gut feeling and overall product metrics (like user churn) turns out to be pretty okay most of the time.
The one place that A/B testing seem to have a huge impact was on the acquisition flow and onboarding, but not in the actual product per se.

Reply View | 0 replies

throwup238 6 months ago

> In short, not many people want to funnel users through N code paths with slightly different behaviors, because not many people have a ton of users, a ton of engineering capacity, and a ton of potential upside from marginal improvements.

I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results. I figure by the law of probabilities they would have gotten at least a single significant experiment but most products have such small user bases and make such large changes at a time that it’s completely pointless.

All my complaints fell on deaf ears until the PM in charge would get on someone’s bad side and then that metric would be used to push them out. I think they’re largely a political tool like all those management consultants that only come in to justify an executive’s predetermined goals.

Reply View 14 replies

dkarl 6 months ago

> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.
What I've seen in practice is that some places trust their designers' decisions and only deploy A/B tests when competent people disagree, or there's no clear, sound reason to choose one design over another. Surprise surprise, those alternatives almost always test very close to each other!
Other places remove virtually all friction from A/B testing and then use it religiously for every pixel in their product, and they get results, but often it's things like "we discovered that pink doesn't work as well as red for a warning button," stuff they never would have tried if they didn't have to feed the A/B machine.
From all the evidence I've seen in places I've worked, the motivating stories of "we increased revenue 10% by a random change nobody thought would help" may only exist in blog posts.

Reply View | 6 replies
- zeroCalories 6 months ago
  
  I think trusting your designers is probably the way to go for most teams. Good designers have solid intuitions and design principles for what will increase conversion rates. Many designers will still want a/b tests because they want to be able to justify their impact, but they should probably be denied. For really important projects designers should do small sample size research to validate their designs like we would do in the past.
  I think a/b tests are still good for measuring stuff like system performance, which can be really hard to predict. Flipping a switch to completely change how you do caching can be scary.
  
  Reply View | 3 replies
  
  Moru 6 months ago
  
  A/B tests for user interface is very annoying when you are on the phone trying to guide someone how to use a website. "Click the green button on the left" - "What do you mean? There is nothing green on the screen." - "Are you on xyz.com? Can you read out the adress to me please?" ... Oh so many hour wasted in tech support.
  
  Reply View | 1 reply
  
  miki123211 6 months ago
  
  Meta apps are like that, particularly around accessibility.
  It's pretty common for one person to have an issue that no other people have, just because they fell for some feature flag.
  
  Reply View | 0 replies
  
  edmundsauto 6 months ago
  
  Good designers generally optimize for taste but not for conversions. I have seen so many designs that were ugly as sin that won, as measured by testing. If you want to build a product that is tasteful, designers are the way to go. If you want to build a product optimized for a clear business metric like sales or upgrades or whatnot, experimentation works better.
  It just depends on the goals of the business.
  
  Reply View | 0 replies
- DanielHB 6 months ago
  
  In paid SaaS B2B A/B testing is usually a very good idea for use acquisition flow and onboarding, but not in the actual product per se.
  Once the user has committed to paying they probably will put up with whatever annoyance you put in their way, also if they are paying if something is _really_ annoying they often contact the SaaS people.
  Most SaaS don't really care that much about "engagement" metrics (ie keeping users IN the product). These are the kinda of metrics are are the easiest to see move.
  In fact most people want a product they can get in and out ASAP and move on with their lives.
  
  Reply View | 1 reply
  
  dkarl 6 months ago
  
  Many SaaS companies care about engagement metrics, especially if they have to sell the product, like their revenue depends on salespeople convincing customers to renew or upgrade their licenses at a certain level for so many seats at $x/year.
  For example, I worked on a new feature for a product, and the engagement metrics showed a big increase in engagement by several customers' users, and showed that their users were not only using our software more but also doing their work much faster than before. We used that to justify raising our prices -- customers were satisfied with the product before, at the previous rates, and we could prove that we had just made it significantly more useful.
  I know of at least one case where we shared engagement data with a power user at a customer who didn't have purchase authority but was able to join it with their internal data to show that use of our software correlated with increased customer satisfaction scores. They took that data to their boss, who immediately bought more seats and scheduled user training for all of their workers who weren't using our software.
  We also used engagement data to convince customers not to cancel. A lot of times people don't know what's going on in their own company. They want to cancel because they think nobody is using the software, and it's important to be able to tell them how many daily and hourly users they have on average. You can also give them a list of the most active users and encourage them to reach out and ask what the software does for them and what the impact would be of cancelling.
  
  Reply View | 0 replies
eru 6 months ago

> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.
Well, at least it looks like they avoided p-hacking to show more significance than they had! That's ahead of much of science, alas.

Reply View | 0 replies
rco8786 6 months ago

> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.
Yea, I've been here too. And in every analytics meeting everyone went "well, we know it's not statistically significant but we'll call it the winner anyway". Every. Single. Time.
Such a waste of resources.

Reply View | 5 replies
- hamandcheese 6 months ago
  
  Is it a waste? You proved the change wasn't harmful.
  
  Reply View | 4 replies
  
  zeroCalories 6 months ago
  
  Statistically insignificant means you didn't prove anything by usual standards. I do agree that it's not a waste, as knowing that you have a 70% chance that you're going in the right direction is better than nothing. The 2 sigma crowd can be both too pessimistic and not pessimistic enough.
  
  Reply View | 1 reply
  
  Dylan16807 6 months ago
  
  Statistical significance depends on what you're trying to prove. Looking for substantial harm is a lot easier than figuring out which option is simply better, depending on what level of substantial you're looking for.
  If your error bar for some change goes from negative 4 percent to positive 6 percent, it may or may not be better, but it's safe to switch to.
  
  Reply View | 0 replies
  
  menaerus 6 months ago
  
  To prove that a change isn't harmful is still a hypothesis.
  
  Reply View | 0 replies
  
  hinkley 6 months ago
  
  You can still enshittify something by degrees this way.
  I think the disconnect here is some people thinking A/B testing is something you try once a month, and someplace like Amazon where you do it all the time and with hundreds of employees poking things.
  
  Reply View | 0 replies

ljm 6 months ago

Tracks that I’ve primarily seen A/B tests used as a mechanism for gradual rollout rather than pure data-driven experimentation. Basically expose functionality to internal users by default then slowly expand it outwards to early adopters and then increment it to 100% for GA.

It’s helpful in continuous delivery setups since you can test and deploy the functionality and move the bottleneck for releasing beyond that.

Reply View 8 replies

baxtr 6 months ago

I wouldn’t call that A/B testing but rather a gradual roll-out.

Reply View | 7 replies
- cornel_io 6 months ago
  
  If you roll it back upon seeing problems, then you're doing something meaningful, at least. IMO 90+% of the value of A/B testing comes from two things, a) forcing engineers to build everything behind flags, and b) making sure features don't crater your metrics before freezing them in and making them much more difficult to remove (both politically and technically).
  Re: b), if you've ever gotten into a screaming match with a game designer angry over the removal of their pet feature, you will really appreciate the political cover that having numbers provides...
  
  Reply View | 0 replies
- alex7o 6 months ago
  
  I think parent is confusing A/B testing with feature flags, which can be used for A/B tests but also for roll-outs.
  
  Reply View | 3 replies
  
  nine_k 6 months ago
  
  Not the parent but some actual practitioners. A change is based on the gut feeling, and it's usually correct, but the internal politics require to demonstrate impartiality, so an "A/B test" is run, to show that the change is "objectively better", whether statistics show that or not.
  
  Reply View | 0 replies
  
  hinkley 6 months ago
  
  Feature flags tend to be all or nothing and/or A/B testing instrumentation can be used to roll out feature flags.
  It’s complicated.
  
  Reply View | 0 replies
  
  awkward 6 months ago
  
  I’m aware of the distinction. A/B testing is the killer app for feature flags from the perspective of business decision makers.
  
  Reply View | 0 replies
- Someone 6 months ago
  
  I think gradual rollout can use the same mechanism, but for a different readon: avoiding pushing out a potentially buggy product to all users in one sweep.
  It becomes an A/B test when you measure user activity to decide whether to roll out to more users.
  
  Reply View | 1 reply
  
  hinkley 6 months ago
  
  Has my CPU use gone up? No.
  Have my error logs gotten bigger? No.
  Have my tech support calls gone up? No.
  Okay then turn the dial farther.
  
  Reply View | 0 replies

hamandcheese 6 months ago

I feel like you are trying to say "sometimes people just need a feature flag". Which is of course true.

Reply View 1 reply

necovek 6 months ago

A feature flag can still be fully on or fully off.
Why they might conflate A/B testing with gradual rollout is control over who gets the feature flag on and who doesn't.
In a sense, A/B testing is a variant of gradual rollout, where you've done it so you can see differences in feature "performance" (eg. funnel dashboards) vs just regular observability (app is not crashing yet).
Basically, a gradual rollout for the purposes of an A/B test.

Reply View | 0 replies

kavenkanum 6 months ago

Derisking changes may not work sometimes. For example I don't use Spotify anymore, because of their ridiculous Ab tests. In one month I saw 3 totally different designs of the home and my fav playlists page on my Android phone. That's it. When you open Spotify only when you start your car then it's ridiculous that you can't find anything and you are in a hurry. That was it. I am no longer subscriber and a user of this shit service. Sometimes these tests are actually harmful. Maybe others are just driving and trying to manage Spotify at the same time and then we have actual killed people because of this. Harmless Indeed.

Reply View 1 reply

Moru 6 months ago

Ever tried a support call helping someone using a website that has A/B testing on? It's a very frustrating experience where you start to think the user on the other side must have mistyped the url. A lot of time wasted on such calls. And yes, the worst is when these things only happen in things you use only when in a hurry.

Reply View | 0 replies

mewpmewp2 6 months ago

Why do you consider it political. Isn't it just a wise thing to do?

Reply View 0 replies