Experiment Driven Development

TL;DR

I tried to answer the following  questions most companies with mature product will ask:

  • how do we prioritise a big bag of ideas?
  • how do we avoid opinion-based decisions (also HiPPO effect) and focus on data?
  • how do we bring back innovation from it sidelines to being the thing your company does everyday?

Just download the canvas and start experimenting or keep on reading if you want more in-depth explanation.

Why?

Few years back I was introduced to the Lean Canvas and I instantly fell in love with the idea. Not the canvas per se (however it’s really good), more in yet another example of power of visualisation. When we think about complex ideas, we have to deal with multiple dimensions and variables and our brain has to switch multiple times between concepts, thus loosing focus and using categorisation with different biases on top of it. The end result relies chiefly on the mental discipline of people involved in the process of analysis and brainstorming. Based on my experience, the moment you leave the idea session, everyone has a different picture in their head and the race has begun towards the ‘oops’ moment, where alignment is finally achieved but only after all the blood, sweat and tears, which could’ve been prevented.

I have recently worked with a lot of A/B testing on different platforms and it always lacked the systematic and scientific approach behind it. Most of the theory behind it (at least to me) feels very shallow and create a feeling that A/B testing is no more than shooting blindly until you hit something. In the end, you don’t know why you’re shooting in first place, why that direction, what did you hit, why it was there and will it be there again. The magic seems to be in the numbers, they say, the more you shoot, the more you hit, but in terms of product development it means you’re focusing on features and more features – the holistic product view is gone and what you’re building up is a pile of everything and nothing and you have no clue why it works or when will it stop working. Here’s my beef with A/B testing in bullet-points (so it must be true):

  • people use it as an excuse to stop doing research and/or talk to users
  • it doesn’t answer question ‘why’, thus your solution is not proven to be sustainable
  • it allows to test without thinking, thus creating backlog of ‘tests’, which lead to serial execution without validated learning
  • it distorts the big picture by chasing small, short-term numbers behind features
  • it focuses only on the metric behind the feature tested, ignoring the surrounding

Enter Scientific Research Methods

It’s not a new concept to test your ideas with split groups. A/B testing is a (really) dumbed down version of research methods used in science to prove or disprove a given hypothesis. Just like any new ‘cool’ process or tool in software development, it has been misused and abused, following Larman’s Laws of Organizational Behavior. In my quest to improve the way we work at Marktplaats.nl, I tried to bring the value of Scientific Research methodologies to product development. Does it work? Yes, it does.

Experiment Design Canvas

The canvas itself is a set of questions based on Scientific Research methodologies, namely two aspects of it – the True Experiment and Empirical Cycle. The purpose of the canvas is to ask the same questions in order to:

What the canvas is not:

  • a process of any sort
  • the experiment registration form to be debated by a committee
  • an all-fields-required form – fill in what you need/know
  • an excuse to do something if others don’t agree, just because it’s filled in.

If you need to change the canvas, it’s a good idea to run evaluation phase within your company to see if it still serves the needs mentioned above. If not, you’re dealing with a different tool. I have run such evaluation and indeed 18 versions have been produced till date based on feedback from my colleagues.

And here it is:

What’s what?

  1. Motivation
    1. Theory – ask yourself – why am I doing this? No, not in a sense – because I want to make money, but more like – what tells me that this is a good idea? User research? Hard data from current implementation? Talked to the users? Gut feeling? The bottom line is – what proof do you have beforehand, that this may work and how strong is it.
    2. Experience – At this stage many people will choke and go back to their GA console to find that data. Hopefully you can see that this is the first line of defence against opinion-based discussions and I observed people coming back saying – ‘You know what? It’s not a good idea.’
  2. Hypothesis
    1. Theory – Simply state your hypothesis. The biggest problem of A/B testing is that people never set hypothesis, they merely expect better (unquantified, see minimum success in point 6) result. In science changing the hypothesis during or after the experiment nullifies all the findings (harking). You experiment to (dis)prove something, not to find data that fits your thinking.
    2. Experience – While you explore the canvas, your hypothesis will be changing to a point it can come out as completely something different. It’s only natural, as you keep uncovering assumptions and complexities of the experiment design. But once you start experimenting, refrain from bending hypothesis to your findings. It’s better to restart.
  3. Cause-effect
    1. Theory – can you separate cause and effect?  Can you define cause in such a way that it’s manipulable? If there’s no cause-effect, there’s no experiment. Period.
    2. Experience – at this stage people find out that the way they state the hypothesis does not fit cause-effect statement or that the cause is something like – ‘having a blue button’ – that’s wrong, dig deeper – what’s the cause in relation to user behaviour?  This way you open up discussions like – why blue, which may push you back to motivation phase. Another issue here is the causation vs correlation. If you say (and people did) that the more users use a feature X, the more active they are, you can also say that the more active they are the more feature X they use. If you can reverse cause and effect and it still makes sense, you are dealing with correlation and not causation and you should not attempt experiment until you solve this issue (run two experiments to find causation or refine hypothesis).
  4. Variables
    1. Theory – Can you quantify cause, effect? Look around the environment in which the experiment will run – will you influence other variables or vice versa?
    2. Experience – Effect seems to be the easiest and then everyone asks what is the cause measurement. I tend to bring it down to the fact of exposure. If you run your A/B test on 100 users with the feature under the fold or hidden in a menu, you cannot claim that 100 users are in the experiment. You should measure exposure to cause. This is yet another pitfall of happy trigger approach in A/B testing. Lastly, if user is supposed to engage with new feature, they may not engage anymore with other features. Identify these features and measure the engagement to see if you created new one or just shifted it.
  5. Design Decisions (this part is under revision)
    1. Theory – This is the part of canvas that can differ based on your situation with experimental setup. But first, a word on randomisation process. In science, the random selection means that every possible property has same chance of being selected. In practice that means you are able to randomly select from the whole population. All statistics theory is based on this assumptions. Here comes another flaw of A/B testing – if you run your experiment at 12 o’clock Monday, you will get a specific set of users coming to your website – office workers, with medium to high income. They do not represent your whole population, but all your samples will be skewed towards their behaviour. That means you cannot extend your findings to a whole population (see External Validity).
    2. Experience – One way to gain confidence about your selection process is to run randomisation check. Before you run A/B, you run A/A, measure and compare their behavior (hint it should not differ much) and then switch one of the A’s to B and see if applying the cause has created effect you expected at the same time observing A for any similar changes. This also counters issues of seasonal trends, etc. where the whole population changes. However it does not solve the random selection issue. Another part is manipulation check, which basically is the fact of measuring exposure (see point 4). And lastly you will be asked to describe your population (e.g. sellers only) and describe expected matching sample and it’s size. Platforms are simply there to note where will you run experiment.
  6. When to stop?
    1. Theory – experimenting is an act of induction. We simply continuously search for data to (dis)prove our hypothesis. But we need to stop at some point whether we like the data or not and we should also not stop before agreed time, because we like what we see. This is called cherry-picking and is a behaviour condemned in science. Negative effect is a circuit breaker – simply think of any scenario that would completely break your experience or put your product in danger – that would justify ending experiment before agreed time passed. Finally Minimum success is a minimum value of the effect variable that would justify full implementation of the experiment.
    2. Experience – choose time that will take into account seasonal trends. For example for us it’s at least a week, as Mondays and weekends tend to be different than any other days. We also experience strong weather influence on user engagement (bless rainy Netherlands!). In practice, when asked ‘What is your minimum success’, people will answer e.g. ‘I want 10%’. But it’s not about what you hope to achieve, it’s about what you can live with. Once you get 5% – is that enough to complicate your product, keep it maintained, engage development, etc.? When do you start to get ROI from this experiment? Very important to agree is that if it’s below that threshold, you get rid of it (see Traynor’s reasons why we build things again!).

One thing I’m always asked is – ‘But it’s taking time!’ Well, on average it takes around 30 minutes to fill in such canvas with a group of 6-10 people, given that people are aware of the problem beforehand. How much time does it take before you realise you can’t measure something, build something or you’re moving engagement around? Hint – waaaaaay more. But don’t take my word for it – you can read this article multiple times, but nothing beats trying the canvas out. And remember – it’s not about the canvas, it’s about asking the questions out loud and get that uncomfortable feeling you don’t have the answer – that will push you to take action.

Our analyst team has created a matching canvas for statistical analysis and evaluation of an experiment, which is a brilliant way to compare apples to apples and streamline execution of experiment. But it’s a story for another time.

On a final note – the canvas has been changed 18 times and I plan to simplify it even more, but always bear in mind that people who used it already keep the ‘why’ behind it in their heads, so always introduce new people to it’s full version with the explanation and examples.

Advertisements

Leave a reply after the beep...

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s