Measuring Search, A Human Approach
In my first post on search I discussed two strategies for human or metrics-centered approaches to improving search. Now, I’m going to introduce you to a concept that is a bit of both: how to measure search.
There are two main ways to measure search relevance: online (log-based) and offline (human-rated). There are distinct advantages and disadvantages to each, but it’s important to know that these approaches complement each other. It’s not one-or-the other, to really measure search (and improve it) you need both approaches!
When it comes to improving our products in a data-driven way, A/B testing is considered the gold standard. After all, the alternative might be just updating our product on a hunch [shudder]. A/B testing, for the uninitiated, is the process of dividing your user traffic into equal parts, randomly assigning them to a group, giving each group a different “treatment” or version of the product, and measuring the difference in some metric of interest. Basically, we are running an experiment. That enables us to draw causal inference from our changes. Aww yeah.
Here’s the thing about A/B testing in search: while online measurement and A/B testing are going to tell you what your users are up to, it won’t tell you why. For that you need some human interpretation. It’s that old qual+quant magic again!
We need both online and offline measures for search. Why? Because they tell you different things.
Online search measurement — A/B testing and metrics — measure how people are interacting with your service. Great! But there’s a fly in the ointment: those measures are highly dependent on the metrics you choose. If you choose click through rate, for example, you might not get the most relevant results.
Users are fickle. They have jobs to do, a task to accomplish on your site. Once they find something that satisfices, they may leave. They may get distracted by pictures of cute dogs.
People click on results for all sorts of reasons. It’s easy to get fooled into thinking your results are great, when in reality you are just promoting your most clickable results (like that pug, he’s just sooooo cuuuuute!) .
Human judgment to the rescue
Human judgment, human relevance testing, human rating, relevance judgment, or whatever you call it, is a method where you get actual humans to rate content. The methodology may differ from company to company, but basically it works like this: you give raters a query, then a document, and ask how relevant that document is to the query. Then, all you have to do is take those ratings and turn them into scores. Easy! The great thing about raters is that they are a captive audience. They have to look at the query document pairs that you show them! They don’t look at the cute dog picture and move on, they stick around and rate all the results! This is a huge boon for getting useful metrics. The other advantage is that these raters are focused on relevance, they aren’t swayed by the soulful gaze of the pug, they know the dude in the sweater is much more likely to be relevant. Another bonus is that raters can tell you why they think of something as relevant. They don’t just bounce from the site, they are there to hang out and discuss with you if you want (and you do). Getting the why and the what is critical to merging your human-centered and metrics-driven approaches to search.
It’s not all sunshine and roses though. Human raters are humans, so … they need guidance, they can make mistakes, they take time, and they require money. Most of all, though, they are only a proxy for the user. They can’t be in the head of the actual user who issued the query in the first place. Remember how I said that users have a task to accomplish? Relevance evaluators don’t know the tasks or the motivations of the user. The tradeoff is that you can understand if Chew v Gates or Priebe v. Nelson is a better result for “Dog Bite” on your case law search engine because the raters can give you that feedback.
Many people will dismiss human rating as an unnecessary step in developing or tuning a search algorithm. I assure you, it is not. You need to understand what relevance truly looks like within your search application, and that means having people look at some search results (I’ll discuss how to do that in a later post). Without human judgment, you might be tuning based on the wrong metric, or tuning to the lowest common denominator. Having human ratings is invaluable as a foil against going down the wrong path with your relevance improvements.
Now that you are persuaded that human relevance is important, how do you combine that with online relevance scoring?
Well, every search algorithm change needs a launch review.
A launch review is a meeting where your team discusses the pros and cons of a particular algorithm change. The kinds of discussion might be: we’ve seen human DCG go up but engagement based DCG go down, should we ship this? Or maybe: we’re seeing across the board increases in engagement, but raters are telling us that we aren’t doing as well with ambiguous queries. Sometimes it’s hard to find the corner cases in log metrics, but our judges can catch them.
The purpose of the launch review is to look at the overall metrics, the A/B test results, the human ratings, and understand the impact of the algorithm change. Is it solving the user need we thought it was? Are we seeing the metrics increase we expected? If it’s not, back to the drawing board, but we need to know why.
Human evaluation is critical in getting that understanding.