Measuring Search: Metrics Matter

James Rubinstein
14 min readJul 6, 2020

--

What gets measured gets managed - Peter Drucker

We’ve all heard the aphorism “What gets measured gets managed.” It’s a truism in today’s “data driven” culture. But what exactly are you measuring, and what are you managing when you measure it? With search, it’s not as straightforward as you might think. Without a clear conception of what goes into a search metric, you may find yourself optimizing for the wrong user experience. Read on to see how different metrics can optimize for different things in your search experience.

First, a (very) brief primer on how search engines work. Search engines start with a corpus, a repository of documents. These can be text, websites, products, or even images. The corpus is a collection of these documents that you want to search through. The next piece is an index, which is a list of words in each document of the corpus. This index tells the search engine what words are in which document. There’s a lot more to it than that, but for now, that’s what we need to know. The user interface for most search engines is a search box where the user puts in a query or terms that they are looking for. The search engine then goes through the index, finds the documents that best match those query terms and returns those documents from the corpus in a results set. That results set is a rank-ordered list of those documents the search engine thinks matches the user’s query. The user can then engage with the documents they think are relevant.

Philosoraptor wants to know about relevance

Another key concept here is relevance. Relevance is not a metric. Relevance is a concept. It’s how well a document answers the information need of the searcher. Relevance is not how many terms match, how many users engage, PageRank, or popularity. We can only use those metrics to try to measure this nebulous concept.

If there is one basic set of metrics that all search scientists need to know, it’s Recall and Precision. Recall and Precision are the bedrock metrics of information retrieval as a science. So, what are they? Recall is the proportion of relevant documents in the corpus that are in the results set.

Recall formula I swiped from wikipedia

Precision is the peanut butter to recall’s jelly. Precision is the proportion of relevant documents in the results set

Precision formula

Here’s the thing: recall and precision are at odds with one another. If you have perfect recall, your precision will be poor and perfect precision will mean missing a lot of potentially relevant documents. Usually, that tradeoff will take the form of a curve like this:

Typical precision recall tradeoff curve

So, already, we have to make a choice. What do we care more about, completeness or effort? Completeness means we are giving the user everything that might satisfy their information need, effort means the user may have to weed through more potentially irrelevant documents to find the relevant ones. This choice has major implications for how we think about our searcher’s task and how we choose to address it, and therefore what metrics we choose.

We could split the difference, and use something like an F-measure, which is the harmonic mean of precision and recall.

guess where I stole this from …. wikipedia!

Here’s why none of what I just told you about precision and recall actually matters, though…

No one looks at the whole results set

That’s right. No one. In fact, most users don’t view anything past the top 5 results. Here’s a click through rate by position chart for Google.

Via Backlinko.com Notice that a very very small proportion of clicks come from the second page or beyond

“That’s Google,” you say, “they have the best search results in the biz. Of course no one ever has to go to the second page.” Well, I could show you pretty similar trends for Pinterest, LexisNexis, or eBay, but some infosec ninja would probably kill me in my sleep for it, so just trust me. The first page of search results is all that matters.

In a way, this is quite liberating, it means we don’t have to get every query/document pair in the corpus rated, like some giant TREC project. However, it does mean that we have to make more choices about our metrics and how we use them.

Metrics that matter

The first metric to discuss is P@K. This means Precision at Rank K. Rank K is just a way of saying “some arbitrary rank.” Usually, you choose a rank k that is in line with the user experience, like P@10, so we are evaluating the precision of documents over the first 10 results. So, instead of precision meaning “number of relevant results / number of returned documents” instead we fix the denominator to 10. The formula is now “number of relevant documents/10.” Easy-peasy lemon squeezy.

Here’s the thing, though. What if we don’t have 10 relevant documents? If the user’s task is to find the one perfect document or navigational result, there’s no need to have 9 other results. Here’s an example. Type “cnn” into Google. The first result is cnn.com, because that’s where I want to go. The other 372,999,999 really don’t matter. The query “cnn” is navigational, an attempt to get a singular document, website, product, etc.

Google tryna to impress us with 372,999,999 irrelevant results 🙄

Perhaps the thing we want to measure, then, is how close to the top of the search results our best result is? Well for that we’d want a metric called MRR. MRR stands for mean reciprocal rank. Basically, we take the rank of the first engaged search result or most relevant result and put it as the denominator of a fraction. So, for our CNN example, the rank is 1, so we set the numerator to 1, the denominator is also 1, so 1/1 = 1, a perfect score! What if we were doing a research project about CNN, so when we typed “cnn” into Google, we were looking for Wikipedia, which is in position 4. Our reciprocal rank would then be 1/4 = 0.25. If we average those two outcomes, we have (1/1 + 1/4)/2 = 0.625 our Mean Reciprocal Rank.

The challenge with MRR is that it’s very biased towards a use case of getting that one right document as close to the top of our results list as possible, and that might not be what our user is doing. For deeper research, the user might want to evaluate many documents, so P@K might be a better choice. For example, when I go on Yelp and type “Chinese restaurants,” I want options. It’s all about that user’s task. The only problem is that we often don’t know what a user’s task is …

If we don’t know how deep to set our K for P@K, and we aren’t sure we’re in the business of delivering one right answer, what are we to do?

Well, we could let the users tell us!

Mean Average Precision is a metric that sets the K to the last relevant document within the set. If the user clicks on 1 document, that rank of K will be that document. If that document is in position 2, a pretty good average precision will result, if they engage with document 200, we’ll have a bad average precision. If the user engages with 10 documents, we’ll take the precision of those 10 documents. If the docs are 1–10, we’ll have good AP, if the documents are 190–200, AP will be bad. Perhaps its time for an example across those three metrics.

Let’s say we have 1 relevant document in our top 10, and it’s in position 4

We can see that our Reciprocal rank is 1/4 or 0.25. We have 1 document that’s relevant out of our top 5, so P@5 is 1/5 = 0.20, likewise we have 1/10 for P@10. Our Average Precision is a little trickier. Since the last relevant doc was position 4, our relevance is 1/4 or 0.25.

Now let’s add a second relevant doc at position 9

Our Reciprocal Rank doesn’t change, because it’s set by the first document, so it’s still 1/4. P@5 doesn’t change because we still have 1 relevant doc in the top 5, 1/5. P@10 increases because now we have two relevant docs in the top 10, 2/10 = 0.20. Average precision now becomes the precision of the two relevant documents in the top 9 = (0.25+0.22)/2 = 0.24, see it’s the average precision. The mean part comes where we take the average AP across queries.

Now, let’s add one more relevant document at position 1

MRR now becomes 1/1, P@5 is 2/5, P@10 is 3/10, and MAP is (1+0.5+0.33)/3=0.61

Personally, I really like MAP as a measure because I don’t have to assume what the user is doing, I can see that she is opening lots of documents (indicating a research task) or opening one (indicating a navigational task).

Graded relevance metrics

The metrics we have looked at up until now have been binary relevance metrics. That means they assume a document can be relevant or irrelevant, with no in-betweens. In real life, though, relevance has shades of grey.

Dogs are good at understanding nuance, it’s all shades of grey to them

Graded relevance metrics extend the binary relevant irrelevant to make judgments like “completely irrelevant”, “somewhat irrelevant” , “somewhat relevant”, “very relevant”. This can be done in online or offline evaluation. In online evaluation, we might assign scores to different levels of engagement, e.g. 1 for a click, 2 for a download, and 3 for sharing with a friend. Documents with no engagement get 0s.

The first graded relevance metric is Cumulative Gain. CG takes the notion that every new result on the results set should add some value to the searcher.

Fancy stats notation to say CG is the sum of relevance scores over the top p documents. Stolen again from wikipedia

“But,” you say, “not all positions are equal, because documents in higher positions are much more likely to be seen and engaged with!” To which, I’d say “you’re absolutely right, that’s why we have DCG!”

DCG is Discounted Cumulative Gain , where we take the sum of relevance scores, but we discount them by position so that a relevant document in position 5 is worth less than a relevant document in position 3, which is less valuable than a doc in position 1. This way, we are focusing on getting our most relevant documents as close to the top of our results list as possible.

More fancy notation to say that DCG is the sum of 2 to the relevance scoreth power, over the log of position, making the discount function

Here’s what that looks like using a previous example:

Now you can see we have a “relevance (graded) column” which has scores from 0–3. Our first “relevant” document is still position 4, so none of the binary metrics have changed. Now we can see the CG is 5 (out of a potential 30, so not great) and our DCG is 3.40, out of a potential 31.8.

Let’s look at how that discount function works a little more closely. If we give the first document a 3, and keep the rest 0, then the first document will get a DCG of 7, so the total DCG will be 7.

DCG when document 1 = 3 (very relevant)

When document 2 gets a rating of 3, however, DCG becomes 4.42, because position 2 is “worth less” than p1.

A relevant document at p2 is worth less than p1

And if we get down to p10?

Our DCG is now only 2.02 because the logarithmic discount function decreases the value of the score of 3. Here, you can really see the difference between Cumulative Gain and Discounted Cumulative Gain.

CG and DCG are really useful where we aren’t sure how many relevant documents we should, have, but we know we want them ranked near the top, similar to MAP, but with a fixed K number of documents (usually 5 or 10) and with graded scores. DCG, in particular is useful as a check of good ranking, but can fall down where we have a navigational query with only one “right” result, because it has a fixed K. For example which of these two is better

First result is perfect, the rest of the top 10 are 0s
all of the results are “somewhat relevant” or “somewhat irrelevant” none are “great”

From these examples we can see that CG and DCG give a better score to the second example, even though none of the results are “great”. Is that the right answer? It depends. If we care about getting the perfect ranking, no, it’s not right.

To look strictly at ranking, we need a new metric ….

Enter nDCG

nDCG is Normalized Discounted Cumulative Gain (don’t ask me why the ’n’ is usually lowercased). nDCG is primarily a metric of ranking quality. It takes DCG and compares it to an ideal DCG, which is what we compare our actual DCG and gives a score between 0 and 1.

nDCG is actual DCG/ideal DCG (via wikipedia)

The ideal DCG is just the DCG of the results set if we had ranked it perfectly.

For example if we had a document with a relevance score of 3 in p10 and all others were 0, then the ordered list would be 0,0,0,0,0,0,0,0,0,3. The ideal ordered list would be 3,0,0,0,0,0,0,0,0,0

Showing the iDCG and nDCG for a results set with 1 relevant item in p10

In this case, we can see that the actual DCG is 2.02, the ideal DCG (where the item with the score of 3 was in p1) is 7, so our nDCG is low, 0.29.

The thing to know about nDCG is that it’s based on the results set as given. We can only compare to what we have, so if we have a document with a score of 1 in p1, but the rest of the results set are 0s, then we’ll get a perfect nDCG.

Metrics for when the only document with any relevance is in p1, but it’s still low

As you can see in the example, the CG, DCG, MAP, etc are all low, but nDCG is perfect. We ranked our documents as well as we could.

It’s for this reason that I don’t like to use nDCG except as a comparison metric. Here’s what I mean by that:

Let’s say we have two versions of our algorithm, Alpha and Beta. Alpha and Beta both return 5 documents, as shown below:

Alpha and Beta results sets

Alpha returns A, B, C, D, E for scores 2,1,0,0,0
Beta returns F, G, A, H, I for scores 1,2,3,3,3

If we compare Alpha only to Alpha’s results set, it’s nDCG is 1.0, perfect! Beta’s nDCG was only 0.80, not so great. So we should ship Alpha, right? Not so fast! Alpha clearly doesn’t return as many relevant documents as Beta did, so how can it have a better nDCG? Remember, all nDCG cares about is how we rank our docs compared to an ideal set. So, what we can do is expand the ideal set.

The way to expand the ideal set is to have the documents from Alpha and Beta in a superset or union of all the distinct results. If we do that we get an ideal ranking of H, I, A,G, B, F, C, D, E with scores of 3,3,2,2,1,1,0,0,0. That is the results set we should compare to, which gives an ideal DCG of 14.95. Calculating the nDCG for Alpha we get 0.35 and Beta gets 0.76, showing Beta was doing the better job.

This is why I say that nDCG is useful as a comparison metric, but as a standalone metric of relevance, it’s not that useful. We can use nDCG to compare two versions of the algo to see which is better at ranking or compare ourselves to competitors, by creating that superset of documents and seeing how close we get to returning all the best documents in our top K.

Conclusion

If it seems like I’ve gone on about “understanding the user” or “jobs to be done” or “user experience” it’s because I have. The number one job of a search scientist is to deeply understand what the user’s task is, build a system to help them complete that task, and measure that we have been successful. Understanding the user is the cornerstone of the work we do. Once we understand that, we can use different measurement techniques and metrics to best represent the user’s goal.

If what we care about is having the best possible ranking within our results set, then nDCG makes sense. MRR will get you similar results. If we want the one best document as close to top as possible for a navigational task, nDCG or MRR are a great choice, depending on if you have binary or graded relevance.

If what we are trying to accomplish is getting the most relevant results in our top 10, then P@10 or CG@10 may make the most sense. This is useful for research where the user wants options.

If we care about some combination of good ranking and good recall in the top K documents, DCG@K is a great choice.

If we don’t know what kind of task we have, MAP might be the way to go, since it doesn’t rely on assumptions about top K documents or the kind of task. However, MAP does rely on binary relevance judgments, meaning selecting the criteria for relevance is critical. Is a click as good as a share or download? Up to you to decide … again based on our knowledge of the user experience.

There are many choices to make and understanding how choices affect metrics and metrics affect choices within our system is critical to selecting the right metric and optimizing the system for the metric and giving users what they need.

Appendix
I would be remiss if I didn’t mention that much of my thinking on search metrics has been influenced by Ali Dasdan’s excellent slides from WWW2009 which are available on his site. Also, a big shout out is due to my old friend and mentor Andy Edmonds. Doug Rosenoff, Doug Turnbull, Michael Cole, Tito Sierra, and Daniel Tunkelang also deserve some mentions for helping me hone my thinking over the years.

Of course, this post barely scratches the surface, so if you want more, I recommend looking at SIGIR proceedings over the years. There are also a few really interesting papers and I’ll try to link some below as I get around to it :)

Finally, if you’d like to download the search metrics excel file, you can grab that from my github repo and play around. https://github.com/jrubinstein/search_metrics_demo

Interesting search metrics papers

Here or There: Preference Judgments for Relevance, Carterette et al, https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1058&context=cs_faculty_pubs

Anything Sue Dumais is on the author list … https://www.microsoft.com/en-us/research/people/sdumais/publications/

The Relationship between IR Effectiveness Measures and User Satisfaction, Al-Maskari, Sanderson, & Clough, http://www.marksanderson.org/publications/my_papers/SIGIR2007-b.pdf

Beyond DCG: User Behavior as a Predictor of a Successful Search, Hassan, Jones & Klinkner http://www.wsdm-conference.org/2010/proceedings/docs/p221.pdf

--

--

James Rubinstein

Search nerd, data nerd, and all-around nerd-nerd. He has worked at eBay, Apple, and Pinterest, and currently leads the Product Analytics team at LexisNexis