Setting up a relevance evaluation program

James Rubinstein
13 min readMay 29, 2020

--

In my first couple of posts, I described the need for human- and metrics-centered approaches to relevance improvement and how we can evaluate relevance in online and offline approaches. This post will focus on the offline approach to relevance evaluation and describe how to set up a human judgment program to effectively evaluate search relevance*.

As I mentioned in the previous post, human-rated relevance judgements are critical to measuring (and therefore improving) search relevance. However, many search teams don’t know where to start when it comes to creating a program of relevance evaluation.

The first step in establishing a relevance program is understanding what kind of task your users are engaging with. This will drive your metrics and task design. Why are your users on your product? What are they doing while they are there? Is there task navigational? Are they on your site to kill time, get inspired, do some shopping, or gather information? To understand how best to measure, it’s vital to understand what you are attempting to measure.

Let’s make the assumption that your application supports an information gathering task. That will inform a lot of choices about what kind of task we want to replicate and understand. Information gathering, in this context means that as a searcher, I want to get the most information possible about a topic. That might be an academic research topic or perhaps (as in my current team) legal research.

If the goal of the searcher/user is to gather as much relevant information on a topic as possible, our job as a search product is to ensure the relevance of the documents in the results set for a query.

This is where things get complicated…

There are multiple ways that we can evaluate a set of search results. I’ll describe a few of these ways to measure search result quality, along with some strengths and weaknesses of each.

Result Set Preference
Probably the simplest way to measure search results, this approach takes a user query and runs it through two versions of your search algorithm. You then show that to some raters and ask which they prefer, A or B? Now you just add up the preference for A and express that as a percentage. 55% of people prefer the B results sets. Ship it.

Result set preference interface taken from https://scholarworks.umass.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1058&context=cs_faculty_pubs a classic of the genre.

The challenge with this methodology is that you don’t get much fine-grained feedback on what documents are better/worse. You can probably make some inferences, based on the symmetric difference in the two results sets, but it’s just an inference. Another difficulty is that you don’t really get to save your work. You can’t re-use a preference judgment because the sets are likely to be different over time. Also, the results sets might not have much difference, so it can be difficult to make a judgment call. To get around that, you could focus on only returning the differences between set A and set B, but that might overly-magnify algorithm changes. Finally, you don’t get the time-over-time measure of search performance because it’s a comparison, not an independent score of each algorithm.

So the value of this methodology is that it’s dead simple and it’s very fast to get ratings, but there are some tradeoffs.

Document preference
Another preference based approach to human evaluation. In this case, we take a query, and a pair of documents, show those to our intrepid raters and see which they prefer. Then we do a bit of fancy math and we can create a rank-ordered list of preferred documents. Once we have an ordered list of documents, we can compare that to the ordered lists that our search engine provides. A little rank correlation, and you can see how close you are to “perfect” ranking. Easy peasy, lemon squeezy!

Of course, this has some issues as well. For one, there is a combinatorial explosion of pairwise comparisons as you increase the number of documents. With two documents, you get 1 comparison. With 3 documents there are 3. With 10 documents there are 45 combinations, etc. To manage the combinatorial explosion, you could employ the transitive property to say that if document 1>document 2 and document 2> document 3, then document 1>3, so no need to test that combination! Unfortunately, much like in sports, it doesn’t always work. Make sure to test before you go applying the transitive property in your testing

JMU has not beat Alabama

Some other advantages of document pairwise comparisons are that comparison is relatively easy for humans (relative to assigning an absolute score). Document pairwise comparisons also don’t need to be re-rated until there is a new document in the result set, but if there is a new document, you may have to make multiple comparisons to find where the new document should live in the ideal rated ranking.

One tweak to this approach is to combine it with whole set preferences, allowing raters to see an entire set of search results and rank them according to their preferences. Think of it as rank-choice voting for search results.

Binary relevance judgments
This methodology is probably one of the most widely used relevance scoring methods, with roots going back to the early 1960s with The Cranfield Paradigm. In this approach judges are given a query and a document and asked to judge the document as relevant to the query or irrelevant to the query. This can then be turned into a variety of useful and interesting scores†

This method of scoring relevance is pretty easy and straightforward. Get queries, get documents, get judgments, make scores and “Robert’s your mother’s brother.” However, it’s not without its nuances. For one, the ratings need to be provided by people who are familiar with the subject. In TREC document collections the raters are often subject matter experts. The main issue though is treating relevance as a binary. Is something relevant/irrelevant or are there shades of grey? Clearly, some documents are more relevant than others.

A binary preference judgment from the homies at the Pin Factory

Graded relevance judgments
Finally we come to what is probably the most widely used method of assessing search result quality: graded relevance judgments. In this paradigm, raters are given a query and a document, but they are then asked to rate the document’s relevance to the query on a scale. The scale can be numeric (e.g. 1–4) or categorical (relevant, somewhat relevant, irrelevant). The ratings can then be averaged, majority-voted, or otherwise combined to create a single score. That score can then be turned into a metric†.

A graded relevance interface from the current gang at LexisNexis

This approach has the advantage of giving document level scores that can be used to benchmark the search algo over time or compare two algos. If your content set doesn’t change too much, you can also re-use the ratings, which means you might not need to get documents re-rated at all! That sounds great, but you may find that your rating set grows stale over time, which can be a problem. Should you get new queries, new documents, or new ratings on query document pairs? It’s considerably more expensive in time and treasure if you do, so often teams will keep ratings around for longer than they should.

These are just a few of the ways to get relevance judgments, I’m sure. Each has their advantages and disadvantages, so it depends on your goals. If you are just starting out, I recommend going with result set preference judgments as they are fast and easy or a graded relevance setup if you have the resources. For an information gathering task, it probably makes the most sense, as you want to ensure that your top documents are adding real value for the searcher.

Now that we know what kind of task we want our raters to rate, we need to get them some material. Let’s start with the queries.

Queries
I’ll use the term “query” a bit loosely here: basically a query is a user input that is run through an algorithm to produce a results set. In search, a query is a string of text. Queries can also be vocal text (for your digital assistant), they also be a product if you are in a shopping recommendation scenario. For the rest of this post, I’ll be using query to refer to a string of text entered into a search input box.

Queries can have multiple forms, even within text. While most of us think of the “natural language” version of queries, where we ask Google “what is the capital of West Virginia” there are also Boolean queries that leverage a syntax with operators like AND, OR, or NOT. An example of this might be “motion to dismiss” AND (adversarial OR repeat OR serial) litigant. This query tells the search engine that the searcher wants documents about a motion to dismiss where the litgant is adversarial or repeat or serial. Most of the time though, we care about natural language queries because in most consumer internet applications that’s what most queries are.

Where do you find queries? From your user logs, of course. If you aren’t logging queries and search performance online … well no better time to start than now! To get a list of queries, you want to make sure that the list is representative. “A representative sample?” you say, “then I should take a random sample!” Indeed, you are a smart cookie. BUT there be dragons. What is the best way to sample queries? Simple random? Sure but you’ll likely get a bunch of uncommon queries because there are many more uncommon queries than common ones. It’s a long-tailed distribution.

So what is the right approach? I usually go for a random weighted sample where the weights are the number of people issuing a query. This enables me to get a representative sample of the kinds of queries people issue without overweighting that one user who searches for “cake recipes” 300 times per day.

Now we need to figure out how many queries we need to accurately represent our total population of queries. Fortunately, there are online calculators for that sort of thing, and that’s a good starting place. A survey calculator is a pretty reasonable estimation for how many queries you need to detect a particular effect size with a particular confidence level. Want to be able to detect a 2% effect size on a population of 10K queries with 95% confidence you’ll need 4800 queries. Yeah it’s a lot. If you want to detect a 10% change, then you only need 370 queries. Basically you take the confidence interval and double it so you are outside the “margin of error” for both A and B variants; e.g. if you want to detect a 5% change, your margin of error needs to be 2.5%

Now that we have our queries sampled, it’s time to get some documents.

Results (documents)
To get documents, plug your queries into your search engine and record the results. Now get the document for each result. Now you can show that query/document pair to the raters and get their ratings in your tool of choice.

Raters
The single most important part of a human relevance evaluation program is the humans.
Surprising, I know. It’s the single most difficult part of the program as well. Getting good raters is a challenge, training them appropriately takes time. There are services out there that can help with some or all of that, depending on your time and budget. On the “low end” there are crowdsourcing platforms such as Amazon Mechanical Turk that just give you a platform. You have to manage the Turkers, their pay, and the task all on your own. Still it’s a good option if you have a simple task and you have the time. Appen, having acquired Figure Eight also seems to be in the platform and the consulting business. On the higher end, more full service approach, Appen has a consulting arm, as well as companies like the pleasingly alliterative Search Strategy Solutions. For more professional classes, where you may need to find SMEs, services like Fiverr might be a good option.

Where you get you raters and ratings from is a huge choice. This will depend a lot on the task you are trying to emulate. Do you need special knowledge to rate legal documents and queries? Then you’ll need to find some lawyers (or at least people with some legal training). If you want ratings on makeup tutorials, you probably need people who wear makeup. If you want ratings on music, maybe break your query set into genres and find people who love that genre. The goal is finding people with as much expertise in the area that you can afford. If you have to make a choice, get fewer queries or documents per query in favor of higher quality raters.

There are two main categories of judges here: internal and external. Internal means employees or contractors working for the company, rating can be their full-time job or something they do on the side. The advantage of internal raters is that you can engage with them more readily if there is a question about a particular document. Often this is the highest quality source of ratings. That said, internal raters tend to be more expensive and are less scalable.

External raters can be further broken down into two groups: consulting and crowd. Crowd raters are sourced through platforms like Mturk. You may not have any relationship with any rater(s) so it’s difficult to gather feedback or have a conversation beyond the score the raters give. It’s important, therefore to build a very robust tool to ensure quality from raters who are getting paid per-judgment and therefore have an incentive to deliver as many ratings per hour as possible, not necessarily deliver the greatest quality.

Consulting groups like Samasource or Appen will have people who are dedicated or semi-dedicated to a particular job. This means that there can be feedback loops, but perhaps not with the same rater relationship that you would get with in-house raters.

As you have probably gathered, there are different marketplaces for different kinds of raters and different time investments. To make the best use of these, you need a tool, a platform to manage the raters and ratings.

Tools & Task
How you present the rating task to the raters is vitally important. You can ask if a document is relevant and use a 1–22 scale, but does that make sense? Does a 22 point scale really help? For my money a 4 point scale works pretty darn well. The question you ask and the task you set up are also really important. You need to have good, clear, concise instructions that make it clear what you mean by ‘relevance’. If you are using internal judges, then you’ll be able to use a more … involved… set of guidelines. If you are using the crowd, however, you’ll want to keep the guidelines to a minimum — it’s hard to get someone to read a 50 page document on rating search results when they are getting 5¢ per rating. Instead, use “gold”, known answer questions, to provide a “teachable moment” that gets to the distinction between good and great or bad and acceptable.

The task you choose is also important. If you are running an ecommerce business, then asking about “relevance” might not be the right quesiton to ask. A better question is “would you buy this item” (side note: look at baby James!). You have to make the task relevant to what your users are doing. The judges aren’t in the head of your users, but you can put them in the closest possible frame of mind.

The tool you use should also help you manage quality. A good platform will help you source workers, find incorrect ratings and remove them, and if necessary remove bad actors on the system. Good quality ratings mean that you need good raters, so finding those high-quality raters is important. A platform should help you find and reward people who are doing well, so that you can keep them on your jobs. Another approach to increase quality is to have multiple tiers of raters. Getting “gold” raters to disambiguate or check the “bronze” raters can add a great deal of additional quality. Basically, if three people agree on a rating, stop asking, but if there’s disagreement on the classification of a document, it can be sent to a “gold rater.” This query/document pair will also make a nice gold unit to test and train against. Finally, you can use your gold raters to check the quality of bronze raters and use that delta to model “correct” answers without input from the gold raters.

Conclusion
It’s difficult to go into any great detail on this topic in a Medium post, but it’s a vitally important topic if you want to measure the search experience. Understanding relevance requires understanding your users and building tools, tasks, getting data, and finding raters to emulate that task as closely as possible. It’s not easy and there are balances that have to be struck with cost, time investment, and quality. However, when you get it right, you’ll have a powerful tool for understanding what is going on with search and why.

*a good human evaluation program can also be used for more than search, it can be used to train and evaluate all sorts of machine learned algorithms, search is just the one I’m describing here, but any ML algo needs training data. If you have more than just search, make sure you keep these other training and evaluation use-cases in mind!

† subject of an upcoming post :)

--

--

James Rubinstein
James Rubinstein

Written by James Rubinstein

Search nerd, data nerd, and all-around nerd-nerd. He has worked at eBay, Apple, and Pinterest, and currently leads the Product Analytics team at LexisNexis

No responses yet