Thinking about Cases and Books of Judgements

Cases and Books share many, maybe too many, similarities, and can be thought of as two different lenses on the same data.

However they are distinct but interconnected systems for handling search evaluation - and they're quite different in their purposes and design!

Cases: Live Search Experimentation

Cases are Quepid's dynamic, real-time search testing environment. Think of a Case as your active laboratory where you're constantly tweaking and experimenting with search configurations. Here's how they work:

Data Storage

Cases store their search queries in the Queries table, with each Query containing just the query_text, information_need, notes, and options. The key thing is that Cases don't store search results - they fetch them live from search engines every time you run a query. The only thing they persist about results is Ratings via the Ratings table, which links a query_id to a doc_id with a rating value.

Search Engine Interaction

This is where Cases really shine. They're designed for live interaction with search engines through Tries and Search Endpoints. Each Case has multiple Tries that represent different configurations (think of them as saved parameter sets), and each Try connects to a Search Endpoint that knows how to talk to Solr, Elasticsearch, OpenSearch, or even custom APIs via the SearchAPI mapper. When you run a query, it hits the live search engine with your current parameters and gets fresh results.

Evaluation

Cases use a real-time rating system where you manually rate individual search results on the fly. These Ratings get aggregated into Scores using configurable Scorers, and you get immediate feedback on how your search tuning is performing. We provide all the standard IR retrieval metrics for you, and you can write your own custom Scorer.

Books: Offline Judgment Collection

Books are the complete opposite - they're designed for systematic, offline evaluation of search quality. Think of them as standardized test datasets.

Data Storage

Books store everything up front in the Query Doc Pairs table. Unlike Cases, Books actually persist both the queries AND the data that makes up each search result (as document_fields JSON). Each Query Doc Pair represents a query-document combination that needs to be judged. This is crucial because Books need to maintain consistency - you can't have the underlying search results changing while people are making judgments.

Everytime a Case is loaded, all of the returned data from the search engine is queued up as an update to the associated Book, and processed in the background. We control what data is stored in document_fields JSON attribute by looking at the field specification for the Case. If you define your field specification as id, title:name, description then we will store the description in document_fields for the Query Doc Pair. If you later change the field specification to id title:short_name description, url, thumb:image and run the Case, then the Book will be updated to have the new short_name value used for the title, and the document_fields attribute will be updated to store the url and thumb:image attributes as well.

Now, if a Query Doc Pair in a Book is no longer being returned by the associated Case, then of course no content updates will happen.

TL;DR: The data stored in the Book is driven by the search results returned by the Case!

Search Engine Interaction

Books don't directly interact with search engines during evaluation. Instead, they're populated from a Case every time that Case is run, which captures a snapshot of search results at that point in time. After that, it's all offline - no more live searching. This means that over time you should run a Case periodically to keep the set of Query Doc Pairs in the Book up to date so it represents what your users are seeing.

Evaluation

Books use a Judgements system where multiple users can independently rate the same query-document pairs. Each Judgement is stored separately (unlike Cases where Ratings overwrite), and they support sophisticated features like explanations, "judge later" flags, and "unrateable" markings.

Books also support a LLM-as-Judge capability that can scale up the judgement process.

They do not provide any measurement of search quality, that is the job of a Case.

The Key Differences and Why They Matter

Ratings vs Judgements

This is the big philosophical difference. Cases use Ratings - quick, single value assessment meant for rapid iteration. Books use Judgements - thoughtful, often collaborative evaluations meant for creating gold standard datasets.

We bring these two worlds together by linking a Case to a Book. As judging happens in the Book, the various Judgements are aggregated into a single Rating to be used in a Case. Learn more on the How Judgements are Averaged into a Rating in a Case.

Scale

Books have configurable scales (like 0-3 or 0-1) with detailed scoring guidelines built right in. Cases also have a scale that is tied to the specific Scorer that is being used. If your Book has 0-3 graded scale, and your Case has a binary Scorer like Precision, then the Case automatically maps the graded scale to a binary one.

Data Flow

The integration is brilliant - you can populate a Book from a Case (capturing live search results), collect Judgements from multiple people offline, then refresh that data back into a Case to continue live experimentation as Jdugements flow in. You do need to manually reload the Case screen to see the new aggregated Rating data. Likewise, if you DO create a Rating in the Case, it is recorded as Judgement by you for the corresponding Query Doc Pair in the Book.

Collaboration

Cases are more individual-focused (though they can be shared via Teams) with the intended audience being Search Relevance Engineers, while Books are designed from the ground up for collaborative evaluation with multiple Subject Matter Experts.

Data Models Summary

Workflow Integration

The typical workflow shows the power of both systems working together:

  1. Development Phase: Use a Case to experiment with search configurations, tune parameters, and get quick feedback through live search results
  2. Evaluation Phase: Populate a Book from the Case to capture a snapshot of results for systematic evaluation
  3. Judgment Collection: Multiple users provide Judgements on the standardized dataset in the Book
  4. Integration Phase: Refresh the Book data back into Cases to continue development with improved baseline Ratings

Can I use them Individually?

Of course! For many small relevance investigations, you don't need a Book. A Case with 6 or 8 Queries is enough that you quickly rate yourself to investigate a problem.

Likewise, you may find that you want to source your raw data for evaluation from a different system and import that to your Book. Book's have complete API's that let you manage a book or import the Query Doc Pairs yourself for evaluation. You can also export the data for use in non Quepid evaluation systems.

Summary

It's really a perfect example of designing different tools for different jobs - Cases for the rapid iteration and experimentation phase of search development, and Books for the rigorous evaluation and benchmarking phase. The fact that data flows seamlessly between them means you get the best of both worlds!