Skip to content

Instantly share code, notes, and snippets.

@surister
Last active August 28, 2024 10:24
Show Gist options
  • Select an option

  • Save surister/f4ded7123fa8ec9d5a2a625385bc615e to your computer and use it in GitHub Desktop.

Select an option

Save surister/f4ded7123fa8ec9d5a2a625385bc615e to your computer and use it in GitHub Desktop.
Dissecting hybrid search SQL queries (CTEs)

In https://cratedb.com/blog/hybrid-search-explained we learned about Hybrid Search and how to do it in pure SQL, the resulting query can be hard to understand if you don't have too much experience doing Common table expressions (CTEs), in this piece we will dive deeper into CTEs and the smaller details of the query.

Recap

In the last chapter, we learned that Hybrid Search is pretty much doing some queries that capture different meanings and combine them, don't forget about this as we will see how CTEs are very similar.

##Common Table Expressions.

CTEs are subqueries that can be referenced from a main query, they are temporal, meaning that outside of this main query, they do not exist.

Let's look at the most basic CTE:

WITH a AS (SELECT 1)
SELECT a FROM a

+---+
| 1 |
+---+
| 1 |
+---+

WITH x as is where we define the name of the subquery, inside the brackets (query) is the actual subquery, and after the definition of the subqueries, is where the main query goes.

We can do as many subqueries as needed

WITH 
Q AS (SELECT 1),
Q2 AS (SELECT 2),
Q3 AS (SELECT 3)

SELECT * FROM Q, Q2, Q3
 
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+  

See the pattern now? To top it off let's see a practical example. Let's build a query that reports the name of our tables in CrateDB, their health, and how many rows there are, for our new shiny dashboard.

In CrateDB there are several tables that have metadata about the database itself, some can be found in the sys schema. In sys.health there is information about the health of the tables.

SELECT
  table_name,
  CASE
    WHEN 'RED' = ANY(ARRAY_AGG(health)) then 'RED'
    WHEN 'YELLOW' = ANY(ARRAY_AGG(health)) then 'YELLOW'
    ELSE 'GREEN'
  END AS overall_health
FROM
  sys.health
GROUP BY
  table_name,
  table_schema
 
+---------------+----------------+
| table_name    | overall_health |
+---------------+----------------+ 
| "foo"         | "GREEN"        |
| "tbl1"        | "GREEN"        |
| "restaurants" | "GREEN"        |
| "taxi"        | "GREEN"        |
| "data"        | "GREEN"        |
+---------------+----------------+

Now for the second part, the row count information can be found in sys.shards split into different records, each one representing one shard. If we sum all the records from the primary shards of a table, we will get the total existing records for that table.

SELECT
  table_name,
  SUM(num_docs) as total_records
FROM
  sys.shards
WHERE
  primary = TRUE
GROUP BY
  table_name,
  schema_name
 
+---------------+---------------+
| table_name    | total_records |
+---------------+---------------+
| "foo"         | 4             |
| "taxi"        | 10            |
| "tbl1"        | 2             |
| "restaurants" | 25359         |
| "data"        | 149           |
+---------------+---------------+

Now, we can do both queries as subqueries in a CTE. We will join both subqueries using an inner join, on the table name.

WITH health as (
  SELECT
    table_name,
    CASE
      WHEN 'RED' = ANY(ARRAY_AGG(health)) then 'RED'
      WHEN 'YELLOW' = ANY(ARRAY_AGG(health)) then 'YELLOW'
      ELSE 'GREEN'
    END AS overall_health
  FROM
    sys.health
  GROUP BY
    table_name,
    table_name
),
row_count as (
  SELECT
    table_name,
    SUM(num_docs) as total_records
  FROM
    sys.shards
  WHERE
    primary = TRUE
  GROUP BY
    table_name,
    schema_name
)
SELECT
  *
FROM
  health h,
  row_count r
WHERE
  h.table_name = r.table_name
  
+--------------+------------------+---------------+
|  table_name  |  overall_health  | total_records |
+--------------+------------------+---------------+
| foo	       | GREEN	          | 4             |
| taxi	       | GREEN	          | 10            |
| tbl1	       | GREEN	          | 2             | 
| restaurants  | GREEN	          | 25359         |
| data	       | GREEN	          | 149           |
+--------------+------------------+---------------+

Does it ring a bell with our Hybrid Search queries? We did the same thing, made a vector search query, a full-text query and joined the results using some special weighted combination (convex or RRF), let's have a look at the convex combination query.

Convex combination

WITH bm25 as (
  SELECT
    (_score - MIN(_score) OVER ()) / (MAX(_score) OVER () - MIN(_score) OVER ()) as _score,
    title
  FROM
    fs_search
  WHERE
    MATCH("content", 'knn search')
  ORDER BY
    _score DESC
),
vector as (
  SELECT
    _score,
    title
  FROM
    fs_search
  WHERE
    KNN_MATCH(
      xs,
      [vector...],
      15
    )
)
SELECT
  a * bm25._score + (1 - a) * vector._score as hybrid_score,
  bm25._score as bm25_score,
  vector._score as vector_score,
  bm25.title
FROM
  bm25,
  vector
WHERE
  bm25.title = vector.title
ORDER BY
  hybrid_score DESC

Two subqueries are defined bm25 and vector and later on in the main query, they are joined by title and the data we want selected, being one of them the actual convex combination that gives us the hybrid score.

Looking at the query you might be wondering what are we doing with _score - MIN(_score) OVER ()) / (MAX(_score) OVER () - MIN(_score) OVER () in the bm25 subquery, we are normalizing the values.

Normalizing bm25 _score.

bm25 or full-text searches in CrateDB return a _score column, the higher the scores are the more the searched tokens are valued by the algorithm, the scores are not normalized between 0 and 1, e.g:

+-----------+-----------------+-----------------+------------+
| _score    | city            | country         | population |
|-----------|-----------------|-----------------|------------|
| 2.4358468 | "New York"      | "United States" | 18908608   |
| 2.4358468 | "New Orleans"   | "United States" | 932759     |
+-----------+-----------------+-----------------+------------+

See how the first row's _score is 2.4358468.

A Vector search _score is how similar two vectors are, even though we might use the euclidian distance to calculate how similar they are, the actual distance is not returned by our KNN_MATCH function, only how related they are. We usually can divide 1 / distance and normalize between 0 and 1 to get the aforementioned value.

We should not sum a 0-1 normalized value with another value that can reach high values like 14 for example; a 'good' vector match might give us 0.89 (while 1 is a perfect match) but a bad full-text match might give a score of 3 if a higher score exists within the same result set. It would bias the final score towards the bad full-text search result and get a very inaccurate result.

Note that _scores of different queries cannot be meaningfully compared, as their values depend on the context of the result set.

By normalizing bm25 value between 0 and 1 we can more accurately combine it with other normalized values. In this case using a common normalize method, min-max. The only special thing here is that we use window functions MIN(_score) OVER () and MAX(_score) OVER () to calculate the min and max.

There are several methods to normalize values, using one or another might depend on your data and experimentation, although one thing you could do is use a simple min-max normalization method and fix the bias in the convex combination weights.

RRF query.

WITH bm25 as (
  SELECT
    _score,
    RANK() OVER (
      ORDER BY
        _score DESC
    ) as rank,
    title
  FROM
    fs_search
  WHERE
    MATCH("content", 'knn search')
  ORDER BY
    _score DESC
),
vector as (
  SELECT
    _score,
    RANK() OVER (
      ORDER BY
        _score DESC
    ) as rank,
    title
  FROM
    fs_search
  WHERE
    KNN_MATCH(
      xs,
      [vector...],
      15
    )
)
SELECT
  TRUNC((1.0 / (bm25.rank + 60)) + (1.0 / (vector.rank + 60)), 6) as final_rank,
  bm25.rank as bm25_rank,
  vector.rank as vector_rank,
  bm25.title
FROM
  bm25,
  vector
WHERE
  bm25.title = vector.title
ORDER BY final_rank DESC

At this point you are well aware of the CTE structure and we can skip that part. The only new thing we do in this query is using the RANK OVER () window function to assign a rank to our result set, let's see an example re-using the health-total_records query we built before in the article, for clarity we will omit the subqueries.

WITH health as (...),
row_count as (...)
SELECT
  *
FROM
  health h,
  row_count r
WHERE
  h.table_name = r.table_name
ORDER BY
  total_records DESC

We will assign a rank, in descending order, to the tables that have more records, with RANK() OVER () we need to fill in the OVER () the condition for the ranks to be assigned, in this case OVER( ORDER BY SUM(NUM_DOCS) DESC ).

WITH health as (...),
row_count as (...)
SELECT
  *,
  RANK() OVER (ORDER BY SUM(num_docs) DESC) as rank
FROM
  health h,
  row_count r
WHERE
  h.table_name = r.table_name
ORDER BY
  total_records DESC

+--------------+------------------+---------------+------+
|  table_name  |  overall_health  | total_records | RANK |
+--------------+------------------+---------------+------+
| restaurants	 | GREEN	          | 25359         | 1    |
| data	       | GREEN	          | 149           | 2    |
| taxi	       | GREEN	          | 10            | 3    |
| foo          | GREEN	          | 4             | 4    |
| tbl1	       | GREEN	          | 2             | 5    |
+--------------+------------------+---------------+------+

Besides RANK the only new thing we do is TRUNC the final_rank because the formula returns long decimals, at some point they carry nothing meaningful for us and can be safely truncated, making it easier to read and fit in consoles/screen, you can of course, not truncate at all, it's up to you.

Overview

We looked at what CTEs are, how they are constructed, and how we used this to build powerful hybrid-search queries. Also, we had a look at smaller details of the hybrid search queries such as how and why we normalized the scores inbm25 queries, how RANK() works, and why we TRUNC in RRF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment