Tutorial - Simple Indexed Search in PostgreSQL

5 minute read

This is a simple tutorial on creating queries that match specific database entries given a string or list of strings in PostgreSQL.

ℹ️ Notice: You need some basic knowledge of tables and SQL to properly ingest the information in this post.

Goal of this Tutorial

Business logic in certain applications often requires developers to create search engines that conform to specific given parameters. For example, you may be building a messaging application like Discord or Slack, and you may find yourself in a situation where you have to find some information that matches a given query. There are open source frameworks like ElasticSearch dedicated to this matter, but they can be overkill in certain scenarios.

As such, this tutorial is going to focus on a simple approach that only requires the usage of your existing PostgreSQL database. If you need help setting up a Postgres database, you can find some information about setting it up here. The documentation is always the best place to start!

The Background

Let’s shove ourselves into the developer seat and assume we’re an employee of Crusty Ham Inc., your average development company that abuses the Agile model and makes you, the developer, work on 3 to 4 projects at a time. One of these projects is “Batty 🦇”, a chatting application that focuses on communication during night hours.

Today, the Project Manager decided to inform the team working on Batty 🦇, which includes you, that its message searching system is of urgent priority. To make matters worse, your deadline is tomorrow.

Your task: You need to make a query that given a keyword, will find and return the most relevant messages matching that keyword.

Normally, this feature would require planning and effort. However, you have to finish it during this work day. How?

The Details

Assume that you are managing the columns USER_ID and USER_MESSAGE through the table USER_DATA below:

USER_DATA
+--------------+---------------------------+
|   USER_ID    |       USER_MESSAGE        |
+--------------+---------------------------+
| 4fb9da4481ce | hunter2                   |
| 226723b7c690 | See? I only view ******'s |
| 4fb9da4481ce | oh, ok.                   |
+--------------+---------------------------+

The USER_ID is some arbitrarily generated value and the primary key of our table. USER_MESSAGE is a string that shows the message a user has sent. Note how we can see the USER_ID twice in the example table; this means that user 4fb9da4481ce has sent two messages: hunter2 and oh, ok.

Step 1: Indices

Indices are an essential part of database design. You can imagine them as a specialized list that is used to search through database entries in a quick manner. Under the hood, the database uses data structures fit for this task. Read more on database indexing starting with this elaborate StackOverflow answer here.

As such, we are going to start by creating an index on our USER_MESSAGE column. There are many choices available, but we are going to use something simple for this tutorial:

CREATE INDEX message_idx ON USER_DATA(USER_MESSAGE);

This index is created on just one column, but for your use case there may be other columns in need of an index. Refer to the documentation for more information.

Note that do not need to create an index on USER_ID, as Primary Key columns are automatically indexed by the PostgreSQL on creation.

⚠️ Warning: Indices are not the “to-go” deus-ex machina solution for every slow query you may encounter. Study your use case carefully before manually adding an index.

Step 2: TSVector

With our indices up and running, we can get to building the structure for our search. PostgreSQL has this really neat data structure called tsvector which is heavily optimized for text searching.

What you need to remember is that you can “convert” your message into an instance of this data structure by using the function to_tsvector(). This will be useful to us later as we can use the pre-made functions of the data structure to construct our query.

Step 3: TSQuery

A tsvector needs to be queried in a special way. As such, the Postgres development team has created tsquery. This is also another data structure, but it is used as a query for our tsvector The to_tsquery() function can be used to convert plain strings into a tsquery.

Step 4: Ranking

After we have retrieved the information from our tsvector, we have to rank it by relevance. In different contexts, “relevance” can mean a lot of things. However, for our example we can assume that the “cover density” of our query is our relevance. The ts_rank_cd is a function that does exactly that, so we will use it in combination with that was mentioned above so that we can order our results by relevance.

ℹ️ Notice: Make sure to look into the other option ts_rank too, which orders on frequency of lexemes rather than cover density.

Step 5: Building the Query

Let’s begin with a SELECT statement:

SELECT ts_rank_cd(to_tsvector(USER_MESSAGE),
  to_tsquery(:query)) AS relevance
FROM USER_DATA

There’s a lot to unpack here. Let’s take it step by step and see what the query is doing:

Selecting the USER_MESSAGE column from the table USER_DATA
Converting USER_MESSAGE entries to a tsvector
Taking in some keyword :query (note the colon) and converting it into a tsquery
Wrapping this vector and query with ts_rank_cd for ranking purposes

🚫 Danger: In this state, our query has a big problem! Currently it just selects the relevance from every row in the table without actually doing any filtering. Luckily, this is where the match operator @@ comes in.

The match operator @@ has a simple purpose: it returns true if a tsvector matches a tsquery. Thus, we can use it to filter our results by adding this condition to our query:

WHERE USER_MESSAGE @@ to_tsquery(:query)

To clean up our query some more, we can tell it to return only the top 5 most relevant results:

ORDER BY relevance DESC
FETCH FIRST 5 ROWS ONLY;

Now we have all the parts we need to complete our query. Let’s put it together:

SELECT ts_rank_cd(to_tsvector(USER_MESSAGE),
  to_tsquery(:query)) relevance, USER_ID
FROM USER_DATA
WHERE USER_MESSAGE @@ to_tsquery(:query)
ORDER BY relevance DESC
FETCH FIRST 5 ROWS ONLY;

And of course, don’t forget to replace :query with your query, or inject it using whatever method works for you.

Closing Thoughts

And just like that, we’ve completed the Project Manager’s request with only a bit of effort! Batty 🦇’s clients will now remain happy, until the next fun idea of course.

There’s a lot of fine-tuning you can do, but if you want a simple method to search quickly without refactoring a large part of your system, this is one of simplest ways you can do it if your application is running on Postgres.

Refer to the documentation links above for extra reading material, as there is too much to cover in one article.

Share on

Twitter Facebook LinkedIn

Daniel Stefani