How to use the SEARCH function in BigQuery

Jash Bhatt
5 min readOct 29, 2023

Introduction

Have you ever wondered how to find a needle in a haystack of data? Or how to quickly locate the rows of data that match your search criteria? Maybe you’ve bumped into this issue before and took a crack at fixing it with stuff like regular expressions, LIKE operators, or JSON functions. But those fixes can be kinda slow, not very efficient, or just a real pain to decipher. Wouldn’t it be cool if there was a simpler and classier way to handle this?

You know what? BigQuery’s got you covered! BigQuery has a feature that can help you with this task: search indexes. Search indexes let you use GoogleSQL to easily find unique data elements that are buried in unstructured text and semi-structured JSON data, without having to know the table schemas in advance. And the best part is, BigQuery stores and manages the indexes for you, so you don’t have to worry about the maintenance or performance overhead.

What are search indexes?

Search indexes are special structures that BigQuery creates and maintains for your tables or columns that contain text or JSON data. Search indexes enable efficient row lookups when you need to find individual rows of data that match a set of search terms. A common use case is log analytics. For example, you might want to identify the rows of data associated with a user for GDPR reporting, or to find specific error codes in a text payload.

Search indexes work with a function called SEARCH, which takes two arguments: the data to search over, and the query to search for. The SEARCH function returns TRUE if all tokens in the query appear in the data based on the tokenization described in the text analyzer, and FALSE otherwise. If the query is NULL or doesn’t contain any search terms, an error is thrown.

The SEARCH function is designed to work with search indexes to optimize point lookups. Although the SEARCH function works for columns of a table that are not indexed, its performance will be greatly improved with a search index.

How to create a search index?

To create a search index, you need to use the CREATE INDEX statement in GoogleSQL. The syntax for creating a search index is:

CREATE INDEX index_name ON table_name (column_name [,…]) [OPTIONS (option_list)]

where index_name is the name of the index, table_name is the name of the table where you want to create the index, column_name is the name of the column that contains text or JSON data, and option_list is a comma-separated list of options for the index.

The options for creating a search index are:

analyzer:

  • This option specifies the text analyzer to use for tokenizing and normalizing the text data. You can choose between NO_OP_ANALYZER and LOG_ANALYZER.
  • The NO_OP_ANALYZER does not apply any tokenization or normalization to the text, and is useful when you have pre-processed data that you want to match exactly.
  • The LOG_ANALYZER applies some rules for tokenizing and normalizing the text, such as lowercasing, removing punctuation, splitting on whitespace, etc.
  • The default analyzer is LOG_ANALYZER.

json_scope:

  • This option specifies the scope of JSON data to be searched. You can choose between JSON_VALUES, JSON_KEYS, and JSON_KEYS_AND_VALUES.
  • The JSON_VALUES option searches only the JSON values, the JSON_KEYS option searches only the JSON keys, and the JSON_KEYS_AND_VALUES option searches both the JSON keys and values.
  • The default scope is JSON_VALUES.

refresh_interval:

  • This option specifies how often BigQuery refreshes the index with new data from the table.
  • You can choose between MANUAL and AUTO. The MANUAL option requires you to manually refresh the index using the REFRESH INDEX statement whenever new data is added or updated in the table.
  • The AUTO option automatically refreshes the index at regular intervals based on BigQuery’s internal logic.
  • The default interval is AUTO.

Here is an example of creating a search index on a table called logs that contains text and JSON columns:

CREATE INDEX logs_index ON 
logs (message, payload)
OPTIONS (analyzer='LOG_ANALYZER', json_scope='JSON_KEYS_AND_VALUES');

This statement creates an index called logs_index on the message and payload columns of the logs table, using the LOG_ANALYZER and searching both JSON keys and values.

How to use the SEARCH function?

To use the SEARCH function, you need to write a GoogleSQL query that calls the SEARCH function with two arguments: the data to search over, and the query to search for. The data can be a string literal, a list of columns, or a table reference. A table reference is evaluated as a struct whose fields are the columns of the table. A column reference must be one of the following types: ARRAY<STRING>, ARRAY<STRUCT>, JSON, STRING, or STRUCT. The query must be a string literal that represents the terms of the search query.

Examples of using the SEARCH function

Here are some examples of using the SEARCH function with different data types and queries:

Find the rows that contain the word ‘error’ in the message column -

SELECT * FROM logs WHERE SEARCH(message, 'error');

Find the rows that contain the words ‘user’ and ‘123’ in any of the columns -

SELECT * FROM logs WHERE SEARCH(logs, 'user 123');

Find the rows that contain the key ‘status’ and the value ‘200’ in the payload column -

SELECT * FROM logs WHERE SEARCH(payload, 'status 200');

Find the rows that contain the exact phrase ‘Hello world’ in the message column -

SELECT * FROM logs WHERE SEARCH(message, '"Hello world"', analyzer=>'NO_OP_ANALYZER');

The SEARCH function returns TRUE or FALSE for each row, depending on whether it matches the query or not. You can use the SEARCH function in any clause that accepts a boolean expression, such as WHERE, HAVING, CASE, etc.

Why should you use search indexes and the SEARCH function?

Now that you know what search indexes and the SEARCH function are and how to use them, you may wonder why you should use them in your queries. Here are some reasons why these features can make your life easier:

  • They are simple and concise — You can write a single line of code instead of a complex regular expression or JSON function.
  • They are expressive and intuitive — You can clearly state what you want to find in your data without worrying about how to find it.
  • They are flexible and powerful — You can use them with any data type and any query term. You can also specify different options for the index and the analyzer.
  • They are fast and efficient — You can leverage BigQuery’s search indexes to speed up your point lookups and reduce your query costs.

Conclusion

In this story, we learned about the search indexes and the SEARCH function in BigQuery, which are exclusive features that let you easily find unique data elements that are buried in unstructured text and semi-structured JSON data, without having to know the table schemas in advance. We saw some examples of how to create a search index and how to use the SEARCH function with different data types and queries. We also discussed some reasons why these features can make your life easier when writing SQL queries.

I hope you liked this story and picked up something fresh. If you’re into diving deeper into BigQuery and other Google Cloud stuff, just head over to their official docs. To catch more stories like this, make sure to follow me on Medium. And if you’ve got any questions or thoughts, don’t hesitate to drop a comment down below. Keep searching 🔍, and oh, don’t forget to connect with me on LinkedIn!

--

--