Exploring BigQuery alternatives for the LIKE statement

Jash Bhatt
8 min readMar 30, 2023

Introduction

When it comes to string matching in BigQuery, the LIKE statement is a commonly used operator. However, it can be slow and inefficient when working with large datasets. Fortunately, BigQuery provides several functions that can be used as alternatives to the LIKE statement. In this story, we’ll explore some BigQuery alternatives to the LIKE statement that can help you speed up your queries and get more accurate results. We will see examples for each function for better understanding.

String Search

CONTAINS_SUBSTR

This function in BigQuery is used to test whether a string contains a substring. The function returns a boolean value (true or false) depending on whether the substring is found or not. It is a case-insensitive function.

Syntax

CONTAINS_SUBSTR(string, substring)

Parameters

  • string: The input string to be searched.
  • substring: The substring to be matched.

Following are the examples:

Example 1

SELECT CONTAINS_SUBSTR('Hello, World!', 'world') AS match_found

In this example, we’re testing whether the string ‘Hello, World!’ contains the substring ‘World’. The output of this query would be: TRUE.

Example 2

SELECT *
FROM products
WHERE CONTAINS_SUBSTR(description, 'organic');

In this example, we’re filtering the rows of the products table where the description column contains the substring "organic".

STARTS_WITH

This function in BigQuery is used to test whether a string starts with a specific prefix. The function returns a boolean value (true or false) depending on whether the prefix is found at the beginning of the string or not. It is a case-sensitive function.

Syntax

STARTS_WITH(string, prefix)

Parameters

  • string: the input string to search
  • prefix: the prefix to search for in the input string

Following are the examples:

Example 1

SELECT STARTS_WITH('Hello, World!', 'Hello') AS match_found

In this example, we’re testing whether the string ‘Hello, World!’ starts with the prefix ‘Hello’. The output of this query would be: TRUE.

Example 2

SELECT *
FROM users
WHERE STARTS_WITH(username, 'admin');

In this example, we’re filtering the rows of the users table where the username column starts with the substring "admin".

ENDS_WITH

This function in BigQuery is used to test whether a string ends with a specific suffix. The function returns a boolean value (true or false) depending on whether the suffix is found at the end of the string or not. It is a case-sensitive function.

Syntax

STARTS_WITH(string, suffix)

Parameters

  • string: the input string to search
  • suffix: the suffix to search for in the input string

Following are the examples:

Example 1

SELECT ENDS_WITH('Hello, World!', 'World!') AS match_found

In this example, we’re testing whether the string ‘Hello, World!’ ends with the suffix ‘World!’. The output of this query would be: TRUE.

Example 2

SELECT *
FROM orders
WHERE ENDS_WITH(order_id, '2022');

In this example, we’re filtering the rows of the orders table where the order_id column ends with the substring "2022".

CONTAINS_SUBSTR, STARTS_WITH and ENDS_WITH is generally faster than the LIKE statement for large datasets since it does not use wildcards and pattern matching.

REGEXP_CONTAINS

This function in BigQuery is used to test whether a string contains a regular expression pattern. The function returns a boolean value (true or false) depending on whether the pattern is found or not.

Syntax

REGEXP_CONTAINS(string, regex)

Parameters

  • string: The input string to be searched.
  • regex: The regular expression to be matched.

Following are the examples:

Example 1

SELECT REGEXP_CONTAINS('Hello, World!', r'^Hello.*$') AS match_founds

Here is the breakdown of the function for better understanding.

  • 'Hello, World!': is the input string that we want to check for a match against the regular expression pattern.
  • r'^Hello.*$': is the regular expression pattern that we are using to check whether the input string starts with the substring "Hello" and contains any number of characters thereafter.

The regular expression pattern ^Hello.*$ has the following meaning:

  • ^: matches the beginning of the input string.
  • Hello: matches the substring "Hello".
  • .*: matches any number of characters (except newline characters).
  • $: matches the end of the input string.

Therefore, if the input string matches the regular expression pattern, the function returns TRUE. Otherwise, it returns FALSE.

Example 2

SELECT REGEXP_CONTAINS('123456789', r'^\d+$') AS match_found

In this example, we’re testing whether the string ‘123456789’ contains only digits. The regular expression pattern ‘^\d+$’ matches one or more digits at the beginning and end of the string. The output of this query would be: TRUE

REGEXP_EXTRACT_ALL

This function in BigQuery is used to extract all occurrences of a regular expression pattern from a string. The function returns an array of strings containing all the matches.

Syntax

REGEXP_EXTRACT_ALL(string, regex)

Parameters

  • string: The input string to be searched.
  • regex: The regular expression to be matched.

Following are the examples:

Example 1

SELECT REGEXP_EXTRACT_ALL('apples, oranges, bananas', r'\w+') AS fruits

In this example, we’re extracting all the words from the string ‘apples, oranges, bananas’. The regular expression pattern ‘\w+’ matches one or more word characters (letters, digits, or underscores).

The output of this query would be:

fruits
[“apples”, “oranges”, “bananas”]

Example 2

SELECT REGEXP_EXTRACT_ALL('My email is john@example.com and my coworker\'s email is jane@example.com', r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}') as email_address

In this example, we’re extracting all occurrences of email addresses in the given string using a regular expression pattern that matches typical email address formats.

Here is the breakdown of the function for better understanding.

  • 'My email is john@example.com and my coworker\'s email is jane@example.com': is the string that we are searching for occurrences of the regular expression pattern.
  • r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}': is the regular expression pattern that we are searching for in the string.

The regular expression pattern is used to match email addresses. Here is a breakdown of each part of the pattern:

  • [A-Za-z0-9._%+-]+: matches one or more characters that are letters, digits, or any of the special characters '.', '_', '%', '+', or '-'.
  • @: matches the '@' symbol.
  • [A-Za-z0-9.-]+: matches one or more characters that are letters, digits, or any of the special characters '.', or '-'.
  • \.: matches a period '.' character.
  • [A-Z|a-z]{2,}: matches two or more characters that are either upper or lowercase letters.

The r before the pattern specifies that it is a "raw" string, which means that backslashes are treated as literal backslashes instead of being used to escape special characters.

The output of this query would be:

email_address
[“john@example.com”, “jane@example.com”]

REGEXP_EXTRACT

This function in BigQuery is used to extract a specific string format from the given input string using a regular expression pattern that matches that format. The function returns a string containing the first match.

Syntax

REGEXP_EXTRACT(string, regexp[, position[, occurrence]])

Parameters

  • string: The input string to be searched.
  • regex: The regular expression to be matched.
  • position: This optional parameter specifies the position within the stringto start the search. If not provided, the search starts at the beginning of the string.
  • occurrence: This optional parameter specifies occurrence of the regexp in string. The default value is 1, which means the first occurrence. If occurrence is greater than the number of matches found, NULL is returned. For occurrence > 1, the function searches for additional occurrences beginning with the character following the previous occurrence.

Following are the examples:

Example 1

SELECT REGEXP_EXTRACT('Hello, World!', r'^Hello, (\w+)!') AS name

In this example, we’re extracting the name from the string ‘Hello, World!’ using the regular expression pattern ‘^Hello, (\w+)!’, which matches the word following the greeting. The output of this query would be: World.

Example 2

SELECT REGEXP_EXTRACT('Product: Apple iPhone 12 Pro Max', r'Apple iPhone \d{1,2} Pro Max')

In this example, the function will return the string “Apple iPhone 12 Pro Max” because it is the first occurrence of the pattern in the input string “Product: Apple iPhone 12 Pro Max”.

Here is the breakdown of the function for better understanding.

  • 'Product: Apple iPhone 12 Pro Max': is the string that we are searching for an occurrence of the regular expression pattern.
  • r'Apple iPhone \d{1,2} Pro Max': is the regular expression pattern that we are searching for in the string.

The regular expression pattern is used to match a specific string format. Here is a breakdown of each part of the pattern:

  • Apple iPhone: matches the literal string "Apple iPhone".
  • \d{1,2}: matches one or two digits.
  • Pro Max: matches the literal string "Pro Max".

Therefore, the regular expression pattern matches a string that starts with “Apple iPhone”, followed by one or two digits, and then followed by the string “Pro Max”.

Example 3

SELECT REGEXP_EXTRACT('video videoo and videooo', 'v?ideo+', 1, 2) second_occurrence; 

SELECT REGEXP_EXTRACT('video videoo and videooo', 'v?ideo+', 1, 3) third_occurrence;

In the above query, two regular expression patterns are used with the REGEXP_EXTRACT function to extract the second and third occurrences of the pattern in the input string.

The regular expression'v?ideo+'matches any string that has zero or one v followed by one or more i, d, e, and ocharacters. The ? sign indicates that the preceding v character is optional, and the + sign indicates that the pattern i, d, e, and oshould occur one or more times.

The position parameter is set to 1, which means the function will start searching for the pattern from the beginning of the string. The occurrence parameter is set to 2, which means the function will return the second occurrence of the pattern. The output of this query will be the second occurrence of the pattern in the input string, which is 'videoo'.

The second pattern 'v?ideo+' is similar to the first pattern and matches the same letters, but the occurrence parameter is set to 3. The output of this query will be the third occurrence of the pattern in the input string, which is 'videooo'.

Overall, this example demonstrates how the position and occurrence parameters of the REGEXP_EXTRACT function can be used to extract specific occurrences of a regular expression pattern from a string.

Conclusion

When working with string matching functions in BigQuery, it’s important to keep in mind that the performance of these functions can be affected by the size of the dataset being queried. To optimize query performance, it’s recommended to use functions like CONTAINS_SUBSTR, ENDS_WITH, and STARTS_WITH that are optimized for performance when working with large datasets. Additionally, using regular expressions can be a powerful tool for more complex string matching, but it’s important to balance the complexity of the regular expression with the performance impact on the query.

Finally, when using string matching functions in BigQuery, it’s important to test the performance of your queries and adjust your approach as needed to ensure the best possible performance. Each of these functions has its own strengths and weaknesses, and the choice of which function to use will depend on the specific requirements of the query. By understanding these functions and their capabilities, you can optimize your queries and improve their performance.

If you want to learn more about BigQuery and its features, you can check out my other stories here:

https://medium.com/@jashbhatt776/data-time-machine-unearthing-history-with-bigquerys-time-travel-feature-5ab9a6707dcf

https://medium.com/@jashbhatt776/optimizing-performance-in-gcp-bigquery-bfd66647c0e7

--

--