Mastering Regular Expressions in BigQuery

Jash Bhatt
5 min readOct 22, 2023

Introduction

Regular expressions, or regex for short, are a powerful way to search, match, and manipulate text data in BigQuery. Regex can help you perform tasks such as validating email addresses, extracting phone numbers, replacing words, and more. In this article, I will explain the basic syntax of regex and show you some common use cases in BigQuery.

Syntax of Regex

The syntax of regex consists of two parts: a pattern and a modifier. A pattern is a sequence of characters that defines the search criteria. A modifier is an optional flag that modifies the behavior of the pattern. For example, the modifier i makes the pattern case-insensitive, while the modifier g makes the pattern global, meaning it will match all occurrences of the pattern in the input.

A pattern can contain literal characters, such as abc, or special characters, such as . or *. Literal characters match themselves in the input, while special characters have special meanings and functions. For example, the special character . matches any single character, while the special character * matches zero or more occurrences of the preceding character. You can use parentheses () to group characters together and form subpatterns. You can also use brackets [] to create a character class, which matches any one of the characters inside the brackets. For example, [a-z] matches any lowercase letter from a to z.

Functions for Regex in BigQuery

BigQuery provides several functions that allow you to use regex in your queries. These functions are:

  • REGEXP_CONTAINS(value, regexp): This function returns true if the value matches the regexp pattern, and false otherwise.
  • REGEXP_EXTRACT(value, regexp): This function returns the first substring of the value that matches the regexp pattern. If no match is found, it returns null.
  • REGEXP_EXTRACT_ALL(value, regexp): This function returns an array of all substrings of the value that match the regexp pattern. If no match is found, it returns an empty array.
  • REGEXP_INSTR(value, regexp): This function returns the position of the first occurrence of the regexp pattern in the value. If no match is found, it returns 0.
  • REGEXP_REPLACE(value, regexp, replacement): This function replaces all occurrences of the regexp pattern in the value with the replacement string. If no match is found, it returns the original value.
  • REGEXP_SUBSTR(value, regexp): This function is a synonym for REGEXP_EXTRACT and works in the same way.

Examples of Regex in BigQuery

To illustrate how to use these functions, let’s look at some examples using a sample table called customers, which contains information about customers’ names, emails, and phone numbers.

Table : customers

Validating Emails

Suppose we want to validate if all emails in the table have a valid format. We can use a regex pattern that matches a typical email format: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}. This pattern consists of several parts:

  • [a-zA-Z0-9._%+-]+: This part matches one or more alphanumeric characters or symbols such as ., _, %, +, or -. This represents the local part of an email address before the @ sign.
  • @: This part matches the literal character @.
  • [a-zA-Z0-9.-]+: This part matches one or more alphanumeric characters or symbols such as ., or -. This represents the domain part of an email address after the @ sign.
  • \.: This part matches the literal character ., which is escaped by a backslash \ because it is a special character in regex.
  • [a-zA-Z]{2,}: This part matches two or more alphabetic characters. This represents the top-level domain part of an email address after the last dot.

We can use this pattern with the REGEXP_CONTAINS function to check if each email in the table is valid or not. The query would look like this:

SELECT name, email,
REGEXP_CONTAINS(email,'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') AS valid_email
FROM customers;

The result would be:

As you can see, all emails in the table are valid according to the regex pattern.

Extracting and Formatting Phone Numbers

Suppose we want to extract the phone numbers from the table and remove the country code and the parentheses. We can use a regex pattern that matches a typical phone number format: \+\d \(\d{3}\) \d{3}-\d{4}. This pattern consists of several parts:

  • \+: This part matches the literal character +, which is escaped by a backslash \ because it is a special character in regex.
  • \d: This part matches any single digit from 0 to 9.
  • : This part matches a single space character.
  • \( and \): These parts match the literal characters ( and ), which are escaped by backslashes \ because they are special characters in regex.
  • \d{3}: This part matches exactly three digits. The curly braces {} specify the exact number of repetitions of the preceding character.
  • -: This part matches the literal character -.
  • \d{4}: This part matches exactly four digits.

We can use this pattern with the REGEXP_EXTRACT function to extract the phone numbers from the table. We can also use parentheses () to create capture groups, which are sub-patterns that can be referenced later. For example, we can use (\d{3}) to capture the area code of the phone number, and use $1 to refer to it in the replacement string. The query would look like this:

SELECT name, phone,
REGEXP_EXTRACT(phone,'\+\d \((\d{3})\) (\d{3}-\d{4})') AS extracted_phone,
REGEXP_REPLACE(phone,'\+\d \((\d{3})\) (\d{3}-\d{4})','$1-$2') AS formatted_phone
FROM customers;

The result would be:

As you can see, we have extracted and formatted the phone numbers using regex.

Conclusion

These are just some examples of how to use regex in BigQuery. There are many more possibilities and applications of regex that you can explore.

I hope this article has helped you understand and master regular expressions in BigQuery. Regex can be a very useful tool for data analysis and manipulation, so I encourage you to practice and experiment with it. Happy querying! 😊

--

--