If you work with strings a lot, you most likely will use substring
function quite frequently for your data manipulation tasks. PostgreSQL provides a very versatile substring function that can handle a lot of string manipulation scenarios. In this blog, we’ll dive into real examples to show you how to make the most of substrings in PostgreSQL 16.
PostgreSQL Substr With Index and Offset Values
Index (one-based) and offset are the simplest ways to extract a substring. Useful for strings with fixed lengths. PostgreSQL provides several methods to specify index bounds.
Set Index Bounds as Arguments
Start index and offset values can be provided to substring function as argument 2 and 3 where offset is an optional argument. For example:
Returns the substring starting at index 1 to 4:
SELECT substring('techbuddies', 1, 4); substring ----------- tech (1 row)
Returns the substring from index 5 to the end:
SELECT substring('techbuddies', 5); substring ----------- buddies (1 row)
Set Index Bounds with FROM and FOR Clauses
The index bounds can also be set using FROM
and FOR
clauses; Both are optional of each other, meaning that you can specify FROM
without FOR
or FOR
without FROM
. Otherwise they work the same way as arguments. It is also possible to use LENGTH()
operator to return the length of the input string, which is handy if you need to compute an index from the end of the string. For example:
Extracts the substring starting at index 1 to 4.
SELECT substring('techbuddies' FROM 1 FOR 4); substring ----------- tech (1 row)
Returns the first 5 characters as substring:
SELECT substring('techbuddies.io' FOR 5); substring ----------- techb (1 row)
Extracts the last 7 characters from the string and grabs the first 4 as substring
SELECT substring('techbuddies.io' FROM LENGTH('techbuddies.io') - 6 for 4); substring ----------- dies (1 row)
PostgreSQL Substr With Regular Expressions
PostgreSQL supports substring extraction using POSIX regular expressions metacharacters. A metacharacter is a special character or character sequence that have specific meanings and functions to define patterns to match and manipulate text. The following list outlines the most commonly used metacharacters:
- . (Dot): Matches any single character except a newline character. For example, the pattern b.t would match “bat,” “bit,” “bot,” etc.
- * (Asterisk): Matches zero or more occurrences of the preceding character or subexpression. For example, ab*c would match “ac,” “abc,” “abbc,” etc.
- + (Plus): Matches one or more occurrences of the preceding character or subexpression. For example, ab+c would match “abc,” “abbc,” “abbbc,” etc.
- ? (Question Mark): Matches zero or one occurrence of the preceding character or subexpression. For example, ab?c would match “ac” and “abc.”
- [] (Character Class): Defines a character class and matches any single character from the specified set. For example, [aeiou] would match any vowel character.
- [^] (Negation): Defines a negated character class and matches any single character not in the specified set. For example, [^0-9] would match any non-digit character.
- () (Grouping): Groups characters or subexpressions together and allows you to apply quantifiers to the entire group. For example, (ab)+ would match “ab,” “abab,” “ababab,” etc.
- | (Alternation): Specifies alternation between two or more patterns. For example, cat|dog would match “cat” or “dog.”
- ^ (Caret): Matches the start of a line or string. For example, ^abc would match “abc” at the start of a line.
- $ (Dollar Sign): Matches the end of a line or string. For example, abc$ would match “abc” at the end of a line.
Match Words with One or More Dots
Extracts a substring starting with b
plus 10 characters following it:
SELECT substring('techbuddies.io', 'b.........'); substring ------------ buddies.io (1 row)
Matches a substring starting with t
and ending with h
inclusive with arbitrary number of characters in between:
SELECT substring('techbuddies.io', 't.*h'); substring ----------- tech (1 row)
Use * and + Operators to Match Word Repetition
Extracts a substring that matches techbud
with zero or more
repetition of d
following after. Returns full string even if input string has only one d
in techbuddies.io
.
select substring('techbudies.io', 'techbudd*.*'); substring --------------- techbudies.io (1 row)
Returns a substring that matches techbud
with one or more
repetition of d
following after. Returns NULL because input string has only one d
in techbudies.io
.
select substring('techbudies.io', 'techbudd+.*'); substring ----------- (1 row)
Set Match Constraints with $ and ^ Operators
Prints the full string if the string starts with tech
select substring('techbuddies.io', '^tech.*'); substring ----------------- techbuddies.io (1 row)
Returns io
plus 8 chars before if the string ends with io
.
select substring('techbuddies.io', '........io$'); substring ------------ buddies.io (1 row)
Set Character Constraints with [abcd] and [^abcd] Operators
[abcd]
and [^abcd]
can match or un-match any character or range of characters (separated by a hyphen) within the square brackets. It also support digits.
Return substring starting at the first lowercase or uppercase letters and stops until the next non-letter character.
select substring('techbuddies.io', '[a-zA-Z]+'); substring ------------- techbuddies (1 row)
Return substring starting at the first non-letter character and stops until the next the next character.
postgres=# select substring('techbuddies.io', '[^a-zA-Z]+'); substring ----------- . (1 row)
Chain Multiple Match Conditions with | Operator
The |
operator can be used to define a conditional match utilizing other operators discussed so far. Please note that the first
or longest
match of the string is always returned regardless the order of conditions separated by |
.
Return substring if the string contains either tech
or buddies
select substring('techbuddies.io', '(tech|buddies)'); substring ----------- tech (1 row)
Return first digit-character as substring or if the string matches one or more occurrence of bud
or return substring starting at the first lowercase or uppercase letters and ends before a non-letter character is met.
select substring('techbuddies.io', '^[1-9]|bud+|tech|[a-z]+'); substring ------------- techbuddies (1 row)
Extract a Sub Expression with () Operator
The ()
brackets can be used to extract a sub-expression to fine-tune substring output. This is very handy in many substring scenarios.
Without brackets, the function returns the whole string:
select substring('techbuddies.io', '^tech.*dies'); substring ------------- techbuddies (1 row)
With brackets, the function returns the substring inside the brackets:
select substring('techbuddies.io', '^tech(.*)dies'); substring ----------- bud (1 row)
Other Examples
Extract the website name from a URL string:
select substring('https://techbuddies.io/home', '://(.*)/'); substring ---------------- techbuddies.io (1 row)
Match a value after certain character:
select substring('a=techbuddies.io', '=(.*)buddies'); substring ----------- tech (1 row)
Obtain the value from a JSON key-value pair:
select substring('"key":"value"', '"key":"(.*)"'); substring ----------- value (1 row)
PostgreSQL Substr With Regular Expressions Class-Shorthand Escapes
PostgreSQL also supports substring extraction using class-shorthand escapes, which are very similar to common regular expressions described above but using special escape character sequence to represent. The tables below are taken directly from PostgreSQL documentation to showcase all supported escape character sequences.
Escape | Description |
---|---|
\d | matches any digit, like [[:digit:]] |
\s | matches any whitespace character, like [[:space:]] |
\w | matches any word character, like [[:word:]] |
\D | matches any non-digit, like [^[:digit:]] |
\S | matches any non-whitespace character, like [^[:space:]] |
\W | matches any non-word character, like [^[:word:]] |
Escape | Description |
---|---|
\A | matches only at the beginning of the string (see Section 9.7.3.5 for how this differs from ^ ) |
\m | matches only at the beginning of a word |
\M | matches only at the end of a word |
\y | matches only at the beginning or end of a word |
\Y | matches only at a point that is not the beginning or end of a word |
\Z | matches only at the end of the string (see Section 9.7.3.5 for how this differs from $ ) |
Match with Character Constraint Escapes
Extracts the first word character sequence and stops until the next non-word character (such as a space or a digit):
SELECT substring('welcome 123 techbuddies.io 456' FROM '(\w+)'); substring ----------- welcome (1 row)
Extracts the first digit character sequence and stops until the next non-digit character (such as a space or a word):
SELECT substring('welcome 123 techbuddies.io 456' FROM '(\d+)'); substring ----------- 123 (1 row)
Un-Match with Character Constraint Escapes
Returns the first non-digit character sequence as substring
SELECT substring('welcome 123 techbuddies.io 456' FROM '(\D+)'); substring ----------- welcome (1 row)
Returns the first non-word and non-space character sequence as substring.
SELECT substring('welcome 123 techbuddies.io 456' FROM '(\W\S+)'); substring ----------- 123 (1 row)
Summary
PostgreSQL substring is a simple but yet versatile function that can save you a lot of development efforts on string manipulations. I hope the examples used in this blog can provide you with the necessary information to simply your string manipulation efforts.
Reference
- PostgreSQL: Documentation: 9.1: String Functions and Operators
- PostgreSQL: Documentation: 16: 9.7. Pattern Matching
- Pattern-matching meta characters – IBM Documentation
Hi, this is Cary, your friendly tech enthusiast, educator and author. Currently working as a software architect at Highgo Software Canada. I enjoy simplifying complex concepts, diving into coding challenges, unraveling the mysteries of software. Most importantly, I like sharing and teaching others about all things tech. Find more blogs from me at highgo.ca
Pingback: Cary Huang: Explore Practical PostgreSQL Substring Use Cases With Examples - RSSFeedsCloud
Why do you use “.*?” somewhere in examples. Isn’t ? useless?
Thanks for the comment, yeah the ? is not necessary there. Corrected. Thank you!
You missed an important quantifier in your example : the brackets {} quantifier. The example that you gave of matching words with one or more dots:
SELECT substring(‘techbuddies.io’, ‘b………’);
would normally be more succinctly expressed as SELECT substring(‘techbuddies.io’, ‘b.{10}’);
That is much less error-prone than stringing a bunch of dots together.
Your next example states:
Matches a substring starting with t and ending with h inclusive with exactly 2 characters in between:
SELECT substring(‘techbuddies.io’, ‘t.*h’);
That is just wrong. The asterick, as stated in the documentation you posted, matches zero or more characters. Perhaps you meant SELECT substring(‘techbuddies.io’, ‘t..h’);
Thank you for pointing out. You are right, SELECT substring(‘techbuddies.io’, ‘b.{10}’); is a much better way to do the substring. Yes, this example “SELECT substring(‘techbuddies.io’, ‘t.*h’);” does not match the description. I changed the example but forgot to update the description :p
I ran “SELECT substring(‘techbuddies.io’, ‘b.{10}’);” on PostgreSQL 16 but it returns empty instead. Am I missing something here?
I also tried “SELECT substring(‘techbuddies.io’, strpos(‘techbuddies.io’, ‘b’), 10);” which returns the same thing, but the command is quite lengthy…
I’m not convinced this is 100% correct:
“Matches a substring starting with t and ending with h inclusive with exactly 2 characters in between:”
SELECT substring(‘techbuddies.io’, ‘t.*h’);
substring
———–
tech
(1 row)
According to regex101.com, .* will match any character, 0 to unlimited times. See here for examples: https://regex101.com/r/FBTZLy/1
yeah, that example does not match the description. I will correct it. Thank you!