GIANT ROBOTS SMASHING INTO OTHER GIANT ROBOTS

Written by thoughtbot

Back to Basics: Regular Expressions

Regular expressions have been around since the early days of computer science. They gained widespread adoption with the introduction of Unix. A regular expression is a notation for describing sets of character strings. They are used to identify patterns within strings. There are many useful applications of this functionality, most notably string validations, find and replace, and pulling information out of strings.

Regular expressions are just strings themselves. Each character in a regular expression can either be part of a code that makes up a pattern to search for, or it can represent a letter, character or word itself. Let’s take a look at some examples.

Basics

First let’s look at an example of a regular expression that is made up of only actual characters and none of the special characters or patterns that generally make up regular expressions.

To get started let’s fire up irb and create our regular expression:

> regex = /back to basics/
 => /back to basics/

Notice we create a regular expression by entering a pattern between two front slashes. The pattern we’ve used here will only match strings that contain the stringi ‘back to basics’. Let’s use the match method, which gives us information about the first match it finds, to look at some examples of what matches and what doesn’t:

> regex.match('basics to back')
 => nil

We’re getting close, but nothing in this string matches our regular expression, so we get nil.

> regex.match('i enjoyback to basics')
 => <MatchData "back to basics">

After an unsuccessful attempt we have a match. Notice that our regular expression matched even though there are no spaces between the pattern and the words before it.

MatchData

The object returned from the RegularExpression object’s match method is of type MatchData. This object can tell us all sorts of things about a particular match. Let’s take a look at some of the information we can get about our match.

We can use the begin method to find out the offset of the beginning of our match in the original string:

> match = regex.match('i enjoyback to basics')
 => <MatchData "back to basics">

> match.begin(0)
 => 7

> 'i enjoyback to basics'[7]
 => "b"

The argument we send the method can be used to specify a capture, a concept which is covered below, within our match. In our above example begin tells us that the beginning of our match can be found at index 7 in our string. As we can see from the code above the 8th character in the string (at the 7th index in our string) is ‘b’ the first letter of our match.

Similarly we can get the index of the character following the end of our match using the end method:

> match.end(0)
 => 21

> 'i enjoyback to basics'[21]
 => nil

In this case we get nil since the end of our match is also the end of our string.

We can also use the to_s method to print our match:

> match.to_s
 => "back to basics"

Patterns

The regular expression’s real power becomes obvious when we introduce patterns. Let’s take a look at some examples.

Metacharacters

A metacharacter is any character that has a meaning within a regular expression. Let’s start with something simple, let’s say we want to find out if our string contains a number. This will require we use our first pattern the \d, which is a metacharacter that says we’re looking for any digit:

> string_to_match = 'back 2 basics'

> regex = /\d/
 => /\d/

> regex.match(string_to_match)
 => <MatchData "2">

Our regular expression matches the number 2 in our string.

Character Classes

Let’s say we wanted to find out if any of the letters from ‘k’ to ’s' were in our string. This will require we use a character class. A character class let’s us specify a list of characters or patterns that we’re looking for:

> string_to_match = 'i enjoy making stuff'

> regex = /[klmnopqrs]/
 => /[klmnopqrs]/

> regex.match(string_to_match)
 => <MatchData "n">

In this example we can see we entered all the letters of the alphabet we were interested in between the brackets and the first instance of any of those characters results in a match. We can simplify the above regular expression by using a range. This is done by entering two character or numbers separated by a -:

> string_to_match = 'i enjoy making stuff'

> regex = /[k-s]/
 => /[k-s]/

> regex.match(string_to_match)
 => <MatchData "n">

As expected, we get the same results with our simplified regular expression.

It’s also possible to invert a character class. This is done by adding a ^ to the beginning of the pattern. If we wanted to look for the first letter not in between ‘k’ and ’s' we would use the pattern /[^k-s]/:

> string_to_match = 'i enjoy making stuff'

> regex = /[^k-s]/
 => /[^k-s]/

> regex.match(string_to_match)
 => <MatchData "i">

Since ‘i’ isn’t in our range the first letter in our string meets the criteria our regular expression specified.

Another thing worth noting is the \d character we used above is an alias for the character class [0-9].

Modifiers

We have the ability to set a regular expression’s matching mode via modifiers. In Ruby this is done by appending characters after the regular expression pattern is defined. A particularly useful matching modifier is the case insensitive modifier i. Let’s take a look:

> string_to_match = 'BACK to BASICS'

> regex = /back to basics/i
 => /back to basics/i

> regex.match(string_to_match)
 => <MatchData "BACK to BASICS">

The regular expression matches our string in spite of the fact that the cases are clearly not the same. We’ll look at another common modifier later on in the blog.

Repetitions

Repetitions give us the ability to look for repeated patterns. We are given the ability to broadly search for that are repeating an indiscriminate number of time, or we can get as granular as the exact number of repetitions we’re looking for.

Let’s try to identify all the numbers in a string again:

> string_to_match = 'The Mavericks beat the Spurs by 21 in game two.'

> regex = /\d/
 => /\d/

> regex.match(string_to_match)
 => <MatchData "2">

Because we used only a single \d we only got the first digit, in this case ‘2’. What we’re actually looking for is the entire number, not just the first digit. We can fix this by modifying our pattern. We need to specify a pattern that will say find any group of contiguous digits. For this we can use the + metacharacter. This tells the regular expression engine to find one or more of the character or characters that match the previous pattern. Let’s take a look:

> string_to_match = 'The Mavericks beat the Spurs by 21 in game two.'

> regex = /\d+/
 => /\d+/

> regex.match(string_to_match)
 => <MatchData "21">

We could also look for an exact number of repetitions. Let’s say we only wanted to look for numbers between 100 and 999. One way we could do that would be using the {n} patern, where n indicates the number of repetitions we’re looking for:

> string_to_match = 'In 30 years the San Francisco Giants have had two 100 win seasons.'

> regex = /\d{3}/
 => /\d{3}/

> regex.match(string_to_match)
 => <MatchData "100">

Our pattern doesn’t match 30, but does match 100 because we told it only three repeating digit characters constituted a match.

Let’s look for words that are only longer than five characters. This will require a new metacharacter, the \w that matches any word character. Then we’ll use the {n,} pattern, which says look for n or more of the previous pattern:

> string_to_match = 'we are only looking for long words'

> regex = /\w{5,}/
 => /\w{5,}/

> regex.match(string_to_match)
 => <MatchData "looking">

You can also specify less than using this pattern {,m} and in between with this {n,m}.

Grouping

Grouping gives us the ability to combine several patterns into one single cohesive unit. This can be very useful when combined with repetitions. Earlier we looked at using repetitions with a single metacharacter \d, but rarely will that be enough to satisfy our needs. Let’s look at how we could define a more complex pattern we expect to see repeated.

Let’s look at how we might create a more complicated regular expression that matches phone numbers in several different formats. We’ll use groups and repetitions to do this:

> phone_format_one = '5125551234'
 => "5125551234"

> phone_format_two = '512.555.1234'
 => "512.555.1234"

> phone_format_three = '512-555-1234'
 => "512-555-1234"

regex = /(\d{3,4}[.-]{0,1}){3}/
 => /(\d{3,4}[\.-]{0,1}){3}/

> regex.match(phone_format_one)
 => <MatchData "5125551234" 1:"234">

> regex.match(phone_format_two)
 => <MatchData "512.555.1234" 1:"1234">

> regex.match(phone_format_three)
 => <MatchData "512-555-1234" 1:"1234">

We have successfully created our regular expression, but there is a lot going on there. Let’s break it down. First we define that our pattern will be made up of groups of three or four digits with this \d{3,4}. Next we indicate that we want to allow for ‘-’ or ‘.’ patterns (we have to escape the ‘.’ because this character is also a metacharacter that acts as a wild card), but that we don’t want to require these characters with this pattern [\.-]{0,1}. Finally we say we need three of this group of patterns by grouping the previous two patterns together and apply a repetition of three (\d{3,4}[.-]{0,1}){3}.

Lazy and Greedy

Regular expressions are by default greedy, which means they’ll find the largest possible match. Often that isn’t the behavior we’re looking for. When creating our patterns it’s possible to tell Ruby we’re looking for a lazy match, or the first possible match that satisfies our pattern.

Let’s look at an example. Let’s say we wanted to parse out the timestamp of a log entry. We’ll start out just trying to grab everything in between the square brackets that we know our log is configured to output the date in. In this pattern we’ll use a new metacharacter. The . is a wildcard in a regular expression:

> string_to_match = '[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo] received.'

> regex = /\[.+\]/
 => /\[.+\]/

> regex.match(string_to_match)
 => <MatchData "[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo]">

Instead of matching just the text in between the first two square brackets it grabbed everything between the first instance of an opening square bracket and the last instance of a closing square bracket. We can fix this by telling the regular expression to be lazy using the ? metacharacter. Let’s take another shot:

> string_to_match = '[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo] received.'

> regex = /\[.+?\]/
 => /\[.+?\]/

> regex.match(string_to_match)
 => <MatchData "[2014-05-09 10:10:14]">

Notice that we added our ? after our repetition metacharacter. This tells the regular expression engine to keep looking for the next part of the pattern only until it finds a match; not until it finds the last match.

Assertions

Assertions are part of regular expressions that do not add any characters to a match. They just assert that certain patterns are present, or that a match occurs at a certain place within a string. There are two types of assertions, let’s take a closer look.

Anchors

The simplest type of assertion is an anchor. Anchors are metacharacters that let us specify positions in our patterns. The thing that makes these metacharacters different is they don’t match characters only positions.

Let’s look at how we can determine if a line starts with Back to Basics using the ^ anchor, which denotes the beginning of a line:

> multi_line_string_to_match = <<-STRING
"> I hope Back to Basics is fun to read.
"> Back to Basics is fun to write.
"> STRING
 => "I hope Back to Basics is fun to read.\nBack to Basics is fun to write.\n"

> regex = /^Back to Basics/
 => /^Back to Basics/

> regex.match(multi_line_string_to_match)
 => <MatchData "Back to Basics">

> match.begin(0)
 => 38

Looking at where our match begins we can see it’s the second instance of the string “Back to Basics” we’ve matched. Another thing to take note of is the ^ anchor doesn’t only match the beginning of a string, but the beginning of a line within a string.

There are many anchors available. I encourage you to review the Regex documentation and check out some of the others.

Lookarounds

The second type of assertion is called a lookaround. Lookarounds allow us to provide a pattern that must be matched in order for a regular expression to be satisified, but that will not be included in a successful match. These are called lookahead and lookbehind patterns.

Let’s say we had a comma delimited list of companies and the year they were founded. Let’s match the year that thoughtbot was founded. In this case we only want the year, we’re not interested in including the company in the match, but we’re only interestedin thougtbot, not the other two companies. To do this we’ll use a positive lookbehind. This means we’ll provide a pattern we expect to appear before the pattern we want to match.

> string_to_match = 'Dell: 1984, Apple: 1976, thoughtbot: 2003'

> regex = /(?<=.thoughtbot: )\d{4}/
 => /(?<=.thoughtbot: )\d{4}/

> regex.match(string_to_match)
 => <MatchData "2003">

Even though the pattern we use to assert the word thoughtbot preceeds our match appears in our regular expression it isn’t included in our match data. This is exactly the behavior we were looking for.

To specify a positive lookbehind we use the ?<=. If we wanted to use a negative lookbehind, meaning the match we want isn’t preceed by some particular text we would use ?<!=.

To do a positive lookahead we use ?=. A negative look ahead is achieved using ?!=.

Captures

Another useful tool is called a capture. This gives us the ability to match on a pattern, but only captures parts of the pattern that are of interest to us. We accomplish this by surrounding the pattern data we intend to capture with parenthesis, which is also how we specify a group. Let’s look at how we might pull the quantity and price for an item off of an invoice:

> string_to_match = 'Mac Book Pro - Quantity: 1 Price: 2000.00'

> regex = /[\w\s]+ - Quantity: (\d+) Price: ([\d\.]+)/
 => /[\w\s]+ - Quantity: (\d+) Price: ([\d\.]+)/ 

> match = regex.match(string_to_match)
 => <MatchData "Mac Book Pro - Quantity: 1 Price: 2000.00" 1:"1" 2:"2000.00"> 

> match[0]
 => "Mac Book Pro - Quantity: 1 Price: 2000.00" 

> match[1]
 => "1"

> match[2]
 => "2000.00"

Notice we have all the match data in an array. The first element is the actual match and the second two are our captures. We indicate we want something to be captured by surrounding it in parentheses.

We can make working with captures simpler by using what is called a named capture. Instead of using the match data array we can provide a name for each capture and access the values out of the match data as a hash of those names after the match has occurred. Let’s take a look:

> string_to_match = 'Mac Book Pro - Quantity: 1 Price: 2000.00'

> regex = /[\w\s]+ - Quantity: (?<quantity>\d+) Price: (?<price>[\d\.]+)/
 => /[\w\s]+ - Quantity: (?<quantity>\d+) Price: (?<price>[\d\.]+)/

> match = regex.match(string_to_match)
 => <MatchData "Mac Book Pro - Quantity: 1 Price: 2000.00" quantity:"1" price:"2000.00">

> match[:quantity]
 => "1"

> match[:price]
 => "2000.00"

Strings

There are also some useful functions that take advantage of regular expressions in the String class. Let’s take a look at some of the things we can do.

sub and gsub

The sub and gsub methods both allow us to provide a pattern and a string to replace instances of that pattern with. The difference between the two methods is that gsub will replace all instances of the pattern, while sub will only replace the first instance.

The gsub method gets its name from the fact that matching mode (discussed above) is set to global, which is accomplished using the modifier code g hence the name.

Let’s take a look at some examples.

> string_to_match = "My home number is 5125551234, so please call me at 5125551234."
 => "My home number is 5125551234, so please call me at 5125551234."

> string_to_match.sub(/5125551234/, '(512) 555-1234')
 => "My home number is (512) 555-1234, so please call me at 5125551234."

When we use sub we can see we still have one instance of our phone number that isn’t formatted. Let’s use gsub to fix it.

> string_to_match.gsub(/5125551234/, '(512) 555-1234')
 => "My home number is (512) 555-1234, so please call me at (512) 555-1234."

As expected gsub replaces both instances of our phone number.

While our previous example demonstrates the way the functions work it isn’t a particularly useful regular expression. If we were trying to format all the phone numbers in a large document we obviously couldn’t make our pattern the number in each case, so let’s revisit our example and see if we can make it more useful.

> string_to_match = "My home number is 5125554321. My office number is 5125559876."
 => "My home number is 5125554321. My office number is 5125559876." 

> string_to_match.gsub(/(?<area_code>\d{3})(?<exchange>\d{3})(?<subscriber>\d{4})/, '(\k<area_code>) \k<exchange>-\k<subscriber>')
 => "My home number is (512) 555-4321. My office number is (512) 555-9876."

Now our regular expression will format any phone number in our string. Notice that we take advantage of named captures in our regular expression and use them in our replacement by using \k.

scan

The scan method lets pull all reular expression matches out of a string. Let’s look at some examples.

> string_to_scan = "I've worked in TX and CA so far in my career."
 => "I've worked in TX and CA so far in my career." 

> string_to_scan.scan(/[A-Z]{2}/)
 => ["TX", "CA"]

Using a regular expression we pull out all the state codes in our string. One thing to keep in mind as you continue to learn is pay close attention to the assorted metacharacters available and how their meanings change depending context. Just in this introductory blog we saw multiple meaings for both the ^ and ? character and we didn’t even cover all of the possible meanings of even those two characters. Sorting out when each metacharacter means what is one of the more difficult parts of mastering regular expressions.

Regular expressions are one of the most powerful tools we have at our disposal with Ruby. Keep them in mind as you code and you’ll be surprised how often they can provide a nice clean solution to an otherwise daunting task!

What’s next?

If you found this useful, you might also enjoy: