Regular Expression

1. Introduction to Examples#

Open the regular expression testing tool provided by Open Source China http://tool.oschina.net/regex/, input the text to be matched, and then select a commonly used regular expression to get the corresponding matching result. For example, here is the input text to be matched:

Hello, my phone number is 010-86432100 and email is cqc@cuiqingcai.com, and my website is https://cuiqingcai.com.

This string contains a phone number and an email address. Next, let's try to extract them using regular expressions. For URLs, you can use the following regular expression to match:

[a-zA-z]+://[^\s]*

Using this regular expression to match a string, if the string contains text similar to a URL, it will be extracted.

This regular expression may look messy, but it actually follows specific syntax rules. For example, a-z represents matching any lowercase letter, \s represents matching any whitespace character, and * means matching zero or more of the preceding character. This long string of regular expression is a combination of many matching rules.

Once the regular expression is written, it can be used to match and search within a long string. Regardless of what is in the string, as long as it conforms to the rules we wrote, it can all be found. For web pages, if you want to find out how many URLs are in the source code of a web page, you can match it using the regular expression for URLs.

We mentioned several matching rules above, and Table 3-2 lists commonly used matching rules.

Commonly Used Matching Rules

Pattern	Description
\w	Matches letters, digits, and underscores
\W	Matches characters that are not letters, digits, or underscores
\s	Matches any whitespace character, equivalent to [\t\n\r\f]
\S	Matches any non-whitespace character
\d	Matches any digit, equivalent to [0-9]
\D	Matches any non-digit character
\A	Matches the start of the string
\Z	Matches the end of the string; if there is a newline, only matches up to the end string before the newline
\z	Matches the end of the string; if there is a newline, also matches the newline
\G	Matches the position where the last match finished
\n	Matches a newline character
\t	Matches a tab character
^	Matches the start of a line of text
$	Matches the end of a line of text
.	Matches any character except a newline; when the re.DOTALL flag is specified, it can match any character including newlines
[...]	Represents a set of characters, listed individually, e.g., [amk] matches a, m, or k
[^...]	Characters not in [], e.g., [^abc] matches any character except a, b, or c
*	Matches 0 or more occurrences of the preceding expression
+	Matches 1 or more occurrences of the preceding expression
?	Matches 0 or 1 occurrence of the preceding regular expression defined segment, non-greedy
{n}	Matches exactly n occurrences of the preceding expression
{n, m}	Matches between n and m occurrences of the preceding regular expression, greedy
a	b
( )	Matches the expression within parentheses, also represents a group

2. match#

Here we first introduce the first commonly used matching method — match. By passing the string to be matched and the regular expression to it, you can check whether this regular expression matches the string.

The match method attempts to match the regular expression from the start of the string. If it matches, it returns the successful match result; if it does not match, it returns None. The example is as follows:

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}', content)
print(result)
print(result.group())
print(result.span())

The output is as follows:

41
<_sre.SRE_Match object; span=(0, 25), match='Hello 123 4567 World_This'>
Hello 123 4567 World_This
(0, 25)

Here, we first declare a string that contains English letters, whitespace characters, digits, etc. Next, we write a regular expression:

^Hello\s\d\d\d\s\d{4}\s\w{10}

We use it to match this long string. The beginning ^ matches the start of the string, meaning it starts with Hello; then \s matches the whitespace character to match the space in the target string; \d matches digits, and 3 \d matches 123; then we write 1 \s to match the space; after that, there is 4567, and we could still use 4 \d to match, but that would be cumbersome, so we can follow it with {4} to represent matching the preceding rule 4 times, which means matching 4 digits; then we follow it with 1 whitespace character, and finally \w{10} matches 10 letters and underscores. We notice that we haven't matched the entire target string, but we can still perform the match; the only difference is that the match result is shorter.

In the match method, the first parameter passes the regular expression, and the second parameter passes the string to be matched.

By printing the output, we can see that the result is an SRE_Match object, which proves that the match was successful. This object has two methods: the group method can output the matched content, which is Hello 123 4567 World_This, exactly what the regular expression rule matched; the span method can output the matching range, which is (0, 25), indicating the position range of the matched result string in the original string.

Through the above example, we have a basic understanding of how to use regular expressions in Python to match a piece of text.

Matching Target#

Just now, we used the match method to get the matched string content, but what if we want to extract a part of the content from the string? Just like the earlier example, extracting emails or phone numbers from a piece of text.

Here we can use parentheses () to enclose the substring we want to extract. The parentheses actually mark the start and end positions of a sub-expression, and each marked sub-expression corresponds to a group, allowing us to obtain the extraction result by calling the group method with the index of the group. The example is as follows:

import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^Hello\s(\d+)\sWorld', content)
print(result)
print(result.group())
print(result.group(1))
print(result.span())

Here we want to extract 1234567 from the string, so we can enclose the regular expression for the digits in parentheses, and then call group(1) to get the matching result.

The output is as follows:

<_sre.SRE_Match object; span=(0, 19), match='Hello 1234567 World'>
Hello 1234567 World
1234567
(0, 19)

We can see that we successfully obtained 1234567. Here we used group(1), which is different from group(); the latter outputs the complete matching result, while the former outputs the first matching result surrounded by (). If there are more contents enclosed in () after the regular expression, we can use group(2), group(3), etc., to obtain them.

General Matching#

The regular expression we wrote just now is actually quite complex; whenever there is a whitespace character, we write \s to match it, and whenever there is a digit, we use \d to match it. This is quite a workload. In fact, it is unnecessary to do this because there is a universal match we can use, which is .* (dot star). Here, . (dot) can match any character (except newline), and * (star) represents matching the preceding character an unlimited number of times, so together they can match any character. With this, we don't have to match each character one by one.

Continuing from the previous example, we can rewrite the regular expression:

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^Hello.*Demo$', content)
print(result)
print(result.group())
print(result.span())

Here we directly omitted the middle part and replaced it all with .* and added a trailing string. The output is as follows:

<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
Hello 123 4567 World_This is a Regex Demo
(0, 41)

We can see that the group method outputted the entire matched string, meaning our written regular expression matched all the content of the target string; the span method outputted (0, 41), which is the length of the entire string.

Therefore, we can use .* to simplify the writing of regular expressions.

Greedy and Non-Greedy#

When using the general match .* above, sometimes the matched result may not be what we want. Look at the following example:

import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group(1))

Here we still want to get the number in the middle, so we still write (\d+) in the middle. However, since the content on both sides of the number is quite messy, we want to omit it and write it all as .* . Finally, it forms ^He.*(\d+).*Demo$, which seems fine. Let's check the output:

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
7

A strange thing happened; we only got the number 7. What happened?

This involves the issue of greedy matching versus non-greedy matching. In greedy matching, .* will match as many characters as possible. In the regular expression, .* is followed by \d+, which means at least one digit, and does not specify how many digits, so .* matches as many characters as possible, leaving \d+ with one digit that satisfies the condition, which is 7, resulting in only the number 7.

But this clearly brings us a lot of inconvenience. Sometimes, the matching result may inexplicably lack a part of the content. In fact, we just need to use non-greedy matching. The non-greedy matching is written as .*?, which adds a ?, so what effect can it achieve? Let's see with an example:

import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result)
print(result.group(1))

Here we just changed the first .* to .*?, turning it into non-greedy matching. The result is as follows:

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
1234567

Now we can successfully obtain 1234567. The reason is clear; greedy matching tries to match as many characters as possible, while non-greedy matching tries to match as few characters as possible. When .? matches the whitespace character after Hello, the subsequent characters are digits, and \d+ can match them, so .? stops matching and hands it over to \d+ to match the subsequent digits. Thus, .*? matches as few characters as possible, and the result of \d+ is 1234567.

Therefore, when performing matching, it is advisable to use non-greedy matching, that is, use .? instead of . to avoid missing matching results.

However, it should be noted that if the matching result is at the end of the string, .*? may not match any content because it will match as few characters as possible. For example:

import re

content = 'http://weibo.com/comment/kEraCN'
result1 = re.match('http.*?comment/(.*?)', content)
result2 = re.match('http.*?comment/(.*)', content)
print('result1', result1.group(1))
print('result2', result2.group(1))

The output is as follows:

result1 
result2 kEraCN

We can observe that .? did not match any results, while . matched as much content as possible and successfully obtained the matching result.

Modifiers#

Regular expressions can include some optional flag modifiers to control the matching pattern. Modifiers are specified as optional flags. Let's look at an example:

import re

content = '''Hello 1234567 World_This
is a Regex Demo
'''
result = re.match('^He.*?(\d+).*?Demo$', content)
print(result.group(1))

Similar to the previous example, we added a newline character in the string, and the regular expression remains the same, used to match the number within it. Let's check the output:

AttributeError Traceback (most recent call last)
<ipython-input-18-c7d232b39645> in <module>()
      5 '''
      6 result = re.match('^He.*?(\d+).*?Demo$', content)
----> 7 print(result.group(1))

AttributeError: 'NoneType' object has no attribute 'group'

The execution directly reports an error, indicating that the regular expression did not match this string, returning None, and we called the group method, resulting in an AttributeError.

So why did adding a newline character cause it not to match? This is because it matches any character except newline, and when it encounters a newline, .*? cannot match, leading to a match failure. We just need to add a modifier re.S to fix this error:

result = re.match('^He.*?(\d+).*?Demo$', content, re.S)

This modifier allows . to match all characters, including newline characters. Now the output is as follows:

This re.S is often used in web matching because HTML nodes often have newlines, and adding it allows matching between nodes and their newlines.

Additionally, there are some other modifiers that can be used when necessary, as shown in Table 3-3.

Table 3-3 Modifiers

Modifier	Description
re.I	Makes matching case-insensitive
re.L	Performs locale-aware matching
re.M	Multi-line matching, affecting ^ and $
re.S	Makes . match all characters, including newlines
re.U	Parses characters according to the Unicode character set. This flag affects \w, \W, \b, and \B
re.X	This flag allows you to write regular expressions in a more readable format by giving you more flexibility

In web matching, re.S and re.I are commonly used.

Escape Matching#

We know that regular expressions define many matching patterns, such as . matching any character except newline. But what if the target string contains a .?

Here we need to use escape matching, as shown in the example:

import re

content = '(百度) www.baidu.com'
result = re.match('\(百度 \) www\.baidu\.com', content)
print(result)

When encountering special characters used for regular matching patterns, just add a backslash to escape them. For example, . can be matched with .. The output is as follows:

<_sre.SRE_Match object; span=(0, 17), match='(百度) www.baidu.com'>

We can see that we successfully matched the original string.

These are several commonly used knowledge points for writing regular expressions. Mastering them will be very helpful for writing regular expression matches later.

3. search#

As mentioned earlier, the match method matches from the beginning of the string, and if the beginning does not match, the entire match fails. Let's look at the following example:

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.match('Hello.*?(\d+).*?Demo', content)
print(result)

Here the string starts with Extra, but the regular expression starts with Hello, so the entire regular expression is part of the string, but this matching fails. The output is as follows:

None

Because the match method requires consideration of the content at the beginning when used, this is not convenient for matching. It is more suitable for detecting whether a string conforms to a certain regular expression rule.

Here we have another method called search, which scans the entire string during matching and returns the first successful match result. This means that the regular expression can be part of the string, and during matching, the search method will scan the string sequentially until it finds the first string that meets the rules, then returns the matching content. If it finishes searching and still hasn't found anything, it returns None.

Let's modify the match method in the above code to search and see the output:

<_sre.SRE_Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>
1234567

Now we have obtained the matching result.

Therefore, for convenience in matching, we can try to use the search method.

Next, let's look at a few examples to see how to use the search method.

First, here is a piece of HTML text to be matched. Next, we will write several regular expression examples to extract the corresponding information:

html = '''<div id="songs-list">
<h2 class="title"> 经典老歌 </h2>
<p class="introduction">
经典老歌列表
</p>
<ul id="list" class="list-group">
<li data-view="2"> 一路上有你 </li>
<li data-view="7">
<a href="/2.mp3" singer="任贤齐"> 沧海一声笑 </a>
</li>
<li data-view="4" class="active">
<a href="/3.mp3" singer="齐秦"> 往事随风 </a>
</li>
<li data-view="6"><a href="/4.mp3" singer="beyond"> 光辉岁月 </a></li>
<li data-view="5"><a href="/5.mp3" singer="陈慧琳"> 记事本 </a></li>
<li data-view="5">
<a href="/6.mp3" singer="邓丽君"> 但愿人长久 </a>
</li>
</ul>
</div>'''

We can observe that the ul node contains many li nodes, some of which contain a nodes, while others do not. The a nodes also have some corresponding attributes — hyperlinks and singer names.

First, we will try to extract the singer name and song name contained in the hyperlink of the li node with class active. At this point, we need to extract the singer attribute and text of the a node under the third li node.

The regular expression can start with li, then look for an identifier active, and the middle part can be matched with .?. Next, we need to extract the value of the singer attribute, so we need to write singer="(.?)", where the part to be extracted is enclosed in parentheses so that we can use the group method to extract it. Its left boundary is a double quote. Then we also need to match the text of the a node, where its left boundary is >, and the right boundary is . The target content is still matched with (.*?), so the final regular expression becomes:

<li.*?active.*?singer="(.*?)">(.*?)</a>

Then we call the search method, which will search the entire HTML text and find the first content that matches the regular expression.

Additionally, since the code contains newlines, we need to pass re.S as the third parameter. The entire matching code is as follows:

result = re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>', html, re.S) 
if result:  
    print(result.group(1), result.group(2))

Since the singer and song names we need to obtain are already enclosed in parentheses, we can use the group method to get them.

The output is as follows:

齐秦 往事随风

We can see that this is exactly the singer name and song name contained in the hyperlink of the li node with class active.

What if we remove active from the regular expression (i.e., matching the content of nodes without the class active)? Let's modify the code by removing active from the regular expression:

result = re.search('<li.*?singer="(.*?)">(.*?)</a>', html, re.S)
if result:  
    print(result.group(1), result.group(2))

Since the search method returns the first matching target that meets the conditions, the result changes:

任贤齐 沧海一声笑

By removing the active tag, the search now starts from the beginning of the string, and the matching target becomes the second li node, so the result changes to the content of the second li node.

Note that in both of the above matches, the third parameter of the search method is re.S, which allows .*? to match newlines, so the li nodes containing newlines were matched. If we remove it, what will the result be? The code is as follows:

result = re.search('<li.*?singer="(.*?)">(.*?)</a>', html)
if result:  
    print(result.group(1), result.group(2))

The output is as follows:

beyond 光辉岁月

We can see that the result has become the content of the fourth li node. This is because the second and third li nodes both contain newline characters, and without re.S, .*? can no longer match newline characters, so the regular expression does not match the second and third li nodes, while the fourth li node does not contain newline characters, so it matches successfully.

Since most HTML texts contain newline characters, it is advisable to always add the re.S modifier to avoid matching issues.

4. findall#

We previously introduced the usage of the search method, which can return the first content that matches the regular expression. But what if we want to get all the content that matches the regular expression? In this case, we need to use the findall method. This method searches the entire string and returns all the content that matches the regular expression.

Using the same HTML text, if we want to get all the hyperlinks, singers, and song names of the a nodes, we can replace the search method with the findall method. If there are return results, it will be of list type, so we need to iterate through it to obtain each group of content. The code is as follows:

results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S)
print(results)  
print(type(results))  
for result in results:  
    print(result)  
    print(result[0], result[1], result[2])

The output is as follows:

[('/2.mp3', ' 任贤齐 ', ' 沧海一声笑 '), ('/3.mp3', ' 齐秦 ', ' 往事随风 '), ('/4.mp3', 'beyond', ' 光辉岁月 '), ('/5.mp3', ' 陈慧琳 ', ' 记事本 '), ('/6.mp3', ' 邓丽君 ', ' 但愿人长久 ')]
<class 'list'>
('/2.mp3', ' 任贤齐 ', ' 沧海一声笑 ')
/2.mp3 任贤齐 沧海一声笑
('/3.mp3', ' 齐秦 ', ' 往事随风 ')
/3.mp3 齐秦 往事随风
('/4.mp3', 'beyond', ' 光辉岁月 ')
/4.mp3 beyond 光辉岁月
('/5.mp3', ' 陈慧琳 ', ' 记事本 ')
/5.mp3 陈慧琳 记事本
('/6.mp3', ' 邓丽君 ', ' 但愿人长久 ')
/6.mp3 邓丽君 但愿人长久

We can see that each element in the returned list is of tuple type, and we can extract them using the corresponding index.

If we only want to get the first content, we can use the search method. When we need to extract multiple contents, we can use the findall method.

5. sub#

In addition to using regular expressions to extract information, sometimes we also need to use them to modify text. For example, if we want to remove all digits from a string of text, using the string's replace method would be too cumbersome. In this case, we can use the sub method. The example is as follows:

import re

content = '54aK54yr5oiR54ix5L2g'
content = re.sub('\d+', '', content)
print(content)

The output is as follows:

aKyroiRixLg

Here we only need to pass \d+ as the first parameter to match all digits, the second parameter is the string to replace (if this parameter is omitted, it can be assigned to empty), and the third parameter is the original string.

In the above HTML text, if we want to get all the song names of the li nodes, directly using regular expressions to extract them may be cumbersome. For example, it could be written like this:

results = re.findall('<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>', html, re.S)
for result in results:
    print(result[1])

The output is as follows:

一路上有你
沧海一声笑
往事随风
光辉岁月
记事本
但愿人长久

At this point, using the sub method would be simpler. We can first use the sub method to remove the a nodes, leaving only the text, and then use findall to extract it:

html = re.sub('<a.*?>|</a>', '', html)
print(html)
results = re.findall('<li.*?>(.*?)</li>', html, re.S)
for result in results:
    print(result.strip())

The output is as follows:

<div id="songs-list">
    <h2 class="title"> 经典老歌 </h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2"> 一路上有你 </li>
        <li data-view="7">
            沧海一声笑
        </li>
        <li data-view="4" class="active">
            往事随风
        </li>
        <li data-view="6"> 光辉岁月 </li>
        <li data-view="5"> 记事本 </li>
        <li data-view="5">
            但愿人长久
        </li>
    </ul>
</div>
一路上有你
沧海一声笑
往事随风
光辉岁月
记事本
但愿人长久

We can see that the a nodes have been removed after processing with the sub method, and then we can directly extract them using findall. It can be seen that, at appropriate times, using the sub method can achieve a more efficient effect.

6. compile#

The methods discussed earlier are used to process strings. Finally, let's introduce the compile method, which can compile a regular string into a regular expression object for reuse in subsequent matches. The example code is as follows:

import re

content1 = '2016-12-15 12:00'
content2 = '2016-12-17 12:55'
content3 = '2016-12-22 13:21'
pattern = re.compile('\d{2}:\d{2}')
result1 = re.sub(pattern, '', content1)
result2 = re.sub(pattern, '', content2)
result3 = re.sub(pattern, '', content3)
print(result1, result2, result3)

For example, here are 3 dates, and we want to remove the time from each of them. We can use the sub method. The first parameter of this method is the regular expression, but there is no need to repeatedly write the same regular expression three times. At this point, we can use the compile method to compile the regular expression into a regular expression object for reuse.

The output is as follows:

2016-12-15  2016-12-17  2016-12-22

Additionally, compile can also accept modifiers, such as re.S, so that in search, findall, and other methods, you do not need to pass them again. Therefore, the compile method can be said to encapsulate the regular expression for better reuse.