Regexes for daily use
In this post, we are going to discuss some common code snippets used in extracting phrases out of text.
Alternative possibilities exist (for instance using spaCy and tokenization); for this post, we will be using the popular re
package of Python.
Comments
- There is also this fantastic tool/site regex101 that is quite useful for testing out one’s regexes on sample strings, with explanations etc.
- In the process of noting down my common regexes, I also came across this great resource: regular-expressions.info which has the same purpose, to list out, as well as reason about the most commonly used regexes. However the list below is tailored for my usecases, for my perusal during the day’s work, so I am keeping this article in.
Basics
There are four main functions in the re module:
match
, search
, findall
and finditer
(there are many other interesting functions such as sub, subn - for
details look at the source code).
The function findall
gives
us all the matches of the pattern in the actual string, but it is the function finditer
that also gives us
the spans of the matches. This is often necessary in actual work, where we want to modify the spans (say, we want
to create a HTML span where the span of the pattern is highlighted).
Sample code snippets follow.
Keyphrases
This is quite mundane, the only two things that need be noted here are:
- use of the
\b
in order to demarcate word boundary. - How we frame the input keyphrase, which is often a variable in the program,
as a pattern - the use of
r'..
import re
def extract_kp_from_sentence(kp, sentence):
pattern = r'\b' + kp + r'\b'
for match in re.finditer(pattern, sentence):
print(match.span())
A sample run:
kp = "abc"
sentence = "This is the work of abc company"
extract_kp_from_sentence(kp, sentence)
# (20, 23)
assert(sentence[20:23] == 'abc')
sentence = "This is the work of abc & abc company"
extract_kp_from_sentence(kp, sentence)
# (20, 23)
# (26, 29)
If you want to ignore the case of the keyphrase while detecting a match, use re.IGNORECASE
:
kp = 'ABC'
sentence = "This is the work of abc & abc company"
list(re.finditer(pattern, sentence, re.IGNORECASE))
# returns
# [<re.Match object; span=(20, 23), match='abc'>,
# <re.Match object; span=(26, 29), match='abc'>]
Emails
Here is a regex for emails; with the
username and the domain name separated out.
(Of course for the following examples, if we need to demarcate the word boundary, then
we should use \b
before and after the base regex.)
import re
email_pattern = r'(\w+)\@([\w+.]+)'
string = 'This is my email = abc@def.co.in'
matches = re.findall(email_pattern, string)
This results in the following:
matches = [('abc', 'def.co.in')]
This is because of the grouping effected by (..)
:
(\w+)
for the first group and ([\w+.]+)
for the
second grouping.
If you want the entire email itself without the username and domain name separated, then just remove the parentheses:
import re
email_pattern = r'\w+\@[\w+.]+'
string = 'This is my email = abc@def.co.in'
matches = re.findall(email_pattern, string)
# which results in:
# matches = ['abc@def.co.in']
GUIDs
Let us consider sample guids as generated by the online guid generator
Here is an example guid: 386940a3-f37d-48c4-b3a0-92ab6859e04c
The structure of a guid is pretty clear; also refer to this resource for information. A guid string is composed of
- first 8 hexadecimal digits,
- then three groups of 4 hexadecimal digits,
- and finally a group of 12 hexadecimal digits.
Crucial to note is that the parts are hexadecimal digits, and not just arbitrarily alphanumeric.
We can catch hexadecimal with [0-9a-f]
.
A sample on regex101 is here.
import re
guid_pattern = r'[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}'
string = 'Here is a sample guid = 386940a3-f37d-48c4-b3a0-92ab6859e04c'
matches = re.findall(guid_pattern, string)
# matches = ['386940a3-f37d-48c4-b3a0-92ab6859e04c']
Of course, it would be great to make this regex compact, without having to repeat the
[a-f0-9]{4}-
block thrice.