In this post, we are going to discuss some common code snippets used in extracting phrases out of text. Alternative possibilities exist (for instance using spaCy and tokenization); for this post, we will be using the popular re package of Python.

Comments

There is also this fantastic tool/site regex101 that is quite useful for testing out one’s regexes on sample strings, with explanations etc.
In the process of noting down my common regexes, I also came across this great resource: regular-expressions.info which has the same purpose, to list out, as well as reason about the most commonly used regexes. However the list below is tailored for my usecases, for my perusal during the day’s work, so I am keeping this article in.

Basics

There are four main functions in the re module: match, search, findall and finditer (there are many other interesting functions such as sub, subn - for details look at the source code).

The function findall gives us all the matches of the pattern in the actual string, but it is the function finditer that also gives us the spans of the matches. This is often necessary in actual work, where we want to modify the spans (say, we want to create a HTML span where the span of the pattern is highlighted).

Sample code snippets follow.

Keyphrases

This is quite mundane, the only two things that need be noted here are:

use of the \b in order to demarcate word boundary.
How we frame the input keyphrase, which is often a variable in the program, as a pattern - the use of r'..

import re

def extract_kp_from_sentence(kp, sentence):
   pattern = r'\b' + kp + r'\b'
   for match in re.finditer(pattern, sentence):
   	  print(match.span())

A sample run:

kp = "abc"
sentence = "This is the work of abc company"

extract_kp_from_sentence(kp, sentence)
# (20, 23)
assert(sentence[20:23] == 'abc')

sentence = "This is the work of abc & abc company"
extract_kp_from_sentence(kp, sentence)
# (20, 23)
# (26, 29)

If you want to ignore the case of the keyphrase while detecting a match, use re.IGNORECASE:

kp = 'ABC'
sentence = "This is the work of abc & abc company"
list(re.finditer(pattern, sentence, re.IGNORECASE))
# returns
# [<re.Match object; span=(20, 23), match='abc'>,
# <re.Match object; span=(26, 29), match='abc'>]

Emails

Here is a regex for emails; with the username and the domain name separated out. (Of course for the following examples, if we need to demarcate the word boundary, then we should use \b before and after the base regex.)

import re
email_pattern = r'(\w+)\@([\w+.]+)'

string = 'This is my email = abc@def.co.in'
matches = re.findall(email_pattern, string)

This results in the following:

matches = [('abc', 'def.co.in')]

This is because of the grouping effected by (..): (\w+) for the first group and ([\w+.]+) for the second grouping.

If you want the entire email itself without the username and domain name separated, then just remove the parentheses:

import re
email_pattern = r'\w+\@[\w+.]+'
string = 'This is my email = abc@def.co.in'
matches = re.findall(email_pattern, string)

# which results in:
# matches = ['abc@def.co.in']

GUIDs

Let us consider sample guids as generated by the online guid generator Here is an example guid: 386940a3-f37d-48c4-b3a0-92ab6859e04c The structure of a guid is pretty clear; also refer to this resource for information. A guid string is composed of

first 8 hexadecimal digits,
then three groups of 4 hexadecimal digits,
and finally a group of 12 hexadecimal digits.

Crucial to note is that the parts are hexadecimal digits, and not just arbitrarily alphanumeric. We can catch hexadecimal with [0-9a-f].

A sample on regex101 is here.

import re
guid_pattern = r'[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}'
string = 'Here is a sample guid = 386940a3-f37d-48c4-b3a0-92ab6859e04c'
matches = re.findall(guid_pattern, string)

# matches = ['386940a3-f37d-48c4-b3a0-92ab6859e04c']

Of course, it would be great to make this regex compact, without having to repeat the [a-f0-9]{4}- block thrice.

Regexes for daily use

Comments

Basics

Keyphrases

Emails

GUIDs

URLs

References: