Discussion 4 Regex Supplemental Notebook

In [1]:
import re
import pandas as pd
In [2]:
class NotStringError(Exception):
    pass

def find_matches(text_list, regex):
    # This function finds the parts of each string in text_list which matches the regex
    if not isinstance(text_list, list):
        text_list = [text_list]
    
    for i, text in enumerate(text_list):
        if not isinstance(text, str):
             raise NotStringError(f'The {i}th item is not a string.')
    
    matches = [re.search(regex, text) for text in text_list]
        
    return [None if match is None else match[0] for match in matches]

Question 1

Which strings contain a match for the following regular expression, "1+1$"?

Remember: "+" matches preceding literal or sub-expression one or more times and "$" matches the position at the end of a string.

In [3]:
texts = ['What is 1+1?', 'Make a wish at 11:11', '111 Ways to Succeed']
regex10 = r'1+1$'
find_matches(texts, regex10)
Out[3]:
[None, '11', None]

image%20%281%29.png

In [4]:
"fas11" " askldfh 1111" "aasdfa111111111"
Out[4]:
'fas11 askldfh 1111aasdfa111111111'

Question 2

Given the text in the cell below,

Which of the following matches exactly to the email addresses (including angle brackets)?

In [5]:
records = [
    '<record> Josh Hug <hug@cs.berkeley.edu> Faculty </record>',
    '<record> Manana Hakobyan <manana.hakobyan@berkeley.eud> TA </record>',
    '<@>'
]
regex5 = r'<.*@.*>'
regex6 = r'<[^<]+@[^>]+>'
regex7 = r'<.*@\w+\..*>'
In [6]:
find_matches(records, regex6)
Out[6]:
['<hug@cs.berkeley.edu>', '<manana.hakobyan@berkeley.eud>', None]

image%20%282%29.png

image%20%283%29.png

image%20%284%29.png

Question 3

Skip for now come back, if time review the regex syntax

For each pattern specify the starting and ending position of the first match in the string. The index starts at zero and we are using closed intervals (both endpoints are included).

. abcdefg abcs! ab abc abc, 123
abc* [0,2] . . .
[^\s]+ . . . .
ab.* c . . . .
[a-z1,9]+ . . . .

Note: Try using the https://regex101.com/ tool to understand more about regular expressions!

In [7]:
# Save all of our strings and regex expressions as string objects.
string1 = 'abcdefg'
string2 = 'abcs!'
string3 = 'ab abc'
string4 = 'abc, 123'
regex1 = 'abc*'
regex2 = '[^\s]+'
regex3 = 'ab.*c'
regex4 = '[a-z1,9]+'
In [8]:
q3data = [[find_matches(string1,regex1),find_matches(string2,regex1),find_matches(string3,regex1),find_matches(string4,regex1)],
           [find_matches(string1,regex2),find_matches(string2,regex2),find_matches(string3,regex2),find_matches(string4,regex2)],
           [find_matches(string1,regex3),find_matches(string2,regex3),find_matches(string3,regex3),find_matches(string4,regex3)],
           [find_matches(string1,regex4),find_matches(string2,regex4),find_matches(string3,regex4),find_matches(string4,regex4)]]

q3table = pd.DataFrame(q3data, columns = [string1, string2, string3, string4], index = [regex1, regex2, regex3, regex4])

q3table

# Need to query the original strings to get index start and end positions
Out[8]:
abcdefg abcs! ab abc abc, 123
abc* [abc] [abc] [ab] [abc]
[^\s]+ [abcdefg] [abcs!] [ab] [abc,]
ab.*c [abc] [abc] [ab abc] [abc]
[a-z1,9]+ [abcdefg] [abcs] [ab] [abc,]

Question 4

Write a regular expression that matches strings (including the empty strings) that only contain lowercase letters and numbers.

In [9]:
regex11 = '^[^A-Z!]*$'
string6 = ['adsf04RTS!','asdfa342','RA43','adsfa!']
find_matches(string6,regex11)
Out[9]:
[None, 'asdfa342', None, None]

image%20%285%29.png

Question 5

Come back to after q3

Write a regular expression that matches strings that contain exactly 5 vowels.

Remember:

  • ^ matches the position at the beginning of a string (unless used for negation as in "[^]").
  • * matches the preceding literal or sub-expression zero or more times.
  • [ ] matches any one of the characters inside of the brackets.
  • { } indicates the {minimum, maximum} number of matches.
  • $ matches the position at the end of a string.
In [10]:
regex8 = ''
string5 = ['fabulous', 'berkeley', 'go uc berkeley', 'GO UC Berkeley', 'vowels are fun', 'vowels are great']
find_matches(string5,regex8)
Out[10]:
['', '', '', '', '', '']

Question 6

Given that address is a string, use re.sub to replace all vowels with a lowercase letter "o". For example "123 Orange Street" would be changed to "123 orango Stroot".

In [11]:
address = "123 Orange Street"
regex12 = r""
re.sub(regex12, "o", address)
Out[11]:
'o1o2o3o oOoroaonogoeo oSotoroeoeoto'

Question 7

Given that sometext is a string, use re.sub to replace all clusters of non-vowel characters with a single period. For example a big moon, between us... would be changed to a.i.oo.e.ee.u..

In [12]:
sometext = "a big moon, between us..."
regex9 = r""
re.sub(regex9, ".", sometext)
Out[12]:
'.a. .b.i.g. .m.o.o.n.,. .b.e.t.w.e.e.n. .u.s.......'

Question 8

Given sometext = "I've got 10 eggs, 20 gooses, and 30 giants.", use re.findall to extract all the items and quantities from the string. The result should look like ['10 eggs', '20 gooses', '30 giants']. You may assume that a space separates quantity and type, and that each item ends in s.

In [13]:
sometext = "I've got 10 eggs, 20 gooses, and 30 giants."
regex13 = r""
re.findall(regex13, sometext)
Out[13]:
['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

image%20%286%29.png

Question 9

Given the following text in variable log:

Fill in the regular expression in the variable pattern below so that after it executes, day is 26, month is Jan, and year is 2014.

In [15]:
log = '169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"'
pattern = r""
matches = re.findall(pattern, log)
day, month, year = matches[0]
[day, month, year]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-b7115837b43e> in <module>
      2 pattern = r""
      3 matches = re.findall(pattern, log)
----> 4 day, month, year = matches[0]
      5 [day, month, year]

ValueError: not enough values to unpack (expected 3, got 0)

image%20%287%29.png

In [ ]: