import re
import pandas as pd
class NotStringError(Exception):
pass
def find_matches(text_list, regex):
# This function finds the parts of each string in text_list which matches the regex
if not isinstance(text_list, list):
text_list = [text_list]
for i, text in enumerate(text_list):
if not isinstance(text, str):
raise NotStringError(f'The {i}th item is not a string.')
matches = [re.search(regex, text) for text in text_list]
return [None if match is None else match[0] for match in matches]
Which strings contain a match for the following regular expression, "1+1$"?
Remember: "+" matches preceding literal or sub-expression one or more times and "$" matches the position at the end of a string.
texts = ['What is 1+1?', 'Make a wish at 11:11', '111 Ways to Succeed']
regex10 = r'1+1$'
find_matches(texts, regex10)
"fas11" " askldfh 1111" "aasdfa111111111"
Given the text in the cell below,
Which of the following matches exactly to the email addresses (including angle brackets)?
records = [
'<record> Josh Hug <hug@cs.berkeley.edu> Faculty </record>',
'<record> Manana Hakobyan <manana.hakobyan@berkeley.eud> TA </record>',
'<@>'
]
regex5 = r'<.*@.*>'
regex6 = r'<[^<]+@[^>]+>'
regex7 = r'<.*@\w+\..*>'
find_matches(records, regex6)
For each pattern specify the starting and ending position of the first match in the string. The index starts at zero and we are using closed intervals (both endpoints are included).
. | abcdefg | abcs! | ab abc | abc, 123 |
---|---|---|---|---|
abc* | [0,2] | . | . | . |
[^\s]+ | . | . | . | . |
ab.* c | . | . | . | . |
[a-z1,9]+ | . | . | . | . |
Note: Try using the https://regex101.com/ tool to understand more about regular expressions!
# Save all of our strings and regex expressions as string objects.
string1 = 'abcdefg'
string2 = 'abcs!'
string3 = 'ab abc'
string4 = 'abc, 123'
regex1 = 'abc*'
regex2 = '[^\s]+'
regex3 = 'ab.*c'
regex4 = '[a-z1,9]+'
q3data = [[find_matches(string1,regex1),find_matches(string2,regex1),find_matches(string3,regex1),find_matches(string4,regex1)],
[find_matches(string1,regex2),find_matches(string2,regex2),find_matches(string3,regex2),find_matches(string4,regex2)],
[find_matches(string1,regex3),find_matches(string2,regex3),find_matches(string3,regex3),find_matches(string4,regex3)],
[find_matches(string1,regex4),find_matches(string2,regex4),find_matches(string3,regex4),find_matches(string4,regex4)]]
q3table = pd.DataFrame(q3data, columns = [string1, string2, string3, string4], index = [regex1, regex2, regex3, regex4])
q3table
# Need to query the original strings to get index start and end positions
Write a regular expression that matches strings (including the empty strings) that only contain lowercase letters and numbers.
regex11 = '^[^A-Z!]*$'
string6 = ['adsf04RTS!','asdfa342','RA43','adsfa!']
find_matches(string6,regex11)
Write a regular expression that matches strings that contain exactly 5 vowels.
Remember:
^
matches the position at the beginning of a string (unless used for negation as in "[^]").*
matches the preceding literal or sub-expression zero or more times.[ ]
matches any one of the characters inside of the brackets.{ }
indicates the {minimum, maximum} number of matches.$
matches the position at the end of a string.regex8 = ''
string5 = ['fabulous', 'berkeley', 'go uc berkeley', 'GO UC Berkeley', 'vowels are fun', 'vowels are great']
find_matches(string5,regex8)
Given that address
is a string, use re.sub
to replace all vowels with a lowercase letter "o". For example "123 Orange Street" would be changed to "123 orango Stroot".
address = "123 Orange Street"
regex12 = r""
re.sub(regex12, "o", address)
Given that sometext
is a string, use re.sub
to replace all clusters of non-vowel characters with a single period. For example a big moon, between us...
would be changed to a.i.oo.e.ee.u.
.
sometext = "a big moon, between us..."
regex9 = r""
re.sub(regex9, ".", sometext)
Given sometext = "I've got 10 eggs, 20 gooses, and 30 giants."
, use re.findall
to extract all the items and quantities from the string. The result should look like ['10 eggs', '20 gooses', '30 giants']
. You may assume that a space separates quantity and type, and that each item ends in s.
sometext = "I've got 10 eggs, 20 gooses, and 30 giants."
regex13 = r""
re.findall(regex13, sometext)
Given the following text in variable log
:
Fill in the regular expression in the variable pattern
below so that after it executes, day is 26, month is Jan, and year is 2014.
log = '169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"'
pattern = r""
matches = re.findall(pattern, log)
day, month, year = matches[0]
[day, month, year]