Regex Q&A

In [3]:
import re

Q

can regex be used to parse through dictionaries and other objects of similar typing?

A

re.findall() and re.sub() take a string as input. You will have to iterate over other structures and feed re the strings one at a time.


Q

Can you search for terms by length?

A

yes, you can specify the length of the expression match that you've defined. E.g. re.findall('[a-z]{2}',"the cat sat on the mat.") will find all strings containing lower case letters of the length 2. If you specify {1,3} as your argument, it will find strings with 1, 2, or 3 letters matching the expression (lower case a-z in this example).

In [6]:
re.findall(r'\b[a-z]{2}\b','The cat sat on the mat in the doorway.')
Out[6]:
['on', 'in']

Q

How do I create an expression that properly uses 'or'? As in, either of two

A

The vertical bar character (ex: 'a|b') will look for values that satisfy expressions a or b

In [8]:
re.findall('color|colour','Growing up in Ireland, I tend to write "colour" instead of "color".')
Out[8]:
['colour', 'color']
In [9]:
re.findall('colou?r','Growing up in Ireland, I tend to write "colour" instead of "color".')
Out[9]:
['colour', 'color']

Q

How do we encrypt passwords using regex?

A

this is not really what regex is for...


Q

example, finding 'ar' in 'I love Richard.'

A

re.findall("ar", ...)


Q

Delete all words in a dictionary with a vowel?

A

Not really a pattern. Better done with pure Python.


Q

Does regular expressions (re.findall) work well when dealing with a document with hundreds of thousands of words (for instance a textbook). Is the runtime good?

A

Possibly not: Python re is not very fast.


Q

word with two letters in front of 'at', without the words with one letter in

A


Q

items validates my expression if they exist in the string?

A

In [10]:
if re.search(r'\bh[aeiou]*ck\b','what the heck?!'):
    print('Got a match!')
Got a match!

Q

how to find a list of key words? how to identify that the key words are not only being mentioned but actually being implemented?

A

We can use a for loop to search for each keyword


Q

Can you parse through a string and search for terms by length?

A

Yes, you can use a for loop to go through each word until you reach the end of the string whilst isolating each term and using counts and indexing to go through each term to find the length/


Q

Is every other letter of a string a vowel?

A

JR: I'd be tempted to do this with pure Python. Something like this:

In [11]:
s = 'mississippi'; 
set(s[1::3]).issubset( set('aeiou') )
Out[11]:
True

Q

How do you use regex on data base?

A

Some databases have their own implementation of regular expression, it is certainly possible for SQL but specifying that here is a bit lengthy - if you mean data base like a data frame, you could define a lambda expression with the regular expression you need and apply this to a specific column of your dataframe with the map() function


Q

when we use '.*' to find the longest string, how does python decide that it is the end of the string? Does it use the period sign? Could we customize it?

A

In this case, the code is not explicitly searching for the end of the string, but for the last instance of the keyword, which in our case is close to the end of the string.


Q

How do you look for a decimal number (ex: 0.5) in regex?

A

\d.?\d

In [12]:
re.findall('\d*\.\d*','Is 3.14159 > 2.71?')
Out[12]:
['3.14159', '2.71']

Q

with the findall example we used, how do we find the word 'at' and not part of a word

A

In [13]:
re.findall( r'\bat\b', 'The cat spat at the mat.' )  # \b for word boundary
Out[13]:
['at']

Q

How to find if an expression is repeated in the same word in a string?

A

Yes, you could by choosing your expr, then creating a count for it


Q

can we use this to create a method that find and replace words in document

A

yes re.sub() (I think)


Q

can you ignore certain strings/chars when looking for something? ie ignore word 'cat' when searching for 'at'.

A

In [16]:
re.findall( '[^c]at', 'The cat spat at the mat.' )  
Out[16]:
['pat', ' at', 'mat']

Q

how to find a specific symbol (like \) in Regular Expression?

A

by typing another '\' before your symbol


Q

Can we use RegEx to find emojis in a text?

A

Possibly, but not simply.


Q

What is the difference between using this method and using split()?

A

The difference is that the split function can be a little difficult to work with, and can lead to cluttered substrings. For example, suppose I was looping through a dictionary and extracting specific file names that began with 'Patient[X]' , where X can vary from A-Z, and ended with '.svs'. While it's true that we can split on Patient, we would then have to do more work in order to ensure that the file we are checking ends with .svs instead of something like .txt. Using this library on the other hand, makes it far more easier, effecient, and cleaner to write/read than the former. On top of this, if you needed something in the middle of a string, and if the string was huge (like 10,000 charachters), then you would not have a fun time trying to find the specific substring you want, in fact it would be a pretty slow operation as opposed to using the regex library.

JR: I'd say regex is good when you are looking to match one of a number of possibilities. Here's a good example that arose in the first names project. The task is to find all the names that are "like" Dmitri, which we interpret to mean of the form D-m-t-r-, where the blanks are vowels or sequences of vowels.

In [11]:
import glob
names = set([])
for file in glob.glob('names2019/yob*.txt'):
    with open(file) as f:
        for name,gender,count in [line.split(',') for line in f.read().split('\n') if len(line)>2]:
            names.add(name)
print( len(names) )
s = ' '.join(names)
import re
re.findall(r'\bD[aeiouy]*?m[aeiouy]*?t[aeiouy]*?r[aeiouy]*?\b',s)
98400
Out[11]:
['Domitri',
 'Demeter',
 'Dmitri',
 'Demitrio',
 'Dmetri',
 'Dimitri',
 'Demetria',
 'Dmitry',
 'Dametra',
 'Demitra',
 'Demitria',
 'Dimetri',
 'Demeteria',
 'Damitri',
 'Demitre',
 'Demetrey',
 'Dimitar',
 'Dimitre',
 'Dametria',
 'Demetrie',
 'Dmitriy',
 'Dimitriy',
 'Dimitry',
 'Demitry',
 'Dimitrie',
 'Demetre',
 'Dimitra',
 'Dametri',
 'Demetry',
 'Demetra',
 'Dametre',
 'Dimitria',
 'Demitri',
 'Demeatra',
 'Dimetra',
 'Demitrie',
 'Demetriu',
 'Dmytro',
 'Damitra',
 'Demetri',
 'Demetrio',
 'Demetree']

Q

Is it possible to find out error in sentence. For example find out other language in the content that cannot be read.

A

Define a list or a dictionary contains invalid occurrences ('error') in sentences, then loop over the list or dictionaty with regex function as content to report these occurrences ('error').

JR: Regex knows nothing about human languages or meaning. It only deals with patterns in character strings.


Q

How would you specify the number of occurrences a pattern is allowed?

A

JR: {3}

In [17]:
re.findall( '[A-Z,a-z]{2,10}at','The cat spat on the mat.')
Out[17]:
['spat']

Q

How to find a word of different Length containing a specific expression ?

A

/\b\w[Yy]\w\b/g This will grab the first word containing whatever expression you are looking for. In this case, it will look for a word containing either capital or lowercase y. This grabs words of any length contianing your expression


Q

Can we find exact capital letters in regular expressions

A

yes, you can search upper and lower case letter, at the same time you can search with case insensitive by adding flag: re.IGNORECASE

In [22]:
re.findall(r'J.*?\b','Javier y Jesús') 
Out[22]:
['Javier', 'Jesús']

Q

How to get the word which contains some letters which is between the words?For

A

re.findall("ar", ...)


Q

How to find a word with specified length in a string contains the expression

A


Q

How can we incorporate regular expressions into flow control and conditionals.

A

In [23]:
if re.search(r'\bh[aeiou]*ck\b','what the heck?!'):
    print('Got a match!')
Got a match!

Q

If I have the sentence 'The flat cat sat on the mat'. How do I just extract the front of 'at'. So, in this case just the word flat. ('..at' shows all the words with a space added for 3 letter words)

A


Q

suppose you find expressions with a particular string, how do you then report back only the part of the string that was not in the specific search. i.e. if were search for .*at, how do we report only the part that is not 'at'? for cat that would just be c or The c depending on where it is in the string.

A

JR: () for marked group(s)

In [24]:
re.findall( '([A-Z,a-z]*?)at','The cat spat on the mat.')  # captured group ()
Out[24]:
['c', 'sp', 'm']

Q

Does the re library have a way of detecting synonyms?

A

JR: No. This requires knowledge of the natural language. Regex only concerns patterns in character strings.


Q

How would I parse through a directory and get files that begin with 'Case_x' where x can be either A, B, or C and end with '-rendered_img.svs'.

A

JR: Iterate over filenames and check for match with re.find( 'Case_[ABC]-rendered_img.svs' , filename )


Q

how to delete arbitrary number of spaces between characters? for instance, "asdf gghdh dhdh. dhhd.

A

In [25]:
re.sub( ' +', ' ', 'The  cat spat   at the      rat.' ) 
Out[25]:
'The cat spat at the rat.'

Q

What if I want to search the string for thing A OR thing B? For example, if I made a todo list and the topic of the todo could be a string like "Research" OR a class title, like in your document, how could I tell it to do that?

A

[A|B]

In [ ]: