A Brief Review: Regular Expressions (in Linux)

If you’re like me, you learned regular expressions in Theory of Computation and quickly forgot about them. Regular expressions are a powerful tool on the command line and can save you copious amounts of time when searching for patterns.

A quick side note about regular expressions - they were first implemented by Ken Thompson in his editor QED. Later, he took this same idea and put it the linux editor ed. From this editor is where the linux command ‘grep’ was born. ‘g/re/p’ meant Global search for Regular Expression and Print matching lines in ‘ed’.

There are two basic sets of regular expressions: basic and extended [these are both POSIX]. Most programs in linux use the basic regular expression type. Here’s a list of some of the programs that use the basic format and extended format (taken from http://www.grymoire.com/Unix/Regular.html)

Basic: vi, sed, grep, csplit, dbx, dbxtool, more, ed, expr, lex, pg, nl, rdist
Extended: awk, nawk, egrep

The main difference between basic and extended concerns the way in which metacharacters are expressed. With basic regular expressions, the meta characters need to be escaped with the backslash character in order to perceived as such. In extended regular expressions, it is assumed that characters such as ‘{‘ are meta characters unless they are escaped with the backslash (there are some implicit ways of detecting that it is not a meta character, but that is for another time).

Here are some of the meta chars used by extended regular expressions:

’^’ - Match epsilon (empty string) from the end of the line [This was added in Extended Regular Expressions]
’$’ - Match epsilon (empty string) at the end of the line [This was added in Extended Regular Expressions]
’.’ - Match any one character [This was added in Extended Regular Expressions]
‘[a-m]’ - Match any character a through m lowercase
’[^a-m]’ - Match any character that is not a through m lowercase
’?’ - Match one occurance or none of the preceding object.
’\*’ - Kleene Star [remember this?]. Match zero or arbitrarily many of preceding object.
’\+’ - The preceding object will be matched one or more times.
a{n} - Match ‘a’ exactly n times.
a{n,} - Match ‘a’ at least n times.
a{n,m} - Match ‘a’ at least n times, but not more than m times.
’|’ - This is the logical or symbol.

Now check this out: <div class=’bogus-wrapper’>

<div class=”highlight”><table><tr><td class=”gutter”><pre class=”line-numbers”>1 </pre></td><td class=’code’><pre>$ ls -la | grep -E 'profile$'</pre></td></tr></table></div></div> This command will match all the files in /etc/ that end with profile.

See if you can use this to find the meaning of the site’s subtitle.

For more information visit:
http://www.ibm.com/developerworks/aix/library/au-speakingunix9/
http://en.wikipedia.org/wiki/Regular_expression
http://www.regular-expressions.info/posix.html
And finally, always consult your man pages when in doubt. There are some good instructions in the man pages for grep, especially about the differences between basic and extended regular expressions.

Found Objects

find | egrep '\.(c|cpp|java|sh)$'

A Brief Review: Regular Expressions (in Linux)