Real World Regexes
Dan mentioned that he wasn’t that knowledgeable about regular expressions (a topic I am intimately familiar with), so I figured I’d put up some examples from code I’ve actually written, along with the text they’re actually supposed to match.
To begin with, here are the general rules for regexes. To begin with, “operator” refers to any of these (so \s+, [A-Z], (Word), etc). Greedy means it’ll continue matching as far as possible, and if the operator/character you want to match occurs more than once in the string, it’ll eat the first one and only stop matching at the last one.
. Match any character
\w Match “word” character (alphanumeric plus “_”)
\W Match non-word character
\s Match whitespace character
\S Match non-whitespace character
\d Match digit character
\D Match non-digit character
\t Match tab
\n Match newline
\r Match return
\f Match formfeed
\a Match alarm (bell, beep, etc)
\e Match escape
^ Beginning of the line
$ End of the line
+ matches the preceding operator one or more times (greedy)
* matches the preceding operator zero or more times (greedy)
? matches the preceding operator once if it exists, but it doesn’t have to be there. Mostly used to stop greedy operators (*? or +?, for instance) at the match you want.
() is used for grouping (either to use later as a backreference or to exclude)
(?<name>) (or (?P<name>) in Python and maybe others) is used for a named backreference. There’ll be some examples of that.
| is used as a logical or
{n} is used to match the preceding character n times
{n, m} matches n to m times
{n,} matches 1 or more times (may as well use +)
[A-Za-z] is used to match whatever is in the middle, but it only counts as one character (so [A-Za-z] would match any of those characters ONCE. Useful if you want [a-f] or [0-5]+ or something).
[^] is used to exclude things. [^word] excludes “w”, but the caret only matches ONCE (this can be chained as [^(word)], since groups count as a single operator.
Sound confusing? It is, which is why I’ll put up real examples. FYI, these are PCRE (Perl Compatible Regular Expressions) rather than SCRE (Sed Compatible Regular Expressions), but Dan’ll almost certainly never use sed compatible (which doesn’t have a ? operator, among other things).
Using a backreference later depends on the language. .NET uses ${n} where n is the reference number (note that they start from 1, as the entire string you matched is ${0}), Perl (and a lot of others) us $n, Ruby uses \1 (as does Python, but Python {like .NET} needs an operator in front to use a raw string {.NET is @, Python is r}, otherwise it’s \\1). Language reference is your best bet here.
First example.
(Oct6 0423z) Dec4100: C, was acknowledged by, ek
string regexPattern = @".*?\)\s (?<system>\S+?) :\s (?<tape>\w) .*,\s (?<initials>.*)"; Regex re = new Regex(regexPattern, RegexOptions.ExplicitCapture);
It eats everything up until the right parenthesis (escaped so the regex parser doesn’t try to interpret it) followed by a space, then it gets all non-whitespace characters until the colon as the system name. Ignores the colon and a space, then grabs all word characters ([A-Z0-9_]) as the tape number. Ignores zero or more matches of any character (the “.”) until it finds a comma followed by a space, then yanks the rest of the line as the initials.
C is the tape name.
ek are the initials.
This means Dec4100 is available as ${system} (if doing Regex.Replace) or m.Groups["system"] if you matched the regex with m = Regex.Match(logfilestring, re);
Another example:
<form action="http://www.climate.weatheroffice.ec.gc.ca/climateData/Interform.cfm" method="post" name="stnRequest1"> <input type="Hidden" name="hlyRange" value="N/A"> <input type="Hidden" name="dlyRange" value="1998-4-1|2007-11-30"> <input type="Hidden" name="mlyRange" value="1998-4-1|2007-11-1"> <input type="Hidden" name="StationID" value="10700"> <input type="Hidden" name="prov" value="CA"> <input type="Hidden" name="urlExtension" value="_e.html"> <tr id="dataTableOddRow"> <td id="dataTableRowHeader">(AE) BOW SUMMIT</td> <td id="dataTableRowHeader"><abbr title="ALBERTA">ALTA</abbr></td> <td> <select name="timeframe" size="1" class="formElement75w" onChange="elementChange(document.stnRequest1,1)"> <option value="2">Daily</option><option value="3">Monthly</option><option value="4">Almanac</option> </select> </td> <td> <select name="day" size="1" class="formElement" disabled><option value="1" >1</option><option value="2" >2</option><option value="3" >3</option><option value="4" >4</option><option value="5" >5</option><option value="6" >6</option><option value="7" >7</option><option value="8" >8</option><option value="9" >9</option><option value="10" >10</option><option value="11" >11</option><option value="12" >12</option><option value="13" >13</option><option value="14" >14</option><option value="15" >15</option><option value="16" >16</option><option value="17" >17</option><option value="18" >18</option><option value="19" >19</option><option value="20" >20</option><option value="21" >21</option><option value="22" >22</option><option value="23" >23</option><option value="24" >24</option><option value="25" >25</option><option value="26" >26</option><option value="27" >27</option><option value="28" >28</option><option value="29" >29</option><option value="30" Selected>30</option><option value="31" >31</option> </select> </td> <td> <select name="month" size="1" class="formElement" onChange="elementChange(document.stnRequest1,1)" ><option value="1" >Jan</option><option value="2" >Feb</option><option value="3" >Mar</option><option value="4" >Apr</option><option value="5" >May</option><option value="6" >Jun</option><option value="7" >Jul</option><option value="8" >Aug</option><option value="9" >Sep</option><option value="10" >Oct</option><option value="11" Selected>Nov</option><option value="12" >Dec</option> </select> </td> <td> <select name="year" size="1" class="formElement" onChange="elementChange(document.stnRequest1,1)"><option value="1998" >1998</option><option value="1999" >1999</option><option value="2000" >2000</option><option value="2001" >2001</option><option value="2002" >2002</option><option value="2003" >2003</option><option value="2004" >2004</option><option value="2005" >2005</option><option value="2006" >2006</option><option value="2007" Selected>2007</option> </select> </td> <td> <input type="submit" name="stnSubmit" value="Go" class="formElement"> </td> </form>
And the parser:
if ($chunk =~ /.*StationID.*?"(\d+)".*?prov.*?"(\w+).*?TableRowHeader">(.*?)<.*abbr title.*?>(\w+).*?/s) { my $stationid = $1; my $province = $2; my $name = $3; my $abbrprov = $4; }
This is a multi-line regex (hence the //s, like //g is global, //i is case insensitive, //gi is both g and i, etc), and a good example of non-greedy matching. It snags everything up until StationId, then the next quotation mark followed by numbers, and captures those numbers. It comes out as “10700″.
Does the same thing following “prov” up until the next word characters in quotation marks, and captures those. As .* rather than .*?, it would have grabbed “data”, which precedes TableRowHeader (inside the same parenthesis). Comes out as “CA”.
Grabs everything from TableRowHeader”> until the next < Comes out as “(AE) Bow Summit”.
Drops everything up until the next < after “abbr title”, then captures all word characters. “ALBA”
These are all assigned to variables via backreferences. $1, $2, $3, $4 are the groups in order. It’s worth noting that (at least in .NET), named backreferences are assigned numbers BEFORE regular backreferences. So (?<a>a)(b)(?<c>c)(d) would be acbd as ${0}${1}${2}${3}.
Another example:
04:26:23 [2] Error creating WLAAAP06.FS8 = 1 : Unrecognized KGFXENG Error Code
And the parser:
re.match(line, r'^(?P<time>.*?)\s+\[(?P<engine>\d+)\]\s+(?P<error>.*?(KGFXENG|LeadTools).*)'
Grabs everything from the beginning of the line until the first space as “time”. Comes out as “04:26:23″.
Then skips whitespace and a bracket (escaped with \[) and grabs one or more numbers (\d+) as "engine". Comes out as "2", of course. Skips a space, then captures anything which contains "KGFXENG" or "LeadTools" as "error". Basically, the rest of the line.
This line, for instance, wouldn't match, and nothing in the regex would be captured:
00:15:18 [1] Error producing WPATAZ00.FSD = F088 : Error while saving the graphic
These are used later with this:
message = "ERROR: %s %s: %s" % (re.sub(r'.*?([A-Za-z]+Engine[A-Za-z]*?)(Errors)?.*', r'\1', logfilename), engine, match.group('error'))
“logfilename” is something like “2008_Oct_07__ProductEngineErrors.log”. This grabs everything up until A through Z (uppercase or lowercase) one or more times followed by Engine, optionally followed by something else (*, though ? would have worked if I said r’Engine([A-Za-z]+)?’). It stops on Errors, if it exists (the question mark afterwards), and replaces the entire name with the first backreference (”ProductEngine” in this case).
Last example is a nested bitch of increasingly complicated rules:
#Match plain ol' timezones if ($brpos =~ /^\[(\w+)\](.*)/) { $DateZone = $1; $newname = $2; } #Match timezones with a day modification, and grab that along with the +/- elsif ($brpos =~ /^\[(\w+)(\S\d+)\](.*)/) { $DateZone = $1; $TempDay2 = ONE_DAY * $2; $newname = $3; } #Check for a delete flag elsif ($brpos =~ /^(\d)\[.*/) { $DeleteFilesStatus = $1; #If the status is one, we want to capture everything after the timezone as the DeleteName if ($DeleteFilesStatus == 1) { if ($brpos =~ /^(\d)\[(\w+)\](.*)/) { $DeleteFilesStatus = $1; $DateZone = $2; $DeleteFilesNames = $3; $newname = $3; } elsif ($brpos =~ /^(\d)\[(\w+)(\S\d+)\](.*)/) { $DeleteFilesStatus = $1; $DateZone = $2; $TempDay2 = ONE_DAY * $3; $DeleteFilesNames = $4; $newname = $4; } } #Otherwise, the DeleteName is in more brackets elsif ($DeleteFilesStatus == 2) { #Grab it all, but without a time modification if ($brpos =~ /^(\d)\[(\w+)\]\[(.*\.\w+)\](.*)/) { $DeleteFilesStatus = $1; $DateZone = $2; $DeleteFilesNames = $3; $newname = $4; } #Grab it with a time modification elsif ($brpos =~ /^(\d)\[(\w+)(\S\d+)\]\[(.*\.\w+)\](.*)/) { $DeleteFilesStatus = $1; $DateZone = $2; $TempDay2 = ONE_DAY * $3; $DeleteFilesNames = $4; $newname = $5; } } }
Examples of what I’m catching (hopefully in order). The stuff in brackets later is filled in for date/time stamps:
[EDT]DOV-F-[MM][dd][yy][hh].csv [CST-1][MM][dd].act 1[PDT]Actual[yy][MM][dd][hh][mm].csv 1[EST-3]KLGA[yy][MM][dd].mtx 2[EDT][WBD*.txt]WBD[yy][MM][dd]05.txt 2[MST+2][WSM*.txt]WBD[yyyy][MM].txt
Sadly, I’m out of work for the night, but these matches aren’t that complicated. Lots of escaping brackets, and use of the \S character to match “-” or “+”, then grabbing the rest of them. I may write more tomorrow…