Schema
Let me know if I’m missing anything.
The image is a little ginormous, but the SVG scaling looks like ass in Firefox

If you want the SVG, here’s a link. WARNING: Firefox’s SVG rendering sucks.
Let me know if I’m missing anything.
The image is a little ginormous, but the SVG scaling looks like ass in Firefox

If you want the SVG, here’s a link. WARNING: Firefox’s SVG rendering sucks.
Firstly, I’m starting P90X tomorrow. Should be interesting. Secondly, I miss you guys :/ I’m living with somebody who asked me what the Dead Sea Scrolls are this morning, since it was on the news that they’re coming to the Science Museum.
By the way, ever planning on touching your blogs again (Sewpbox and Rattributes not included)?
So I’m migrating Heather’s Palm Desktop crap to Google Calendar (I have no idea why no tool exists to do this). Google Calendar doesn’t really like the CSV I massaged out of it (only importing about half the records), and I’m starting to see why. Half the records are fucking duplicates in every way but one. I wrote a Python script to do it for me anyway.
The long and short of it amounts to this:
If you want the easy way, export the Palm data to a .mda, import it into Yahoo Calendar, then into Google Calendar from there. Otherwise, export it to a CSV, and hit it with this script:
#!/usr/bin/ruby # require 'csv' input = "export.csv" output = "gcal.csv" csvfile = File.open(input) {|f| f.read} puts "Parsing..." csv = CSV::parse(csvfile) fields = csv.shift puts "Writing..." File.open(output, "w") do |f| f.print "Subject, Start Date, Start Time, End Date, End Time\n" csv.each do |line| startdate, starttime = Time.at(line[6].to_i).strftime("%m/%d/%Y,%I:%M:%S %p").split(',') enddate, endtime = Time.at(line[7].to_i).strftime("%m/%d/%Y,%I:%M:%S %p").split(',') f.print "\"#{line[11]}\",#{startdate},#{starttime},#{enddate},#{endtime}\n" end end puts "Done."
If you don’t feel like exporting, and are running on Windows:
#!/usr/bin/ruby # # require 'win32ole' require 'dbi' class Access attr_accessor :mdb, :conn, :data, :fields def initialize(mdb=nil) @mdb = mdb @conn = nil @data = nil @fields = nil end def open connstring = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=#{@mdb}" @conn = WIN32OLE.new('ADODB.Connection') @conn.Open(connstring) end def query(sql) set = WIN32OLE.new('ADODB.Recordset') set.Open(sql, @conn) @fields = [] set.Fields.each do |field| @fields << field.Name end @data = set.GetRows.transpose set.Close end def close @conn.Close end end output = "gcal.csv" rows = Array.new db = Access.new('c:\path\to\mdb') db.open db.query("SELECT * FROM Main;") names = db.fields rows = db.data #Alternatively DBI.connect("DBI:ODBC:driver=Microsoft Access Driver (*.mdb);"+"dbq=c:/path/to/mdb") do |dbh| dbh.select_all('select * from Main') {|row| rows << row} end puts "Writing..." File.open(output, "w") do |f| f.print "Subject, Start Date, Start Time, End Date, End Time\n" rows.each do |line| startdate, starttime = Time.at(line[6].to_i).strftime("%m/%d/%Y,%I:%M:%S %p").split(',') enddate, endtime = Time.at(line[7].to_i).strftime("%m/%d/%Y,%I:%M:%S %p").split(',') f.print "\"#{line[11]}\",#{startdate},#{starttime},#{enddate},#{endtime}\n" end end puts "Done."
If you want the details…
Essentially, Palm’s Datebook dumps everything into an Access database. No keys or relations (granted, only 3 tables, but still), and no idea what most of the columns do. Tools for working with Jet on Linux are minimal, and I didn’t feel like going through win32ole just to get to Jet, plus this sort of thing is nicer to do in downtime at work. So, I exported it via ODBC to a Postgres database on my Solaris box. Not pretty.
access=# \d main TABLE "public.main" COLUMN | Type | Modifiers ----------------+------------------------+----------- record_id | bigint | NOT NULL STATUS | integer | placement | bigint | private | smallint | category | character varying(20) | start_time | bigint | end_time | bigint | untimed | smallint | time_zone | character varying(40) | location | character varying(255) | summary | text | alarm_advance | character varying(10) | alarm_unit | character varying(10) | repeated_event | character varying(255) | alarm | smallint | note | character varying(100) | access=#
Ok, so record_id seems to be some sort of key, and Heather doesn’t bother with notes or alarms, so this doesn’t seem like it’d be so bad. To figure why Google is only taking some of the records, though:
access=$ SELECT count(*) FROM main; count ------- 5094 (1 row) access=$ SELECT count(DISTINCT record_id) FROM main; count ------- 5074 (1 row) access=$ SELECT count(DISTINCT start_time) FROM main; count ------- 2488 (1 row) access=$ SELECT count(DISTINCT end_time) FROM main; count ------- 2490 (1 row) access=$ SELECT count(DISTINCT summary) FROM main; count ------- 2264 (1 row) access=$ SELECT record_id, start_time, end_time, summary FROM main WHERE record_id IN (SELECT record_id FROM main GROUP BY record_id HAVING count(*)>1); record_id | start_time | end_time | summary -----------+------------+------------+------------------------------------------------------------------------- 0 | 1231437600 | 1231441200 | tammy 0 | 1231869600 | 1231873200 | nb chanber lunch 0 | 1229642100 | 1229645700 | tammy AND joe photos st claire broiler 0 | 1231959600 | 1231963200 | dr hunt 0 | 1230505200 | 1230508800 | tilsen photos 0 | 1230568200 | 1230571800 | meet gary at studio 0 | 1230571800 | 1230584400 | bri AND kids 0 | 1230744600 | 1230748200 | tilsen, AND sandy ORDER y membership mail 0 | 1230681600 | 1230681600 | Dan, missy AND the kids. 0 | 1231610400 | 1231614000 | james j hill houseOngoing Daily 11/15/08 - 2/22/09 m-sat 10-4 sun 1-4 0 | 1230663600 | 1230667200 | tammys house glasses shopping 0 | 1229727600 | 1229731200 | ryan help at studio 0 | 1231889400 | 1231893000 | 0 | 1231889400 | 1231903800 | EMS 0 | 1237161600 | 1237161600 | spring break 0 | 1229983200 | 1229986800 | msp WITH the girls 0 | 1241049600 | 1241049600 | DISH 0 | 1232233200 | 1232244000 | jordan senior photos excel AND studio 0 | 1230055200 | 1230058800 | paige studio 0 | 1230314400 | 1230318000 | amanda tg 0 | 1229968800 | 1229972400 | sara AND nolan (21 rows) access=$ SELECT record_id, start_time, end_time, summary FROM main ORDER BY start_time ASC LIMIT 10; record_id | start_time | end_time | summary -----------+------------+----------+--------- 7128069 | 31449600 | 31449600 | c 7128068 | 31449600 | 31449600 | a 7123605 | 31449600 | 31449600 | a 7128070 | 31449600 | 31449600 | 3 7124866 | 31449600 | 31449600 | c 7124107 | 31449600 | 31449600 | 3 7124145 | 31449600 | 31449600 | o 7124141 | 31449600 | 31449600 | ; 7128072 | 31449600 | 31449600 | ; 7128071 | 31449600 | 31449600 | o (10 rows) access=$ SELECT record_id, start_time, end_time, summary FROM main ORDER BY start_time DESC LIMIT 10; record_id | start_time | end_time | summary -----------+------------+------------+----------------------------- 7127485 | 1256774400 | 1256774400 | lawerance wedding 7125815 | 1256774400 | 1256774400 | lawerance wedding 7128114 | 1244167200 | 1244170800 | NB senior ALL night party 7125941 | 1242489600 | 1242493200 | nyquist edding 7125827 | 1242489600 | 1242493200 | nyquist edding 0 | 1241049600 | 1241049600 | DISH 7128073 | 1238079600 | 1238083200 | books IN the woods 7125623 | 1238079600 | 1238083200 | books IN the woods 7125697 | 1238025600 | 1238025600 | gunflint books IN the woods 7126175 | 1238025600 | 1238025600 | gunflint books IN the woods (10 rows) access=$
Oh, yeah! What I’ve gathered:
A working solution:
access=$ SELECT DISTINCT a.start_time, a.end_time, a.summary INTO holdkey FROM main a WHERE EXISTS ( SELECT 'x' FROM main b WHERE a.start_time = b.start_time AND a.end_time = b.end_time AND a.summary = b.summary) ORDER BY a.start_time DESC; SELECT access=$ SELECT count(*) FROM holdkey; count ------- 2597 (1 row) access=$ DELETE FROM main USING holdkey WHERE main.start_time = holdkey.start_time AND main.end_time = holdkey.end_time AND main.summary = holdkey.summary; DELETE 5085 access=$ SELECT record_id, start_time, end_time, summary FROM main; record_id | start_time | end_time | summary -----------+------------+------------+--------- 5280360 | 31536000 | 31536000 | 5280298 | 31536000 | 31536000 | 5280429 | 31536000 | 31536000 | 7125497 | 1193437800 | 1193437800 | 7128378 | 31536000 | 31536000 | 7128376 | 31536000 | 31536000 | 7128374 | 31536000 | 31536000 | 7127620 | 1193437800 | 1193437800 | 0 | 1231889400 | 1231893000 | (9 rows) access=$ DROP TABLE main; DROP TABLE access=$ SELECT * INTO main FROM holdkey; SELECT
That works. Of course there’s the quick and dirty way which doesn’t involve munging about with temp tables:
access=$ DELETE FROM main t1 USING main WHERE EXISTS (SELECT * FROM main t2 WHERE t1.start_time = t2.start_time AND t1.end_time = t2.end_time AND t1.summary = t2.summary AND t1.record_id < t2.record_id); DELETE 2488 access=$ SELECT count(*) FROM test; count ------- 2606 (1 row)
It gives a slightly different result, but operates under the assumption that Palm’s record_id means something (it may not, for all I know). On the upside, it preserves all the columns in case they’re useful for something (doubtful). I could order by start_time and select into another table, add an index, and do the same thing, but it’s easier the quick and dirty way. There’s probably a trivial way to do this with joins, but I couldn’t think of one, and it leaves 9 records with a record_id of 0..
Here’s the code which it turns out I didn’t need, but it might be useful to somebody:
#Rips data from Palm Desktop. Uploads it to Google Calendar #Written with Python 2.5 (though imports should work anyway) # #Currently, the Access MDB Palm Datebook uses has been exported to a #PostgreSQL server via ODBC, so I'll be connecting to that # #There's code in here for getting through Access also, but I haven't tested it. #Use at your own risk (kinda like Access). # #This is mostly due to the Postgres ODBC driver, and the fact that I didn't #want to bother with quoting all the queries for Postgres to allow spaces try: from xml.etree import ElementTree #Python 2.5, probably 2.6/3.0 also except ImportError: from elementtree import ElementTree #Python <2.4 import gdata.calendar.service import gdata.service import atom.service import gdata.calendar import atom import getopt import sys import string import time import psycopg2 #Talk to Postgres class Struct: def __init__(self, *args, **kwargs): for k,v in kwargs.items(): setattr(self, k, v) class GCalMigrate: def __init__(self): self.conn = None self.cur = None self.calendar = None self.records = [] def connect(self): try: self.conn = psycopg2.connect("dbname='whatever' user='yournamehere' host='server'") except: print "Can't connect to the database!\n" sys.exit() self.cur = conn.cursor() query() def accessconnect(self,mdbpath): import odbc self.conn = odbc.odbc("driver=Microsoft Access Driver (*.mdb);DBQ=%s") % mdbpath self.cur = conn.cursor() queryaccess() def queryaccess(self): rows = [] self.cur.execute("SELECT Main.[Start Time], Main.[End Time], Main.[Summary] FROM Main") rows = cur.fetchall() conn.close() parserows(rows) def query(self): rows = [] try: self.cur.execute("SELECT start_time, end_time, summary FROM main") rows = cur.fetchall() except: print "Couldn't query the database.\n" conn.close() parserows(rows) def parserows(self, rows): for row in rows: starttime = time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime(row[0])) endtime = time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime(row[1])) title = row[2] record = Struct(start_time=starttime, end_time=enddtime, title=title) self.records.append(record) login() def login(self, username, password): self.calendar = gdata.calendar.service.CalendarService() self.calendar.email = username self.calendar.password = password self.calendar.source = "Palm_Desktop_Migrator" self.calendar.ProgrammaticLogin() batchsubmit() def batchsubmit(self): feed = gdata.calendar.CalendarEventFeed() for record in records: insertme = gdata.calendar.CalendarEventEntry() insertme.title = atom.Title(record.title) insertme.content = atom.Content("") insertme.when.append(gdata.calendar.When(start_time=record.start_time, end_time=record.end_time)) insertme.batch_id = gdata.BatchId(text='Palm_Migration') feed.Add_Insert(entry=insertme) response = self.calendar.ExecuteBatch(feed, gdata.calendar.service.DEFAULT_BATCH_URL) return response if __name__ == "__main__": runner = GCalMigrate() responses = runner.connect() for entry in responses.entry: print "Batch ID: %s" % entry.batch_id.text print "Status: %s" % entry.batch_status.code print "Reason: %s" % entry.batch_status.reason
Given that I’ve deleted my Facebook account, and I’ll be deleting my MySpace account shortly (they never get used, and I don’t see the point), I decided to look into the possibility of importing posts from MySpace to Wordpress. As it turns out, the Wordpress developers are apparently really ridiculously lazy, or just don’t give a shit about importing. Somebody had written a Perl script which pulled MySpace blogs into RSS, but bringing RSS into Wordpress doesn’t get comments with it.
After looking over the formats Wordpress -could- import from, I threw out everything with [!CDATA] tags in XML (almost every format). Fortunately, the Movable Type developers don’t see a need to dump binary blobs in XML, or use XML at all. Their format is refreshingly simple. Hence, a parser that runs through every blog post on somebody’s MySpace, pulls out data that matters (title, date, post, commenters and their comments), then puts those into Movable Type format. You’ll note that I now have way more posts on here than I did before, some of those being from before this website existed.
Code:
#!/usr/bin/ruby require 'open-uri' require 'time' class Comment attr_accessor :author, :datewritten, :comment def initialize (author, datewritten, comment) @author, @datewritten, @comment = author, datewritten, comment end end class Post attr_accessor :author, :title, :datewritten, :body, :comments def initialize (author, title, datewritten, body) @author, @title, @datewritten, @body = author, title, datewritten, body @comments = [] end def addcomment (author, datewritten, comment) @comments.push(Comment.new(author, datewritten, comment)) end end class Ripper def initialize @pages = [] @posts = Array.new end def get (uri) connection = open(uri) content = connection.read return content end def parse (uri) content = get(uri) #blogContentInfo points to links to posts links = content.scan(/class="blogContentInfo">.*?<a href=".*?">/m) links.each do |link| #Strip out the bullshit amazon links unless link =~ /amazon/ #Pull the URL out of the link link = (/.*<a href="(.*)">/m).match(link)[1] @pages.push(link) end end #Checking if there are any older pages with a hyperlink if content =~ /\[.*?<a href="(.*?)">Older<\/a>/ #If so, call itself recursively to pull out the rest #Myspace breaks the URI standard. Replace the spaces with real escapes parse($1.gsub(/\s/, "%20")) else #Edge case to break out of the loop for when there aren't any more older parsepages() end end def parsepages() @pages.each do |uri| #Replace with yourself, if you want author = "Ryan" content = get(uri) #Pull out the fields I want title = (/blogSubject">(.*?)\n/m).match(content)[1] body = (/blogContent">(.*?)<table/m).match(content)[1] datewritten = (/blogTimeStamp">(.*?)<\/p>/m).match(content)[1].gsub(/(^\s+|\n+)/, "") time = (/blogContentInfo"><b>.*?(\d+:\d+)/m).match(content)[1] datewritten = datewritten + " #{time}:00" #Parse the time, and force it into something Wordpress can deal with t = Time.parse(datewritten) datewritten = t.strftime("%m/%d/%Y %H:%M:%S") puts "Title: #{title}\n" #Create a new Post object post = Post.new(author, title, datewritten, body) #Pull out an array of all the comment blocks comments = content.scan(/id="blogComments.*?commentSpacer/m) #Pass off the post object along with the list of comments parsecomments(comments, post) end end def parsecomments(comments, post) comments.each do |com| author = (/profileLinks">(.*?)</m).match(com)[1] puts "Author: #{author}\n" #MySpace decided to make the CSS ids identical here, except that the #actual comment doesn't have "Posted" after the closing tag #Filter it as such comment = (/blogCommentsContent">(.*?)<\/p>/m).match(com)[1] datewritten = (/blogCommentsContent">Posted by.*?> on(.*?)<b/m).match(com)[1].gsub(/\n|\t|\r/, "") t = Time.parse(datewritten) #The same datetime munging as before datewritten = t.strftime("%m/%d/%Y %H:%M:%S") #Commit each commment to our post object post.addcomment(author, datewritten, comment) end #Push them all into our class array @posts.push(post) end def print(file) @posts.each do |post| #Using Movable Type's export syntax, so I don't need to mess with XML #It's documented here: http://www.sixapart.com/moveabletype/docs/mtimport#example #Basically, 5 hyphens separates the categories #Eight hyphens separate each post file.puts "TITLE: #{post.title}" file.puts "AUTHOR: #{post.author}" file.puts "DATE: #{post.datewritten}" #Change this, too, if you want file.puts "CATEGORY: MySpace" file.puts "-----" #Get rid of empty lines and fucking Windows ^M newlines, plus convert to " " file.puts "BODY:\n#{post.body.gsub(/^(\s+|\t+|\n+)$/, "").gsub(/\015/, "").gsub(/ /, " ")}" file.puts "-----" post.comments.each do |com| #More stuff is possible here, but isn't necessary file.puts "COMMENT:" file.puts "AUTHOR: #{com.author}" file.puts "DATE: #{com.datewritten}" file.puts "#{com.comment}" file.puts "-----" end file.puts "--------" end end end #Instantiate it ripper = Ripper.new #Parse my blog (substitute whatever yours is here) ripper.parse("http://blog.myspace.com/lykurgos") #Output it output = File.open("posts.txt", "a") ripper.print(output) puts "Done!\n" #Import into Wordpress!
Maybe somebody will actually find it useful.
Dan mentioned that he wasn’t that knowledgeable about regular expressions (a topic I am intimately familiar with), so I figured I’d put up some examples from code I’ve actually written, along with the text they’re actually supposed to match.
To begin with, here are the general rules for regexes. To begin with, “operator” refers to any of these (so \s+, [A-Z], (Word), etc). Greedy means it’ll continue matching as far as possible, and if the operator/character you want to match occurs more than once in the string, it’ll eat the first one and only stop matching at the last one.
. Match any character
\w Match “word” character (alphanumeric plus “_”)
\W Match non-word character
\s Match whitespace character
\S Match non-whitespace character
\d Match digit character
\D Match non-digit character
\t Match tab
\n Match newline
\r Match return
\f Match formfeed
\a Match alarm (bell, beep, etc)
\e Match escape
^ Beginning of the line
$ End of the line
+ matches the preceding operator one or more times (greedy)
* matches the preceding operator zero or more times (greedy)
? matches the preceding operator once if it exists, but it doesn’t have to be there. Mostly used to stop greedy operators (*? or +?, for instance) at the match you want.
() is used for grouping (either to use later as a backreference or to exclude)
(?<name>) (or (?P<name>) in Python and maybe others) is used for a named backreference. There’ll be some examples of that.
| is used as a logical or
{n} is used to match the preceding character n times
{n, m} matches n to m times
{n,} matches 1 or more times (may as well use +)
[A-Za-z] is used to match whatever is in the middle, but it only counts as one character (so [A-Za-z] would match any of those characters ONCE. Useful if you want [a-f] or [0-5]+ or something).
[^] is used to exclude things. [^word] excludes “w”, but the caret only matches ONCE (this can be chained as [^(word)], since groups count as a single operator.
Sound confusing? It is, which is why I’ll put up real examples. FYI, these are PCRE (Perl Compatible Regular Expressions) rather than SCRE (Sed Compatible Regular Expressions), but Dan’ll almost certainly never use sed compatible (which doesn’t have a ? operator, among other things).
Using a backreference later depends on the language. .NET uses ${n} where n is the reference number (note that they start from 1, as the entire string you matched is ${0}), Perl (and a lot of others) us $n, Ruby uses \1 (as does Python, but Python {like .NET} needs an operator in front to use a raw string {.NET is @, Python is r}, otherwise it’s \\1). Language reference is your best bet here.
First example.
(Oct6 0423z) Dec4100: C, was acknowledged by, ek
string regexPattern = @".*?\)\s (?<system>\S+?) :\s (?<tape>\w) .*,\s (?<initials>.*)"; Regex re = new Regex(regexPattern, RegexOptions.ExplicitCapture);
It eats everything up until the right parenthesis (escaped so the regex parser doesn’t try to interpret it) followed by a space, then it gets all non-whitespace characters until the colon as the system name. Ignores the colon and a space, then grabs all word characters ([A-Z0-9_]) as the tape number. Ignores zero or more matches of any character (the “.”) until it finds a comma followed by a space, then yanks the rest of the line as the initials.
C is the tape name.
ek are the initials.
This means Dec4100 is available as ${system} (if doing Regex.Replace) or m.Groups["system"] if you matched the regex with m = Regex.Match(logfilestring, re);
Another example:
<form action="http://www.climate.weatheroffice.ec.gc.ca/climateData/Interform.cfm" method="post" name="stnRequest1"> <input type="Hidden" name="hlyRange" value="N/A"> <input type="Hidden" name="dlyRange" value="1998-4-1|2007-11-30"> <input type="Hidden" name="mlyRange" value="1998-4-1|2007-11-1"> <input type="Hidden" name="StationID" value="10700"> <input type="Hidden" name="prov" value="CA"> <input type="Hidden" name="urlExtension" value="_e.html"> <tr id="dataTableOddRow"> <td id="dataTableRowHeader">(AE) BOW SUMMIT</td> <td id="dataTableRowHeader"><abbr title="ALBERTA">ALTA</abbr></td> <td> <select name="timeframe" size="1" class="formElement75w" onChange="elementChange(document.stnRequest1,1)"> <option value="2">Daily</option><option value="3">Monthly</option><option value="4">Almanac</option> </select> </td> <td> <select name="day" size="1" class="formElement" disabled><option value="1" >1</option><option value="2" >2</option><option value="3" >3</option><option value="4" >4</option><option value="5" >5</option><option value="6" >6</option><option value="7" >7</option><option value="8" >8</option><option value="9" >9</option><option value="10" >10</option><option value="11" >11</option><option value="12" >12</option><option value="13" >13</option><option value="14" >14</option><option value="15" >15</option><option value="16" >16</option><option value="17" >17</option><option value="18" >18</option><option value="19" >19</option><option value="20" >20</option><option value="21" >21</option><option value="22" >22</option><option value="23" >23</option><option value="24" >24</option><option value="25" >25</option><option value="26" >26</option><option value="27" >27</option><option value="28" >28</option><option value="29" >29</option><option value="30" Selected>30</option><option value="31" >31</option> </select> </td> <td> <select name="month" size="1" class="formElement" onChange="elementChange(document.stnRequest1,1)" ><option value="1" >Jan</option><option value="2" >Feb</option><option value="3" >Mar</option><option value="4" >Apr</option><option value="5" >May</option><option value="6" >Jun</option><option value="7" >Jul</option><option value="8" >Aug</option><option value="9" >Sep</option><option value="10" >Oct</option><option value="11" Selected>Nov</option><option value="12" >Dec</option> </select> </td> <td> <select name="year" size="1" class="formElement" onChange="elementChange(document.stnRequest1,1)"><option value="1998" >1998</option><option value="1999" >1999</option><option value="2000" >2000</option><option value="2001" >2001</option><option value="2002" >2002</option><option value="2003" >2003</option><option value="2004" >2004</option><option value="2005" >2005</option><option value="2006" >2006</option><option value="2007" Selected>2007</option> </select> </td> <td> <input type="submit" name="stnSubmit" value="Go" class="formElement"> </td> </form>
And the parser:
if ($chunk =~ /.*StationID.*?"(\d+)".*?prov.*?"(\w+).*?TableRowHeader">(.*?)<.*abbr title.*?>(\w+).*?/s) { my $stationid = $1; my $province = $2; my $name = $3; my $abbrprov = $4; }
This is a multi-line regex (hence the //s, like //g is global, //i is case insensitive, //gi is both g and i, etc), and a good example of non-greedy matching. It snags everything up until StationId, then the next quotation mark followed by numbers, and captures those numbers. It comes out as “10700″.
Does the same thing following “prov” up until the next word characters in quotation marks, and captures those. As .* rather than .*?, it would have grabbed “data”, which precedes TableRowHeader (inside the same parenthesis). Comes out as “CA”.
Grabs everything from TableRowHeader”> until the next < Comes out as “(AE) Bow Summit”.
Drops everything up until the next < after “abbr title”, then captures all word characters. “ALBA”
These are all assigned to variables via backreferences. $1, $2, $3, $4 are the groups in order. It’s worth noting that (at least in .NET), named backreferences are assigned numbers BEFORE regular backreferences. So (?<a>a)(b)(?<c>c)(d) would be acbd as ${0}${1}${2}${3}.
Another example:
04:26:23 [2] Error creating WLAAAP06.FS8 = 1 : Unrecognized KGFXENG Error Code
And the parser:
re.match(line, r'^(?P<time>.*?)\s+\[(?P<engine>\d+)\]\s+(?P<error>.*?(KGFXENG|LeadTools).*)'
Grabs everything from the beginning of the line until the first space as “time”. Comes out as “04:26:23″.
Then skips whitespace and a bracket (escaped with \[) and grabs one or more numbers (\d+) as "engine". Comes out as "2", of course. Skips a space, then captures anything which contains "KGFXENG" or "LeadTools" as "error". Basically, the rest of the line.
This line, for instance, wouldn't match, and nothing in the regex would be captured:
00:15:18 [1] Error producing WPATAZ00.FSD = F088 : Error while saving the graphic
These are used later with this:
message = "ERROR: %s %s: %s" % (re.sub(r'.*?([A-Za-z]+Engine[A-Za-z]*?)(Errors)?.*', r'\1', logfilename), engine, match.group('error'))
"logfilename" is something like "2008_Oct_07__ProductEngineErrors.log". This grabs everything up until A through Z (uppercase or lowercase) one or more times followed by Engine, optionally followed by something else (*, though ? would have worked if I said r'Engine([A-Za-z]+)?'). It stops on Errors, if it exists (the question mark afterwards), and replaces the entire name with the first backreference ("ProductEngine" in this case).
Last example is a nested bitch of increasingly complicated rules:
#Match plain ol' timezones if ($brpos =~ /^\[(\w+)\](.*)/) { $DateZone = $1; $newname = $2; } #Match timezones with a day modification, and grab that along with the +/- elsif ($brpos =~ /^\[(\w+)(\S\d+)\](.*)/) { $DateZone = $1; $TempDay2 = ONE_DAY * $2; $newname = $3; } #Check for a delete flag elsif ($brpos =~ /^(\d)\[.*/) { $DeleteFilesStatus = $1; #If the status is one, we want to capture everything after the timezone as the DeleteName if ($DeleteFilesStatus == 1) { if ($brpos =~ /^(\d)\[(\w+)\](.*)/) { $DeleteFilesStatus = $1; $DateZone = $2; $DeleteFilesNames = $3; $newname = $3; } elsif ($brpos =~ /^(\d)\[(\w+)(\S\d+)\](.*)/) { $DeleteFilesStatus = $1; $DateZone = $2; $TempDay2 = ONE_DAY * $3; $DeleteFilesNames = $4; $newname = $4; } } #Otherwise, the DeleteName is in more brackets elsif ($DeleteFilesStatus == 2) { #Grab it all, but without a time modification if ($brpos =~ /^(\d)\[(\w+)\]\[(.*\.\w+)\](.*)/) { $DeleteFilesStatus = $1; $DateZone = $2; $DeleteFilesNames = $3; $newname = $4; } #Grab it with a time modification elsif ($brpos =~ /^(\d)\[(\w+)(\S\d+)\]\[(.*\.\w+)\](.*)/) { $DeleteFilesStatus = $1; $DateZone = $2; $TempDay2 = ONE_DAY * $3; $DeleteFilesNames = $4; $newname = $5; } } }
Examples of what I'm catching (hopefully in order). The stuff in brackets later is filled in for date/time stamps:
[EDT]DOV-F-[MM][dd][yy][hh].csv [CST-1][MM][dd].act 1[PDT]Actual[yy][MM][dd][hh][mm].csv 1[EST-3]KLGA[yy][MM][dd].mtx 2[EDT][WBD*.txt]WBD[yy][MM][dd]05.txt 2[MST+2][WSM*.txt]WBD[yyyy][MM].txt
Sadly, I'm out of work for the night, but these matches aren't that complicated. Lots of escaping brackets, and use of the \S character to match "-" or "+", then grabbing the rest of them. I may write more tomorrow...
Fixed the nesting problem. Fixed item parsing. Item stats for nested ones units show up now. As with the Ruby parser, throw different combinations at it and see what happens.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 | using System; using System.Collections.Generic; using System.Text; using System.IO; using System.Text.RegularExpressions; using System.Xml; using System.Xml.Serialization; using System.Xml.Schema; using System.Xml.XPath; namespace abparser { class Program { static void Main(string[] args) { RosterParserTest.XmlParser parser = new RosterParserTest.XmlParser(); parser.ParseRoster(@"C:\Temp\de7th.rst", @"C:\Temp\output.xml"); Console.ReadLine(); } } } namespace RosterParserTest { class XmlParser { static XmlDocument Roster = new XmlDocument(); static XmlElement rootElement = Roster.CreateElement("", "Army", ""); public static string RemoveWhitespace(string str) { try { //Ryan's Regex return new Regex(@"(\s+|\{.*?\}|\(.*?\)|\/+|\.+)").Replace(str, String.Empty); } catch (Exception) { return str; } } public void ParseNestedXML(XmlElement thisElement, XmlElement rosterElement) { bool linkUnitStatsDone = false; //This is a dirty hack. string replaceMe = thisElement.GetAttribute("name").ToString(); replaceMe = RemoveWhitespace(replaceMe); XmlElement baseElement; XmlNodeList linkUnitStatNodeList = thisElement.SelectNodes("./link | ./unitstat"); /*Grab the last of the PascalCase names. HarGanethExecutioners becomes Executioners * SupremeSorceress becomes Sorceress, etc. Replace the rest of the name with a backreference */ string regexMatcher = Regex.Replace(replaceMe, @".*?([A-Z][a-z]+)$", "${1}"); //This way, it'll actually parse the NodeList for stats in nested things. if (Regex.IsMatch(rosterElement.Name.ToString(), regexMatcher)) //So that I don't get duplicate empty nodes. { baseElement = rosterElement; //Adding to the previous node in the tree. foreach (XmlElement parseElement in thisElement) { if (parseElement.HasChildNodes && parseElement.InnerXml.Contains("entity")) { ParseNestedXML(parseElement, baseElement); //Parsing out nested. } //if (parseElement.HasChildNodes && parseElement.InnerXml.Contains("entity")) else if (!linkUnitStatsDone) { ParseLinkUnitStats(linkUnitStatNodeList, baseElement); linkUnitStatsDone = true; //Hack implemented. } //else if (!linkUnitStatsDone) } //foreach (XmlElement parseElement in thisElement) } else { baseElement = Roster.CreateElement(replaceMe); foreach (XmlElement parseElement in thisElement) { if (parseElement.HasChildNodes && parseElement.InnerXml.Contains("entity")) { ParseNestedXML(parseElement, baseElement); //Whee recursion. } //if (parseElement.HasChildNodes && parseElement.InnerXml.Contains("entity")) else { ParseLinkUnitStats(parseElement, baseElement); //This has always worked. } //else rosterElement.AppendChild(baseElement); //Add to the local node. rootElement.AppendChild(rosterElement); //Add to the Army node. } //foreach (XmlElement parseElement in thisElement) } //else } //public void ParseNestedXML(XmlElement thisElement, XmlElement rosterElement) public void ParseRoster(string path, string output) { XmlDocument parsingRoster = new XmlDocument(); parsingRoster.Load(path); XmlNodeList parsingElements = parsingRoster.SelectNodes("/document/squad"); foreach (XmlElement thisElement in parsingElements) { XmlElement rosterElement = Roster.CreateElement("Unit"); ParseNestedXML(thisElement, rosterElement); } //foreach (XmlElement thisElement in parsingElements) Roster.AppendChild(rootElement); Roster.Save(output); } //public void ParseRoster(string path, string output) public void ParseLinkUnitStats(XmlElement parseElement, XmlElement baseElement) { foreach (XmlElement correctElement in parseElement) { if (correctElement.HasAttribute("name")) { string subReplaceMe = correctElement.GetAttribute("name").ToString(); subReplaceMe = RemoveWhitespace(subReplaceMe); XmlElement addElement = Roster.CreateElement(subReplaceMe); if (!Regex.Match(subReplaceMe, @"(Left|Worker|Helper|Pts|Coun|Group)").Success) { if (parseElement.HasChildNodes && parseElement.InnerXml.Contains("entity")) { //Console.WriteLine("Found an item (XmlElement)"); ParseNestedXML(addElement, correctElement); } else if (correctElement.HasAttribute("description")) { addElement.InnerText = correctElement.GetAttribute("description").ToString(); } //if correctElement.HasAttribute("description")) else if (correctElement.HasAttribute("value") && (Regex.IsMatch(correctElement.GetAttribute("value"), @"[^0|-]"))) { addElement.InnerText = RemoveWhitespace(correctElement.GetAttribute("value").ToString()); baseElement.AppendChild(addElement); } //else if (correctElement.HasAttribute("value")) } //else if (parseElement.HasAttribute("basename")) { /*It's a non-dwarf item. Whee! They don't show up in the XmlNodeList one. Get rid of newlines and periods at the end, then set it as the InnerText This doesn't catch cases where the item has other properties inside it, but I haven't seen those */ baseElement.InnerText = Regex.Replace(parseElement.GetAttribute("itemsummary"), @"(\\n|\.)", String.Empty); } } //if (correctElement.HasAttribute("name") } //foreach (XmlElement correctElement in parseElement) } //public void ParseLinkUnitStats(XmlElement parseElement, XmlElement baseElement) public void ParseLinkUnitStats(XmlNodeList parseNodeList, XmlElement baseElement) { foreach (XmlElement correctElement in parseNodeList) { if (correctElement.HasAttribute("name")) { string subReplaceMe = correctElement.GetAttribute("name").ToString(); subReplaceMe = RemoveWhitespace(subReplaceMe); if (!Regex.Match(subReplaceMe, @"(Left|Worker|Helper|Pts|Coun|Group)").Success) { XmlElement addElement = Roster.CreateElement(subReplaceMe); if (correctElement.HasChildNodes && correctElement.InnerXml.Contains("entity")) { //Console.WriteLine("Found an item (XmlNodeList)"); ParseNestedXML(addElement, correctElement); } if (correctElement.HasAttribute("description")) { addElement.InnerText = correctElement.GetAttribute("description").ToString(); baseElement.AppendChild(addElement); } //if (correctElement.HasAttribute("description")) else if (correctElement.HasAttribute("value") && (Regex.IsMatch(correctElement.GetAttribute("value"), @"[^0|-]"))) { addElement.InnerText = RemoveWhitespace(correctElement.GetAttribute("value").ToString()); baseElement.AppendChild(addElement); } //else if (correctElement.HasAttribute("value")) } //else } //if (correctElement.HasAttribute("name")) } //foreach (XmlElement correctElement in parseNodeList) } //public void ParseLinkUnitStats(XmlNodeList parseNodeList, XmlElement baseElement) } //class XmlParser } //namespace RosterParserTest |
I’m already not that fond of working with XML in .NET. Here are a couple of fixes:
public static string RemoveWhitespace(string str) { try { return new Regex(@"(\s+|\{.*?\}|\(.*?\)|\/+|\.+)").Replace(str, String.Empty); } catch (Exception) { return str; } }
Which actually gets rid of the crap in the braces, parentheses, etc (as well as getting rid of periods).
Secondly, I loathe empty nodes (stats, etc).
replaceMe = RemoveWhitespace(replaceMe); Console.WriteLine(replaceMe); if (replaceMe != String.Empty) { XmlElement baseElement = Roster.CreateElement(replaceMe); foreach (XmlElement parseElement in thisElement) { if (parseElement.HasChildNodes && parseElement.InnerXml.Contains("entity")) { ParseNestedXML(parseElement, baseElement); } else { foreach (XmlElement correctElement in parseElement) { if (correctElement.HasAttribute("name")) { string subReplaceMe = correctElement.GetAttribute("name").ToString(); subReplaceMe = RemoveWhitespace(subReplaceMe); XmlElement addElement = Roster.CreateElement(subReplaceMe); if (correctElement.HasAttribute("description")) { addElement.InnerText = correctElement.GetAttribute("description").ToString(); baseElement.AppendChild(addElement); } //Bye, stats with a value of zero or a hyphen! else if (correctElement.HasAttribute("value") && (Regex.Match(correctElement.GetAttribute("value").ToString(), @"[^0|-]").Success)) { addElement.InnerText = correctElement.GetAttribute("value").ToString(); baseElement.AppendChild(addElement); } } } } rosterElement.AppendChild(baseElement); rootElement.AppendChild(rosterElement); } }
I find it kind of ironic that recursion is used after bitching about recursion. I’ll probably take a look at the nesting problems, and whatnot this weekend, assuming I have any time.
I wonder if it’s possible to get a job doing nothing but writing regular expressions…
Fixed the nested <item> blocks. Ran into another problem where it didn’t parse nested characters and their items properly, then yet another where some Gifts of Khaine (and probably other things I haven’t seen in either list) are essentially nested worthlessness. Fixed code for it:
def parseitem(d, addto) added = addto.add_element(d.attributes["name"].gsub(/\s+/, '')).add_text(d.attributes["description"].gsub(/\./, '')) d.elements.each('link') do |ele| unless ele.attributes["name"] =~ /(Worker|Helper|Cost|Left)/ if !ele.attributes["description"].nil? added.add_element(ele.attributes["name"].gsub(/\s+/, '')).add_text(ele.attributes["description"]) else #This is necessary for some Gifts of Khaine, apparently if added.text == ele.attributes["name"] print "Found duplicated #{added.text}!\n\n" added.text = '' end added.add_element(ele.attributes["name"].gsub(/\s+/, '')).add_text("True") end end end end def parsenested(process, addto) #Try to guess if it's a champion, character in the unit, or item process.elements.each('entity') do |d| if d.attributes["statset"] =~ /Normal/ #It's a character, crew, or mount. Figure out which if d.attributes["totalcost"] !~ /^0/ #It's a champion or character adder = addto.add_element("champion") puts "Found champ\n" parse(d, adder) else #It's crew or the like adder = addto.add_element("crew") puts "Found crew\n" parse(d, adder) end else #It's an item puts "Found item\n" if addto.elements["item"].nil? @adder = addto.add_element("item") end parseitem(d, @adder) end end end
Ok, boring. I didn’t spend as much time working on it tonight as I intended to, but it parses the Dwarf roster, at least, fine. It does not parse the Dark Elf roster properly (namely, it doesn’t pull the description out of items or Gifts of Khaine, and it doubles up the <item> tag for reasons I’m not sure of), but that’ll get fixed when I’m at work tomorrow.
Ruby code:
#!/usr/bin/ruby require "rexml/document" require "pp" require "rexml/formatters/default" include REXML inputxml = File.read('dwarfroster.rst') @roster = Document.new inputxml @army = Document.new.add_element("army") def parsenested(process, addto) #Try to guess if it's a champion, character in the unit, or item process.elements.each('entity') do |p| #puts p if p.elements["link"].has_elements? #Recursively run through these to figure out what the hell it is if p.elements["link/entity"].attributes["itemsummary"].any? adder = addto.add_element("item") puts "Found nested\n" parsenested(p.elements["link"], adder) else #This is really just stubbed out, since I haven't seen it end elsif p.attributes["statset"] =~ /Normal/ #It's a character, crew, or mount. Figure out which if p.attributes["totalcost"] !~ /^0/ #It's a champion or character adder = addto.add_element("champion") puts "Found champ\n" parse(p, adder) else #It's crew or the like adder = addto.add_element("crew") puts "Found crew\n" parse(p, adder) end else #It's an item puts "Found item\n" if addto.elements["item"].nil? @adder = addto.add_element("item") end added = @adder.add_element(p.attributes["name"].gsub(/\s+/, '')) p.elements.each('link') do |ele| unless ele.attributes["name"] =~ /(Worker|Helper|Cost|Left)/ added.add_element(ele.attributes["name"].gsub(/\s+/, '')).add_text(ele.attributes["description"]) end end end end end def parse(s, addto) #In some cases, the basename differs (i.e. Supreme Sorc vs. High Sorc) #Also, it'll pick up whether there's a champion in the unit by the diff #of base and count %w[basename count base].each do |b| if s.attributes[b].any? addto.add_element(b).add_text(s.attributes[b]) end end stats = addto.add_element("stats") s.elements.each('unitstat') do |a| #unit.fetch(:stats) { |el| unit[el] = {}} #I don't want blank stats if a.attributes["value"].any? && (a.attributes["value"] !~ /(0|-)/) stats.add_element(a.attributes["name"].gsub(/\s+/, '')).add_text(a.attributes["value"]) end end s.elements.each('link') do |link| if link.has_elements? #Figure out what the hell it is parsenested(link, addto) else #unitatt = unit.add_element("attributes") #Rip out the name if it doesn't have "Helper, Worker, Points Left, or Cost" unless link.attributes["name"] =~ /(Worker|Helper|Cost|Left)/ #Get rid of the stuff in braces AB puts in if addto.elements["attributes"].nil? @unitatt = addto.add_element("attributes") end @unitatt.add_element(link.attributes["name"].gsub(/\{.*?\}/, '').gsub(/\s+/, '')).add_text('true') end end end end @roster.elements.each('document/roster') do |ele| info = @army.add_element("info") #Pick out the race, army name, total points, used points, canonical race name %w[race size activesize racename].each do |attr| info.add_element(attr).add_text(ele.attributes[attr]) end #@army.push(info) end @roster.elements.each('document/squad') do |ele| @unit = @army.add_element("unit") #Pick out the name of the model and its cost, plus how many models %w[name modelcount totalcost].each do |attr| @unit.add_element(attr).add_text(ele.attributes[attr]) end ele.elements.each('entity') do |s| #Parse it out parse(s, @unit) end end #pp @army prettyprint = REXML::Formatters::Pretty.new output = String.new puts prettyprint.write(@army, output)
And the XML output:
<?xml version="1.0" encoding="ISO-8859-1"?> <army> <info> <race>Dwarf</race> <size>1500</size> <activesize>1499.</activesize> <racename>Dwarfs</racename> </info> <unit> <name>Thane</name> <modelcount>1</modelcount> <totalcost>134</totalcost> <basename>Thane</basename> <count>1</count> <base>1</base> <stats> <Ld>9</Ld> <Mv>3</Mv> <Save>3+</Save> <St>4/8</St> <To>5</To> <UnitSt.>1</UnitSt.> <WS>6</WS> <Wo>2</Wo> <At>3</At> <BS>4</BS> <In>3</In> <ItemPts>75</ItemPts> </stats> <attributes> <General>true</General> <HandWeapon>true</HandWeapon> <GreatWeapon>true</GreatWeapon> <GromrilArmor>true</GromrilArmor> </attributes> <item> <RunicWeapon> <MasterRuneofKraggtheGrim>Allows other runes to be placed on a Great Weapon.</MasterRuneofKraggtheGrim> <RuneofCleaving>+1 Strength</RuneofCleaving> </RunicWeapon> <RunicArmor> <RuneofStone>+1 Armor Save</RuneofStone> </RunicArmor> </item> </unit> <unit> <name>Thane</name> <modelcount>1</modelcount> <totalcost>132</totalcost> <basename>Thane</basename> <count>1</count> <base>1</base> <stats> <In>3</In> <ItemPts>75</ItemPts> <Ld>9</Ld> <Mv>3</Mv> <Save>2+/1+</Save> <St>4/7</St> <To>5</To> <UnitSt.>1</UnitSt.> <WS>6</WS> <Wo>2</Wo> <At>3</At> <BS>4</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <GromrilArmor>true</GromrilArmor> <Shield>true</Shield> </attributes> <item> <RunicWeapon> <RuneofCleaving>+1 Strength</RuneofCleaving> </RunicWeapon> <RunicArmor> <RuneofStone>+1 Armor Save</RuneofStone> </RunicArmor> </item> </unit> <unit> <name>Thane</name> <modelcount>1</modelcount> <totalcost>95</totalcost> <basename>Thane</basename> <count>1</count> <base>1</base> <stats> <In>3</In> <ItemPts>75</ItemPts> <Ld>9</Ld> <Mv>3</Mv> <Save>3+</Save> <St>4</St> <To>5</To> <UnitSt.>1</UnitSt.> <WS>6</WS> <Wo>2</Wo> <At>3</At> <BS>4</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <GromrilArmor>true</GromrilArmor> <BattleStandardBearer>true</BattleStandardBearer> </attributes> <item> <RunicArmor> <RuneofStone>+1 Armor Save</RuneofStone> </RunicArmor> </item> </unit> <unit> <name>Dwarf Warriors</name> <modelcount>20</modelcount> <totalcost>205</totalcost> <basename>Dwarf Warriors</basename> <count>19</count> <base>20</base> <stats> <In>2</In> <Ld>9</Ld> <Mv>3</Mv> <Save>4+/3+</Save> <St>3</St> <To>4</To> <UnitSt.>1</UnitSt.> <WS>4</WS> <Wo>1</Wo> <At>1</At> <BS>3</BS> </stats> <champion> <basename>Veteran</basename> <count>1</count> <base>1</base> <stats> <In>2</In> <Ld>9</Ld> <Mv>3</Mv> <Save>4+/3+</Save> <St>3</St> <To>4</To> <UnitSt.>1</UnitSt.> <WS>4</WS> <Wo>1</Wo> <At>2</At> <BS>3</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <HeavyArmor>true</HeavyArmor> <Shield>true</Shield> </attributes> </champion> <attributes> <Musician>true</Musician> <StandardBearer>true</StandardBearer> <HandWeapon>true</HandWeapon> <HeavyArmor>true</HeavyArmor> <Shield>true</Shield> </attributes> </unit> <unit> <name>Quarellers</name> <modelcount>10</modelcount> <totalcost>110</totalcost> <basename>Quarrellers</basename> <count>10</count> <base>10</base> <stats> <In>2</In> <Ld>9</Ld> <Mv>3</Mv> <Save>6+</Save> <St>3</St> <To>4</To> <UnitSt.>1</UnitSt.> <WS>4</WS> <Wo>1</Wo> <At>1</At> <BS>3</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <Crossbow>true</Crossbow> <LightArmor>true</LightArmor> </attributes> </unit> <unit> <name>Quarellers</name> <modelcount>10</modelcount> <totalcost>110</totalcost> <basename>Quarrellers</basename> <count>10</count> <base>10</base> <stats> <In>2</In> <Ld>9</Ld> <Mv>3</Mv> <Save>6+</Save> <St>3</St> <To>4</To> <UnitSt.>1</UnitSt.> <WS>4</WS> <Wo>1</Wo> <At>1</At> <BS>3</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <Crossbow>true</Crossbow> <LightArmor>true</LightArmor> </attributes> </unit> <unit> <name>Ironbreakers</name> <modelcount>14</modelcount> <totalcost>237</totalcost> <basename>Ironbreakers</basename> <count>13</count> <base>14</base> <stats> <Ld>9</Ld> <Mv>3</Mv> <Save>3+/2+</Save> <St>4</St> <To>4</To> <UnitSt.>1</UnitSt.> <WS>5</WS> <Wo>1</Wo> <At>1</At> <BS>3</BS> <In>2</In> </stats> <champion> <basename>Ironbeard</basename> <count>1</count> <base>1</base> <stats> <In>2</In> <Ld>9</Ld> <Mv>3</Mv> <Save>3+/2+</Save> <St>4</St> <To>4</To> <UnitSt.>1</UnitSt.> <WS>5</WS> <Wo>1</Wo> <At>2</At> <BS>3</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <GromrilArmor>true</GromrilArmor> <Shield>true</Shield> </attributes> </champion> <attributes> <Musician>true</Musician> <StandardBearer>true</StandardBearer> <HandWeapon>true</HandWeapon> <GromrilArmor>true</GromrilArmor> <Shield>true</Shield> </attributes> <item> <RunicStandard> <RuneofStoicism>The unit counts as double its actual Unit Strength.</RuneofStoicism> </RunicStandard> </item> </unit> <unit> <name>Hammerers</name> <modelcount>18</modelcount> <totalcost>246</totalcost> <basename>Hammerers</basename> <count>17</count> <base>18</base> <stats> <Ld>9</Ld> <Mv>3</Mv> <Save>5+</Save> <St>4/6</St> <To>4</To> <UnitSt.>1</UnitSt.> <WS>5</WS> <Wo>1</Wo> <At>1</At> <BS>3</BS> <In>2</In> </stats> <champion> <basename>Gate Keeper</basename> <count>1</count> <base>1</base> <stats> <In>2</In> <Ld>9</Ld> <Mv>3</Mv> <Save>5+</Save> <St>4/6</St> <To>4</To> <UnitSt.>1</UnitSt.> <WS>5</WS> <Wo>1</Wo> <At>2</At> <BS>3</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <GreatWeapon>true</GreatWeapon> <HeavyArmor>true</HeavyArmor> </attributes> </champion> <attributes> <Musician>true</Musician> <StandardBearer>true</StandardBearer> <HandWeapon>true</HandWeapon> <GreatWeapon>true</GreatWeapon> <HeavyArmor>true</HeavyArmor> <Stubborn>true</Stubborn> </attributes> </unit> <unit> <name>Artillery Battery</name> <modelcount>4</modelcount> <totalcost>45</totalcost> <basename>Bolt Thrower</basename> <count>1</count> <base>1</base> <stats> <To>7</To> <UnitSt.>3</UnitSt.> <Wo>3</Wo> </stats> <crew> <basename>Crew</basename> <count>3</count> <base>3</base> <stats> <In>2</In> <Ld>9</Ld> <Mv>3</Mv> <Save>6+</Save> <St>3</St> <To>4</To> <WS>4</WS> <Wo>1</Wo> <At>1</At> <BS>3</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <LightArmor>true</LightArmor> </attributes> </crew> <attributes> <BoltThrower>true</BoltThrower> </attributes> </unit> <unit> <name>Artillery Battery</name> <modelcount>4</modelcount> <totalcost>45</totalcost> <basename>Bolt Thrower</basename> <count>1</count> <base>1</base> <stats> <To>7</To> <UnitSt.>3</UnitSt.> <Wo>3</Wo> </stats> <crew> <basename>Crew</basename> <count>3</count> <base>3</base> <stats> <In>2</In> <Ld>9</Ld> <Mv>3</Mv> <Save>6+</Save> <St>3</St> <To>4</To> <WS>4</WS> <Wo>1</Wo> <At>1</At> <BS>3</BS> </stats> <attributes> <HandWeapon>true</HandWeapon> <LightArmor>true</LightArmor> </attributes> </crew> <attributes> <BoltThrower>true</BoltThrower> </attributes> </unit> <unit> <name>Airborne Assault</name> <modelcount>1</modelcount> <totalcost>140</totalcost> <basename>Gyrocopter</basename> <count>1</count> <base>1</base> <stats> <In>2</In> <Ld>9</Ld> <Save>4+</Save> <St>4</St> <To>5</To> <UnitSt.>3</UnitSt.> <WS>4</WS> <Wo>3</Wo> <At>2</At> </stats> <attributes> <Flyer>true</Flyer> </attributes> </unit> </army>
I’ll probably screw with the code so it outputs something more easily parsed by Dan (for the items and attributes, mainly) at the same time as I fix the Dark Elf parsing (which should also hit the Anvil of Doom problem). Right now it comes out like this:
<item> <SacrificialDagger/> <PearlofinfiniteBleakeness/> <BlackDragonEgg/> </item> <item> <RuneofKhaine/> <TouchofDeath> <KillingBlow/> </TouchofDeath> </item>
As you can see, Touch of Death somehow added the name as a subelement, yet I don’t see any substantial differences between the DE roster and the dwarf roster. Still, it’ll get fixed tomorrow (and $deity willing, converted to .NET).
Well, the bailout failed. Not that I was really for it, and all estimates indicated that they needed a lot more than $700 billion dollars, but I’m still not sure how to react to this. I know the Fed just released another $630 billion into the Term Auction Facility (emergency lending) and currency swaps markets today, but it seems that the Fed’s balance sheet is just about tapped anyway. So, as it stands, the Dow is down 7%, and the S&P is down 12%. It’s not even the end of trading.
It’s entirely possible that they’ll attempt to craft a new bill which has more… populist leanings, in the hopes of getting more Democratic support (since they almost certainly won’t get the Republicans). It might pass. It might be a better bill. It might not be. It seems that, in any case, they’re going to sacrifice something that might have worked for ideology and political expediency. The Asians were already leery of lending us more money. Germany and France flat-out said no. Britain’s in the middle of their own crisis. What the fuck happens now?
The US dollar disappears as a reserve currency? Commodities markets start to be denominated in something other than US dollars? Our economy crashes and burns even more? It should be obvious (and should have been obvious) to people that we cannot march inexorably upwards. People in the US (myself included) don’t (or can’t) save any substantial amounts of money. However, at some point, things have to go down when the bubble bursts. It doesn’t look like we’re going to end up creating a new one like we did when the .com bubble burst (hello, housing), thus the incredible amounts of leveraged debt US companies are holding has to crash. We’ve already had the largest bank failure in US history in the last week, Wachovia’s gone (Citigroup, Chase, and Bank of America are now huge entities), most of the investment banks are gone, and there’s still a shitload of debt hanging around.
The Republican plan of cutting the capital gains tax is idiotic. By definition, a company taking losses doesn’t have any gains, but they’re going to fiddle while Rome burns, so to speak. Nobody seems to know what’s going to happen. There may be another bill. There may not be. Bankers are finally offing themselves. It remains to be seen how the markets do for the rest of the day, I guess, but the TED and Libor are way fucking higher than the Fed’s nominal rates on T-bills. Welcome to a new depression?
It would be nice to see government being proactive rather than reactionary. We’re going down, and it looks like a lot of the world might follow for a while (Hong Kong, Russia, Japan, and London stock exchanges are down precipitously over the last week also). Seriously. Propose legislation for a new Civilian Conservation Corps (e.g. rework the Job Corps). Bring back the regulation you abolished over the last 20 years. Take over FHA backed mortgages, and do… something with them (forced refinancing to current value, with the government pocketing any equity they may acquire until such time as they’re paid off or something). Raise the prime rate to stave off inflation (it’s clear that cutting it isn’t going to let us buy our way out of a recession through liquidity at this point). Actually try to solve the problems which are inevitably going to come up before you have to.
Failing that, torches and pitchforks, maybe.
I’ll be honest. I don’t like XML. I don’t like SOAP (REST is far nicer in my opinion), since it manipulates the HTTP spec to do things it was never meant to do. Raw sockets and bit twiddling seem like a more logical extension, just that port 80 happens to be open on most corporate firewalls, so SOAP and CORBA have taken off. Inasmuch as I may dislike XML, though, it has its uses. Representing a datastream on for sets where CSV doesn’t really make sense, and YAML isn’t available, and it’s not that hard to deal with.
The ArmyBuilder developers seem to have squeezed SGML into an XML doctype somehow, and the roster files are littered with references I can’t quite make out. Yes, ArmyBuilder can export to XML, I guess, but it leads to XSL from hell. In some ways, I would have preferred to rip apart a binary format with a hex editor, as long as the data was formatted logically.
This, for instance:
<link id="dwWarCrew" count="1" actual="1" script="0" sequence="106" pseudo="no" totalcost="0" \ name="Crew" category="Equip" visible="no" sourceid="dwBoltThrw" sourceindex="1"></link>
Or this:
<ruleset context="dwSubtype" ruleset="dwDwarves" contextname="Army Subtype" rulesetname="Dwarf Army"/>Is not formatted logically. The second record, as you can see, uses XML attributes rather than nodes for everything, which kinda defeats the point. XSL to parse ArmyBuilder’s XML output? Ahh…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="html" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:variable name="newlinefeed"><![CDATA[ ]]></xsl:variable> <xsl:variable name="statCountGlobal" select="count(document/definition/stat_def)"/> <xsl:template match="/"> <xsl:variable name="namedModelCount" select="document/composition/@model_count"/> <xsl:variable name="actualModelCount" select="sum(//regiment/@model_count)"/> <xsl:variable name="actualPoints" select="sum(//regiment/@cost[not(contains(.,'['))])"/> <xsl:value-of select="concat(/document/summary/@race_name,': ',$actualPoints, /document/definition/@points_abbrev,' - ',$actualModelCount,' ')"/> Models<xsl:value-of select="$newlinefeed"/> <xsl:for-each select="/document/composition/comp_entry"> <xsl:variable name="groupName" select="@group_name"/> <xsl:if test="/document/roster/top_level/regiment[@composition = $groupName]"> <xsl:variable name="unit" select="@group_name"/> <xsl:for-each select="/document/roster/top_level/regiment[@composition = $unit]"> <xsl:apply-templates select="." mode="top_level"> <xsl:with-param name="regDepth"> <xsl:choose> <xsl:when test="position()=1"><xsl:value-of select="count(/document/roster/top_level[regiment/@composition = $unit]//regiment)"/></xsl:when> <xsl:otherwise>0</xsl:otherwise> </xsl:choose> </xsl:with-param> </xsl:apply-templates> </xsl:for-each> </xsl:if> </xsl:for-each> </xsl:template> <xsl:template match="regiment" mode="top_level"> <xsl:param name="regDepth"> <xsl:value-of select="count(..//regiment)"/> </xsl:param> <xsl:variable name="statCountLocal" select="count(stat)"/> <xsl:variable name="fsib" select="preceding-sibling::node()"/> <xsl:variable name="composition" select="@composition"/> <xsl:if test="$regDepth > 0"> <xsl:choose> <xsl:when test="not($composition = $fsib/@composition)"> <xsl:value-of select="$composition"/>(<xsl:value-of select="/document/composition/comp_entry[@group_name = $composition]/@percentage"/>)<xsl:value-of select="$newlinefeed"/> </xsl:when> </xsl:choose> </xsl:if> <xsl:value-of select="concat('[',@model_count,'] ')"/> <xsl:variable name="itemcost"> <xsl:if test="@cost"><xsl:value-of select="@cost"/></xsl:if> </xsl:variable> <xsl:variable name="retinuecost"> <xsl:call-template name="getRetinueCost"> <xsl:with-param name="retinuecostSum" select="0"/> <xsl:with-param name="current" select="regiment[position()=1]"/> <xsl:with-param name="rest" select="regiment[position()!=1]"/> </xsl:call-template> </xsl:variable> <xsl:variable name="transportcost"> <xsl:if test="regiment[@stat_set = 0]/@cost"><xsl:value-of select="substring-after(substring-before(regiment[@stat_set = 0]/@cost,']'),'[')"/></xsl:if> </xsl:variable> <xsl:choose> <xsl:when test="$retinuecost and @model_count=1 and @composition='HQ'"> <xsl:value-of select="concat('[',$itemcost - $retinuecost,'] ')"/> </xsl:when> <xsl:when test="$transportcost != ''"> <xsl:value-of select="concat('[',$itemcost - $transportcost,'] ')"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="concat('[',$itemcost,'] ')"/> </xsl:otherwise> </xsl:choose> <xsl:value-of select="substring-before(name,' (')"/> <xsl:if test="@model_count=1 and @composition='HQ'"> <xsl:text disable-output-escaping="yes">(IC)</xsl:text> </xsl:if> <xsl:text disable-output-escaping="yes">: </xsl:text> <xsl:for-each select="item"> <xsl:variable name="namedItem" select="name"/> <xsl:choose> <xsl:when test="count(../item[name=$namedItem]) > 1"><xsl:value-of select="concat($namedItem,'(x',count(../item/name[.=$namedItem]),');')"/></xsl:when> <xsl:otherwise><xsl:value-of select="concat(name,';')"/></xsl:otherwise> </xsl:choose> </xsl:for-each> <xsl:for-each select="choice"><xsl:value-of select="concat(name,';')"/></xsl:for-each> <xsl:value-of select="$newlinefeed"/> <xsl:for-each select=".//regiment[not(../@category = 'Wargear Item')] "> <xsl:apply-templates select="." mode="regiment" /> </xsl:for-each> <xsl:value-of select="$newlinefeed"/> </xsl:template> <xsl:template match="regiment" mode="regiment"> <xsl:variable name="statCountLocal" select="count(stat)"/> <xsl:variable name="depth"> <xsl:choose> <xsl:when test="../../@stat_count=1"><xsl:value-of select="number(@depth)-1" /></xsl:when> <xsl:otherwise><xsl:value-of select="@depth" /></xsl:otherwise> </xsl:choose> </xsl:variable> <xsl:value-of select="concat('[',@model_count,'] ')"/> <xsl:variable name="itemcost"> <xsl:if test="@cost"><xsl:value-of select="substring-after(substring-before(@cost,']'),'[')"/></xsl:if> </xsl:variable> <xsl:variable name="transportcost"> <xsl:if test="regiment[@stat_set = 0]/@cost"><xsl:value-of select="substring-after(substring-before(regiment[@stat_set = 0]/@cost,']'),'[')"/></xsl:if> </xsl:variable> <xsl:choose> <xsl:when test="((../@composition='HQ' and @depth = 1) or (@depth = 0)) and (regiment[@stat_set = 0])"> <xsl:value-of select="concat('[',$itemcost - $transportcost,'] ')"/> </xsl:when> <xsl:when test="@stat_set = 0"> <xsl:value-of select="concat('[',$itemcost,'] ')"/> </xsl:when> <xsl:when test="../@composition = 'HQ' and ../@stat_count > 1"> <xsl:value-of select="concat('[',$itemcost,'] ')"/> </xsl:when> </xsl:choose> <xsl:call-template name="formatName"><xsl:with-param name="strName" select="concat(name,': ')"/></xsl:call-template> <xsl:for-each select="item[not(name = preceding-sibling::item/name)]"> <xsl:variable name="namedItem" select="name"/> <xsl:choose> <xsl:when test="count(../item[name=$namedItem]) > 1"><xsl:value-of select="concat($namedItem,'(x',count(../item[name=$namedItem]),');')"/></xsl:when> <xsl:otherwise><xsl:value-of select="concat(name,';')"/></xsl:otherwise> </xsl:choose> </xsl:for-each> <xsl:for-each select="choice"><xsl:value-of select="concat(name,';')"/></xsl:for-each> <xsl:value-of select="$newlinefeed"/> </xsl:template> <xsl:template name="getRetinueCost"> <xsl:param name="retinuecostSum" /> <xsl:param name="current" /> <xsl:param name="rest" /> <xsl:variable name="curCost"> <xsl:choose> <xsl:when test="contains($current/@cost,'[')"><xsl:value-of select="substring-after(substring-before($current/@cost,']'),'[')" /></xsl:when> <xsl:otherwise><xsl:value-of select="$current" /></xsl:otherwise> </xsl:choose> </xsl:variable> <xsl:choose> <xsl:when test="$current"> <xsl:call-template name="getRetinueCost"> <xsl:with-param name="retinuecostSum" select="$retinuecostSum + $curCost"/> <xsl:with-param name="current" select="$rest[position()=1]"/> <xsl:with-param name="rest" select="$rest[position()!=1]"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$retinuecostSum" /> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template name="halfCost"> <xsl:param name="itemCost"/> <xsl:value-of select="round($itemCost div 2)" /> </xsl:template> <xsl:template name="formatName"> <xsl:param name="strName"/> <xsl:choose> <xsl:when test="contains($strName,' (')"><xsl:value-of select="substring-before($strName,' (')"/></xsl:when> <xsl:otherwise><xsl:value-of select="$strName"/></xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template name="doReplaceCar"> <xsl:param name="text"/> <xsl:param name="replace"/> <xsl:param name="by"/> <xsl:choose> <xsl:when test="contains($text, $replace)"> <xsl:value-of select="substring-before($text, $replace)" disable-output-escaping="yes"/> <xsl:value-of select="$by" disable-output-escaping="yes"/> <xsl:call-template name="doReplaceCar"> <xsl:with-param name="text" select="substring-after($text, $replace)"/> <xsl:with-param name="replace" select="$replace"/> <xsl:with-param name="by" select="$by"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$text" disable-output-escaping="yes"/> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> |
No, I am not writing a stylesheet like that again, and the one to parse the roster files would be far more complicated.
The problem with the roster file, fundamentally, is that it’s too tightly linked with ArmyBuilder. That makes sense, in a way, but is still irksome. The <link> elements don’t have any nodes under them, just assloads of attributes, and it’s not easy to figure out which ones I am interested in:
<link id="HeavyArmor" count="1" actual="1" script="0" sequence="26" pseudo="no" totalcost="0" name="Heavy Armor" category="Equip" \ abbrev="Hv" description="5+ Armor Save" equipment="yes" footnote="yes" sourceid="dwWarrVet" sourceindex="5"></link>
Versus ones I’m not interested in:
<link id="ItemCost" count="1" actual="1" script="0" sequence="28" pseudo="no" totalcost="0" name="Item Cost Worker" category="Equip"\ visible="no" sourcetype="3" sourceid="Globals" sourceindex="1"></link>
Without passing a long hashlist of element.attribute[$thing] values, or specifically excluding anything with “Helper” or “Worker” or whatever in the name, etc. Not to mention it’s formatted as:
<document> <squad> <!-- unit name and cost is here --> <entity> <!-- unit stats are here, along with composition and whatnot --> <link> <!-- sometimes there's nothing of note in the link tags --> <entity> <!-- this might be a magic item, warmachine crew, magic banner, champion, and probably other stuff, but is not easily \ identified, and there may be more than one --> <link> <!-- might be info for whatever is in entity, might be a helper which I don't want --> </link> <unitstat> <!-- if it's crew, champion, whatever, stats would be here, but this node may not exist --> </unitstat> </entity> </link> </entity> </squad> </document>
The problem with some of these is that by the mantra of whoever wrote ArmyBuilder, champions fall into the “Equip” category. There is, in fact, a “isunit” attribute, but it isn’t set to yes anywhere. Only set to “no” for items, which I can’t figure out (unless there’s some kind of magic item which qualifies as a unit you can add? I don’t know).
I’ve got a parser that works in Ruby written, but I haven’t converted it to C# yet. Also, I’ve not tested it against anything that might have more complicated schema than dwarves: mounted units, chariots, to check if it’s undead/daemon/greenskin and see if special rules apply (since not everything in the army is guaranteed to be), embedded assassins, magic, et al. Sadly, the only roster I’ve got at work is for dwarves, so I’ll have to dump some more output from ArmyBuilder and run the parser against it to see how it handles it.
Any other niche cases either of you can think of that may have specific rules? I’m going to try to stabilize the parser and get it to properly validate every army type, then move it to .NET
Also, thinking about it, I’m utterly convinced that snapping things to some kind of a grid is the only real feasible solution. Querying the object via System.Drawing or GDI might work, but I’m not sure how accurate the pixel mapping is. At any rate, for things like the Lance Formation, line of sight on skirmishers, determining base contact for champions/characters embedded, reforming the unit, and templates, a grid seems like the only way to go without doing occlusion detection (for the templates). Convert inches to millimeters, and make it 1mm x 1mm squares or something.