NOVEMBER 21ST, 2008
By RYAN
Given that I’ve deleted my Facebook account, and I’ll be deleting my MySpace account shortly (they never get used, and I don’t see the point), I decided to look into the possibility of importing posts from MySpace to Wordpress. As it turns out, the Wordpress developers are apparently really ridiculously lazy, or just don’t give a shit about importing. Somebody had written a Perl script which pulled MySpace blogs into RSS, but bringing RSS into Wordpress doesn’t get comments with it.
After looking over the formats Wordpress -could- import from, I threw out everything with [!CDATA] tags in XML (almost every format). Fortunately, the Movable Type developers don’t see a need to dump binary blobs in XML, or use XML at all. Their format is refreshingly simple. Hence, a parser that runs through every blog post on somebody’s MySpace, pulls out data that matters (title, date, post, commenters and their comments), then puts those into Movable Type format. You’ll note that I now have way more posts on here than I did before, some of those being from before this website existed.
Code:
#!/usr/bin/ruby
require 'open-uri'
require 'time'
class Comment
attr_accessor :author, :datewritten, :comment
def initialize (author, datewritten, comment)
@author, @datewritten, @comment = author, datewritten, comment
end
end
class Post
attr_accessor :author, :title, :datewritten, :body, :comments
def initialize (author, title, datewritten, body)
@author, @title, @datewritten, @body = author, title, datewritten, body
@comments = []
end
def addcomment (author, datewritten, comment)
@comments.push(Comment.new(author, datewritten, comment))
end
end
class Ripper
def initialize
@pages = []
@posts = Array.new
end
def get (uri)
connection = open(uri)
content = connection.read
return content
end
def parse (uri)
content = get(uri)
#blogContentInfo points to links to posts
links = content.scan(/class="blogContentInfo">.*?<a href=".*?">/m)
links.each do |link|
#Strip out the bullshit amazon links
unless link =~ /amazon/
#Pull the URL out of the link
link = (/.*<a href="(.*)">/m).match(link)[1]
@pages.push(link)
end
end
#Checking if there are any older pages with a hyperlink
if content =~ /\[.*?<a href="(.*?)">Older<\/a>/
#If so, call itself recursively to pull out the rest
#Myspace breaks the URI standard. Replace the spaces with real escapes
parse($1.gsub(/\s/, "%20"))
else
#Edge case to break out of the loop for when there aren't any more older
parsepages()
end
end
def parsepages()
@pages.each do |uri|
#Replace with yourself, if you want
author = "Ryan"
content = get(uri)
#Pull out the fields I want
title = (/blogSubject">(.*?)\n/m).match(content)[1]
body = (/blogContent">(.*?)<table/m).match(content)[1]
datewritten = (/blogTimeStamp">(.*?)<\/p>/m).match(content)[1].gsub(/(^\s+|\n+)/, "")
time = (/blogContentInfo"><b>.*?(\d+:\d+)/m).match(content)[1]
datewritten = datewritten + " #{time}:00"
#Parse the time, and force it into something Wordpress can deal with
t = Time.parse(datewritten)
datewritten = t.strftime("%m/%d/%Y %H:%M:%S")
puts "Title: #{title}\n"
#Create a new Post object
post = Post.new(author, title, datewritten, body)
#Pull out an array of all the comment blocks
comments = content.scan(/id="blogComments.*?commentSpacer/m)
#Pass off the post object along with the list of comments
parsecomments(comments, post)
end
end
def parsecomments(comments, post)
comments.each do |com|
author = (/profileLinks">(.*?)</m).match(com)[1]
puts "Author: #{author}\n"
#MySpace decided to make the CSS ids identical here, except that the
#actual comment doesn't have "Posted" after the closing tag
#Filter it as such
comment = (/blogCommentsContent">(.*?)<\/p>/m).match(com)[1]
datewritten = (/blogCommentsContent">Posted by.*?> on(.*?)<b/m).match(com)[1].gsub(/\n|\t|\r/, "")
t = Time.parse(datewritten)
#The same datetime munging as before
datewritten = t.strftime("%m/%d/%Y %H:%M:%S")
#Commit each commment to our post object
post.addcomment(author, datewritten, comment)
end
#Push them all into our class array
@posts.push(post)
end
def print(file)
@posts.each do |post|
#Using Movable Type's export syntax, so I don't need to mess with XML
#It's documented here: http://www.sixapart.com/moveabletype/docs/mtimport#example
#Basically, 5 hyphens separates the categories
#Eight hyphens separate each post
file.puts "TITLE: #{post.title}"
file.puts "AUTHOR: #{post.author}"
file.puts "DATE: #{post.datewritten}"
#Change this, too, if you want
file.puts "CATEGORY: MySpace"
file.puts "-----"
#Get rid of empty lines and fucking Windows ^M newlines, plus convert to " "
file.puts "BODY:\n#{post.body.gsub(/^(\s+|\t+|\n+)$/, "").gsub(/\015/, "").gsub(/ /, " ")}"
file.puts "-----"
post.comments.each do |com|
#More stuff is possible here, but isn't necessary
file.puts "COMMENT:"
file.puts "AUTHOR: #{com.author}"
file.puts "DATE: #{com.datewritten}"
file.puts "#{com.comment}"
file.puts "-----"
end
file.puts "--------"
end
end
end
#Instantiate it
ripper = Ripper.new
#Parse my blog (substitute whatever yours is here)
ripper.parse("http://blog.myspace.com/lykurgos")
#Output it
output = File.open("posts.txt", "a")
ripper.print(output)
puts "Done!\n"
#Import into Wordpress!
Maybe somebody will actually find it useful.
OCTOBER 2ND, 2008
By RYAN
Fixed the nested <item> blocks. Ran into another problem where it didn’t parse nested characters and their items properly, then yet another where some Gifts of Khaine (and probably other things I haven’t seen in either list) are essentially nested worthlessness. Fixed code for it:
def parseitem(d, addto)
added = addto.add_element(d.attributes["name"].gsub(/\s+/, '')).add_text(d.attributes["description"].gsub(/\./, ''))
d.elements.each('link') do |ele|
unless ele.attributes["name"] =~ /(Worker|Helper|Cost|Left)/
if !ele.attributes["description"].nil?
added.add_element(ele.attributes["name"].gsub(/\s+/, '')).add_text(ele.attributes["description"])
else
#This is necessary for some Gifts of Khaine, apparently
if added.text == ele.attributes["name"]
print "Found duplicated #{added.text}!\n\n"
added.text = ''
end
added.add_element(ele.attributes["name"].gsub(/\s+/, '')).add_text("True")
end
end
end
end
def parsenested(process, addto)
#Try to guess if it's a champion, character in the unit, or item
process.elements.each('entity') do |d|
if d.attributes["statset"] =~ /Normal/
#It's a character, crew, or mount. Figure out which
if d.attributes["totalcost"] !~ /^0/
#It's a champion or character
adder = addto.add_element("champion")
puts "Found champ\n"
parse(d, adder)
else
#It's crew or the like
adder = addto.add_element("crew")
puts "Found crew\n"
parse(d, adder)
end
else
#It's an item
puts "Found item\n"
if addto.elements["item"].nil?
@adder = addto.add_element("item")
end
parseitem(d, @adder)
end
end
end
OCTOBER 1ST, 2008
By RYAN
Ok, boring. I didn’t spend as much time working on it tonight as I intended to, but it parses the Dwarf roster, at least, fine. It does not parse the Dark Elf roster properly (namely, it doesn’t pull the description out of items or Gifts of Khaine, and it doubles up the <item> tag for reasons I’m not sure of), but that’ll get fixed when I’m at work tomorrow.
Ruby code:
#!/usr/bin/ruby
require "rexml/document"
require "pp"
require "rexml/formatters/default"
include REXML
inputxml = File.read('dwarfroster.rst')
@roster = Document.new inputxml
@army = Document.new.add_element("army")
def parsenested(process, addto)
#Try to guess if it's a champion, character in the unit, or item
process.elements.each('entity') do |p|
#puts p
if p.elements["link"].has_elements?
#Recursively run through these to figure out what the hell it is
if p.elements["link/entity"].attributes["itemsummary"].any?
adder = addto.add_element("item")
puts "Found nested\n"
parsenested(p.elements["link"], adder)
else
#This is really just stubbed out, since I haven't seen it
end
elsif p.attributes["statset"] =~ /Normal/
#It's a character, crew, or mount. Figure out which
if p.attributes["totalcost"] !~ /^0/
#It's a champion or character
adder = addto.add_element("champion")
puts "Found champ\n"
parse(p, adder)
else
#It's crew or the like
adder = addto.add_element("crew")
puts "Found crew\n"
parse(p, adder)
end
else
#It's an item
puts "Found item\n"
if addto.elements["item"].nil?
@adder = addto.add_element("item")
end
added = @adder.add_element(p.attributes["name"].gsub(/\s+/, ''))
p.elements.each('link') do |ele|
unless ele.attributes["name"] =~ /(Worker|Helper|Cost|Left)/
added.add_element(ele.attributes["name"].gsub(/\s+/, '')).add_text(ele.attributes["description"])
end
end
end
end
end
def parse(s, addto)
#In some cases, the basename differs (i.e. Supreme Sorc vs. High Sorc)
#Also, it'll pick up whether there's a champion in the unit by the diff
#of base and count
%w[basename count base].each do |b|
if s.attributes[b].any?
addto.add_element(b).add_text(s.attributes[b])
end
end
stats = addto.add_element("stats")
s.elements.each('unitstat') do |a|
#unit.fetch(:stats) { |el| unit[el] = {}}
#I don't want blank stats
if a.attributes["value"].any? && (a.attributes["value"] !~ /(0|-)/)
stats.add_element(a.attributes["name"].gsub(/\s+/, '')).add_text(a.attributes["value"])
end
end
s.elements.each('link') do |link|
if link.has_elements?
#Figure out what the hell it is
parsenested(link, addto)
else
#unitatt = unit.add_element("attributes")
#Rip out the name if it doesn't have "Helper, Worker, Points Left, or Cost"
unless link.attributes["name"] =~ /(Worker|Helper|Cost|Left)/
#Get rid of the stuff in braces AB puts in
if addto.elements["attributes"].nil?
@unitatt = addto.add_element("attributes")
end
@unitatt.add_element(link.attributes["name"].gsub(/\{.*?\}/, '').gsub(/\s+/, '')).add_text('true')
end
end
end
end
@roster.elements.each('document/roster') do |ele|
info = @army.add_element("info")
#Pick out the race, army name, total points, used points, canonical race name
%w[race size activesize racename].each do |attr|
info.add_element(attr).add_text(ele.attributes[attr])
end
#@army.push(info)
end
@roster.elements.each('document/squad') do |ele|
@unit = @army.add_element("unit")
#Pick out the name of the model and its cost, plus how many models
%w[name modelcount totalcost].each do |attr|
@unit.add_element(attr).add_text(ele.attributes[attr])
end
ele.elements.each('entity') do |s|
#Parse it out
parse(s, @unit)
end
end
#pp @army
prettyprint = REXML::Formatters::Pretty.new
output = String.new
puts prettyprint.write(@army, output)
And the XML output:
<?xml version="1.0" encoding="ISO-8859-1"?>
<army>
<info>
<race>Dwarf</race>
<size>1500</size>
<activesize>1499.</activesize>
<racename>Dwarfs</racename>
</info>
<unit>
<name>Thane</name>
<modelcount>1</modelcount>
<totalcost>134</totalcost>
<basename>Thane</basename>
<count>1</count>
<base>1</base>
<stats>
<Ld>9</Ld>
<Mv>3</Mv>
<Save>3+</Save>
<St>4/8</St>
<To>5</To>
<UnitSt.>1</UnitSt.>
<WS>6</WS>
<Wo>2</Wo>
<At>3</At>
<BS>4</BS>
<In>3</In>
<ItemPts>75</ItemPts>
</stats>
<attributes>
<General>true</General>
<HandWeapon>true</HandWeapon>
<GreatWeapon>true</GreatWeapon>
<GromrilArmor>true</GromrilArmor>
</attributes>
<item>
<RunicWeapon>
<MasterRuneofKraggtheGrim>Allows other runes to be placed on a Great Weapon.</MasterRuneofKraggtheGrim>
<RuneofCleaving>+1 Strength</RuneofCleaving>
</RunicWeapon>
<RunicArmor>
<RuneofStone>+1 Armor Save</RuneofStone>
</RunicArmor>
</item>
</unit>
<unit>
<name>Thane</name>
<modelcount>1</modelcount>
<totalcost>132</totalcost>
<basename>Thane</basename>
<count>1</count>
<base>1</base>
<stats>
<In>3</In>
<ItemPts>75</ItemPts>
<Ld>9</Ld>
<Mv>3</Mv>
<Save>2+/1+</Save>
<St>4/7</St>
<To>5</To>
<UnitSt.>1</UnitSt.>
<WS>6</WS>
<Wo>2</Wo>
<At>3</At>
<BS>4</BS>
</stats>
<attributes>
<HandWeapon>true</HandWeapon>
<GromrilArmor>true</GromrilArmor>
<Shield>true</Shield>
</attributes>
<item>
<RunicWeapon>
<RuneofCleaving>+1 Strength</RuneofCleaving>
</RunicWeapon>
<RunicArmor>
<RuneofStone>+1 Armor Save</RuneofStone>
</RunicArmor>
</item>
</unit>
<unit>
<name>Thane</name>
<modelcount>1</modelcount>
<totalcost>95</totalcost>
<basename>Thane</basename>
<count>1</count>
<base>1</base>
<stats>
<In>3</In>
<ItemPts>75</ItemPts>
<Ld>9</Ld>
<Mv>3</Mv>
<Save>3+</Save>
<St>4</St>
<To>5</To>
<UnitSt.>1</UnitSt.>
<WS>6</WS>
<Wo>2</Wo>
<At>3</At>
<BS>4</BS>
</stats>
<attributes>
<HandWeapon>true</HandWeapon>
<GromrilArmor>true</GromrilArmor>
<BattleStandardBearer>true</BattleStandardBearer>
</attributes>
<item>
<RunicArmor>
<RuneofStone>+1 Armor Save</RuneofStone>
</RunicArmor>
</item>
</unit>
<unit>
<name>Dwarf Warriors</name>
<modelcount>20</modelcount>
<totalcost>205</totalcost>
<basename>Dwarf Warriors</basename>
<count>19</count>
<base>20</base>
<stats>
<In>2</In>
<Ld>9</Ld>
<Mv>3</Mv>
<Save>4+/3+</Save>
<St>3</St>
<To>4</To>
<UnitSt.>1</UnitSt.>
<WS>4</WS>
<Wo>1</Wo>
<At>1</At>
<BS>3</BS>
</stats>
<champion>
<basename>Veteran</basename>
<count>1</count>
<base>1</base>
<stats>
<In>2</In>
<Ld>9</Ld>
<Mv>3</Mv>
<Save>4+/3+</Save>
<St>3</St>
<To>4</To>
<UnitSt.>1</UnitSt.>
<WS>4</WS>
<Wo>1</Wo>
<At>2</At>
<BS>3</BS>
</stats>
<attributes>
<HandWeapon>true</HandWeapon>
<HeavyArmor>true</HeavyArmor>
<Shield>true</Shield>
</attributes>
</champion>
<attributes>
<Musician>true</Musician>
<StandardBearer>true</StandardBearer>
<HandWeapon>true</HandWeapon>
<HeavyArmor>true</HeavyArmor>
<Shield>true</Shield>
</attributes>
</unit>
<unit>
<name>Quarellers</name>
<modelcount>10</modelcount>
<totalcost>110</totalcost>
<basename>Quarrellers</basename>
<count>10</count>
<base>10</base>
<stats>
<In>2</In>
<Ld>9</Ld>
<Mv>3</Mv>
<Save>6+</Save>
<St>3</St>
<To>4</To>
<UnitSt.>1</UnitSt.>
<WS>4</WS>
<Wo>1</Wo>
<At>1</At>
<BS>3</BS>
</stats>
<attributes>
<HandWeapon>true</HandWeapon>
<Crossbow>true</Crossbow>
<LightArmor>true</LightArmor>
</attributes>
</unit>
<unit>
<name>Quarellers</name>
<modelcount>10</modelcount>
<totalcost>110</totalcost>
<basename>Quarrellers</basename>
<count>10</count>
<base>10</base>
<stats>
<In>2</In>
<Ld>9</Ld>
<Mv>3</Mv>
<Save>6+</Save>
<St>3</St>
<To>4</To>
<UnitSt.>1</UnitSt.>
<WS>4</WS>
<Wo>1</Wo>
<At>1</At>
<BS>3</BS>
</stats>
<attributes>
<HandWeapon>true</HandWeapon>
<Crossbow>true</Crossbow>
<LightArmor>true</LightArmor>
</attributes>
</unit>
<unit>
<name>Ironbreakers</name>
<modelcount>14</modelcount>
<totalcost>237</totalcost>
<basename>Ironbreakers</basename>
<count>13</count>
<base>14</base>
<stats>
<Ld>9</Ld>
<Mv>3