Importing From MySpace
Given that I’ve deleted my Facebook account, and I’ll be deleting my MySpace account shortly (they never get used, and I don’t see the point), I decided to look into the possibility of importing posts from MySpace to WordPress. As it turns out, the WordPress developers are apparently really ridiculously lazy, or just don’t give a shit about importing. Somebody had written a Perl script which pulled MySpace blogs into RSS, but bringing RSS into WordPress doesn’t get comments with it.
After looking over the formats WordPress -could- import from, I threw out everything with [!CDATA] tags in XML (almost every format). Fortunately, the Movable Type developers don’t see a need to dump binary blobs in XML, or use XML at all. Their format is refreshingly simple. Hence, a parser that runs through every blog post on somebody’s MySpace, pulls out data that matters (title, date, post, commenters and their comments), then puts those into Movable Type format. You’ll note that I now have way more posts on here than I did before, some of those being from before this website existed.
Code:
#!/usr/bin/ruby require 'open-uri' require 'time' class Comment attr_accessor :author, :datewritten, :comment def initialize (author, datewritten, comment) @author, @datewritten, @comment = author, datewritten, comment end end class Post attr_accessor :author, :title, :datewritten, :body, :comments def initialize (author, title, datewritten, body) @author, @title, @datewritten, @body = author, title, datewritten, body @comments = [] end def addcomment (author, datewritten, comment) @comments.push(Comment.new(author, datewritten, comment)) end end class Ripper def initialize @pages = [] @posts = Array.new end def get (uri) connection = open(uri) content = connection.read return content end def parse (uri) content = get(uri) #blogContentInfo points to links to posts links = content.scan(/class="blogContentInfo">.*?<a href=".*?">/m) links.each do |link| #Strip out the bullshit amazon links unless link =~ /amazon/ #Pull the URL out of the link link = (/.*<a href="(.*)">/m).match(link)[1] @pages.push(link) end end #Checking if there are any older pages with a hyperlink if content =~ /\[.*?<a href="(.*?)">Older<\/a>/ #If so, call itself recursively to pull out the rest #Myspace breaks the URI standard. Replace the spaces with real escapes parse($1.gsub(/\s/, "%20")) else #Edge case to break out of the loop for when there aren't any more older parsepages() end end def parsepages() @pages.each do |uri| #Replace with yourself, if you want author = "Ryan" content = get(uri) #Pull out the fields I want title = (/blogSubject">(.*?)\n/m).match(content)[1] body = (/blogContent">(.*?)<table/m).match(content)[1] datewritten = (/blogTimeStamp">(.*?)<\/p>/m).match(content)[1].gsub(/(^\s+|\n+)/, "") time = (/blogContentInfo"><b>.*?(\d+:\d+)/m).match(content)[1] datewritten = datewritten + " #{time}:00" #Parse the time, and force it into something Wordpress can deal with t = Time.parse(datewritten) datewritten = t.strftime("%m/%d/%Y %H:%M:%S") puts "Title: #{title}\n" #Create a new Post object post = Post.new(author, title, datewritten, body) #Pull out an array of all the comment blocks comments = content.scan(/id="blogComments.*?commentSpacer/m) #Pass off the post object along with the list of comments parsecomments(comments, post) end end def parsecomments(comments, post) comments.each do |com| author = (/profileLinks">(.*?)</m).match(com)[1] puts "Author: #{author}\n" #MySpace decided to make the CSS ids identical here, except that the #actual comment doesn't have "Posted" after the closing tag #Filter it as such comment = (/blogCommentsContent">(.*?)<\/p>/m).match(com)[1] datewritten = (/blogCommentsContent">Posted by.*?> on(.*?)<b/m).match(com)[1].gsub(/\n|\t|\r/, "") t = Time.parse(datewritten) #The same datetime munging as before datewritten = t.strftime("%m/%d/%Y %H:%M:%S") #Commit each commment to our post object post.addcomment(author, datewritten, comment) end #Push them all into our class array @posts.push(post) end def print(file) @posts.each do |post| #Using Movable Type's export syntax, so I don't need to mess with XML #It's documented here: http://www.sixapart.com/moveabletype/docs/mtimport#example #Basically, 5 hyphens separates the categories #Eight hyphens separate each post file.puts "TITLE: #{post.title}" file.puts "AUTHOR: #{post.author}" file.puts "DATE: #{post.datewritten}" #Change this, too, if you want file.puts "CATEGORY: MySpace" file.puts "-----" #Get rid of empty lines and fucking Windows ^M newlines, plus convert to " " file.puts "BODY:\n#{post.body.gsub(/^(\s+|\t+|\n+)$/, "").gsub(/\015/, "").gsub(/ /, " ")}" file.puts "-----" post.comments.each do |com| #More stuff is possible here, but isn't necessary file.puts "COMMENT:" file.puts "AUTHOR: #{com.author}" file.puts "DATE: #{com.datewritten}" file.puts "#{com.comment}" file.puts "-----" end file.puts "--------" end end end #Instantiate it ripper = Ripper.new #Parse my blog (substitute whatever yours is here) ripper.parse("http://blog.myspace.com/lykurgos") #Output it output = File.open("posts.txt", "a") ripper.print(output) puts "Done!\n" #Import into Wordpress!
Maybe somebody will actually find it useful.