Importing WordPress content to Jekyll (moving from WordPress, part 4)
Part 1 - An overview and discussion about Static Site Generators - Part 2 - Setting up the infrastructure on Microsoft Azure - Part 3 - Day-to-day management and writing content
In Part 1 of this series, I talked about static site generators and their benefits. In Part 2, I talked about setting things up in Azure, and in Part 3, I discussed how I manage the blog day-to-day. This post, Part 4, is finally getting to all the fun details about how I moved the content from WordPress. There’s some C# code, some Powershell, and some Ruby coming up!!
Converting WordPress content for use in Jekyll
After some digging on my Windows PC, I found the code and some reminders of how I converted the content from WordPress to something I could use in Jekyll.
The high-level steps:
- Create a new Jekyll site
- Download the WordPress backup XML file and put it in the root of the new site
- Install the jekyll-import gem
- Write a bit of code to use Jekyll import, pulling in the backup from step 2
- Fix issues found in the imported files
- Profit?
Create a new Jekyll site
I’ve covered this in previous posts, but here it is again:
jekyll new mysite
That gives you a basic website with a basic config and single placeholder post. If you run
bundle exec jekyll serve
you can open localhost port 4000 and see what it looks like.
Grab the backup from your WordPress site
WordPress makes it really easy to grab backups of your content, including any media you’ve uploaded.
While I chose the ‘Export All’ option, you can pick and choose which fields to export.
When I chose the ‘Export All’ option, it took a few seconds and then I received an email with the download link:
Clicking that downloaded the zip file which contained a single XML file. I unzipped that file into the new Jekyll site I created, and renamed it to “wordpress.xml” to make it easier running my processing utilities.
I also downloaded the media library which were compressed in a tar file. I un-tar’d the file and this is what it looked like. These folders could be copied directly into the assets folder within the new Jekyll site.
Install the jekyll-import gem
The XML file is interesting
Because there’s no way I was going to write code to parse that XML file myself, I did a bit of searching and found the jekyll-import gem. This makes it incredibly easy to pull in all of the content from the site and convert it to a format usable by Jekyll (mostly).
gem install jekyll-import
I then found this snippet of code that I dropped into a file I named import.rb in the root of my new Jekyll site:
require "jekyll-import"
JekyllImport:Importers::WordpressDotCom.run({
"source" => "wordpress.xml",
"no_fetch_images" => false,
"assets_folder" => "assets"
})
I then ran
ruby .\import.rb
Depending on how much content you have, this could take a long time. I was only importing a couple hundred posts, so it didn’t take too long. It did download the images it could, but there were some errors with a few of them. No big deal because I also downloaded the image library and just unzipped those files into my assets folder.
Find and fix the problems
The typical front matter block for my blog posts looks like this - it’s short and only has the elements I care about.
---
layout: single
classes: wide
title: "Importing WordPress content to Jekyll (moving from WordPress, part 4)"
header:
date: 2024-07-20 07:00:00.000000000 -04:00
type: post
published: true
comments: true
categories: article
tags:
- leadership
- technology
- wordpress
- blogging
- dotnet
- ruby
- powershell
excerpt: The one where I talk about moving away from WordPress and show how I converted my WordPress content using some Ruby, some C#, and even a little Powershell.
---
Jekyll-import brought over every single bit of metadata from WordPress and included them in the front matter block - lots of attributes. Here’s an example from one of the posts:
---
layout: post
title: Interesting links of the week (2022-1)
date: 2022-01-03 08:00:00.000000000 -05:00
type: post
parent_id: '0'
published: true
password: ''
status: publish
categories:
- Business
- Communication
- General
- Programming
tags: []
meta:
_last_editor_used_jetpack: block-editor
_rest_api_client_id: '43452'
_rest_api_published: '1'
_publicize_job_id: '67239321633'
timeline_notification: '1641215226'
_publicize_done_external: a:1:{s:7:"twitter";a:1:{i:18590267;s:54:"https://twitter.com/mjeaton/status/1269963968925233152";}}
_publicize_done_18776930: '1'
_wpas_done_18590267: '1'
publicize_twitter_user: mjeaton
_coblocks_attr: ''
_coblocks_dimensions: ''
_coblocks_responsive_height: ''
_coblocks_accordion_ie_support: ''
spay_email: ''
jetpack_anchor_podcast: ''
jetpack_anchor_episode: ''
jetpack_anchor_spotify_show: ''
_wpas_is_tweetstorm: ''
_wpas_feature_enabled: '1'
publicize_linkedin_url: ''
_publicize_done_24085812: '1'
_wpas_done_27278088: '1'
author:
login: mjeaton
email: mjeaton@gmail.com
display_name: mike
first_name: Michael
last_name: Eaton
permalink: "/2022/01/03/interesting-links-of-the-week-2020-23/"
excerpt: Here are some interesting articles and blog posts I’ve run into over the
last week (December 29, 2021 - January 2, 2022).
---
Cleaning Up with C#
It contained lots of stuff I didn’t want, so I wrote some C# to strip the crap out. It’s not production-level robust, but it did the job. I made sure I created backups of each file I touched so I could rerun as needed while I was developing it.
The gist of it that I iterated over all the files generated by the import utility, created a temp file based on the filename, opened the “real” file for input, reading line-by-line and skipping the things I didn’t want in the final output. I stripped out what I called “wordPress cruft” from the content, too.
Here’s the whole thing:
Additional cleanup
At some point, I realized that I didn’t want every post to use the ‘post’ layout and instead use the ‘single’ layout. I don’t recall WHY I used Powershell. Maybe I felt like punishing myself, but this is what I came up with to make that particular change.
This grabs all html files recursively, replaces that particular string and saves the file.
$posts = Get-ChildItem . *.html -rec
foreach ($file in $posts)
{
(Get-Content $file.PSPath) |
Foreach-Object { $_ -replace "layout: post", "layout: single" } |
Set-Content $file.PSPath
}
Final Thoughts
That’s it! All the other changes I made related to the site layout, not the content.
You’ll notice that the jekyll-import generated html files because that’s how the content was created in WordPress. Much of that content remains html, but all of my current posts are written in markdown because markdown is sooooooo much better than html for content creation. I have considered going back and converting all of them, but it’s a manual process, and there’s very little benefit in touching 3 or 4-year old files.
During this whole process, I was watching the site come together in my browser, making tweaks to layouts, menus, and categories. Jekyll makes those changes simple, but I may try to write one more post just about that topic.
Thanks for reading! I hope if you ever feel like moving from WordPress to Jekyll that these posts will help.
Comments