Importing WordPress content to Jekyll (moving from WordPress, part 4)

6 minute read article leadership   technology   wordpress   blogging   dotnet   ruby   powershell   azure Comments

Part 1 - An overview and discussion about Static Site Generators - Part 2 - Setting up the infrastructure on Microsoft Azure - Part 3 - Day-to-day management and writing content

In Part 1 of this series, I talked about static site generators and their benefits. In Part 2, I talked about setting things up in Azure, and in Part 3, I discussed how I manage the blog day-to-day. This post, Part 4, is finally getting to all the fun details about how I moved the content from WordPress. There’s some C# code, some Powershell, and some Ruby coming up!!

Converting WordPress content for use in Jekyll

After some digging on my Windows PC, I found the code and some reminders of how I converted the content from WordPress to something I could use in Jekyll.

The high-level steps:

  1. Create a new Jekyll site
  2. Download the WordPress backup XML file and put it in the root of the new site
  3. Install the jekyll-import gem
  4. Write a bit of code to use Jekyll import, pulling in the backup from step 2
  5. Fix issues found in the imported files
  6. Profit?

Create a new Jekyll site

I’ve covered this in previous posts, but here it is again:

jekyll new mysite

That gives you a basic website with a basic config and single placeholder post. If you run

bundle exec jekyll serve

you can open localhost port 4000 and see what it looks like.

Grab the backup from your WordPress site

WordPress makes it really easy to grab backups of your content, including any media you’ve uploaded.

The WordPress export content page

While I chose the ‘Export All’ option, you can pick and choose which fields to export.

The WordPress export content page, expanded

When I chose the ‘Export All’ option, it took a few seconds and then I received an email with the download link:

The WordPress-generated email when I exported all of the content

Clicking that downloaded the zip file which contained a single XML file. I unzipped that file into the new Jekyll site I created, and renamed it to “wordpress.xml” to make it easier running my processing utilities.

I also downloaded the media library which were compressed in a tar file. I un-tar’d the file and this is what it looked like. These folders could be copied directly into the assets folder within the new Jekyll site.

The WordPress media export page

Install the jekyll-import gem

The XML file is interesting

The WordPress content export XML file

Because there’s no way I was going to write code to parse that XML file myself, I did a bit of searching and found the jekyll-import gem. This makes it incredibly easy to pull in all of the content from the site and convert it to a format usable by Jekyll (mostly).

gem install jekyll-import

I then found this snippet of code that I dropped into a file I named import.rb in the root of my new Jekyll site:

require "jekyll-import"
JekyllImport:Importers::WordpressDotCom.run({
    "source" => "wordpress.xml",
    "no_fetch_images" => false,
    "assets_folder" => "assets"
})

I then ran

ruby .\import.rb

Depending on how much content you have, this could take a long time. I was only importing a couple hundred posts, so it didn’t take too long. It did download the images it could, but there were some errors with a few of them. No big deal because I also downloaded the image library and just unzipped those files into my assets folder.

Find and fix the problems

The typical front matter block for my blog posts looks like this - it’s short and only has the elements I care about.

---
layout: single
classes: wide
title: "Importing WordPress content to Jekyll (moving from WordPress, part 4)"
header:
date: 2024-07-20 07:00:00.000000000 -04:00
type: post
published: true
comments: true
categories: article
tags:
- leadership
- technology
- wordpress
- blogging
- dotnet
- ruby
- powershell
excerpt: The one where I talk about moving away from WordPress and show how I converted my WordPress content using some Ruby, some C#, and even a little Powershell.
---

Jekyll-import brought over every single bit of metadata from WordPress and included them in the front matter block - lots of attributes. Here’s an example from one of the posts:

---
layout: post
title: Interesting links of the week (2022-1)
date: 2022-01-03 08:00:00.000000000 -05:00
type: post
parent_id: '0'
published: true
password: ''
status: publish
categories:
- Business
- Communication
- General
- Programming
tags: []
meta:
  _last_editor_used_jetpack: block-editor
  _rest_api_client_id: '43452'
  _rest_api_published: '1'
  _publicize_job_id: '67239321633'
  timeline_notification: '1641215226'
  _publicize_done_external: a:1:{s:7:"twitter";a:1:{i:18590267;s:54:"https://twitter.com/mjeaton/status/1269963968925233152";}}
  _publicize_done_18776930: '1'
  _wpas_done_18590267: '1'
  publicize_twitter_user: mjeaton
  _coblocks_attr: ''
  _coblocks_dimensions: ''
  _coblocks_responsive_height: ''
  _coblocks_accordion_ie_support: ''
  spay_email: ''
  jetpack_anchor_podcast: ''
  jetpack_anchor_episode: ''
  jetpack_anchor_spotify_show: ''
  _wpas_is_tweetstorm: ''
  _wpas_feature_enabled: '1'
  publicize_linkedin_url: ''
  _publicize_done_24085812: '1'
  _wpas_done_27278088: '1'
author:
  login: mjeaton
  email: mjeaton@gmail.com
  display_name: mike
  first_name: Michael
  last_name: Eaton
permalink: "/2022/01/03/interesting-links-of-the-week-2020-23/"
excerpt: Here are some interesting articles and blog posts I’ve run into over the
  last week (December 29, 2021 - January 2, 2022).
---

Cleaning Up with C#

It contained lots of stuff I didn’t want, so I wrote some C# to strip the crap out. It’s not production-level robust, but it did the job. I made sure I created backups of each file I touched so I could rerun as needed while I was developing it.

The gist of it that I iterated over all the files generated by the import utility, created a temp file based on the filename, opened the “real” file for input, reading line-by-line and skipping the things I didn’t want in the final output. I stripped out what I called “wordPress cruft” from the content, too.

Here’s the whole thing:

var startDir = args[0];

Console.WriteLine($"Starting in {startDir}");

var backupFolder = $"backup/{DateTime.Now.Ticks}";
if(!Directory.Exists(backupFolder))
{
  Directory.CreateDirectory(backupFolder);
}

string [] fileEntries = Directory.GetFiles(startDir);
foreach(string fileName in fileEntries)
{
  processFile(fileName, backupFolder);
}

void processFile(string fileName, string backupFolder) 
{
  Console.WriteLine(fileName);
  // make a backup of the original file
  backupFile(fileName, backupFolder);

  // open temp file for writing
  var tempFile = $"{fileName}.tmp";
  var sw = new StreamWriter(tempFile);

  var lines = File.ReadAllLines(fileName);
  foreach(var line in lines) 
  {
    var lineToProcess = line;
    var skip = hasHeaderField(lineToProcess) || hasMetaField(lineToProcess) || hasAuthorField(lineToProcess);
    if(!skip)
    {
      if(!hasWordpressCruft(lineToProcess))
      {
        if(lineToProcess.StartsWith("categories:"))
        {
          lineToProcess = "categories: links";
        }
        sw.WriteLine(lineToProcess);
      }
    }
  }

  sw.Close();

  File.Delete(fileName);
  File.Move(tempFile, fileName);
}

bool hasWordpressCruft(string line)
{
  var x = line.Trim();
  return 
    x.StartsWith("<p><!-- wp:paragraph --></p>") || 
    x.StartsWith("<p><!-- /wp:paragraph --></p>") ||
    x.StartsWith("<p><!-- wp:heading --></p>") || 
    x.StartsWith("<p><!-- /wp:heading --></p>") ||
    x.StartsWith("<p><!-- wp:list --></p>") ||
    x.StartsWith("<p><!-- /wp:list --></p>");
}

bool hasHeaderField(string line)
{
  var x = line.Trim();
	return 
    x.StartsWith("parent_id:") || 
    x.StartsWith("password:") || 
    x.StartsWith("status:") ||
    x.StartsWith("meta:") ||
    x.StartsWith("author:") ||
    x.StartsWith("spay_email:") ||
    x.StartsWith("permalink:") ||
    x.StartsWith("- Business") ||
    x.StartsWith("- Communication") ||
    x.StartsWith("- General") ||
    x.StartsWith("- Programming") ||
    x.StartsWith("- Life") ||
    x.StartsWith("- Books") ||
    x.StartsWith("- random") ||
    x.StartsWith("- Bourbon");
}

bool hasMetaField(string line)
{
  var x = line.Trim();
	return 
    x.StartsWith("_thumbnail_id") || 
    x.StartsWith("_last_editor_used_jetpack") || 
    x.StartsWith("_coblocks") ||
    x.StartsWith("jetpack") ||
    x.StartsWith("_wpas") ||
    x.StartsWith("publicize_") ||
    x.StartsWith("_publicize_") ||
    x.StartsWith("timeline_") ||
    x.StartsWith("_headstart") ||
    x.StartsWith("sharing_") ||
    x.StartsWith("switch_") ||
    x.StartsWith("geo_public") ||
    x.StartsWith("_edit_last:") ||
    x.StartsWith("_rest_api_");
}

bool hasAuthorField(string line)
{
  var x = line.Trim();
	return 
    x.StartsWith("login") || 
    x.StartsWith("email") || 
    x.StartsWith("display_name") ||
    x.StartsWith("first_name") ||
    x.StartsWith("last_name");
}

void backupFile(string fileName, string backupFolder) 
{
  var backupFile = Path.Join(backupFolder, Path.GetFileName(fileName));
  File.Copy(fileName, backupFile, true);
}

Additional cleanup

At some point, I realized that I didn’t want every post to use the ‘post’ layout and instead use the ‘single’ layout. I don’t recall WHY I used Powershell. Maybe I felt like punishing myself, but this is what I came up with to make that particular change.

This grabs all html files recursively, replaces that particular string and saves the file.

$posts = Get-ChildItem . *.html -rec
foreach ($file in $posts)
{
    (Get-Content $file.PSPath) |
    Foreach-Object { $_ -replace "layout: post", "layout: single" } |
    Set-Content $file.PSPath
}

Final Thoughts

That’s it! All the other changes I made related to the site layout, not the content.

You’ll notice that the jekyll-import generated html files because that’s how the content was created in WordPress. Much of that content remains html, but all of my current posts are written in markdown because markdown is sooooooo much better than html for content creation. I have considered going back and converting all of them, but it’s a manual process, and there’s very little benefit in touching 3 or 4-year old files.

During this whole process, I was watching the site come together in my browser, making tweaks to layouts, menus, and categories. Jekyll makes those changes simple, but I may try to write one more post just about that topic.

Thanks for reading! I hope if you ever feel like moving from WordPress to Jekyll that these posts will help.


A seal indicating this page was written by a human

Updated:

Comments