Beta Acid

wordpress-to-cms-migration-image

Mastering the Art of WordPress to CMS Migration

  • Development
Gui_profile_picture

Guilherme Mierzwa

November 15, 2023 • 27 min read

WordPress has long been a popular platform for creating and managing content-rich websites. However, as the digital landscape evolves, many users are seeking more modern and flexible alternatives. Other content management systems (“CMS”), such as Contentful, Hygraph, and Sanity provide a more robust solution for managing content and delivering it across various channels.

While these new technologies offer many benefits, migrating from WordPress to a CMS can be a daunting task. This article outlines a comprehensive, step-by-step guide for successfully migrating WordPress posts to another CMS, focusing on the extraction, translation, component creation, and upload phases. The provided code snippets use JavaScript and various libraries to facilitate the migration process.

Step 1 - Extraction

The initial step involves extracting data from WordPress posts using the JSON API with pagination. The provided JavaScript script utilizes axios to fetch data and extract text-based content. Images and other attachments will be handled in a future step, while creating the components directly associated with them. It is crucial to handle authentication and consider pagination to ensure all relevant data is captured.

const axios = require('axios');

// These values can be found by inspecting the cookies on your WordPress site. Keep in mind that the script will only be able to access information your account has permissions for.
const COOKIE_NAME = '';
const COOKIE_VALUE = '';
const RECORDS_PER_PAGE = 100;

wordpressClient = axios.create({
  baseURL: 'https://www.wordpress-site.com/wp-json/wp/v2/',
  headers: {
    Cookie: `${COOKIE_NAME}=${COOKIE_VALUE};`,
  },
});

async function runPaginatedQuery(endpoint, currentPage = 1) {
  const { data, headers } = await wordpressClient.get(`${endpoint}?per_page=${RECORDS_PER_PAGE}&page=${currentPage}`);
  const totalPages = headers['x-wp-totalpages'];

  return {
    data,
    totalPages,
  };
}

// You might want to migrate data from other endpoints too, in that case simply change this value or add a loop here
const ENDPOINT = 'posts';

const extractedData = [];
let currentPage = 1;
let totalPages = 1;
while (currentPage <= totalPages) {
  const result = await wordpressClient.runPaginatedQuery(ENDPOINT, currentPage);
  totalPages = result.totalPages;

  extractedData = extractedData.concat(result.data);
  currentPage++;
}

At the end of this script, all posts will be available on the extractedData variable. You might want to save this data to a file, so you can review and modify it, or simply use the data directly in the next step.

Step 2 - Translation

Once the data is extracted, the next step is translation. The script demonstrates how to convert WordPress-structured data into a format compatible with the target CMS. Depending on the CMS being used we might also need to convert from HTML to a different format. In our case, since we were using Contentful's Rich Text, the simplest path was to convert to Markdown first and then to Rich Text, using a library provided by Contentful. This step may vary significantly depending on the CMS and specific component you're using. If your CMS has an HTML component you might only need to convert certain tags, URLs or add new elements.

To handle the HTML to Markdown conversion we'll be using the node-html-markdown library. If this conversion isn't necessary, you can skip the code below and jump to the data mapping.

const { NodeHtmlMarkdown } = require('node-html-markdown');

const CUSTOM_COMPONENT_STRINGS = {
  iframe: 'CUSTOM_IFRAME',
};

const nhm = new NodeHtmlMarkdown(
  {
    keepDataImages: true,
  },
  {
    cite: {
      prefix: '- ',
    },
    li: ({ options: { bulletMarker }, indentLevel, listKind }) => {
      const indentationLevel = +(indentLevel || 0);
      return {
        prefix:
          '   '.repeat(indentationLevel) + `${bulletMarker} `,
        surroundingNewlines: 2,
        postprocess: ({ content }) => {
          return !/\S/.test(content)
            ? 1
            : content
              .trim()
              .replace(/([^\r\n])(?:\r?\n)+/g, `$1  \n${'   '.repeat(indentationLevel)}`)
              .replace(/(\S+?)[^\S\r\n]+$/gm, '$1  ');
        },
      };
    },
    div: ({ node }) => {
      const style = node.getAttribute('style');
      const cssClass = node.getAttribute('class');
      let customComponent;
      if (style.someProperty === '' || cssClass.someOtherProperty === '') {
        // You might want to replace parts of the document with custom components
        // One way to identify them is using the style or class attributes
        customComponent = 'CUSTOM_COMPONENT';
      }
      if (customComponent) {
        // Setting the boundaries of the custom component as images allows us to easily identify them later
        return {
          prefix: `![${customComponent}](start)`,
          postfix: `\n![${customComponent}](finish)`,
        };
      }
    },
    iframe: ({ node }) => {
      const src = node.getAttribute('src') || '';
      if (!src) {
        return {
          ignore: true,
        };
      }
      return {
        content: `![${CUSTOM_COMPONENT_STRINGS.iframe}](${src})`,
      };
    },
  },
);

Now to map the data from WordPress to Contentful. This step will vary depending on the CMS you're using, but the general idea is to populate all fields that will be required on the CMS' API call.

const htmlEntities = require('html-entities');
const lodash = require('lodash');

function getNewAnchor(title) {
  const id = title
    .replace(/<[^>]*>?/gm, '')
    .replace(/[-\s]/gi, '-')
    .replace(/[^\w-]/gi, '')
    .replace(/--/gi, '-')
    .toLowerCase();
  return `#${id}`;
}

const translatedPosts = [];
await Promise.all(
  lodash.map(wordpressData, async (post) => {
    if (post.content.rendered === '') {
      // Skip posts that have no content
      return;
    }

    const contentfulPost = {};
    contentfulPost.id = post.id;
    contentfulPost.articleName = htmlEntities.decode(post.title.rendered);
    post.content.rendered = htmlEntities.decode(post.content.rendered);
    post.excerpt.rendered = htmlEntities.decode(post.excerpt.rendered);
    // The nhm.translate calls here might get replaced by a different function, depending on the CMS you're using
    contentfulPost.body = nhm.translate(post.content.rendered);
    contentfulPost.headerSummary = nhm.translate(post.excerpt.rendered);
    contentfulPost.slug = `/articles/${post.slug}`;
    const postAnchors = {};
    // Anchor logic might also be different depending on the CMS
    const allMatches = [
      post.content.rendered.matchAll(/<a name="(.*?)"[\s\S]*?>(.*?)<\/h.>/gi),
      post.content.rendered.matchAll(/<h. id="(.*?)"[\s\S]*?>(.*?)<\/h.>/gi),
    ];
    lodash.forEach(allMatches, (matches) => {
      for (const match of matches) {
        const [_, id, title] = match;
        postAnchors[id] = getNewAnchor(title);
      }
    });
    contentfulPost.anchors = postAnchors;
    translatedPosts.push({
      originalPost: post,
      translatedPost: contentfulPost,
    });
  }),
);

Just like in the previous step, you might want to save the data to a file, so you can review and modify it, or simply use the data directly in the next step. To keep things simple we'll continue using the translatedPosts variable directly in the next step.

Step 3 - Component Creation

Modern CMS often support components for modular content management. This step focuses on creating various components required for posts, such as authors, assets, and image assets. Code snippets showcase the instantiation of a Contentful client and mapping of authors and assets.

const contentfulManagement = require('contentful-management');
const crypto = require('crypto');

const AUTHOR_CONTENT_TYPE = 'author';
const REVIEWER_CONTENT_TYPE = 'reviewer';
const IMAGE_ASSET_TYPE = 'imageAsset';
const MIGRATION_TAG = {
  sys: {
    type: 'Link',
    linkType: 'Tag',
    id: 'privateWordpressMigration',
  },
};

const contentfulClient = contentfulManagement.createClient({
  accessToken: authToken,
});
const contentfulSpace = await contentfulClient.getSpace(SPACE_ID);
const contentfulEnvironment = await contentfulSpace.getEnvironment(ENVIRONMENT_ID);

With the Contentful environment in hand, as well as the wordpressClient we defined earlier, we can now start creating the components. The following code shows how we handled author creation:

async function listContentfulEntries(entryTypes) {
  let skip = 0;
  let total = 1;
  let entries = [];
  while (entries.length < total) {
    const result = await contentfulEnvironment.getEntries({
      'sys.contentType.sys.id[in]': entryTypes,
      limit: RECORDS_PER_PAGE,
      skip,
    });
    entries = entries.concat(result.items);
    total = result.total;
    skip += RECORDS_PER_PAGE;
  }
  return entries;
}

async function mapAuthors(translatedPosts) {
  const wordpressAuthors = [];
  const authorIdsObject = {};
  lodash.forEach(translatedPosts, ({ originalPost }) => {
    if (originalPost.author) {
      authorIdsObject[originalPost.author] = true;
    }
  });

  const authorIds = lodash.keys(authorIdsObject);
  while (authorIds.length > 0) {
    const currentAuthorIds = authorIds.splice(0, RECORDS_PER_PAGE);
    const { data: currentAuthors } = await wordpressClient.get(
      `users?include=${currentAuthorIds}&per_page=${currentAuthorIds.length}`,
    );
    currentAuthors.forEach((author) => {
      if (!wordpressAuthors.find((wordpressAuthor) => wordpressAuthor.id === author.id)) {
        wordpressAuthors.push(author);
      }
    });
  }

  const contentfulAuthors = await listContentfulEntries(`${AUTHOR_CONTENT_TYPE},${REVIEWER_CONTENT_TYPE}`);
  return Promise.all(
    lodash.map(wordpressAuthors, async (wordpressAuthor) => {
      let authorName = wordpressAuthor.name;
      let contentfulAuthor = contentfulAuthors.find((author) => author.fields?.displayName?.['en-US'] === authorName);
      const firstName = authorName.substring(0, authorName.indexOf(' '));
      const lastName = authorName.substring(authorName.indexOf(' ') + 1);
      if (!contentfulAuthor) {
        contentfulAuthor = await contentfulEnvironment.createEntryWithId(AUTHOR_CONTENT_TYPE, wordpressAuthor.slug, {
          metadata: {
            // It's a good idea to tag all migrated content, so it can be easily overwritten by the script or reviewed later
            tags: [MIGRATION_TAG],
          },
          fields: {
            displayName: {
              'en-US': `${firstName} ${lastName}`,
            },
            firstName: {
              'en-US': firstName,
            },
            lastName: {
              'en-US': lastName,
            },
          },
        });
      }

      return {
        contentfulAuthor,
        wordpressAuthor,
      };
    }),
  );
}

The other components we'll be creating are assets, a native element that specializes in files and media, and image assets, a custom component used as a wrapper for assets, allowing for extra properties and better reuse. The following code shows how we handled this mapping:

async function listContentfulAssets() {
  let skip = 0;
  let total = 1;
  let assets = [];
  while (assets.length < total) {
    const result = await contentfulEnvironment.getAssets({
      limit: RECORDS_PER_PAGE,
      skip,
    });
    assets = assets.concat(result.items);
    total = result.total;
    skip += RECORDS_PER_PAGE;
  }
  return assets;
}

function getFileName(media) {
  let filename;
  if (media.media_details?.file) {
    filename = media.media_details.file;
  } else {
    filename = media.source_url;
  }
  return crypto.createHash('sha1').update(filename).digest('hex');
}

async function mapAsset(media, contentfulAssets, contentfulImageAssets, downloadedAssets) {
  const fileName = getFileName(media);
  const title = htmlEntities.decode(media.title?.rendered || fileName);
  let contentfulAsset = contentfulAssets.find((asset) => asset.fields?.file?.['en-US']?.fileName === fileName);
  let fileBuffer;
  const url = encodeURI(media.source_url.replace('\\/', '/'));
  try {
    fileBuffer = await wordpressClient({ url: downloadUrl, responseType: 'arraybuffer' }).then((response) =>
      Buffer.from(response.data, 'binary'),
    );
  } catch (error) {
    console.error(`Failed to download Wordpress asset ${url}.`, error.message);
    return;
  }
  if (!contentfulAsset) {
    contentfulAsset = downloadedAssets.find((assetData) => fileBuffer.equals(assetData.fileBuffer))?.contentfulAsset;
  }
  if (!contentfulAsset) {
    const asset = await contentfulEnvironment.createAssetFromFiles({
      metadata: {
        tags: [MIGRATION_TAG],
      },
      fields: {
        title: {
          'en-US': title,
        },
        file: {
          'en-US': {
            contentType: media.mime_type.replace('\\/', '/'),
            fileName,
            fileBuffer,
          },
        },
      },
    });
    contentfulAsset = await asset.processForAllLocales();
  }
  const contentfulImageAsset = await contentfulEnvironment.createEntry(IMAGE_ASSET_TYPE, {
    metadata: {
      tags: [MIGRATION_TAG],
    },
    fields: {
      asset: {
        'en-US': {
          sys: {
            type: 'Link',
            linkType: 'Asset',
            id: contentfulAsset.sys.id,
          },
        },
      },
    },
  });
  downloadedAssets.push({
    fileBuffer,
    contentfulAsset,
  });
  return {
    contentfulAsset,
    contentfulImageAsset,
    wordpressAsset: media,
  };
}

async function mapAssets(translatedPosts) {
  let wordpressMedia = [];
  await Promise.all(
    lodash.map(translatedPosts, async ({ originalPost }) => {
      const { data: attachments } = await wordpressClient.get(`media?parent=${originalPost.id}`);
      wordpressMedia = wordpressMedia.concat(attachments);
    }),
  );

  const [contentfulAssets, contentfulImageAssets] = await Promise.all([
    listContentfulAssets(),
    listContentfulEntries(IMAGE_ASSET_TYPE),
  ]);
  const downloadedAssets = [];
  let mappedAssets = await Promise.all(
    wordpressMedia.map(async (media) => {
      return mapAsset(media, contentfulAssets, contentfulImageAssets, downloadedAssets);
    }),
  );
  mappedAssets = lodash.filter(mappedAssets, (mappedAsset) => !!mappedAsset);

  return { mappedAssets, contentfulAssets, contentfulImageAssets, downloadedAssets };
}

Keep in mind that the WordPress media query doesn't return all images that are used in the posts, only the ones that are directly attached to them. So in order to migrate all images, we'll need to parse the HTML and extract the URLs from the src attribute of the img tags. In our example, this is handled during the markdown conversion. This is also the moment we'll identify and convert any custom components we might have in the posts.

const { richTextFromMarkdown } = require('@contentful/rich-text-from-markdown');
const mime = require('mime');

const IFRAME_TYPE = 'componentIframe';

async function listWordpressMedia() {
  let wordpressMedia = [];
  let currentPage = 1;
  let totalPages = 1;
  while (currentPage <= totalPages) {
    const { data, headers } = await wordpressClient.get(`media?per_page=${RECORDS_PER_PAGE}&page=${currentPage}`);
    totalPages = headers['x-wp-totalpages'];
    wordpressMedia = wordpressMedia.concat(data);
    currentPage++;
  }
  return wordpressMedia;
}

function findMediaBySourceUrl(media, url) {
  return (
    media.source_url === url ||
    lodash.find(media.media_details.sizes, (mediaDetail) => mediaDetail.source_url.replace('\\/', '/') === url)
  );
}

async function getOrCreateAsset(
  name,
  url,
  mappedAssets,
  contentfulAssets,
  contentfulImageAssets,
  downloadedAssets,
  wordpressMedia,
) {
  let mappedAsset = mappedAssets.find(({ wordpressAsset }) => findMediaBySourceUrl(wordpressAsset, url));
  if (!mappedAsset) {
    let media = lodash.find(wordpressMedia, (media) => findMediaBySourceUrl(media, url));
    if (!media) {
      let fileExtension = url.substring(url.lastIndexOf('.') + 1);
      if (fileExtension.indexOf('?') > -1) {
        fileExtension = fileExtension.substring(0, fileExtension.indexOf('?'));
      }
      media = {
        title: {
          rendered: name || getFileName({ source_url: url }),
        },
        source_url: url,
        media_details: {
          file: url,
        },
        mime_type: mime.getType(fileExtension),
      };
    }
    mappedAsset = await mapAsset(media, contentfulAssets, contentfulImageAssets, downloadedAssets);
    mappedAssets.push(mappedAsset);
  }
  return mappedAsset;
}

function getContentfulId(string) {
  return crypto.createHash('sha1').update(string).digest('hex');
}

async function getOrCreateContentfulComponent(type, id, fields, tags) {
  id = getContentfulId(id);
  try {
    const component = await contentfulEnvironment.getEntry(id);
    return component;
  } catch (error) {
    return contentfulEnvironment.createEntryWithId(type, id, {
      metadata: {
        tags: tags || [MIGRATION_TAG],
      },
      fields,
    });
  }
}

async function markdownToRichText(
  markdownString,
  translatedPosts,
  translatedPost,
  mappedAssets,
  contentfulAssets,
  contentfulImageAssets,
  downloadedAssets,
  wordpressMedia,
) {
  return richTextFromMarkdown(markdownString, async (node) => {
    if (node.type === 'linkReference') {
      return {
        nodeType: 'linkReference',
        content: [
          {
            nodeType: 'text',
            value: node.label,
            marks: [],
            data: {},
          },
        ],
      };
    } else if (node.type === 'image') {
      if (node.alt === CUSTOM_COMPONENT_STRINGS.iframe) {
        // Custom component: iframe
        const iframeComponent = await getOrCreateContentfulComponent(IFRAME_TYPE, node.url, {
          url: {
            'en-US': node.url,
          },
        });
        return {
          nodeType: 'embedded-entry-block',
          content: [],
          data: {
            target: {
              sys: {
                id: iframeComponent.sys.id,
                type: 'Link',
                linkType: 'Entry',
              },
            },
          },
        };
      } else if (lodash.values(CUSTOM_COMPONENT_STRINGS).indexOf(node.alt) > -1) {
        return {
          nodeType: node.alt,
          value: node.url,
        };
      }
      const mappedAsset = await getOrCreateAsset(
        node.alt,
        node.url,
        mappedAssets,
        contentfulAssets,
        contentfulImageAssets,
        downloadedAssets,
        wordpressMedia,
      );
      if (mappedAsset) {
        return {
          nodeType: 'embedded-entry-block',
          content: [],
          data: {
            target: {
              sys: {
                id: mappedAsset.contentfulImageAsset.sys.id,
                type: 'Link',
                linkType: 'Entry',
              },
            },
          },
        };
      }
    }
  });
}

Finally, to tie together all the code we've shown in this step, we can call the functions defined previously and map the data as shown below:

async function addMappedComponentsToPost(
  originalPost,
  translatedPosts,
  translatedPost,
  mappedAuthors,
  mappedAssets,
  contentfulAssets,
  contentfulImageAssets,
  downloadedAssets,
  wordpressMedia,
) {
  translatedPost.body = await markdownToRichText(
    translatedPost.body,
    translatedPosts,
    translatedPost,
    mappedAssets,
    contentfulAssets,
    contentfulImageAssets,
    downloadedAssets,
    wordpressMedia,
  );
  const wpAuthors = `${originalPost.author}`.split(',');
  translatedPost.authors = mappedAuthors
    .filter(({ wordpressAuthor }) => wpAuthors.includes(`${wordpressAuthor.id}`))
    .map(({ contentfulAuthor }) => contentfulAuthor);

  return translatedPost;
}

// This process can be slow depending on the number of posts and assets, so we should parallelize as much as possible.
const [mappedAuthors, { mappedAssets, contentfulAssets, contentfulImageAssets, downloadedAssets }, wordpressMedia] =
  await Promise.all([mapAuthors(translatedPosts), mapAssets(translatedPosts), listWordpressMedia()]);

const translatedPostList = translatedPosts.map(({ translatedPost }) => translatedPost);
const postsWithComponents = await Promise.all(
  lodash.map(translatedPosts, async ({ originalPost, translatedPost }) => {
    return addMappedComponentsToPost(
      originalPost,
      translatedPostList,
      translatedPost,
      mappedAuthors,
      mappedAssets,
      contentfulAssets,
      contentfulImageAssets,
      downloadedAssets,
      wordpressMedia,
    );
  }),
);

At the end of this step we have the new posts properly structured and with all components needed, under the postsWithComponents variable, which we'll be using next.

Step 4 - Upload

The final step involves uploading or updating posts in the target CMS. The script provided demonstrates how to upload articles, considering content types, slugs, and associated components. Tags are used for easy identification of migrated content.

const ARTICLE_CONTENT_TYPE = 'article';

function getArticleTags(post) {
  const tags = [MIGRATION_TAG];
  lodash.forEach(post.tags, (tag) => {
    tags.push({
      sys: {
        type: 'Link',
        linkType: 'Tag',
        id: tag,
      },
    });
  });
  return tags;
}

function getArticleFields(post) {
  return {
    slug: {
      'en-US': post.slug,
    },
    articleName: {
      'en-US': post.articleName,
    },
    body: {
      'en-US': post.body,
    },
    headerSummary: {
      'en-US': post.headerSummary,
    },
    authors: {
      'en-US': post.authors.map((author) => {
        return {
          sys: {
            type: 'Link',
            linkType: 'Entry',
            id: author.sys.id,
          },
        };
      }),
    },
  };
}

async function upload(contentfulAuthToken) {
  await contentfulClient.initialize(contentfulAuthToken);
  const contentfulArticles = await listContentfulEntries(ARTICLE_CONTENT_TYPE);

  await Promise.all(
    lodash.map(postsWithComponents, async (post) => {
      let contentfulArticle = contentfulArticles.find((article) => article.fields.slug?.['en-US'] === post.slug);
      if (contentfulArticle) {
        contentfulArticle.metadata.tags = getArticleTags(post);
        contentfulArticle.fields = getArticleFields(post);
        await contentfulArticle.update();
      } else {
        await contentfulEnvironment.createEntryWithId(ARTICLE_CONTENT_TYPE, post.id, {
          metadata: {
            tags: getArticleTags(post),
          },
          fields: getArticleFields(post),
        });
      }
    }),
  );
}

Depending on the structure of your posts, you might need to create other components after the initial upload, if your site includes links to other posts, for example. After doing that, make sure you update the posts again to include the new components.

Conclusion

Migrating content between CMS platforms requires careful planning and execution. By following the steps outlined in this article and utilizing the provided code snippets, you can successfully migrate your WordPress pages to Contentful. This custom migration script streamlines the process, ensuring that your content is seamlessly transferred while maintaining the necessary formatting and components within the Contentful ecosystem. As you make this transition, be sure to review and customize the migration files according to your specific requirements. Happy migrating!