Thursday, December 19, 2024

Markitdown: Convert Office Documents Into AI-Friendly Markdown

Microsoft recently released Markitdown, a powerful open-source library designed to convert Word documents, PDFs, Excel files, and PowerPoints into plain text Markdown syntax.

By bridging the gap between proprietary formats and text-based Markdown, Markitdown streamlines how we prepare documents for AI applications. So let’s dive into how this tool works, what makes it so interesting, and how you can start using it today.

Why Markitdown Matters#

Beyond its pure utility, Markitdown represents a strategic shift by Microsoft to embrace open-source solutions and the growing influence of AI. Instead of locking users into proprietary formats, Markitdown lets businesses leverage their existing Office content in new applications and workflows.

By converting documents into Markdown, Markitdown enables seamless integration with Large Language Models (LLMs) for tasks like Retrieval-Augmented Generation (RAG). This is especially useful for businesses looking to enhance their AI capabilities and make better use of their content.

Testing Markitdown#

So how well does it work? Let’s take a look at converting a few traditional Office document formats into Markdown. Starting, of course, with the quintessential Microsoft document format: Microsoft Word.

Word Document (.docx) to Markdown#

Below on the left is a resume template provided to Markitdown as the input, and on the right is the output in Markdown.

resume.docx

The library does an excellent job extracting data while preserving heading details. It’s easy to imagine this conversion being part of an AI-powered workflow to process incoming job applications.

Powerpoint (.pptx) to Markdown#

Next, let’s convert a PowerPoint presentation with an embedded chart to see how Markitdown handles it. On the left is page four of a multi-page PowerPoint file containing a bar chart, and on the right you can see the Markdown output, including the slide number as a comment.

chart.pptx

Once again, the heading information is preserved. More impressively, the bar chart is converted into a Markdown table, making it simple for an LLM to process the chart’s information.

Excel (.xlsx) to Markdown#

Moving on to Excel, you can see the input of a simple Excel file on the left and the output on the right.

simple.xlsx

The library converts the Excel file to a Markdown table without any issue. But what about a more complex Excel file with custom formatting?

Below on the left is an Excel file using blank cells for spacing and padding. The conversion on the right contains multiple cells labeled “NaN” (not a number).

formatted.xlsx

Even so, the data is still preserved in a format that’s understandable by both humans and large language models. As shown in the conversation below, ChatGPT had no problem interpreting the data despite its odd structure.

In a full document processing workflow, you could use an LLM to clean up the data into cleaner format.

Portable Document Format (.pdf) to Markdown#

Finally, even though PDF isn’t a Microsoft format, let’s see how Markitdown handles PDF files. On the left is the input, and on the right is the conversion, which accurately reflects the text contents.

The image content isn’t included, and there’s no indication in the output that an image was present. Ideally, in a complete document processing pipeline, you’d want to flag any PDF images for additional processing. It’s also worth noting that the PDF on the left isn’t a scanned document, so there’s no rastered text that would require OCR.

Potential Use Cases#

Markitdown’s ability to convert documents into Markdown opens up a range of possibilities for businesses and developers. By making proprietary formats more accessible, it allows for seamless integration into AI pipelines—enhancing data retrieval, structured data extraction, and content analysis. This is especially useful for creating AI-driven applications that require clean, structured input.

Moreover, Markitdown boosts collaboration and content management by making documents more versatile and easier to work with. Whether you’re feeding large language models, generating structured datasets, or refining document workflows, Markitdown offers a flexible and powerful solution.