Markitdown

Transform any document into clean, LLM-ready Markdown in seconds

✨ The solution you've been looking for

Verified
Tested and verified by our team
16036 Stars

Convert files and office documents to Markdown. Supports PDF, DOCX, PPTX, XLSX, images (with OCR), audio (with transcription), HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs and more.

document-conversion markdown pdf ocr transcription microsoft productivity file-processing
Repository

See It In Action

Interactive preview & real-world examples

Live Demo
Skill Demo Animation

AI Conversation Simulator

See how users interact with this skill

User Prompt

Convert this research paper PDF to Markdown so I can analyze its methodology with Claude

Skill Processing

Analyzing request...

Agent Response

Clean Markdown with preserved formatting, tables, and structure ready for AI analysis

Quick Start (3 Steps)

Get up and running in minutes

1

Install

claude-code skill install markitdown

claude-code skill install markitdown
2

Config

3

First Trigger

@markitdown help

Commands

CommandDescriptionRequired Args
@markitdown convert-research-papers-for-ai-analysisTransform PDF research papers into token-efficient Markdown for LLM processing and analysisNone
@markitdown extract-data-from-excel-spreadsheetsConvert Excel files to Markdown tables for easy integration into documentation or reportsNone
@markitdown process-powerpoint-presentations-with-ai-descriptionsConvert slides to Markdown with AI-generated descriptions of visual contentNone

Typical Use Cases

Convert Research Papers for AI Analysis

Transform PDF research papers into token-efficient Markdown for LLM processing and analysis

Extract Data from Excel Spreadsheets

Convert Excel files to Markdown tables for easy integration into documentation or reports

Process PowerPoint Presentations with AI Descriptions

Convert slides to Markdown with AI-generated descriptions of visual content

Overview

MarkItDown - File to Markdown Conversion

Overview

MarkItDown is a Python tool developed by Microsoft for converting various file formats to Markdown. It’s particularly useful for converting documents into LLM-friendly text format, as Markdown is token-efficient and well-understood by modern language models.

Key Benefits:

  • Convert documents to clean, structured Markdown
  • Token-efficient format for LLM processing
  • Supports 15+ file formats
  • Optional AI-enhanced image descriptions
  • OCR for images and scanned documents
  • Speech transcription for audio files

Visual Enhancement with Scientific Schematics

When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.

If your document does not already contain schematics or diagrams:

  • Use the scientific-schematics skill to generate AI-powered publication-quality diagrams
  • Simply describe your desired diagram in natural language
  • Nano Banana Pro will automatically generate, review, and refine the schematic

For new documents: Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.

How to generate schematics:

1python scripts/generate_schematic.py "your diagram description" -o figures/output.png

The AI will automatically:

  • Create publication-quality images with proper formatting
  • Review and refine through multiple iterations
  • Ensure accessibility (colorblind-friendly, high contrast)
  • Save outputs in the figures/ directory

When to add schematics:

  • Document conversion workflow diagrams
  • File format architecture illustrations
  • OCR processing pipeline diagrams
  • Integration workflow visualizations
  • System architecture diagrams
  • Data flow diagrams
  • Any complex concept that benefits from visualization

For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.


Supported Formats

FormatDescriptionNotes
PDFPortable Document FormatFull text extraction
DOCXMicrosoft WordTables, formatting preserved
PPTXPowerPointSlides with notes
XLSXExcel spreadsheetsTables and data
ImagesJPEG, PNG, GIF, WebPEXIF metadata + OCR
AudioWAV, MP3Metadata + transcription
HTMLWeb pagesClean conversion
CSVComma-separated valuesTable format
JSONJSON dataStructured representation
XMLXML documentsStructured format
ZIPArchive filesIterates contents
EPUBE-booksFull text extraction
YouTubeVideo URLsFetch transcriptions

Quick Start

Installation

1# Install with all features
2pip install 'markitdown[all]'
3
4# Or from source
5git clone https://github.com/microsoft/markitdown.git
6cd markitdown
7pip install -e 'packages/markitdown[all]'

Command-Line Usage

 1# Basic conversion
 2markitdown document.pdf > output.md
 3
 4# Specify output file
 5markitdown document.pdf -o output.md
 6
 7# Pipe content
 8cat document.pdf | markitdown > output.md
 9
10# Enable plugins
11markitdown --list-plugins  # List available plugins
12markitdown --use-plugins document.pdf -o output.md

Python API

 1from markitdown import MarkItDown
 2
 3# Basic usage
 4md = MarkItDown()
 5result = md.convert("document.pdf")
 6print(result.text_content)
 7
 8# Convert from stream
 9with open("document.pdf", "rb") as f:
10    result = md.convert_stream(f, file_extension=".pdf")
11    print(result.text_content)

Advanced Features

1. AI-Enhanced Image Descriptions

Use LLMs via OpenRouter to generate detailed image descriptions (for PPTX and image files):

 1from markitdown import MarkItDown
 2from openai import OpenAI
 3
 4# Initialize OpenRouter client (OpenAI-compatible API)
 5client = OpenAI(
 6    api_key="your-openrouter-api-key",
 7    base_url="https://openrouter.ai/api/v1"
 8)
 9
10md = MarkItDown(
11    llm_client=client,
12    llm_model="anthropic/claude-sonnet-4.5",  # recommended for scientific vision
13    llm_prompt="Describe this image in detail for scientific documentation"
14)
15
16result = md.convert("presentation.pptx")
17print(result.text_content)

2. Azure Document Intelligence

For enhanced PDF conversion with Microsoft Document Intelligence:

1# Command line
2markitdown document.pdf -o output.md -d -e "<document_intelligence_endpoint>"
1# Python API
2from markitdown import MarkItDown
3
4md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
5result = md.convert("complex_document.pdf")
6print(result.text_content)

3. Plugin System

MarkItDown supports 3rd-party plugins for extending functionality:

1# List installed plugins
2markitdown --list-plugins
3
4# Enable plugins
5markitdown --use-plugins file.pdf -o output.md

Find plugins on GitHub with hashtag: #markitdown-plugin

Optional Dependencies

Control which file formats you support:

 1# Install specific formats
 2pip install 'markitdown[pdf, docx, pptx]'
 3
 4# All available options:
 5# [all]                  - All optional dependencies
 6# [pptx]                 - PowerPoint files
 7# [docx]                 - Word documents
 8# [xlsx]                 - Excel spreadsheets
 9# [xls]                  - Older Excel files
10# [pdf]                  - PDF documents
11# [outlook]              - Outlook messages
12# [az-doc-intel]         - Azure Document Intelligence
13# [audio-transcription]  - WAV and MP3 transcription
14# [youtube-transcription] - YouTube video transcription

Common Use Cases

1. Convert Scientific Papers to Markdown

1from markitdown import MarkItDown
2
3md = MarkItDown()
4
5# Convert PDF paper
6result = md.convert("research_paper.pdf")
7with open("paper.md", "w") as f:
8    f.write(result.text_content)

2. Extract Data from Excel for Analysis

1from markitdown import MarkItDown
2
3md = MarkItDown()
4result = md.convert("data.xlsx")
5
6# Result will be in Markdown table format
7print(result.text_content)

3. Process Multiple Documents

 1from markitdown import MarkItDown
 2import os
 3from pathlib import Path
 4
 5md = MarkItDown()
 6
 7# Process all PDFs in a directory
 8pdf_dir = Path("papers/")
 9output_dir = Path("markdown_output/")
10output_dir.mkdir(exist_ok=True)
11
12for pdf_file in pdf_dir.glob("*.pdf"):
13    result = md.convert(str(pdf_file))
14    output_file = output_dir / f"{pdf_file.stem}.md"
15    output_file.write_text(result.text_content)
16    print(f"Converted: {pdf_file.name}")

4. Convert PowerPoint with AI Descriptions

 1from markitdown import MarkItDown
 2from openai import OpenAI
 3
 4# Use OpenRouter for access to multiple AI models
 5client = OpenAI(
 6    api_key="your-openrouter-api-key",
 7    base_url="https://openrouter.ai/api/v1"
 8)
 9
10md = MarkItDown(
11    llm_client=client,
12    llm_model="anthropic/claude-sonnet-4.5",  # recommended for presentations
13    llm_prompt="Describe this slide image in detail, focusing on key visual elements and data"
14)
15
16result = md.convert("presentation.pptx")
17with open("presentation.md", "w") as f:
18    f.write(result.text_content)

5. Batch Convert with Different Formats

 1from markitdown import MarkItDown
 2from pathlib import Path
 3
 4md = MarkItDown()
 5
 6# Files to convert
 7files = [
 8    "document.pdf",
 9    "spreadsheet.xlsx",
10    "presentation.pptx",
11    "notes.docx"
12]
13
14for file in files:
15    try:
16        result = md.convert(file)
17        output = Path(file).stem + ".md"
18        with open(output, "w") as f:
19            f.write(result.text_content)
20        print(f"✓ Converted {file}")
21    except Exception as e:
22        print(f"✗ Error converting {file}: {e}")

6. Extract YouTube Video Transcription

1from markitdown import MarkItDown
2
3md = MarkItDown()
4
5# Convert YouTube video to transcript
6result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
7print(result.text_content)

Docker Usage

1# Build image
2docker build -t markitdown:latest .
3
4# Run conversion
5docker run --rm -i markitdown:latest < ~/document.pdf > output.md

Best Practices

1. Choose the Right Conversion Method

  • Simple documents: Use basic MarkItDown()
  • Complex PDFs: Use Azure Document Intelligence
  • Visual content: Enable AI image descriptions
  • Scanned documents: Ensure OCR dependencies are installed

2. Handle Errors Gracefully

 1from markitdown import MarkItDown
 2
 3md = MarkItDown()
 4
 5try:
 6    result = md.convert("document.pdf")
 7    print(result.text_content)
 8except FileNotFoundError:
 9    print("File not found")
10except Exception as e:
11    print(f"Conversion error: {e}")

3. Process Large Files Efficiently

 1from markitdown import MarkItDown
 2
 3md = MarkItDown()
 4
 5# For large files, use streaming
 6with open("large_file.pdf", "rb") as f:
 7    result = md.convert_stream(f, file_extension=".pdf")
 8    
 9    # Process in chunks or save directly
10    with open("output.md", "w") as out:
11        out.write(result.text_content)

4. Optimize for Token Efficiency

Markdown output is already token-efficient, but you can:

  • Remove excessive whitespace
  • Consolidate similar sections
  • Strip metadata if not needed
 1from markitdown import MarkItDown
 2import re
 3
 4md = MarkItDown()
 5result = md.convert("document.pdf")
 6
 7# Clean up extra whitespace
 8clean_text = re.sub(r'\n{3,}', '\n\n', result.text_content)
 9clean_text = clean_text.strip()
10
11print(clean_text)

Integration with Scientific Workflows

Convert Literature for Review

 1from markitdown import MarkItDown
 2from pathlib import Path
 3
 4md = MarkItDown()
 5
 6# Convert all papers in literature folder
 7papers_dir = Path("literature/pdfs")
 8output_dir = Path("literature/markdown")
 9output_dir.mkdir(exist_ok=True)
10
11for paper in papers_dir.glob("*.pdf"):
12    result = md.convert(str(paper))
13    
14    # Save with metadata
15    output_file = output_dir / f"{paper.stem}.md"
16    content = f"# {paper.stem}\n\n"
17    content += f"**Source**: {paper.name}\n\n"
18    content += "---\n\n"
19    content += result.text_content
20    
21    output_file.write_text(content)
22
23# For AI-enhanced conversion with figures
24from openai import OpenAI
25
26client = OpenAI(
27    api_key="your-openrouter-api-key",
28    base_url="https://openrouter.ai/api/v1"
29)
30
31md_ai = MarkItDown(
32    llm_client=client,
33    llm_model="anthropic/claude-sonnet-4.5",
34    llm_prompt="Describe scientific figures with technical precision"
35)

Extract Tables for Analysis

1from markitdown import MarkItDown
2import re
3
4md = MarkItDown()
5result = md.convert("data_tables.xlsx")
6
7# Markdown tables can be parsed or used directly
8print(result.text_content)

Troubleshooting

Common Issues

  1. Missing dependencies: Install feature-specific packages

    1pip install 'markitdown[pdf]'  # For PDF support
    
  2. Binary file errors: Ensure files are opened in binary mode

    1with open("file.pdf", "rb") as f:  # Note the "rb"
    2    result = md.convert_stream(f, file_extension=".pdf")
    
  3. OCR not working: Install tesseract

    1# macOS
    2brew install tesseract
    3
    4# Ubuntu
    5sudo apt-get install tesseract-ocr
    

Performance Considerations

  • PDF files: Large PDFs may take time; consider page ranges if supported
  • Image OCR: OCR processing is CPU-intensive
  • Audio transcription: Requires additional compute resources
  • AI image descriptions: Requires API calls (costs may apply)

Next Steps

  • See references/api_reference.md for complete API documentation
  • Check references/file_formats.md for format-specific details
  • Review scripts/batch_convert.py for automation examples
  • Explore scripts/convert_with_ai.py for AI-enhanced conversions

Resources

What Users Are Saying

Real feedback from the community

Environment Matrix

Dependencies

Python 3.8+
pip install 'markitdown[all]'
tesseract-ocr (for OCR functionality)
Optional: OpenRouter API key (for AI image descriptions)

Framework Support

OpenAI API ✓ (for AI descriptions) OpenRouter ✓ (recommended for model variety) Azure Document Intelligence ✓ Plugin system ✓

Context Window

Token Usage Varies by document size - typically 2K-20K tokens for standard documents

Security & Privacy

Information

Author
davila7
Updated
2026-01-30
Category
productivity-tools