.NET Core library to convert Microsoft Office binary files to various formats. This fork focuses exclusively on Word (.doc) files and plain text extraction from legacy Microsoft Word documents (Word 97-2003, Word 95, and Word 6.0).
You can also use the Open XML SDK to manipulate OpenXML files.
Forked from a .NET 2 Mono implementation under the BSD license.
- DOC to Plain Text Conversion: Robust extraction from Word 97-2003, Word 95, and Word 6.0 formats
- Enhanced Compatibility: Handles tables, headers/footers, embedded objects, and complex document structures
- Clean Output: Produces readable text while preserving document flow
- Edge Case Handling: Robust processing of corrupted or non-standard .doc files
- PowerPoint (.ppt) to PPTX conversion
- Excel (.xls) to XLSX conversion
- Word (.doc) to DOCX conversion
Note: This fork maintains these legacy features but does not actively enhance them.
- Enhanced formatting support for lists (numbers, bullet points, indents) and tables
- Configurable text extraction options (--no-headers-footers, --no-textboxes, --no-comments, --no-bullets)
- Performance optimizations for large document processing
- Additional error handling and recovery mechanisms
This project is inspired by and informed by several existing open-source implementations of the Word Binary Format:
| Name | Language | Description | Link |
|---|---|---|---|
| wvWare | C | Original GPL Word97 .doc text extractor |
SourceForge |
| OnlyOffice | C++ | Proprietary editor with open-source core, includes DOC parsing | GitHub |
| Antiword | C | Lightweight Word .doc to text/postscript converter |
GitHub Mirror |
| Apache POI | Java | Java API for Microsoft Documents, includes Word97 support via HWPF | Apache POI - HWPF |
| LibreOffice | C++ | Full office suite with robust support for legacy DOC files | GitHub |
| Catdoc | C | Lightweight Word .doc to text converter |
GitHub Mirror |
| DocToText | C++ | Lightweight any document file to text converter | GitHub |
dotnet test UnitTests/UnitTests.csproj
dotnet test IntegrationTests/IntegrationTests.csprojTests that the library works correctly when compiled as NativeAOT. This publishes a minimal console app (Shell/doc2text.aot) as a native binary and runs it against all sample .doc files.
dotnet test NativeAotTests/NativeAotTests.csprojThe first run compiles the NativeAOT binary (may take a few minutes). The compiled binary is cached in artifacts/nativeaot/ and reused on subsequent runs. Delete that directory to force a rebuild.
- Microsoft Office binary files documentation
- Open XML Standard
- Microsoft article on this implementation
- .NET 2 Mono implementation architecture
All code retained from that version ©2009 DIaLOGIKa http://www.dialogika.de/
.NET core port work and move to System.IO.Compression ©2017 Evolution https://www.evolutionjobs.com/