349 pages
February 19th, 2008
Joel Spolsky answers the question thousands of computer users have been asking themselves for years: Why are the Microsoft Office file formats so complicated?1
Last week, Microsoft published the binary file formats for Office. These formats appear to be almost completely insane. The Excel 97-2003 file format is a 349 page PDF file.
[...]
If you started reading these documents with the hope of spending a weekend writing some spiffy code that imports Word documents into your blog system, or creates Excel-formatted spreadsheets with your personal finance data, the complexity and length of the spec probably cured you of that desire pretty darn quickly. A normal programmer would conclude that Office’s binary file formats:
- are deliberately obfuscated
- are the product of a demented Borg mind
- were created by insanely bad programmers
- and are impossible to read or create correctly.
You’d be wrong on all four counts. With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it.
Sadly, that last part amounts to two options: buy a copy of Microsoft Office and use it to translate the files between formats, or use a non-Office format which Office can understand but which will fail to support at least 20% of the features you need. (The hell of it being that your 20% probably doesn't overlap with my 20%.)
It's a damn shame that Lotus and Borland and WordPerfect failed to keep up in the early-to-mid 1990s when Microsoft started pushing Office really hard; if there was still a competitive market in the general purpose office suite market then Microsoft wouldn't be able to get away with this nonsense.
- More accurately, ordinary users don't wonder about the complexity of the file format as such; they just find themselves unable to rely on being able to use their Word and Excel documents in any product that isn't from Microsoft, not realising that this is largely because of how complicated the Microsoft file formats are. Same difference. ↩