COMMAND LINE WARRIORS

Taking Control of your Own Technology

Scaffolds and Crutches

8 October 2007

Hello

In today's rant, I look at a few different text document formats, and compare the file size of each, I then question what we trying to achieve with these formats anyway?

If the guts of file formats are not your cup of tea, then move along and come back tomorrow when we will talk about something else.

Method

The process was the same for each application. I opened the program, typed 'Hello World' and then saved the file. I did nothing else in the program so as not to generate any more metadata. The text sample was literally Hello World, i.e. eleven Characters (ten letters and a space), no newlines, no selected fonts or styles.

Results

Format Application File Size (bytes) .txt Emacs 21.4.1 11 .abw Abiword 2.4.6 2517 .odt OpenOffice Writer 2.20 6674 .doc Microsoft Word 2003 SP2 24064

I don't have Office 2007 and OOXML, it might be interesting if someone could tell us the filesize of Hello World in that.

What is this space being used for?

Besides Emacs, all the other applications are storing a lot of style information and metadata about things that I did not specify. They are storing the default font type and size, paper type and size, as well as information about features that I have not used: footnotes, embedded images, macros and so on.

Abiword is slightly more modular and smarter than the other two formats, it does include all the default fonts and so on, but it does not store information about unused features. The size difference between the Abiword format on the one hand and ODF and Microsoft on the other is greater than shown here because I am not actually comparing like-to-like.

The Abiword file is in a human readable format already. The .doc format is binary, so it would take up a lot more space if expressed in a human readable format. Likewise, ODF files are archive files. I unzipped the .odt file into human readable files and there was a whopping 88K of data inside it - remember this is just for 'Hello World'.

Do I want default styles?

The first question that arises for this is when I do not explicitly select any styles, such as paper type or font size and so on, should the application choose styles for me and write it into the file? Or should it respect the default style of whatever computer and application the file might be opened with?

Let's look at this by taking an example, the font type. Now the question is whether the application should include font type information within the file, even if I have not explicitly selected a font?

Let's imagine that I create my file on Linux and then share my file with someone on Windows using Word that is set up to understand the format of the file (either by default or through a plugin).

The first option would be to not include any style information. So 'Hello World' would open on their screen in Word's default font, which I am assuming is still 'Times New Roman'.

The second option would be to include the style information. So it saves in the metadata that font should be 'DejaVu Sans' (this is the default format on OpenOffice on my version of Linux). Word opens the file but Windows does not have DejaVu Sans, so 'Hello World' opens in Times New Roman anyway, looking exactly the same as in the previous example.

If I really cared that the other person could see the document in DejaVu Sans, i.e. that the document on their screen has to look the same as the document on my screen, then we would embed the font in the file, i.e. we would end up with something that looks rather like PDF. The filesize may end up as several megabytes (just to say 'Hello World').

There is a difference of course between the file format and the behaviour of the application that implements the format. For the fun of it, I stripped out the style defaults from the Abiword file, and got it just below 1000 bytes, and it still opened without error.

I don't know exactly how much style information I could strip out of my .odt file before it became a invalid ODF document according to the standards, but it would be an interesting exercise to try to create the smallest valid ODF file.

Dumb and Dumber?

The second question that arises is about modularity, or to put it another way, about 'stretchiness'. I personally like the Abiword approach, i.e. it only creates image metadata when you embed an image, it only includes information about footnotes when you include a footnote, it only has metadata about macros when you have macros and so on. The file starts as simply as possible and only becomes complicated when you need it to.

The ODF format has lots of scaffolding just in case you use these extra features. My Hello World ODF file archive contained over half-a-dozen XML files and over a dozen folders.

I'm sure there are lots of reasons that could be given to justify for all this bloat, but I am not sure that the file format should be optimised for a very small percentage of the files actually being created in the real world, i.e. a file that uses every single feature and so all the metadata and all the subdirectories. So in other words, ODF text and .doc format are optimised for people writing large complex documents that contain lots of macros.

The problem is that for these tasks, you would be better off with a specialist tool. For example, for typesetting you are better off with a tool that generates LaTeX, for complex corporate tasks, it is almost always more efficient to have proper corporate applications with servers and database- backends and the like than building everything upon word processor macros.

So these file formats are optimised for people using office tools beyond their natural capabilities, i.e. people doing a sloppy job because they do not have adequate technical support or budgets for anything else.

Let me be a bit clearer, what people often think of as an 'advanced' or 'complex' document, often is not that complicated from a technical perspective. A document with some tables and some embedded images is not a complex document. While most extremely complex documents ('monster documents) are extremely complex because people are tying too hard with the wrong approach and tools.

Back to the future?

So a few people are quietly discussing putting both OOXML and ODF to one side and instead looking at the format commonly called XHTML+, which is summarised in RFC 3236 (the MIME type is application/xhtml+xml), to see how far it can go.

If you have a XHTML+ document and a stylesheet in CSS, you can cover 90% or more of the documents that are actually created, PDF or more specialist formats can mop up the rest. Tables, embedded images, complex styles and so on are all fine. Being an XML-based format, it is machine readable; being an HTML-based document, it is already web-ready.

This would be a return to some of the 'rich text' approaches of the past, or at least the next generation equivalent of them. These approaches were using classic (non XML) HTML for documents and emails and so on. The difference now is that such a XHTML+ based document format would be native XML and the style would be separated from the content.

This document format would not need many changes to existing office applications, every word processor released in the last half-decade should be able to at least open an XHTML file with a stylesheet, Microsoft Word included; with minimal help, these applications should also be able to write the files in the particular way required by the format.

The idea of a light, fast and easy to use XHTML+XML based document format (that simply ignores the dead-end monster documents) is a great idea, but I doubt this will get anywhere, the world is too exasperated for a new format right now, but it is interesting none the less for how it reflects on ODF and OOXML.

Which part is the document anyway?

A lot of discussion about document formats especially in the media, this post included, is somewhat dubious because the terminology is ambiguous.

Ambiguous terminology normally arises because either the whole picture is not seen and/or because the actual specifics are not understood, i.e. there are breath and/or depth issues hiding under the discussion.

So far we have been discussing what we should do when taking a document from my computer to your computer, i.e. what should be transferred? But to have the discussion properly, we need to agree on what is a document anyway? Why am I trying to share bits and bytes with you in the first place?

This I think is where we need to use a different word altogether, let's use the word 'information'. No one wants to share documents for the sake of it. People want to share information, and people want to do this easily, accurately and efficiently.

What do I mean by accurately? By this, we mean that the information that people receive is as faithful as possible to the information sent. Faithfulness is the older English word, the french import is fidelity, (of course, both originally derive from the Latin word 'fides', the Roman goddess of trust).

So when people talk about that the results of some document converter are 'high-fidelity' or 'low-fidelity', the question is fidelity to what? Fidelity to the information itself, fidelity to the visual appearance, fidelity to the file structure, fidelity to the macros?

Information not Macros

You might have guessed already, that my bias is towards fidelity to the information itself. Some companies, organisations and governments are asking themselves the following question:

*Can ODF include all the styles and macros that have piled up inside our documents that have piled up on our hard-drives and shared drives over the last decade? *

System Message: WARNING/2 (<string>, line 195); backlink

Inline emphasis start-string without end-string.

This is the wrong question. The better question is 'can we get all our information out of these binary formats so we can move to a more efficient way of working?'

This data should not be put into new XML-based documents, just to pile up again on people's hard-drives and shares, it should be in server driven tools, databases, web pages and so on.

I am not a fan of office software macros because macros are crutches. By all means have plugins for your word processors and other office software, but software logic should be shared separately from information. Likewise, corporate styles should not be combined with data until the last possible moment, i.e. when the data is actually leaving the organisation, i.e. in web sites, PDFs and so on.

Over to you

That's how I feel about it today anyway! This might well be one of those posts that you can find holes and edge cases that get me to change or refine my opinions. So leave a comment and let me know.

1 Rob Weir says...

What you point out is more a factor of the applications OpenOffice and MS Word than it is about the formats. There is nothing in the ODF standard, for example, that requires it to store unused style definitions.

Many modern word processors like OpenOffice base new documents on a default document template. This document template (which can itself be edited or replaced by the user) contains the set of styles which a new document will inherit. Without the default template, you can imagine how frustrating it would be to use a word processor.

So imagine that when you saved a document, instead of writing out all of the template's styles, you wrote out on the ones which were actually used. The resulting document would be smaller, right? That may be a benefit in some situations. However, when a user, perhaps a different user, attempts to edit the document, they would only be able to use that same subset of styles.

Also note that default styles are designed to be a harmonious whole. So presentation styles encode a particular color scheme, typography, background art, etc. Even text document template styles are designed to work together in terms of fonts, margins, spacing, header levels, etc.

So, what are the alternatives? We could store all templates in some centrally-located place, and have all documents refer to these styles by URL. But this creates additional management and versioning issues. And what if you want to edit the document on an airplane? Then you would need to have facilities for caching styles, etc. This is all certainly possible, but the far simpler solution is to make the document be self-contained and hold its own default styles.

Posted at 2:10 a.m. on October 9, 2007


2 Chris Walker says...

Microsoft Word 2007 (Trial): 9,920 bytes

I also tried the test in a copy of OO.o 2.0.4 that I still have lying around, which gave me a file size of 7,127 bytes.

So both office suites are bringing their file sizes down, but I suspect that the huge difference in the size of the Microsoft formats has a lot to do with OOXML's use of ZIP compression (which ODF also uses by way of Sun's JAR format).

With regard to the inclusion of fonts in a document that uses a default style, I think that consistency of presentation across different clients is much more important than a slightly larger file size. Otherwise, there is very little point in having a unified format like ODF, if you can't assume that your document will look the same everywhere.

Oh, and I thought I'd try the Hello World test in Apple Pages. 36,693 bytes. Ouch! I had a quick look at the XML, and it seems to have an empty XML tag for each feature in a style that isn't used. Deary me.

Posted at 9:30 a.m. on October 9, 2007


3 Troy Curtis says...

I whole-heartedly agree that office macros should be banished! Right along with monster documents (I love that term). At work we are being pushed into using a requirements tracking system for requirements, design documents, proposals, etc. While it certainly has it's limitations, it is a far far far step above a simple check-in/check-out document management system using .doc file will all the embedded change tracking! And the templates produced by the company were monster documents before they even had any information in them! Many people (especially old-timers) are grumbling and resisting the change, but I love it. The best part is that there is a plain text import with a decent set of simple formatting rules so that I can do everything in VIM! Whoop!

Posted at 3:39 a.m. on October 10, 2007


4 Bug says...

That reminds me something. When I was younger, I used to use Word / Excel in order to make my time tables.

Lately, I found myself doing that in xhtml, I wonder why...

Posted at 7:30 a.m. on October 11, 2007


5 Peter Lewis says...

Quote: "The first question that arises for this is when I do not explicitly select any styles, such as paper type or font size and so on, should the application choose styles for me and write it into the file? Or should it respect the default style of whatever computer and application the file might be opened with?"

A fantastic point and IMO definitely the latter. As I expressed in my post about dark colour schemes (http://www.petesodyssey.org/node/165), I think that when I'm reading some information, I should be able to do it in the way I prefer. For me, this is just a preference, but for some people if you specify size 10 grey text on a white background, it's not readable at all.

I completely agree with the idea of separating the 'information' from the 'document'.

Posted at 2:53 p.m. on October 11, 2007


6 Doug Mahugh says...

Zeth, FYI, the file size in Word 2007 for a DOCX is 9870 bytes.

Regarding macros, I think most organizations with large investments in them, and business processes built around them, will be reluctant to walk away from that investment.  Just my speculation, of course -- in a few years we'll know for sure.

Posted at 11:32 p.m. on October 11, 2007


7 Rolf says...

Word 2007 9909 bytes xml 222 (Open Office 2.3 saved as Docbook XML) rtf 2243 (Open Office 2.3 saved as rtf)

I've done a number of long documents that include cross references, indexes, and of course a table of contents. This would be a real mess if every computer that printed them used its own font and paper sizes. While I agree that mega-size files are ridiculous for a simple "Hello World," that needs to be weighed against a number of other factors, such as how the document will be used, maintained, and distributed. (In my case, Frame Maker to create and maintain, PDF to distribute, used both on-screen and printed.) Unfortunately, most users don't know enough or care enough to consider the tools and formats appropriate for the document needs, and often its simpler to use a one-size fits-all approach.

Posted at 6:34 p.m. on October 22, 2007


8 Rolf says...

Word 2007 9909 bytes xml 222 (Open Office 2.3 saved as Docbook XML) rtf 2243 (Open Office 2.3 saved as rtf)

I've done a number of long documents that include cross references, indexes, and of course a table of contents. This would be a real mess if every computer that printed them used its own font and paper sizes. While I agree that mega-size files are ridiculous for a simple "Hello World," that needs to be weighed against a number of other factors, such as how the document will be used, maintained, and distributed. (In my case, Frame Maker to create and maintain, PDF to distribute, used both on-screen and printed.) Unfortunately, most users don't know enough or care enough to consider the tools and formats appropriate for the document needs, and often its simpler to use a one-size fits-all approach.

Posted at 7:32 p.m. on October 22, 2007


9 Brian says...

I was just trying to add in the comment above for the rant on text editor file size. Love the site!

Don't know if anyone else realizes it or maybe you do, with the 2007 Office .docx files rename the extension to .zip and you will be able to read all of the XML files for the doc properties, etc. Either way its a crazy way to see the bloat involved in M$ Word.

Posted at 11:55 p.m. on October 30, 2007


10 David A says...

It seems to be a very persistent concept, that a document has to provide both content, structure AND style. Very few web site designers can abstain from setting e.g. font sizes and family, instead of respecting the viewers preferences. That is, the defaults in the browser.

(Luckliy I can override the sites' font settings in my browser.)

In the future, the content providers provides only content and structure. The viewer determines the style. Imagine this: I read a document issued by a company B, but I see it in my employer A's style, or in my personal style. And that will be the normal thing. No PDFs. No letter heads.

Sorry I was dreaming.

Posted at 10:05 p.m. on December 3, 2007


What do you have to say?


About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

Zeth

May 16, 2008
To Anonymous, I tried your script with some old SSH keys and it did not manage to break into an apparently vulnerable system. 1. The script requires a known username. My system did not allow root logins. 2. After failed three logins, the script's IP address got added to deny hosts.
Swap out your ssh keys

Zeth

May 16, 2008
To Anonymous, I said to do three things: 1. Accept the update. 2. Replace your keys. 3. Don't *have a panic attack about it.* And I still stand by that. Most non-technical users won't even be using openssh-server. While the update, blacklists and instructions on how to regenerate comes down automatically for those that do. Indeed, I think this episode shows how fast the free/open source community can move. Everytime the open source software has a panic attack over an in-theory, technically possible, but not actually being used, 'exploit', then proprietary software people say "Look their software is no better, it is just as insecure as ours". However, that is not true. There is a range of exploits, from theoretically possible with some serious preparation and knowledge about the target system, through to automated attacks that will work against any machine without the need for knowledge about it.
Swap out your ssh keys

Anonymous

May 15, 2008
Like stefano says, you are being VERY irresponsible by downplaying this as only "theoretically possible with a supercomputer". Linked on the page stefano mentioned is this: http://milw0rm.com/exploits/5622 That will break into your computer in a couple hours is you're using public-key logins, which are considered the safest kind, and are used on many, many machines that are supposed to be extra secure. This is a horrible, horrible problem, and dismissing it does nobody any favours. I'd really suggest you re-write this article to accurately portray how serious the problem is.
Swap out your ssh keys

Ryan

May 15, 2008
Yeah, good layout too. Very clear. :) Better than the last, in fact! I'm another python/django nerd, so I'll be listening even more now. I guess one of the things that's inspiring about Django is they're concerned pretty hardcore with security fixes. Just this week, an email came out and they released new sub-versions for each major Django release to include the fix. Very awesome. For your blog post model, what did you do for entering posts? Do you still use the default admin interface, or did you make your own views for posting and whatnot? I haven't looked into it much, but does django automatically include much in the way of wysiwyg text editors for text fields?
How not to program WSGI

stefano

May 15, 2008
Apparently the bug makes a brute-force attack much easier than "theoretically possible with a supercomputer". http://metasploit.com/users/hdm/tools/debian-openssl/ It looks that the buggy code used the process ID as seed for generating the key, and there might only be 32,768 process IDs. Furthermore not all process ID are equally possible and one could use a range of 1000-3000 seeds and having a very high chance of producing a valid key.
Swap out your ssh keys

Bug

May 15, 2008
@txwikinger: Thing is, I don't use Ubuntu and I can't remember where did I generate my key [I'm using Archlinux]. @Zeth: You should add the number of comments to the front page.
Swap out your ssh keys

Kennon

May 15, 2008
The openssh-blacklist debian package (now available, and required for the latest version of openssh-client and openssh-server) is now available. You should: apt-get update apt-get install openssh-blacklist apt-get upgrade After that you'll have the ssh-vulnkey utility and can check.
Swap out your ssh keys

Krispy

May 15, 2008
mkc: debian only provided blacklists for 2048 bit RSA keys and 1024 bit DSA keys. If your key isn't one of those two types, then the blacklist isn't provided in the package. You can download one here: http://metasploit.com/users/hdm/tools/debian-openssl/ but it is nearly 100MB
Swap out your ssh keys

Ed

May 15, 2008
@Cristian: it applies to keys. If you generated a key on Ubuntu and then put it in authorized_keys on Fedora, it's possible that someone could brute force their way in to the Fedora server.
Swap out your ssh keys

Cristian

May 14, 2008
This vulnerability only applies to ssh servers, right? Aren't they the ones that generate the keys? So if my client is Ubuntu and the server is Fedora everything's okay?
Swap out your ssh keys