My God, it's Full of XML

22 September 2008

In recent posts I looked at a native XML database called DBXML and we looked at where XML came from.

You may find yourself in the situation that you are given a pile of XML documents, possibly broken, and it is up to you to make sense of them. This post explains some tools that can form your first-aid kit for dealing with problem XML documents.

Shine like a star(let)

xmlstarlet is available from your friendly neighbourhood package manager or from the xmlstarlet website

xmlstarlet is a command line tookit that provides various different XML related helpers. For details on all the xmlstarlet tools, type:

xmlstarlet --help

Brock wrote recently about using xmlstarlet's select tool that allows you get use XPATH expressions to query your XML.

Viewing the element structure

Another handy xmlstarlet tool is the element structure viewer, this provides a friendly, xpath style view into the XML document.

xmlstarlet el filename.xml

This I tend to use the -u option which only shows the unique lines:

xmlstarlet el -u filename.xml

There is also -a for attributes and -v for the attribute values as well.

Checking for well-formed XML documents

The most useful xmlstarlet tool for me has been the XML validator, which tests whether your documents are well formed or not. You use the tool as follows:

xmlstarlet val xmlfile.xml

It also has a number of options, the main one I have used is to validate against a Document Type Definition:

xmlstarlet val -d dtdfile.dtd xmlfile.xml

Tidying up your XML files

Sometimes programs output really ugly looking XML. So when you have made sure your document is well-formed with xmlstarlet, you might want to tidy it up a bit before letting anyone else see it.

Xmltidy is a handy little Java program that loads your XML document into memory and then outputs it in a nice looking form with linebreaks and indentation.

This is especially useful when you have a collection of XML files that are referencing each other. Xmltidy will combine them into a nice looking XML document.

Download the jar file from the xmltidy homepage, and then run:

java -jar xmltidy.jar --input oldfile.xml --output newfile.xml

Dealing with Unicode problems

Some of the most annoying problems with XML files can be when the files encoding is not valid UTF-8 and some program is rejecting XML files.

I found a really nice package called uniutils, which is again available from your friendly neighbourhood package manager or from the uniutils website.

Like xmlstarlet, this gives you various utilities, however the main one I use it for is to check whether my XML files are valid UTF-8 unicode. It gives useful error messages when a file is not unicode. you can then check the file in a text editor and/or hex viewer (e.g. Ghex) to see what the problem is. So to validate an XML file, we simply go:

uniname -V filename.xml

If it has non-unicode characters, you will receive errors such as:

Invalid UTF-8 code encountered at line 215, character 115037, byte 115036. The first byte, value 0x82, with bit pattern 10001100, is not a valid first byte of a UTF-8 sequence because its high bits are 10.

So the character with hex value x82 is not a valid character in the UTF-8 encoding. In Emacs you can look at the character by typing

M-x goto-char 115037

Or you can open your hex editor. In Ghex, you can go to the edit menu and use the "Goto byte" feature to the problem character, for example, if the byte number was 119, then you can go:

http://media.commandline.org.uk/images/posts/gnome/ghex.png

That works for one character. If we want to recursively check all XML files within a directory, we can use find:

find . -name '*.xml' -print -exec uniname -V {} \;

So now lets imagine we find that the files have a non-unicode character with the hex value x82 as above, then we might want to replace it with a characters or entity, the following use of find and sed replaces all occurrences of the hex x82 with C:

find . -iname '*.xml' -exec sed -i 's/\x82/\C/g' {} \;

This can help a lot as most XML programs will reject files with inconstant encoding.

Conclusion

These are my tips for dealing with a pile of XML broken files. if you have any tips or suggestions of your own, please share them by leaving a comment below.

In some future posts, we will look at using XML with Python, and with the Django web framework.

Thanks to Andy and Nick for help with this post, and the title was based on Tommi Virtanen's fantastic Europython talk.

If you are a Digg fan, give it some lovin!

1 Andrew West says...

O.k. time to be nit picky (but not on the subject you expect).

All due respect to Virtanen, I think you'll find the original was Arthur C. Clarke, 2001 A Space Odyssey, "Oh my God! It's full of stars"

carry on.

Posted at 8:31 p.m. on September 22, 2008


2 David Jones says...

See also xkcd: http://xkcd.com/224/ My God! It's full of 'car's.

Posted at 2:25 a.m. on September 23, 2008


3 sikanrong says...

I totally just read 2063: Odyssey Two -- f**king amazing blog title

Posted at 9:25 p.m. on October 30, 2008


4 Mike says...

>The most useful xmlstarlet tool for me has been the XML validator, >which tests whether your documents are well formed or not. You >use the tool as follows: >xmlstarlet val xmlfile.xml

Also check the xmlwf linux command - it does the same thing. It's in the 'expat' package.

Posted at 12:44 p.m. on November 29, 2008


5 Edward Garson says...

Note that xmlstarlet can also clean up xml with the `fo' switch, no need for xmltidy for that purpose.

System Message: WARNING/2 (<string>, line 1); backlink

Inline interpreted text or phrase reference start-string without end-string.

Posted at 5:29 p.m. on October 6, 2009


What do you have to say?

Show Editing Help

About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

Zeth

November 29, 2009
Hi Jordan, yes that URL is gone now. I have a new contact form on this site.
Python CGI contact forms

Jordan

November 29, 2009
Zeth attention! Your form, http://zeth.me.uk/contact/, is not working The explorer says connecting ..but nothing happens Sorry for my poor English: I am Spanish Regards
Python CGI contact forms

Jordan

November 26, 2009
Sorry: tell me , not tellme (I'm spaniard) And http://zeth.me.uk/contact/ don't work
You got the touch, you got the power

David Jones

November 25, 2009
Your mad skillz are too l33t! for me. I specifically switched to Google Reader so that I could show people what blogs I read. But I couldn't work out how ...
How to find the fashionable blogs quickly

Brian R. Hickey

November 20, 2009
Symantec picked it up too.
How to bring down Internet Explorer with six words

Zeth

November 17, 2009
Thanks djm, I am the moose here. Christian, assuming one actually does Internationalise the countries, it should still work I guess, as the gettext stuff will happen before the list ...
Countries in Django

Phillip Temple

November 17, 2009
Good start, but: a) wouldn't I want None back rather than 'ZZ'? b) why not add a 'shortcut' boolean, then prepend flagged fields (plus usual '-----' separator) to the actual ...
Countries in Django

djm

November 17, 2009
Am I being a moose or did you mean: from whatever.countries import CountryField instead of from whatever.countries import CharField ? Good post though, cheers.
Countries in Django

Christian Joergensen

November 17, 2009
Wouldn't the ordering get messed up after i18n?
Countries in Django

Steve - Electronic Cigarettes Fan

November 17, 2009
Very well done. Is your blog just you writing? Nicely done, Steven.
Blogger vs Wordpress

vetetix

November 15, 2009
Sorry to bother you nearly two years after you wrote this blog article, but I can't manage to find how to modify an existing field. I am trying to change ...
Three Useful Python Bindings - ClamAV, Apt and Evolution

Manju

November 4, 2009
I am transferring some files using psftp to other device's FAT partition. But the filestamp of the file being transferred is modified to that of FAT device, after the transfer. ...
PuTTY Series: Using PSFTP

iki

November 2, 2009
or simpler: socket.gethostbyname_ex(socket.gethostname())[2]
How to find out your IP address in Python

iki

November 2, 2009
local_ip = set([ i[4][0] for i in socket.getaddrinfo(socket.gethostname(), None) if i[0] == 2 ])
How to find out your IP address in Python

Fred

November 2, 2009
testing rst ------------- - point 1
An Introduction to ReStructuredText

Ano

October 27, 2009
"You simply found the license of the StumbleUpon Toolbar for Internet Explorer." That's possible. I've got some more interesting information to add. Firstly, go to this page: https://addons.mozilla.org/en-US/firefox/addon/138 - this ...
Are your Firefox extensions proprietary software?

Ken

October 21, 2009
Stumbled in here at lunch. This is the best find of the week. Thanks.
Three classic command line tips

Jim

October 19, 2009
Thanks for the rtsp:// post - that's something that has been bugging me for a while!
Three classic command line tips

Zeth

October 18, 2009
Thanks for the comments guys. Great to see the all the gang are still here!
Three classic command line tips

Bubba

October 18, 2009
Is there any way psftp can return the true transfer rates oberved during the actual transfer?
PuTTY Series: Using PSFTP