My God, it's Full of XML

22 September 2008

In recent posts I looked at a native XML database called DBXML and we looked at where XML came from.

You may find yourself in the situation that you are given a pile of XML documents, possibly broken, and it is up to you to make sense of them. This post explains some tools that can form your first-aid kit for dealing with problem XML documents.

Shine like a star(let)

xmlstarlet is available from your friendly neighbourhood package manager or from the xmlstarlet website

xmlstarlet is a command line tookit that provides various different XML related helpers. For details on all the xmlstarlet tools, type:

xmlstarlet --help

Brock wrote recently about using xmlstarlet's select tool that allows you get use XPATH expressions to query your XML.

Viewing the element structure

Another handy xmlstarlet tool is the element structure viewer, this provides a friendly, xpath style view into the XML document.

xmlstarlet el filename.xml

This I tend to use the -u option which only shows the unique lines:

xmlstarlet el -u filename.xml

There is also -a for attributes and -v for the attribute values as well.

Checking for well-formed XML documents

The most useful xmlstarlet tool for me has been the XML validator, which tests whether your documents are well formed or not. You use the tool as follows:

xmlstarlet val xmlfile.xml

It also has a number of options, the main one I have used is to validate against a Document Type Definition:

xmlstarlet val -d dtdfile.dtd xmlfile.xml

Tidying up your XML files

Sometimes programs output really ugly looking XML. So when you have made sure your document is well-formed with xmlstarlet, you might want to tidy it up a bit before letting anyone else see it.

Xmltidy is a handy little Java program that loads your XML document into memory and then outputs it in a nice looking form with linebreaks and indentation.

This is especially useful when you have a collection of XML files that are referencing each other. Xmltidy will combine them into a nice looking XML document.

Download the jar file from the xmltidy homepage, and then run:

java -jar xmltidy.jar --input oldfile.xml --output newfile.xml

Dealing with Unicode problems

Some of the most annoying problems with XML files can be when the files encoding is not valid UTF-8 and some program is rejecting XML files.

I found a really nice package called uniutils, which is again available from your friendly neighbourhood package manager or from the uniutils website.

Like xmlstarlet, this gives you various utilities, however the main one I use it for is to check whether my XML files are valid UTF-8 unicode. It gives useful error messages when a file is not unicode. you can then check the file in a text editor and/or hex viewer (e.g. Ghex) to see what the problem is. So to validate an XML file, we simply go:

uniname -V filename.xml

If it has non-unicode characters, you will receive errors such as:

Invalid UTF-8 code encountered at line 215, character 115037, byte 115036. The first byte, value 0x82, with bit pattern 10001100, is not a valid first byte of a UTF-8 sequence because its high bits are 10.

So the character with hex value x82 is not a valid character in the UTF-8 encoding. In Emacs you can look at the character by typing

M-x goto-char 115037

Or you can open your hex editor. In Ghex, you can go to the edit menu and use the "Goto byte" feature to the problem character, for example, if the byte number was 119, then you can go:

http://media.commandline.org.uk/images/posts/gnome/ghex.png

That works for one character. If we want to recursively check all XML files within a directory, we can use find:

find . -name '*.xml' -print -exec uniname -V {} \;

So now lets imagine we find that the files have a non-unicode character with the hex value x82 as above, then we might want to replace it with a characters or entity, the following use of find and sed replaces all occurrences of the hex x82 with C:

find . -iname '*.xml' -exec sed -i 's/\x82/\C/g' {} \;

This can help a lot as most XML programs will reject files with inconstant encoding.

Conclusion

These are my tips for dealing with a pile of XML broken files. if you have any tips or suggestions of your own, please share them by leaving a comment below.

In some future posts, we will look at using XML with Python, and with the Django web framework.

Thanks to Andy and Nick for help with this post, and the title was based on Tommi Virtanen's fantastic Europython talk.

If you are a Digg fan, give it some lovin!

1 Andrew West says...

O.k. time to be nit picky (but not on the subject you expect).

All due respect to Virtanen, I think you'll find the original was Arthur C. Clarke, 2001 A Space Odyssey, "Oh my God! It's full of stars"

carry on.

Posted at 8:31 p.m. on September 22, 2008


2 David Jones says...

See also xkcd: http://xkcd.com/224/ My God! It's full of 'car's.

Posted at 2:25 a.m. on September 23, 2008


3 sikanrong says...

I totally just read 2063: Odyssey Two -- f**king amazing blog title

Posted at 9:25 p.m. on October 30, 2008


4 Mike says...

>The most useful xmlstarlet tool for me has been the XML validator, >which tests whether your documents are well formed or not. You >use the tool as follows: >xmlstarlet val xmlfile.xml

Also check the xmlwf linux command - it does the same thing. It's in the 'expat' package.

Posted at 12:44 p.m. on November 29, 2008


5 Edward Garson says...

Note that xmlstarlet can also clean up xml with the `fo' switch, no need for xmltidy for that purpose.

System Message: WARNING/2 (<string>, line 1); backlink

Inline interpreted text or phrase reference start-string without end-string.

Posted at 5:29 p.m. on October 6, 2009


What do you have to say?

Show Editing Help

About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

Essex Web Design

September 3, 2010
A lot of contract providers give you free internet usage now, but if you have Pay As You Go, then you are going to be paying heavy prices.
Calling time on mobile internet nonsense?

Krasochka

September 2, 2010
Hack again?!
Adding more terminals to your function keys

GenryFlorist

September 2, 2010
<b>Cheap flowers delivery around the world!</b> Celebrate summer with our gorgeous flowers. They?re the perfect gift for any summer occasion. From birthdays to anniversaries, we offer beautiful flowers, lush plants, ...
Burning an iso to CD on Windows

auto-financing.co.cc

September 2, 2010
auto-financing
ReStructuredText tables and doctests

rubaxa

September 1, 2010
FTP = NOT RANDOM software Dominated hands postflop suckout often on all-ins. EX. AK vs. A9 or KQ vs. K6. Both players hit top pair. Bad player goes all in ...
Burning an iso to CD on Windows

empodayaddelm

September 1, 2010
Sorry admin - my post is test
This Week: Heroes and Monsters

increase synthroid dosage

September 1, 2010
Latest world news: 1 <a target="_blank" class="ext" href=http://www.maktabti.org/profiles/blogs/viagra-cialis-buy-no>buy cheap cialis generic levitra viagra</a> Viagra 2 <a target="_blank" class="ext" href=http://www.maktabti.org/profiles/blogs/buy-viagra-online-at-lowest>rainbowpush discussion board buy viagra</a> Viagra 3 <a target="_blank" class="ext" href=http://www.maktabti.org/profiles/blogs/how-to-get-generic-brand>search viagra ...
SFTP in Python: Paramiko

Lacilslaw

September 1, 2010
HYUN JAIMIE enniless and homele JAMILA
This Week: Heroes and Monsters

domaserisk

August 31, 2010
who was shaking his head back and forth knowingly Grissom shifted his eyes over at Brass,
How I Removed Windows from my Laptop

get ready loan

August 30, 2010
Though, by the you kill the legitimate PC user from visiting the site. Also, think about the dynamic IP's issue.
Only the penitent man will pass - on captchas and cotton wool

Packers and movers in pune

August 30, 2010
The topic you disscussed here is very amazing, informative and useful in future...
On Comment Spam

serhanters1

August 30, 2010
?? ???????? ??... ??????...... ??. ????????? ??? ??? ???????D ???????? ??. ????? ????? ???? ??? ???=) ?? ?????- http://letitbit.net/download/8746.894a84bc20f38f1661895aeee0/stereokartinki.html ???http://f-zona.ru ? ? ?? ????????????
Burning an iso to CD on Windows

strona startowa

August 29, 2010
Thanks For This Post, was added to my bookmarks.
Python CGI contact forms

lerexottori

August 29, 2010
?????????????? ??????????
Adding more terminals to your function keys

KelpAugmeme

August 29, 2010
aofaapsymp, http://forums.quark.com/members/jennaq.aspx online stock trading broker, rdgofzary
PuTTY Series: Adding PuTTY to your system path

Cheeday

August 28, 2010
What flowers do you like?
This Week: Heroes and Monsters

magfcvb

August 28, 2010
??????? ?????????????? ?????? - ????? ?????? ?????????????? ??????, ?????????????? ?????? ???????, ?????????????? ?????? crosman, ???? ??????????????? ??????, ?????????????? ?????? ?????? ????????. ???? magazin-oruzhie.ru
Include ODF support in the Linux Standard Base?

noni

August 28, 2010
I find myself coming to your blog more and more often to the point where my visits are almost daily now!
On Comment Spam

Latenadsfes

August 28, 2010
http://mynewblog.for-breastcancer.com/ http://mynewblog.photoblogcentral.com/ http://ilovezebras.thechicks.org/ http://mynewblog.cyberbardsymposium.com/ http://wewphost.com/ilovezebras/
Burning an iso to CD on Windows

LeupoldEst

August 28, 2010
pretty cool stuff here thank you!!!!!!!
OOXML Vote Coverage