My God, it's Full of XML

22 September 2008

In recent posts I looked at a native XML database called DBXML and we looked at where XML came from.

You may find yourself in the situation that you are given a pile of XML documents, possibly broken, and it is up to you to make sense of them. This post explains some tools that can form your first-aid kit for dealing with problem XML documents.

Shine like a star(let)

xmlstarlet is available from your friendly neighbourhood package manager or from the xmlstarlet website

xmlstarlet is a command line tookit that provides various different XML related helpers. For details on all the xmlstarlet tools, type:

xmlstarlet --help

Brock wrote recently about using xmlstarlet's select tool that allows you get use XPATH expressions to query your XML.

Viewing the element structure

Another handy xmlstarlet tool is the element structure viewer, this provides a friendly, xpath style view into the XML document.

xmlstarlet el filename.xml

This I tend to use the -u option which only shows the unique lines:

xmlstarlet el -u filename.xml

There is also -a for attributes and -v for the attribute values as well.

Checking for well-formed XML documents

The most useful xmlstarlet tool for me has been the XML validator, which tests whether your documents are well formed or not. You use the tool as follows:

xmlstarlet val xmlfile.xml

It also has a number of options, the main one I have used is to validate against a Document Type Definition:

xmlstarlet val -d dtdfile.dtd xmlfile.xml

Tidying up your XML files

Sometimes programs output really ugly looking XML. So when you have made sure your document is well-formed with xmlstarlet, you might want to tidy it up a bit before letting anyone else see it.

Xmltidy is a handy little Java program that loads your XML document into memory and then outputs it in a nice looking form with linebreaks and indentation.

This is especially useful when you have a collection of XML files that are referencing each other. Xmltidy will combine them into a nice looking XML document.

Download the jar file from the xmltidy homepage, and then run:

java -jar xmltidy.jar --input oldfile.xml --output newfile.xml

Dealing with Unicode problems

Some of the most annoying problems with XML files can be when the files encoding is not valid UTF-8 and some program is rejecting XML files.

I found a really nice package called uniutils, which is again available from your friendly neighbourhood package manager or from the uniutils website.

Like xmlstarlet, this gives you various utilities, however the main one I use it for is to check whether my XML files are valid UTF-8 unicode. It gives useful error messages when a file is not unicode. you can then check the file in a text editor and/or hex viewer (e.g. Ghex) to see what the problem is. So to validate an XML file, we simply go:

uniname -V filename.xml

If it has non-unicode characters, you will receive errors such as:

Invalid UTF-8 code encountered at line 215, character 115037, byte 115036. The first byte, value 0x82, with bit pattern 10001100, is not a valid first byte of a UTF-8 sequence because its high bits are 10.

So the character with hex value x82 is not a valid character in the UTF-8 encoding. In Emacs you can look at the character by typing

M-x goto-char 115037

Or you can open your hex editor. In Ghex, you can go to the edit menu and use the "Goto byte" feature to the problem character, for example, if the byte number was 119, then you can go:

http://media.commandline.org.uk/images/posts/gnome/ghex.png

That works for one character. If we want to recursively check all XML files within a directory, we can use find:

find . -name '*.xml' -print -exec uniname -V {} \;

So now lets imagine we find that the files have a non-unicode character with the hex value x82 as above, then we might want to replace it with a characters or entity, the following use of find and sed replaces all occurrences of the hex x82 with C:

find . -iname '*.xml' -exec sed -i 's/\x82/\C/g' {} \;

This can help a lot as most XML programs will reject files with inconstant encoding.

Conclusion

These are my tips for dealing with a pile of XML broken files. if you have any tips or suggestions of your own, please share them by leaving a comment below.

In some future posts, we will look at using XML with Python, and with the Django web framework.

Thanks to Andy and Nick for help with this post, and the title was based on Tommi Virtanen's fantastic Europython talk.

If you are a Digg fan, give it some lovin!

1 Andrew West says...

O.k. time to be nit picky (but not on the subject you expect).

All due respect to Virtanen, I think you'll find the original was Arthur C. Clarke, 2001 A Space Odyssey, "Oh my God! It's full of stars"

carry on.

Posted at 8:31 p.m. on September 22, 2008


2 David Jones says...

See also xkcd: http://xkcd.com/224/ My God! It's full of 'car's.

Posted at 2:25 a.m. on September 23, 2008


3 sikanrong says...

I totally just read 2063: Odyssey Two -- f**king amazing blog title

Posted at 9:25 p.m. on October 30, 2008


4 Mike says...

>The most useful xmlstarlet tool for me has been the XML validator, >which tests whether your documents are well formed or not. You >use the tool as follows: >xmlstarlet val xmlfile.xml

Also check the xmlwf linux command - it does the same thing. It's in the 'expat' package.

Posted at 12:44 p.m. on November 29, 2008


What do you have to say?

Show Editing Help

Europython

About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

gutes Qualitätscasino

July 3, 2009
The paragraph is the most basic block in a reST document. Paragraphs are simply chunks of text separated by one or more blank lines. As in Python, indentation is significant ...
An Introduction to ReStructuredText

sreejith

July 3, 2009
I want to download a file from remote server in binary format. Can anyone let me know the command to do so? Thanks in advance
PuTTY Series: Using PSFTP

jythlkedl;rg

July 2, 2009
????? ??? ????????? ?????? ?? ???????? ? ????????? ???, ? ??????? ??? ??????? ??????? ? ??? ?? ???? ?? ?? ????? ???????????????? ??????????????????. ??????????????????? ????? ?? ?? ????, ?, ???, ...
Burning an iso to CD on Windows

gbi-service-ru

July 1, 2009
???? ?????????, ?????????? ?? ?? ? ???"??? ??????, ?????????? ??? ?? ???? ? ????, ? ? ?????. ?? ??? ???? ???? ??? ???. ?? ?????????? ???? ?? ?. ???????? ?, ...
Burning an iso to CD on Windows

seo techniques

July 1, 2009
I would like to thank you for the inforamtion you have put on this article no matter.
Only the penitent man will pass - on captchas and cotton wool

Online Craps lernen

July 1, 2009
I would like to thank you for the making these clarifications in such a detailed manner to rebuilt the communication and enhancing the strategies of the organization which could be ...
Disclaimer: NO WARRANTY

ZK@Web Marketing Blog

July 1, 2009
Django is an amazing web framework; we built a lot of features in a very short period of time and Django [mostly] stayed out of our way. Last night as ...
Baby Steps with Django - Part 4 Django Applications and flow

overnight payday loans

July 1, 2009
I found commandline.org.uk very informative. The article is professionally written and I feel like the author knows the subject very well. commandline.org.uk keep it that way.
Only the penitent man will pass - on captchas and cotton wool

Drogo

June 30, 2009
Gotta agree with your sentiments about many modern games. The cost of a new game is prohibitive, especially for consoles (although I've noticed that PS2 games have crashed in price ...
Retro British Gaming - Part 3: Amstrad CPC Games

pppiohooddd

June 29, 2009
Free vadult video site! http://crech.us/ 1000 free video every day!
OpenSolaris, Gobuntu, and be careful who you kiss

Tesyimasystus

June 29, 2009
...Love this dude!!! http://www.esnips.com/doc/79c22395-7bd6-4299-92db-cf392e381698/kutiman---this-is-what-it-became Peace
5 Homebrew Python Games

Simon Tite

June 28, 2009
twitterfall is still there, I just tried it, and to me it beats Visible Tweets hands down. Problem with Visible Tweets: * Extremely **irritating** animations! (There are three available, but ...
Visualising your favourite keywords in Twitter

piffAltetle

June 28, 2009
??? ??? ???? ???????????? ??????,?????????? ???? ?????? ??????????? ???????,??????????? ????? mp3,??????? ??????????? ??????.
Encrypt your /home this Christmas: part three - moving your data to the encrypted partition

idhyougjdsyhfr

June 26, 2009
SMS Trap is something that never fails to help you get your partner off guard? Our software will make reading other people?s SMS as easy as ABC. Ready for some ...
Burning an iso to CD on Windows

Sozdanie-saitov-com

June 26, 2009
???? ???????? ? ?????????????????? ????????????? - ??? - ???? ?? ????? ?????. ???? ?? ??? ? ??????? ?????? ????????? ?????! ???, ?????23126 sozdanie-saitov.com@mail.ru
Burning an iso to CD on Windows

gameskillz

June 26, 2009
Killzone 2 - the best PS3 game yet?Still LittleBigPlanet for me, but Sony's new shooter is mightily impressive. What you think about my web? http://www.easyfaxlesspaydayloan.com/payday-loans-online.html
Email Syntax Check in Python

Anish

June 25, 2009
hey Moritz, Check this http://commandline.org.uk/python/my-merry-five-minutes-with-bazaar/
Setting up a bazaar server

gbi zavod 177

June 24, 2009
???? ?????????, ?????????? ?? ?? ? ???"??? ??????, ?????????? ??? ?? ???? ? ????, ? ? ?????. ?? ??? ???? ???? ??? ???. ?? ?????????? ???? ?? ?. ???????? ?, ...
Burning an iso to CD on Windows

vettone

June 24, 2009
??? ????? ????? ????,??????? ?? ???,????? ???? ???.????? ?? ??????????? ????,?????????? ?????? ???????? ?? ????.???? ????????: http://euro-football.ucoz.com ????? ???? ??????????.
Burning an iso to CD on Windows

tuegjhg78kjfhuey

June 23, 2009
? ???????????????? ???? ??? ???, ?? ?????? ?? ?????????, ???, ???????????????? ??? ??????????, ???? ????? ??? ??? http://remont.ucoz.ua/
Burning an iso to CD on Windows