• Zeth will be attending PyCon UK on the 12th to 14th September 2008.

Native XML storage with Berkeley DB XML - part one

14 August 2008

XML databases have not really broken through yet in a big way, primarily because SQL has proved more resilient than expected in storing a wide range of data.

SQL has limits

For example, object-relational mappers such as SQLAlchemy, the one inside Django and ActiveRecord in Ruby, have become very popular recently, as they all you to serialise in-memory objects and store them in a database. Very little SQL experience is required and you can create very elaborate relational storage.

However, when you have data that is not relational at all, converting it to rows and columns can be either lossy or a lot of work, and often you end up unable to round-trip, i.e. you can get the data in, but you cannot get it out again in the original format.

The most common approach is not to even try, i.e. often people will not store XML in a database, but run a series of XML files though some XSLT stylesheets and then store the compiled HTML in a database. This removes XML's advantage as a processable exchange format. In order to get the data in a different way, the human has to go back to the beginning of the pipeline and rerun the complication and database import. If you want to allow other computers to make flexible requests over the Internet, well then you are out of luck.

Meanwhile, a lot of the XML extensions to SQL databases are still relational underneath, for example, with MS SQL, you can give it XML, which it then splits up into rows and columns; you can then also ask for the data back in XML, which it then retrieves from the rows and columns and outputs in XML. You have not actually gained very much however, as your data is still stored in rows and columns, you have to output the data before you can process or query the XML. This can prove very long-winded with highly complex hierarchical data, it can sometimes prove faster to just dump the XML to disk yourself.

Even worse is storing a whole XML document within a single SQL field in a single row. Again, this is using a database instead of a hard disk.

Elliotte Harold's article, Managing XML data: Native XML databases is a few years old so the software examples he mentions are probably a little old, however the discussion of why one might want an XML database is quite good, this interview with Jonathan Robie is also worth browsing.

Berkeley DB

Berkeley DB, more commonly BDB or just DB, is the classic embedded database, Started in 1991, it is used more or less everywhere. Despite some recent completion from SQLite, it is still by far the leading embedded database. It bundled inside in everything: Linux, BSD, OS X, OpenOffice, Python, Apache, Sendmail, Postfix, subversion, GNOME and no doubt millions of other things.

DB, like much of software that powers the world, was started at the University of California, Berkeley. It was then spun off in Sleepycat Software, which was bought by Oracle a few years ago. Note DB has nothing to do with Oracle's eponymous relational database product. They are at the complete other end of the spectrum, Oracle database is for large corporate data storage, while DB is a fast and light embedded database.

DB is also non-relational, i.e. there is no SQL interface, there are key-value pairs stored in byte arrays and that is more or less it. If you want SQL or network interfaces, then you stick that on top, for example DB is one of the back-end options for MySQL.

Berkeley DB XML

The folks at Sleepycat/Oracle have been working on Berkeley DB XML for a couple of years, but it is not so well known as its older brother. DB XML is a layer on top of DB that forms a fully XML-native database with "XQuery-based access to documents stored in containers and indexed based on their content". (Source)

There are two good things about it. Firstly, it is an XML database written in C and C++, not in Java. Secondly, it is a proper XML database, not putting XML into SQL columns.

The other notable XML database is the Java-based eXist which seems to be more well known, primarily among Java XML developers. eXist is more of a service, rather than something light to embed into applications. It also rumoured to be significantly slower than DB XML, but I have never used eXist so I can't show any benchmarks for that.

There are also a load of proprietary and half-finished XML databases we do not care about. Including the abandoned Java XML database called dbXML which has no relation to the Berkeley on but confuses everyone because the sourceforge page is often the first search engine result.

Install Berkeley DB XML - Windows

Go over to the Oracle/Berkeley DB XML homepage and grab the installer. I have not used it so I have no idea what happens, hopefully it is straightforward.

Install Berkeley DB XML - Posix platforms

DB XML works on all major posix compatible systems, including Linux, OS X, BSD, Solaris and so on. Installing it varies according to distribution. If your distribution has packaged it, then lucky you, you just install it through that. Of the two distributions I usually use as examples on this site, Gentoo Linux and Ubuntu Linux, the former has a DB XML package, the latter currently does not.

Install Berkeley DB XML - Gentoo

sudo emerge dbxml

The package is new, so if you are running stable, then the emerge command will moan and tell you that the package and some of the dependencies are not yet in stable. You will need to add these to /etc/portage/package.unmask, see Using Masked Packages in the Gentoo Handbook for more details.

Install Berkeley DB XML - Ubuntu

Hopefully, in the near future, you will be able to go:

sudo apt-get install dbxml python-dbxml

But we are not there yet. When we are there, the rest of this post will be irrelevant. Therefore this following part will date badly, I will try to remember to update it as events unfold, but if you are reading this paragraph in 2009 or later, it means I have forgotten and it might be worth checking other sources.

What follows shows how important packagers are. Almost any complex package will need to be optimised for that distribution, and normally all the work is done for us behind the scenes. The pointy-heads in the Mozilla corporation that didn't want Firefox to be patched downstream in the distributions are not living in the real world. But that is another story. Let's get going.

Dependencies on Ubuntu

First, we need to get the dependencies. We need Berkeley DB, xqilla and libxerces, the latter two are not all yet in the default Ubuntu repositories, but my mate txwikinger has packaged them for us in his private archive. You need to add these lines to the bottom of your /etc/apt/sources.list with a comment so you remember why you put the lines there:

# Deps for DBXML
deb http://ppa.launchpad.net/txwikinger/ubuntu hardy main
deb-src http://ppa.launchpad.net/txwikinger/ubuntu hardy main

Then run: sudo apt-get update

sudo apt-get install libdb4.6++ libxerces28 xqilla

sudo apt-get install libxerces28-dev libxqilla-dev libdb4.6++-dev

Working directory

We need to have a working directory, in my examples we will assume ~/Sandbox, but use where you like.

mkdir ~/Sandbox/

cd ~/Sandbox/

Go over to the Oracle/Berkeley DB XML homepage and grab the tarball and extract it into your ~/Sandbox directory, or you can just use the commands:

cd ~/Sandbox

wget http://download.oracle.com/berkeley-db/dbxml-2.4.13.tar.gz

tar -xvvzf dbxml-2.4.13.tar.gz

Compile DB XML

The tarball from Oracle contains a lot of dependencies which we have already installed. So ignore the 'buildall' instructions, as they compile all the dependencies which takes all day. You then have half a dozen or so packages that are in your main system folders but are not managed by Apt, this is not really the done thing.

Now because we are doing it the proper Unix/Ubuntu way rather than the Oracle big-tarball-of-mud approach, we want to use the shared library of xqilla that we installed earlier. There is a slight problem in the configure file which stops it from working. Open the following file in your text editor:

~/Sandbox/dbxml-2.4.13/dbxml/dist/configure

In line 4396, change .la to .so

So the line:

elif test `ls "$with_xqilla"/libxqilla*.la 2>/dev/null | wc -l` -gt 0 ; then

Becomes:

elif test `ls "$with_xqilla"/libxqilla*.so 2>/dev/null | wc -l` -gt 0 ; then

Now we can build DB XML

cd ~/Sandbox/dbxml-2.4.13/dbxml/build_unix

CFLAGS="-DSWIG_PYTHON_NO_USE_GIL" ../dist/configure --with-berkeleydb=/usr/lib/ --with-xqilla=/usr/lib/ --with-xerces=/usr/

If that works without error, you can type:

make

sudo make install

Now if that works without error, you can try to run dbxml:

cd /usr/local/BerkeleyDBXML.2.4/bin

./dbxml

Type help to see the list of commands. Press Ctrl+D to quit when you have had enough.

If you want, feel free to add the dbxml binaries to your path.

Python Bindings on Ubuntu

So far so good, now we need the Python bindings:

cd ~/Sandbox/dbxml-2.4.13/dbxml/src/python/

Now we have to do a little more patching to let the Python bindings know where dbxml is installed.

Open setup.py in your text editor We are going to add three lines, in each case make sure the indentation lines up with the line above it.

After line 18, add the following line:

db_xml_home = '/usr/local/BerkeleyDBXML.2.4'

After what is now line 65, add the following line:

INCLUDES.append(os.path.join(db_xml_home, "include"))

After what is now line 69, add the following line:

os.path.join(db_xml_home, "lib"),

Now in dbxml-2.4.13, there is another little bug we need to deal with first.

Open the following file in your text editor:

~/Sandbox/dbxml-2.4.13/dbxml/dist/swig/dbxml_python.i

Go to line 533, and change 'Vaue' to 'Value'.

So change:

class XmlInvalidVaue(XmlException):

To:

class XmlInvalidValue(XmlException):

Now this file eas used to automatically generate the Python bindings, so the generated file needs to be fixed too. The dbxml developers need to generate them, but until they do, we can just fix it ourselves. Open the following file in your text editor:

~/Sandbox/dbxml-2.4.13/dbxml/src/python/dbxml.py

Go to line 121, as before change:

class XmlInvalidVaue(XmlException):

To

class XmlInvalidVaue(XmlException):

Now we can finally install the bindings:

CFLAGS="-DSWIG_PYTHON_NO_USE_GIL" python setup.py build

sudo python setup.py install

Now lets test it, which involves, yes you guessed it, more patches:

cd ~/Sandbox/dbxml-2.4.13/dbxml/examples/python/

Now edit examples.py using your text editor. Remove the 3 from line 9. So from:

from bsddb3.db import *

To:

from bsddb.db import *

Now you can go:

python examples.py 7

Which should give you:

Running example 7.
book1 = <book><title>Knowledge Discovery in Databases.</title></book>

If you have that then everything should be in working order. This has been a very long post so I will break here and come back to DB XML in Python another day.

1 lame says...

why don't you just provide a shell script instead of going on and on with silly details?

Posted at 10:04 a.m. on August 15, 2008


2 Andy says...

Possibly because the whole point is for Zeth to document and show his process so that others can learn?

They do say, ignorance is bliss, but I never understood that one myself.

Posted at 10:47 a.m. on August 15, 2008


3 John Snelson says...

Hi Zeth,

Thanks for the article, I'm glad you're finding DB XML useful. I'm one of the developers on DB XML - let me know if I can help you out with anything (first.last at oracle.com).

We've posted a link to your article here:

http://www.oracle.com/technology/products/berkeley-db/xml/index.html

John

Posted at 3:32 p.m. on August 15, 2008


4 Antonio Araujo says...

Dear friends, has anyone built debian packages of db xml 2.4.13?

Best regards

Antonio

Posted at 9:08 a.m. on September 2, 2008


5 Zeth says...

Now I tried it on a 64bit Ubuntu Linux machine.

When I got to this point:

cd /usr/local/BerkeleyDBXML.2.4/bin ./dbxml

I got an error:

./dbxml: error while loading shared libraries: libdb_cxx-4.6.so: wrong ELF class: ELFCLASS64

So I added the 32 bit shared object to /usr/lib32:

dpkg -x libdb4.6++_4.6.21-6ubuntu1_i386.deb ./ cd usr/lib sudo cp libdb_cxx-4.6.so /usr/lib32/

I did the same for

libxerces28_2.8.0-2ubuntu1~hardy~ppa3_i386.deb libxqilla4_2.1.2-0ubuntu1~hardy~ppa1_i386.deb libicu38_3.8-6_i386.deb

Then it worked. Not sure if there is a way to make it work without resorting to 32 bit.

Posted at 12:51 a.m. on September 26, 2008


6 Zeth says...

Going down the 32 bit approach on a 64 bit system is perhaps not the best plan because there is no end of things that have to be installed into the lib32 directory.

Therefore I will try to do a native 64 bit install instead on 64 bit systems.

Posted at 11:11 a.m. on September 26, 2008


7 Zeth says...

The method in the main post seems to work even on 64bit, no 32bit emulation required.

Posted at 12:23 p.m. on September 26, 2008


8 Rux says...

Thanks a lot mate for the tut. It has saved my back big time in school :D

Posted at 12:09 p.m. on October 6, 2008


9 Jez says...

First release was in 2003, so it's not the bright young thing you suggest. It is, in my admittedly not vast, experience a thoroughly solid bit of kit.

My admiration for John, btw, because of his work on XQuilla is little short of unbounded.

Posted at 9:59 a.m. on October 17, 2008


10 fmv says...

just a real db SAMPLE please

Posted at 10:36 a.m. on November 19, 2008


11 Giacomo says...

Er, "elif test ls "$with_xqilla"/libxqilla*.so 2>/dev/null | wc -l -gt 0 ; then" should now be "elif test ls "$with_xqilla"/libxqilla.so* 2>/dev/null | wc -l -gt 0 ; then", as the file is actually libxqilla.so.4 (at least in the recent packages i've just installed from that repository you mentioned).

Posted at 2:49 a.m. on November 29, 2008


What do you have to say?

Show Editing Help


PyCon UK

About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

Tringi

December 1, 2008
Hi, I am far from your league, but instead of [20. Nd5], why not just play Qd8? :-) Wouldn't it be only Qd8 Qd8 then, or am I missing something?
Ruy Lopez, Berlin defence, open variation part three

Tringi

November 30, 2008
...oh, I meant "Qe8 Qe8" in my previous post, sry ;-)
Ruy Lopez, Berlin defence, open variation part three

Cruze

November 29, 2008
Buy discount professional health products online.
Include ODF support in the Linux Standard Base?

Mike

November 29, 2008
>The most useful xmlstarlet tool for me has been the XML validator, >which tests whether your documents are well formed or not. You >use the tool as follows: >xmlstarlet val ...
My God, it's Full of XML

Giacomo

November 29, 2008
Er, "elif test `ls "$with_xqilla"/libxqilla*.so 2>/dev/null | wc -l` -gt 0 ; then" should now be "elif test `ls "$with_xqilla"/libxqilla.so* 2>/dev/null | wc -l` -gt 0 ; then", as the ...
Native XML storage with Berkeley DB XML - part one

Felipe Coury

November 23, 2008
What do I have to say? Only this: "THANK YOU"! Awesome!
SFTP in Python: Really Simple SSH

fmv

November 19, 2008
just a real db SAMPLE please
Native XML storage with Berkeley DB XML - part one

Very helpful

November 12, 2008
but i need more help. I'm have to execute the sudo command after I log in. What do I need to do to enter the password after the sudo command ...
SFTP in Python: Really Simple SSH

blz

November 12, 2008
I buy 99% of PEP8, except: I don't like the line spacing rules... I can't read the code when it's too close together - it looks congested and I can't ...
Twelve commandments for Beautiful Python code

Zeth

November 11, 2008
Hi Ioxs, I said above *"I will give an example of a standard directive, then an example of a third-party directive"*, so the image directive is the example of a ...
An Introduction to ReStructuredText

loxs

November 9, 2008
Hello, Are you sure about the sourcecode directive, because I didn't manage to make it work. And it doesn't work with the online renderers too.
An Introduction to ReStructuredText