2007-05-23

Parsing simple XML files in python using etree

So, I'm relatively new to the worlds of python and xml, and as such haven't quite figured everything out, as was demonstrated earlier today.

I needed to parse a really simple XML file, storing all the tags underneath the root tag as items in a dictionary - basically parsing an xml options file.

So, let's say our xml file looks like this:

<?xml version="1.0"?>
<options>
<option1>foo</option2>
<option2>bar</option2>
<option3>zip</option3>
</options>



Pretty simple right? I figured, hey this can't possibly be a pain, I'm sure python has some great XML parsing libraries available.

Some basic internet searching led me to xml.parsers.expat . In fact, this was what most of the "parse xml with python" examples I could find were using, so instead of looking through the rest of the available xml libraries, I started using expat.

It was frustrating.

Now, I've made xml parsers in php before, following the same basic model as expat (probably using expat) - you create a parser, and set some functions to get called whenever a tag is started, whenever a tag ends, or whenever character data is found.

I made a simple xmlParser class to inherit specified parsers from, inherited from it, set appropriate functions - which included setting a flag to say we were inside the root element, setting the value of the current tag being parsed, adding to the dictionary. . . here's the inherited class:


from xmlParser import *
import time

class optionsParser(xmlParser):
"""
"This parser is used to generate a dictionary of overall options"
"for our script"
"""
def __init__(self,xmlFile):
xmlParser.__init__(self,xmlFile)
self.inOptions = False
self.curTag = " "
self.options = {}

def handleCharacterData(self, data):
#print "Handling: " , data
if self.inOptions and self.curTag != "options":
self.options[self.curTag] = data
#time.sleep(1)

def handleStartElement(self, name, attributes):
# print "Starting: ", name
if name == "options":
self.inOptions = True
if self.inOptions:
self.curTag = name
# time.sleep(1)

def handleEndElement(self, name):
# print "Ending: ", name
if name == "options":
self.inOptions = False
self.curTag = ""
# time.sleep(1)

def getOptions(self):
return self.options","python", "code1")



The commented out print statements and sleep statements were so I could watch it parse. The first time I ran it, I noticed that it was outputting a bunch of


Handling:
Handling:
Starting: option1
Handling: foo
Handling:
Ending: option1


And so on - lots of empty stuff, I'm assuming they were new lines in the text. Then, if I was to print out the dictionary, it would look like this:

:

option1 : foo
option2 : bar
option3 : zip


This annoyed the crap out of me. I figured out that it was because I was setting the curTag value to "" after processing the first set of character data under a tag - but if I didn't do that, then the tags got overwritten with blank space.

My options were to set another flag, "tagNameAlreadyProcessed" or similar, or change the
if self.inOptions and self.curTag != "options":

line to read
if self.inOptions and self.curTag != "options" and self.curTag != ""


That worked, but damn is it ugly. I knew there had to be a better way.

I posted to the python-list explaining what was happening (python-list rocks by the way - if you use python, and aren't subscribed, do it now) and an awesome fellow by the name of Steven Bethard pointed me to xml.etree.ElementTree

etree is way easier to use, at least for simple xml parsing like I needed to do here. I haven't tried to use it for anything remotely complex, but for this problem it works like a charm. Compare the above code to the equivalent with etree:


optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text","python", "code2");

Yeah, that's it. 6 lines of code. Suited more for being embedded into a class as a function. Short and sweet.

Nerdgasm.

Lessons learned:

  • Look over all available options before choosing one (Christ, I feel stupid)

  • Use etree with python, at least for simple parsing tasks

  • python-list is your friend (we already knew that, right?)