2007-05-23

Parsing simple XML files in python using etree

So, I'm relatively new to the worlds of python and xml, and as such haven't quite figured everything out, as was demonstrated earlier today.

I needed to parse a really simple XML file, storing all the tags underneath the root tag as items in a dictionary - basically parsing an xml options file.

So, let's say our xml file looks like this:

<?xml version="1.0"?>
<options>
<option1>foo</option2>
<option2>bar</option2>
<option3>zip</option3>
</options>



Pretty simple right? I figured, hey this can't possibly be a pain, I'm sure python has some great XML parsing libraries available.

Some basic internet searching led me to xml.parsers.expat . In fact, this was what most of the "parse xml with python" examples I could find were using, so instead of looking through the rest of the available xml libraries, I started using expat.

It was frustrating.

Now, I've made xml parsers in php before, following the same basic model as expat (probably using expat) - you create a parser, and set some functions to get called whenever a tag is started, whenever a tag ends, or whenever character data is found.

I made a simple xmlParser class to inherit specified parsers from, inherited from it, set appropriate functions - which included setting a flag to say we were inside the root element, setting the value of the current tag being parsed, adding to the dictionary. . . here's the inherited class:


from xmlParser import *
import time

class optionsParser(xmlParser):
"""
"This parser is used to generate a dictionary of overall options"
"for our script"
"""
def __init__(self,xmlFile):
xmlParser.__init__(self,xmlFile)
self.inOptions = False
self.curTag = " "
self.options = {}

def handleCharacterData(self, data):
#print "Handling: " , data
if self.inOptions and self.curTag != "options":
self.options[self.curTag] = data
#time.sleep(1)

def handleStartElement(self, name, attributes):
# print "Starting: ", name
if name == "options":
self.inOptions = True
if self.inOptions:
self.curTag = name
# time.sleep(1)

def handleEndElement(self, name):
# print "Ending: ", name
if name == "options":
self.inOptions = False
self.curTag = ""
# time.sleep(1)

def getOptions(self):
return self.options","python", "code1")



The commented out print statements and sleep statements were so I could watch it parse. The first time I ran it, I noticed that it was outputting a bunch of


Handling:
Handling:
Starting: option1
Handling: foo
Handling:
Ending: option1


And so on - lots of empty stuff, I'm assuming they were new lines in the text. Then, if I was to print out the dictionary, it would look like this:

:

option1 : foo
option2 : bar
option3 : zip


This annoyed the crap out of me. I figured out that it was because I was setting the curTag value to "" after processing the first set of character data under a tag - but if I didn't do that, then the tags got overwritten with blank space.

My options were to set another flag, "tagNameAlreadyProcessed" or similar, or change the
if self.inOptions and self.curTag != "options":

line to read
if self.inOptions and self.curTag != "options" and self.curTag != ""


That worked, but damn is it ugly. I knew there had to be a better way.

I posted to the python-list explaining what was happening (python-list rocks by the way - if you use python, and aren't subscribed, do it now) and an awesome fellow by the name of Steven Bethard pointed me to xml.etree.ElementTree

etree is way easier to use, at least for simple xml parsing like I needed to do here. I haven't tried to use it for anything remotely complex, but for this problem it works like a charm. Compare the above code to the equivalent with etree:


optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text","python", "code2");

Yeah, that's it. 6 lines of code. Suited more for being embedded into a class as a function. Short and sweet.

Nerdgasm.

Lessons learned:

  • Look over all available options before choosing one (Christ, I feel stupid)

  • Use etree with python, at least for simple parsing tasks

  • python-list is your friend (we already knew that, right?)

30 comments:

  1. You should also check out amara. I have found it is just as simple to use as etree and provides a more pythonic API. One handy aspect that is very nice is being able to use XPath. for example:

    doc = amara.parse('<hello><world/></hello>')
    worlds = doc.xml_xpath(u'//world')

    My biggest reason for using amara is namespaces. ElementTree and friends are technically correct, but Amara (and 4Suite) do a better job getting the namespaces and prefixes you want.

    You can get amara with easy_install or from here.

    Good luck!

    ReplyDelete
  2. I'll check it out - I prefer to use tools that are in the standard library if they're available though.

    ReplyDelete
  3. Here's a one-liner (not counting the import):

    options = dict((e.tag, e.text) for e in etree.parse("options.xml").getroot())

    ReplyDelete
  4. Hey I know this is off topic but I was wondering if you knew of any widgets I could add to my blog
    that automatically tweet my newest twitter updates.
    I've been looking for a plug-in like this for quite some time and was hoping maybe you would have some experience with something like this. Please let me know if you run into anything. I truly enjoy reading your blog and I look forward to your new updates.
    Also see my website - avis Mitoslim

    ReplyDelete
  5. Your method of explaining everything in this paragraph is truly pleasant, every one
    can easily be aware of it, Thanks a lot.
    Here is my page Premium Green Coffee

    ReplyDelete
  6. Very nice post. I just stumbled upon your blog and wanted to say that I have truly enjoyed surfing around
    your blog posts. After all I will be subscribing to your feed and I hope you write again soon!
    My weblog ; Tru Visage

    ReplyDelete
  7. My brother recommended I might like this web site.
    He used to be totally right. This submit actually made my day.
    You can not believe just how a lot time I had spent for this info!

    Thanks!
    My web blog ; Rejuvenex Wrinkle Reducer

    ReplyDelete
  8. I simply couldn't go away your website prior to suggesting that I actually enjoyed the usual info a person supply on your guests? Is gonna be back frequently to check out new posts
    Stop by my page ; Lorenza Obryan (lorenzawya) - Vault Microblog

    ReplyDelete
  9. I do not even know how I ended up here, but I thought this post
    was great. I don't know who you are but definitely you are going to a famous blogger if you aren't already ;) Cheers!
    My weblog - UserPageramontrent : Bug Wiki

    ReplyDelete
  10. Undeniably believe that which you said. Your favorite reason
    appeared to be on the net the easiest thing to be aware of.
    I say to you, I definitely get irked while people think about
    worries that they just don't know about. You managed to hit the nail upon the top as well as defined out the whole thing without having side effect , people could take a signal. Will probably be back to get more. Thanks
    Feel free to surf my blog : User:Jasmine86L - Inwicast

    ReplyDelete
  11. A motivating discussion is definitely worth comment.
    I do believe that you should write more about this issue, it might not be a taboo subject but generally folks don't talk about these issues. To the next! All the best!!
    Feel free to surf my website - African Mango Active Diet

    ReplyDelete
  12. Good day! I know this is kinda off topic but I was wondering if you knew where I could locate a
    captcha plugin for my comment form? I'm using the same blog platform as yours and I'm having
    difficulty finding one? Thanks a lot!
    My web blog aristotle.oneonta.edu

    ReplyDelete
  13. Yes! Finally something about weight lifting supplements.


    Also visit my blog: Ripped Muscle Extreme Supplement

    ReplyDelete
  14. Everyone loves what you guys tend to be up too.
    This kind of clever work and reporting! Keep up the terrific works guys I've added you guys to blogroll.

    Here is my website Asphalt Driveway Mn

    ReplyDelete
  15. Hi there every one, here every one is sharing these kinds of familiarity,
    therefore it's fastidious to read this webpage, and I used to go to see this website everyday.

    My web blog Hardknight Male Enhancement

    ReplyDelete
  16. Hey! I just wanted to ask if you ever have any problems with hackers?
    My last blog (wordpress) was hacked and I ended up
    losing many months of hard work due to no back up. Do you have any
    solutions to prevent hackers?

    my page ... Auravie Review
    Also see my web page > Auravie Reviews

    ReplyDelete
  17. Hi, I do think this is an excellent site. I stumbledupon it ;) I am going to revisit yet
    again since I bookmarked it. Money and freedom
    is the greatest way to change, may you be rich and continue to help others.


    My web page; Achat vues youtube

    ReplyDelete
  18. Pretty! This has been an extremely wonderful article.
    Thanks for supplying this information.

    Also visit my homepage; click the following page

    ReplyDelete
  19. Excellent blog right here! Additionally your web site
    so much up very fast! What web host are you the usage of?
    Can I get your associate link to your host?
    I want my website loaded up as fast as yours lol

    Also visit my web-site acheter followers twitter

    ReplyDelete
  20. Heya i am for the first time here. I came across this board and I find It truly useful & it helped me out
    a lot. I hope to give something back and aid others like you
    aided me.

    Also visit my web blog - avoir plus de vue sur youtube

    ReplyDelete
  21. continuously i used to read smaller articles that also clear their motive,
    and that is also happening with this paragraph which
    I am reading now.

    Review my web site acheter des followers

    ReplyDelete
  22. It's genuinely very complicated in this active life to listen news on Television, so I only use web for that reason, and obtain the hottest news.

    my blog ... Weight Loss

    ReplyDelete
  23. Quality articles or reviews is the secret
    to be a focus for the viewers to pay a visit the website, that's what this website is providing.

    My homepage ... Buy lift and glow pro

    ReplyDelete
  24. Sweet blog! I found it while searching on Yahoo News.

    Do you have any suggestions on how to get listed in Yahoo News?
    I've been trying for a while but I never seem to get there! Many thanks

    my site Vydox Male Enhancement Solution

    ReplyDelete
  25. It's hard to find educated people about this topic, however, you sound like you know what you're talking about!
    Thanks

    My web page: http://maxthermoburnblog.com

    ReplyDelete
  26. We're a group of volunteers and starting a new scheme in our community. Your website provided us with valuable information to work on. You've done a formidable job and our whole community
    will be grateful to you.

    My site - http://trimextrindiet.com

    ReplyDelete