I needed to parse a really simple XML file, storing all the tags underneath the root tag as items in a dictionary - basically parsing an xml options file.
So, let's say our xml file looks like this:
<?xml version="1.0"?>
<options>
<option1>foo</option2>
<option2>bar</option2>
<option3>zip</option3>
</options>
Pretty simple right? I figured, hey this can't possibly be a pain, I'm sure python has some great XML parsing libraries available.
Some basic internet searching led me to xml.parsers.expat . In fact, this was what most of the "parse xml with python" examples I could find were using, so instead of looking through the rest of the available xml libraries, I started using expat.
It was frustrating.
Now, I've made xml parsers in php before, following the same basic model as expat (probably using expat) - you create a parser, and set some functions to get called whenever a tag is started, whenever a tag ends, or whenever character data is found.
I made a simple xmlParser class to inherit specified parsers from, inherited from it, set appropriate functions - which included setting a flag to say we were inside the root element, setting the value of the current tag being parsed, adding to the dictionary. . . here's the inherited class:
from xmlParser import *
import time
class optionsParser(xmlParser):
"""
"This parser is used to generate a dictionary of overall options"
"for our script"
"""
def __init__(self,xmlFile):
xmlParser.__init__(self,xmlFile)
self.inOptions = False
self.curTag = " "
self.options = {}
def handleCharacterData(self, data):
#print "Handling: " , data
if self.inOptions and self.curTag != "options":
self.options[self.curTag] = data
#time.sleep(1)
def handleStartElement(self, name, attributes):
# print "Starting: ", name
if name == "options":
self.inOptions = True
if self.inOptions:
self.curTag = name
# time.sleep(1)
def handleEndElement(self, name):
# print "Ending: ", name
if name == "options":
self.inOptions = False
self.curTag = ""
# time.sleep(1)
def getOptions(self):
return self.options","python", "code1")
The commented out print statements and sleep statements were so I could watch it parse. The first time I ran it, I noticed that it was outputting a bunch of
Handling:
Handling:
Starting: option1
Handling: foo
Handling:
Ending: option1
And so on - lots of empty stuff, I'm assuming they were new lines in the text. Then, if I was to print out the dictionary, it would look like this:
:
option1 : foo
option2 : bar
option3 : zip
This annoyed the crap out of me. I figured out that it was because I was setting the curTag value to "" after processing the first set of character data under a tag - but if I didn't do that, then the tags got overwritten with blank space.
My options were to set another flag, "tagNameAlreadyProcessed" or similar, or change the
if self.inOptions and self.curTag != "options":
line to read
if self.inOptions and self.curTag != "options" and self.curTag != ""
That worked, but damn is it ugly. I knew there had to be a better way.
I posted to the python-list explaining what was happening (python-list rocks by the way - if you use python, and aren't subscribed, do it now) and an awesome fellow by the name of Steven Bethard pointed me to xml.etree.ElementTree
etree is way easier to use, at least for simple xml parsing like I needed to do here. I haven't tried to use it for anything remotely complex, but for this problem it works like a charm. Compare the above code to the equivalent with etree:
optionsXML = etree.parse("options.xml")
options = {}
for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text","python", "code2");
Yeah, that's it. 6 lines of code. Suited more for being embedded into a class as a function. Short and sweet.
Nerdgasm.
Lessons learned:
- Look over all available options before choosing one (Christ, I feel stupid)
- Use etree with python, at least for simple parsing tasks
- python-list is your friend (we already knew that, right?)
You should also check out amara. I have found it is just as simple to use as etree and provides a more pythonic API. One handy aspect that is very nice is being able to use XPath. for example:
ReplyDeletedoc = amara.parse('<hello><world/></hello>')
worlds = doc.xml_xpath(u'//world')
My biggest reason for using amara is namespaces. ElementTree and friends are technically correct, but Amara (and 4Suite) do a better job getting the namespaces and prefixes you want.
You can get amara with easy_install or from here.
Good luck!
I'll check it out - I prefer to use tools that are in the standard library if they're available though.
ReplyDeleteHere's a one-liner (not counting the import):
ReplyDeleteoptions = dict((e.tag, e.text) for e in etree.parse("options.xml").getroot())
Thank you
ReplyDeletenice tutorial
ReplyDeleteGood One
ReplyDeletewww.teckzilla.net
Hey I know this is off topic but I was wondering if you knew of any widgets I could add to my blog
ReplyDeletethat automatically tweet my newest twitter updates.
I've been looking for a plug-in like this for quite some time and was hoping maybe you would have some experience with something like this. Please let me know if you run into anything. I truly enjoy reading your blog and I look forward to your new updates.
Also see my website - avis Mitoslim
Your method of explaining everything in this paragraph is truly pleasant, every one
ReplyDeletecan easily be aware of it, Thanks a lot.
Here is my page Premium Green Coffee
Very nice post. I just stumbled upon your blog and wanted to say that I have truly enjoyed surfing around
ReplyDeleteyour blog posts. After all I will be subscribing to your feed and I hope you write again soon!
My weblog ; Tru Visage
My brother recommended I might like this web site.
ReplyDeleteHe used to be totally right. This submit actually made my day.
You can not believe just how a lot time I had spent for this info!
Thanks!
My web blog ; Rejuvenex Wrinkle Reducer
I simply couldn't go away your website prior to suggesting that I actually enjoyed the usual info a person supply on your guests? Is gonna be back frequently to check out new posts
ReplyDeleteStop by my page ; Lorenza Obryan (lorenzawya) - Vault Microblog
I do not even know how I ended up here, but I thought this post
ReplyDeletewas great. I don't know who you are but definitely you are going to a famous blogger if you aren't already ;) Cheers!
My weblog - UserPageramontrent : Bug Wiki
Undeniably believe that which you said. Your favorite reason
ReplyDeleteappeared to be on the net the easiest thing to be aware of.
I say to you, I definitely get irked while people think about
worries that they just don't know about. You managed to hit the nail upon the top as well as defined out the whole thing without having side effect , people could take a signal. Will probably be back to get more. Thanks
Feel free to surf my blog : User:Jasmine86L - Inwicast
A motivating discussion is definitely worth comment.
ReplyDeleteI do believe that you should write more about this issue, it might not be a taboo subject but generally folks don't talk about these issues. To the next! All the best!!
Feel free to surf my website - African Mango Active Diet
Good day! I know this is kinda off topic but I was wondering if you knew where I could locate a
ReplyDeletecaptcha plugin for my comment form? I'm using the same blog platform as yours and I'm having
difficulty finding one? Thanks a lot!
My web blog aristotle.oneonta.edu
Yes! Finally something about weight lifting supplements.
ReplyDeleteAlso visit my blog: Ripped Muscle Extreme Supplement
Everyone loves what you guys tend to be up too.
ReplyDeleteThis kind of clever work and reporting! Keep up the terrific works guys I've added you guys to blogroll.
Here is my website Asphalt Driveway Mn
Hey! I just wanted to ask if you ever have any problems with hackers?
ReplyDeleteMy last blog (wordpress) was hacked and I ended up
losing many months of hard work due to no back up. Do you have any
solutions to prevent hackers?
my page ... Auravie Review
Also see my web page > Auravie Reviews
Hi, I do think this is an excellent site. I stumbledupon it ;) I am going to revisit yet
ReplyDeleteagain since I bookmarked it. Money and freedom
is the greatest way to change, may you be rich and continue to help others.
My web page; Achat vues youtube
Pretty! This has been an extremely wonderful article.
ReplyDeleteThanks for supplying this information.
Also visit my homepage; click the following page
Excellent blog right here! Additionally your web site
ReplyDeleteso much up very fast! What web host are you the usage of?
Can I get your associate link to your host?
I want my website loaded up as fast as yours lol
Also visit my web-site acheter followers twitter
Heya i am for the first time here. I came across this board and I find It truly useful & it helped me out
ReplyDeletea lot. I hope to give something back and aid others like you
aided me.
Also visit my web blog - avoir plus de vue sur youtube
continuously i used to read smaller articles that also clear their motive,
ReplyDeleteand that is also happening with this paragraph which
I am reading now.
Review my web site acheter des followers
It's genuinely very complicated in this active life to listen news on Television, so I only use web for that reason, and obtain the hottest news.
ReplyDeletemy blog ... Weight Loss
Quality articles or reviews is the secret
ReplyDeleteto be a focus for the viewers to pay a visit the website, that's what this website is providing.
My homepage ... Buy lift and glow pro
Sweet blog! I found it while searching on Yahoo News.
ReplyDeleteDo you have any suggestions on how to get listed in Yahoo News?
I've been trying for a while but I never seem to get there! Many thanks
my site Vydox Male Enhancement Solution
It's hard to find educated people about this topic, however, you sound like you know what you're talking about!
ReplyDeleteThanks
My web page: http://maxthermoburnblog.com
We're a group of volunteers and starting a new scheme in our community. Your website provided us with valuable information to work on. You've done a formidable job and our whole community
ReplyDeletewill be grateful to you.
My site - http://trimextrindiet.com
e cigarette, ecig forum, electronic cigarette, e cig reviews, electronic cigarette, e cigarette
ReplyDelete