XML is one of the powerfullest, often missused data-structures these days. To express huge object- and data-structures XML is an easy and elegant way to save and transport all kind of informations. There are XML-parser for nearly all systems, and there are a lot of modules for nearly all IDEs to deal with XML in a more or less easy way.
Due to some trouble with choosing the right “XML toolkit” from cpan, I’ll give you a short overview over the three most important XML perl modules and my experiences with them.
But first I’ll have to make some statements (the usual way to begin a good program
):
In the beginning I wrote about the “missuse” of XML; first I’ll give you an example what I understand on “missuse of XML”: The position of XML siblings in equal rank in hirarchy within the document should make no difference; In the following, I’ll give you an example where the position of the command-siplings does matter!
<?xml version="1.0" encoding="iso-8859-1"?> <myxml general="valid" xmlinfo="important"> <command action="echo" option="Message" target="/tmp/test.txt"/> <command action="mail" option="/tmp/test.txt" address="some@email.com"/> </myxml>
This is an example for a worse XML-structure! If (for any reason) the two command-siblings are switched, the email will be sent before generating the content.
A better solution for this would be:
<?xml version="1.0" encoding="iso-8859-1"?> <myxml general="valid" xmlinfo="important"> <command step="2" action="mail" option="/tmp/test.txt" address="some@email.com"/> <command step="1" action="echo" option="Message" target="/tmp/test.txt"/> </myxml>
Here the attribute “step” is used to order the single commands. So switching the position of the two siblings does not matter (as the program will be able to handle the commands still in the correct order). An alternative way that would be ok is the following structure:
<?xml version="1.0" encoding="iso-8859-1"?> <myxml general="valid" xmlinfo="important"> <command action="echo" option="Message" target="/tmp/test.txt"> <command action="mail" option="/tmp/test.txt" address="some@email.com"/> </command> </myxml>
The main message is: Prepare to get your xml sorted in a new way by bad xml modules! Another reason is: It is absolutly no fun in development when you have to count lines to get the correct value you want (use your computer for counting and do not do it yourself!).
And, when the xml gets changed someday it can happen, that you have to change your programs too, since they have to count from another line. Unfortunatly, this is one of the most frequently occuring crap I have to deal with at customers I’m supporting.
First let’s take a look on XML::Simple
Short information:
4 of 5 stars rating on cpan
Last modification date: 15 Aug 2007
Actual ratings show, that the module still is used widely
XML::Simple really is simple to use. After reading a few minutes in the well written documentation, I was able to read, work and write XML-documents with XML::Simple. A really nice feature is, that you do not have to deal with the XML-structures yourself. The XML is parsed into an hash-array construct that represents the xml-structure and is easy to use and (for small changes) easy to change.
First I created a xml-file to read from:
<?xml version="1.0" encoding="iso-8859-1"?> <test debug="0" attr1="1" attr2="2" another="<>"> <info attr1="perl" attr2="xml module" /> <info attr1="perl" attr2="xml module" /> <info attr1="perl" attr2="xml module" /> <info attr1="perl" attr2="xml module"> <deepinfo>last text here</deepinfo> </info> </test>
The I wrote a short test-unit:
#!/usr/bin/perl # Simple XML module test unit ### # We are strict, cauz we are elitecoderz! use strict; use XML::Simple; ################ # Get / Check Parameter (here we get the xml file we wanna deal with) if ($#ARGV+1 != 1) { print "Error: Wrong number of parameters.\n"; exit(1); } my $input = $ARGV[0]; chomp($input); ################ # Read Inputfile / check content / validate XML structure # Some modules are able to read directly from a file; for easy going, we use this method here. # Direct, dirty, but simple reading of a file open(FILE, "<$input") || die "Error: File not readable.\n"; my @lines = <FILE>; close(FILE); # Put the lines into one string for this parser my $XMLString = join(' ',@lines); #################################################################### # XML::Simple my $ref; eval { $ref = XMLin($XMLString, KeepRoot => 1); }; if ($@) { print "Error: XML parsing error: $@\n"; exit(1); } # Adding an attribute and a text to the first node $ref->{'test'}->{'info'}->[0]->{'addon'} = 'valid text'; $ref->{'test'}->{'info'}->[0]->{'content'} = "Here is an valid xml text\nusing linebreaks"; # Adding an attribute and an unescaped text to the second node $ref->{'test'}->{'info'}->[1]->{'addon'} = 'invalid unescaped text'; $ref->{'test'}->{'info'}->[1]->{'content'} = "Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?"; # Adding a third node unescaped text to the second node $ref->{'test'}->{'info'}->[2]->{'addon'} = 'valid unescaped text in cdata'; $ref->{'test'}->{'info'}->[2]->{'content'} = "<![CDATA[Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?]]>"; open(FILE, ">$input\_XML-Simple") || die "Datei nicht gefunden"; print FILE XMLout($ref, NoSort => 1, KeepRoot => 1, NoEscape => 1); close(FILE); exit(0);
The resulting XML looks like this:
<test attr2="2" attr1="1" another="<>" debug="0"> <info attr2="xml module" addon="valid text" attr1="perl">Here is an valid xml text using linebreaks</info> <info attr2="xml module" addon="invalid unescaped text" attr1="perl">Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and?</info> <info attr2="xml module" addon="valid unescaped text in cdata" attr1="perl"><![CDATA[Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and?]]></info> <info attr2="xml module" attr1="perl" deepinfo="last text here" /> </test>
As you can see (and set by NoEscape => 1 in the XMLout), “evil” characters like the <> are not escaped automatically. That is quite ok, since we set the Option to “1″.
But with autoescaping enabled, the CDATA-part get’s escaped too, and that isn’t really funny, since the CDATA structure get’s damaged by this.
So, after reading from XML, you have to escape all contents manually, before writing the structure again. You have to care about which content must be escaped and which not. A lot of work and a good point for making errors.
Did you become aware of the most evil error in the XML?
<info attr1="perl" attr2="xml module"> <deepinfo>last text here</deepinfo> </info>
became:
<info attr2="xml module" attr1="perl" deepinfo="last text here" />The sibling “deepinfo” was completely removed and put into the parent as attribute. And this is really a bad bad problem, since when you expect the <deepinfo> as sibling, you won’t look for it as an attribut in the parent.
Result:
If you do not work with complex XML-structures that need to be escaped and if XML::Simple is the only module that will read and write the XML-Files, then this is your module of choice!
The next post will be about XML::LibXML vs. XML::Mini::Document. After this, I’ll go on with a series about “getting your life compfortable with XML::LibXML”.
As you can see by this:
I strongly recommend you XML::LibXML. Why I come to this conclusion, you’ll get in the following posts
Update:
Next chapter: XML::LibXML vs. XML::Mini::Document
Pingback: Best XML module for perl? XML::LibXML vs. XML::Mini::Document | No technical expert advice can be found here ;)
my $parser = XML::LibXML->new(); <- this does not make sense if you use simpleXML
“<$input” <- uh?
my @lines = ; <- hmm
alex – you’re absolutly right. I forgot to remove this line in the post. Thanks for your hint.
Encoding error of the wordpress syntax-hightlighting plugin in conjunction with the language-plugin. It’s not written perl – what do you expect? But again – thanks
And again – thanks. This wordpress runs me mad sometimes: The editor is trying to escape and unescape my sourcecode, and so removes sometimes parts of the source, since it thinks it is invalid html.
Thanks for the corrections.