Best XML module for perl? XML::LibXML vs. XML::Mini::Document
In my last post, I took a look on XML::Simple, a (as called) simple and easy to understand module for perl to deal with XML structures. Today we’ll take a look on the module XML::Mini::Document.
Short information:
3 of 5 stars rating on cpan
Last modification date: 05 Feb. 2008
XML::Mini::Document is a module, that has more the “look and feel” of a XML-parser, but also offers to parse XML to and from Hash-structures. The documentation and the examples are well written, so you will get along with this module quite easy and fast.
We will use the same test-xml, we used for XML::Simple:
<?xml version="1.0" encoding="iso-8859-1"?> <test debug="0" attr1="1" attr2="2" another="<>"> <info attr1="perl" attr2="xml module" /> <info attr1="perl" attr2="xml module" /> <info attr1="perl" attr2="xml module" /> <info attr1="perl" attr2="xml module"><deepinfo>last text here</deepinfo></info> </test>
We’ll do the same operations as with XML::Simple before. The syntax is slightly different. The sourcecode of the new testcase looks therefor like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | #!/usr/bin/perl # Simple XML module test unit ### # We are strict, cauz we are Elitecoderz! use strict; use XML::Mini::Document; use XML::Mini; ################ # Get / Check Parameter (here we get the xml file we wanna deal with) if ($#ARGV+1 != 1) { print "Error: Wrong number of parameters.\n"; exit(1); } my $input = $ARGV[0]; chomp($input); ################ # Read Inputfile / check content / validate XML structure # Some modules are able to read directly from a file; for easy going, we use this method here. # Direct, dirty, but simple reading of a file open(FILE, "<$input") || die "Error: File not readable.\n"; my @lines = <FILE>; close(FILE); # Put the lines into one string for this parser my $XMLString = join(' ',@lines); #################################################################### # XML::Mini my $xmlDoc = XML::Mini::Document->new(); eval { $xmlDoc->parse($XMLString); }; if ($@) { print "Error: XML parsing error: $@\n"; exit(1); } my $xmlHash = $xmlDoc->toHash(); # Adding an attribute and a text to the first node $xmlHash->{'test'}->{'info'}->[0]->{'addon'} = 'valid text'; $xmlHash->{'test'}->{'info'}->[0]->{'content'} = "Here is an valid xml text\nusing linebreaks"; # Adding an attribute and an unescaped text to the second node $xmlHash->{'test'}->{'info'}->[1]->{'addon'} = 'invalid unescaped text'; $xmlHash->{'test'}->{'info'}->[1]->{'content'} = "Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?"; # Adding a third node unescaped text to the second node $xmlHash->{'test'}->{'info'}->[2]->{'addon'} = 'valid unescaped text in cdata'; $xmlHash->{'test'}->{'info'}->[2]->{'content'} = "<![CDATA[Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?]]>"; # my $newDoc = XML::Mini::Document->new(); $xmlDoc->fromHash($xmlHash); open(DATEI, ">output_XMLMini") || die "Datei nicht gefunden"; print DATEI $xmlDoc->toString(); close(DATEI); |
After running the test, the resulting XML looks like this:
<test> <info> <attr2> xml module </attr2> <addon> valid text </addon> <attr1> perl </attr1> <content> Here is an valid xml text using linebreaks </content> </info> <info> <attr2> xml module </attr2> <addon> invalid unescaped text </addon> <attr1> perl </attr1> <content> Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and? </content> </info> <info> <attr2> xml module </attr2> <addon> valid unescaped text in cdata </addon> <attr1> perl </attr1> <content> <![CDATA[Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and?]]> </content> </info> <info> <attr2> xml module </attr2> <attr1> perl </attr1> <deepinfo> last text here </deepinfo> </info> <attr2> 2 </attr2> <attr1> 1 </attr1> <another> &lt;&gt; </another> <debug> 0 </debug> </test> <xml> <version> 1.0 </version> <encoding> iso-8859-1 </encoding> </xml>
*Outch* – what the heck did happen to our XML?! After the first shock, we see, that the attributes are all put into a new generated node. This is a valid syntax, and all information is still kept in the file (the good point is, that unlike XML::Simple the “deepinfo”-node holds its correct position in XML::Mini::Document). But the problem, that attributes and nodes are mixed up by the module remains the same.
Another big problem is dealing with CDATAs. XML::Mini::Document escapes the CDATA-node and so is not able to write correct CDATA nodes.
And the most evil problem is the treatment of the “xml information node”. It is moved to the bottom and handled like it was a normal XML-node.
First result:
The “fromHash” and “toHash” method in XML::Mini::Document is technical crap, that can only be used on small xmls without attributes and CDATA nodes.
But there is a second way to use XML::Mini::Document
Unlike XML::Simple, dealing with the XML-structure by parsing it from and to a hash isn’t the only way in XML::Mini::Document. It also gives a toolbox to deal with the structure directly. This we want to test next and therefor rewrite our little testcase:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | #!/usr/bin/perl # Simple XML module test unit ### # We are strict, cauz we are Elitecoderz! use strict; use XML::Mini::Document; use XML::Mini; ################ # Get / Check Parameter (here we get the xml file we wanna deal with) if ($#ARGV+1 != 1) { print "Error: Wrong number of parameters.\n"; exit(1); } my $input = $ARGV[0]; chomp($input); ################ # Read Inputfile / check content / validate XML structure # Some modules are able to read directly from a file; for easy going, we use this method here. # Direct, dirty, but simple reading of a file open(FILE, "<$input") || die "Error: File not readable.\n"; my @lines = <FILE>; close(FILE); # Put the lines into one string for this parser my $XMLString = join(' ',@lines); #################################################################### # XML::Mini my $xmlDoc = XML::Mini::Document->new(); eval { $xmlDoc->parse($XMLString); }; if ($@) { print "Error: XML parsing error: $@\n"; exit(1); } my $xmlRoot = $xmlDoc->getRoot(); my $firstnode = $xmlDoc->getElementByPath('test/info'); $firstnode->attribute('addon', "Here is an valid xml text\nusing linebreaks"); my $secondnode = $xmlDoc->getElementByPath('test/info',1,2); $secondnode->attribute('addon','invalid unescaped text'); $secondnode->text("Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?"); my $testnode = $xmlDoc->getElementByPath('test'); my $newchild = $testnode->createChild("info"); $newchild->attribute('addon', "unescaped text for a CDATA"); $newchild->cdata("Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?"); open(DATEI, ">output_XMLMini3") || die "Datei nicht gefunden"; print DATEI $xmlDoc->toString(); close(DATEI); |
The resulting XML looks like this:
<?xml version="1.0" encoding="iso-8859-1"?> <test another="<>" attr1="1" attr2="2" debug="0"> <info addon="Here is an valid xml text using linebreaks" attr1="perl" attr2="xml module" /> <info addon="invalid unescaped text" attr1="perl" attr2="xml module"> Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and? </info> <info attr1="perl" attr2="xml module" /> <info attr1="perl" attr2="xml module"> <deepinfo> last text here </deepinfo> </info> <info addon="unescaped text for a CDATA"> <![CDATA[ Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and? ]]> </info> </test>
*Wow* – the first and very bad image turns a lot better! The structure is complete, looks good, is escaped where it should be and not, where it shouldn’t. In short words: The XML looks quite fine! After this, it is definitifly worth to take a closer look on XML::Mini::Document. And this we will do in the next chapert.
What happens to more complex XML-structures, containing CDATAs and encoded text when parsing?
To test this, we modify our little test-case again,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | #!/usr/bin/perl # Simple XML module test unit ### # We are strict, cauz we are Elitecoderz! use strict; use XML::Mini::Document; use XML::Mini; ################ # Get / Check Parameter (here we get the xml file we wanna deal with) if ($#ARGV+1 != 1) { print "Error: Wrong number of parameters.\n"; exit(1); } my $input = $ARGV[0]; chomp($input); ################ # Read Inputfile / check content / validate XML structure # Some modules are able to read directly from a file; for easy going, we use this method here. # Direct, dirty, but simple reading of a file open(FILE, "<$input") || die "Error: File not readable.\n"; my @lines = <FILE>; close(FILE); # Put the lines into one string for this parser my $XMLString = join(' ',@lines); #################################################################### # XML::Mini my $xmlDoc = XML::Mini::Document->new(); eval { $xmlDoc->parse($XMLString); }; if ($@) { print "Error: XML parsing error: $@\n"; exit(1); } my $xmlRoot = $xmlDoc->getRoot(); my $firstnode = $xmlDoc->getElementByPath('test/info'); $firstnode->attribute('addon', "Here is an valid xml text\nusing linebreaks"); my $cdatanode = $xmlDoc->getElementByPath('test/info',1,5); print "\n\n--\n".$cdatanode->getValue."\n--\n\n"; my $textnode = $xmlDoc->getElementByPath('test/info',1,2); print "\n\n--\n".$textnode->getValue."\n--\n\n"; open(DATEI, ">output_XMLMini4") || die "Datei nicht gefunden"; print DATEI $xmlDoc->toString(); close(DATEI); |
and read again from the previously generated xml-file and see the output:
-- Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and? -- -- Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and? --
The result for the CDATA looks quite nice, but the text from our second node unfortunatly does not get unescaped. Well – it’s no big deal to do this your own, but it is also not really my part to do this for XML::Mini::Document.
Result:
Do not try to use the “fromHash()” method, unless you want to destroy your XML. The “toHash()” method can be used to gather quickly information from a XML-file, but that is really all you can use this method for.The toolbox for directly manipulating XML-structures is really nice. Especially the syntax for getting elements by path is something I really found usefull! Nice idea – great implementation!
XML::Mini::Document deals with all kind of XML-structures and does the escaping automatically. Unescaping of text-nodes must be done manually. That is a bug, that prevent me from using this module.
XML::Mini::Document is the module of your choice, if you have to deal with more complex structures. Since unescaping is a problem I would not recommend to use this module on really huge and sensitive XML-structures. For everything else XML::Mini::Document is easy to use and quick to implement.
The next post will be about “Getting your life compfortable with XML::LibXML”.
In all the tests above XML::LibXML shows no weakness and I could not find any problems. But there are a few other “traps” in XML::LibXML you’ll have to deal with.