xml

Read OS X binary .plist with Java

With OS X 10.2 Apple introduced a more space-efficient binary plist-format. With OS X 10.4 this new binary format became the default format. The underlaying format is a simple XML-format (introduced with OS X 10.0), which was easy to read and parse from Java.

The new format must be parsed in a special way: either by using Apple’s Property List Editor, or by using the Apache Commons Configuration (which API is supporting the Property List format). In addition there is a new project of Daniel Dreibrodt, who is maintaining a JAVA project to read the binary plist format. It is licenced under GPL3 and can be found at:

http://code.google.com/p/plist/

Update: There is a new project on sourceforge that should be able to read and write the .plist-format. Up to now I haven’t tested it, but I want to provide the link here too: Property List Library

Convert String into InputStream

In the last days I was asked several times: “How do I convert a string into an InputStream, so that I can parse my XML-String with the SAX parser?” (note: or any other XML-Parser like DOM, JAXP, JDOM, …).

Being tired of explaining the same thing again and again – here is the answer:

1
2
3
4
5
6
7
String myString = 'content of your very own string';
ByteArrayInputStream in = new ByteArrayInputStream(myString.getBytes());
InputSource is = new InputSource();
is.setByteStream(in);
 
Parser myParser = new Parser();
myParser.parse(is);

Quick & Easy :)

GWT and XML – First steps with com.google.gwt.xml

Developing web applications and portals using eclipse, GWT (google web toolkit) and apache tomcat is one of the most effective ways to get your work done: The result will always be a very high-performance AJAX application. To exchange data from the AJAX surface to the tomcat-server, I decided to use the GWT RPC, exchanging XML data structures.

Therefor GWT provides a nice package: com.google.gwt.xml.client.*

com.google.gwt.xml.client.* is a complete toolkit to deal with xml data-structures on the JavaScript clientside in a high perfomant way.

After having importing the needed packages as usual, I wrote some code to create a new XMLDom stucture. But when running the new code, the following error occured:

Line 52: No source code is available for type
com.google.gwt.xml.client.Document; did you forget to inherit a
required module?

Up to this point I was used to the fact, that eclipse did everything for me; so I had absolutly no idea what to do, since google returned no hints to this problem. I started to study the documentation and the project in detail and found the solution:

To “inherit” the XML module in GWT, you have to add the following line to the gwt.xml:

<inherits name="com.google.gwt.xml.XML" />

The gwt.xml file can be found under your project in the path: <project>/src/<your domain>/<project>.gwt.xml

After adding this line, dealing with XMLs  is really simple. I’ll post an article here with examples, as soon as I can get a little time for my blog again.

Best XML module for perl? XML::LibXML vs. XML::Mini::Document

In my last post, I took a look on XML::Simple, a (as called) simple and easy to understand module for perl to deal with XML structures. Today we’ll take a look on the module XML::Mini::Document.

Short information:
3 of 5 stars rating on cpan
Last modification date: 05 Feb. 2008

XML::Mini::Document is a module, that has more the “look and feel” of a XML-parser, but also offers to parse XML to and from Hash-structures. The documentation and the examples are well written, so you will get along with this module quite easy and fast.

We will use the same test-xml, we used for XML::Simple:

<?xml version="1.0" encoding="iso-8859-1"?>
<test debug="0" attr1="1" attr2="2" another="&lt;&gt;">
	<info attr1="perl" attr2="xml module" />
	<info attr1="perl" attr2="xml module" />
	<info attr1="perl" attr2="xml module" />
	<info attr1="perl" attr2="xml module"><deepinfo>last text here</deepinfo></info>
</test>

We’ll do the same operations as with XML::Simple before. The syntax is slightly different. The sourcecode of the new testcase looks therefor like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/perl
# Simple XML module test unit
###
# We are strict, cauz we are Elitecoderz!
use strict;
use XML::Mini::Document;
use XML::Mini;
 
################
# Get / Check Parameter (here we get the xml file we wanna deal with)
if ($#ARGV+1 != 1) {
	print "Error: Wrong number of parameters.\n";
	exit(1);
}
my $input = $ARGV[0];
chomp($input);
 
################
# Read Inputfile / check content / validate XML structure
# Some modules are able to read directly from a file; for easy going, we use this method here.
 
# Direct, dirty, but simple reading of a file
open(FILE, "<$input") || die "Error: File not readable.\n";
my @lines = <FILE>;
close(FILE);
 
# Put the lines into one string for this parser
my $XMLString = join(' ',@lines);
 
 
####################################################################
# XML::Mini
my $xmlDoc = XML::Mini::Document->new();
eval {
	$xmlDoc->parse($XMLString);
};
if ($@) {
	print "Error: XML parsing error: $@\n";
	exit(1);
}
 
my $xmlHash = $xmlDoc->toHash();
 
# Adding an attribute and a text to the first node
$xmlHash->{'test'}->{'info'}->[0]->{'addon'} = 'valid text';
$xmlHash->{'test'}->{'info'}->[0]->{'content'} = "Here is an valid xml text\nusing linebreaks";
 
# Adding an attribute and an unescaped text to the second node
$xmlHash->{'test'}->{'info'}->[1]->{'addon'} = 'invalid unescaped text';
$xmlHash->{'test'}->{'info'}->[1]->{'content'} = "Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?";
 
# Adding a third node unescaped text to the second node
$xmlHash->{'test'}->{'info'}->[2]->{'addon'} = 'valid unescaped text in cdata';
$xmlHash->{'test'}->{'info'}->[2]->{'content'} = "<![CDATA[Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?]]>";
 
# my $newDoc = XML::Mini::Document->new();
$xmlDoc->fromHash($xmlHash);
open(DATEI, ">output_XMLMini") || die "Datei nicht gefunden";
print DATEI $xmlDoc->toString();
close(DATEI);

After running the test, the resulting XML looks like this:

<test>
	<info>
		<attr2>
			xml module
		</attr2>
		<addon>
			valid text
		</addon>
		<attr1>
			perl
		</attr1>
		<content>
			Here is an valid xml text using linebreaks
		</content>
	</info>
	<info>
		<attr2>
			xml module
		</attr2>
		<addon>
			invalid unescaped text
		</addon>
		<attr1>
			perl
		</attr1>
		<content>
			Here is an valid xml text using linebreaks and unescaped characters like &lt; and &gt; do you see the and?
		</content>
	</info>
	<info>
		<attr2>
			xml module
		</attr2>
		<addon>
			valid unescaped text in cdata
		</addon>
		<attr1>
			perl
		</attr1>
		<content>
			&lt;![CDATA[Here is an valid xml text using linebreaks and unescaped characters like &lt; and &gt; do you see the and?]]&gt;
		</content>
	</info>
	<info>
		<attr2>
			xml module
		</attr2>
		<attr1>
			perl
		</attr1>
		<deepinfo>
			last text here
		</deepinfo>
	</info>
	<attr2>
		2
	</attr2>
	<attr1>
		1
	</attr1>
	<another>
		&amp;lt;&amp;gt;
	</another>
	<debug>
		0
	</debug>
</test>
<xml>
	<version>
		1.0
	</version>
	<encoding>
		iso-8859-1
	</encoding>
</xml>

*Outch* – what the heck did happen to our XML?! After the first shock, we see, that the attributes are all put into a new generated node. This is a valid syntax, and all information is still kept in the file (the good point is, that unlike XML::Simple the “deepinfo”-node holds its correct position in XML::Mini::Document). But the problem, that attributes and nodes are mixed up by the module remains the same.

Another big problem is dealing with CDATAs. XML::Mini::Document escapes the CDATA-node and so is not able to write correct CDATA nodes.

And the most evil problem is the treatment of the “xml information node”. It is moved to the bottom and handled like it was a normal XML-node.

First result:

The “fromHash” and “toHash” method in XML::Mini::Document is technical crap, that can only be used on small xmls without attributes and CDATA nodes.

But there is a second way to use XML::Mini::Document

Unlike XML::Simple, dealing with the XML-structure by parsing it from and to a hash isn’t the only way in XML::Mini::Document. It also gives a toolbox to deal with the structure directly. This we want to test next and therefor rewrite our little testcase:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/usr/bin/perl
# Simple XML module test unit
###
# We are strict, cauz we are Elitecoderz!
use strict;
use XML::Mini::Document;
use XML::Mini;
 
################
# Get / Check Parameter (here we get the xml file we wanna deal with)
if ($#ARGV+1 != 1) {
	print "Error: Wrong number of parameters.\n";
	exit(1);
}
my $input = $ARGV[0];
chomp($input);
 
################
# Read Inputfile / check content / validate XML structure
# Some modules are able to read directly from a file; for easy going, we use this method here.
 
# Direct, dirty, but simple reading of a file
open(FILE, "<$input") || die "Error: File not readable.\n";
my @lines = <FILE>;
close(FILE);
 
# Put the lines into one string for this parser
my $XMLString = join(' ',@lines);
 
####################################################################
# XML::Mini
my $xmlDoc = XML::Mini::Document->new();
eval {
	$xmlDoc->parse($XMLString);
};
if ($@) {
	print "Error: XML parsing error: $@\n";
	exit(1);
}
my $xmlRoot = $xmlDoc->getRoot();
 
my $firstnode = $xmlDoc->getElementByPath('test/info');
$firstnode->attribute('addon', "Here is an valid xml text\nusing linebreaks");
 
my $secondnode = $xmlDoc->getElementByPath('test/info',1,2);
$secondnode->attribute('addon','invalid unescaped text');
$secondnode->text("Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?");
 
my $testnode = $xmlDoc->getElementByPath('test');
my $newchild = $testnode->createChild("info");
$newchild->attribute('addon', "unescaped text for a CDATA");
$newchild->cdata("Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?");
 
open(DATEI, ">output_XMLMini3") || die "Datei nicht gefunden";
print DATEI $xmlDoc->toString();
close(DATEI);

The resulting XML looks like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<test another="&lt;&gt;" attr1="1" attr2="2" debug="0">
	<info addon="Here is an valid xml text
using linebreaks" attr1="perl" attr2="xml module" />
	<info addon="invalid unescaped text" attr1="perl" attr2="xml module">
		Here is an valid xml text using linebreaks and unescaped characters like &lt; and &gt; do you see the and?
	</info>
	<info attr1="perl" attr2="xml module" />
	<info attr1="perl" attr2="xml
                        module">
		<deepinfo>
			last text here
		</deepinfo>
	</info>
	<info addon="unescaped text for a CDATA">
<![CDATA[ Here is an valid xml text
using linebreaks
and unescaped characters like < and >
do you see the and? ]]> 
	</info>
</test>

*Wow* – the first and very bad image turns a lot better! The structure is complete, looks good, is escaped where it should be and not, where it shouldn’t. In short words: The XML looks quite fine! After this, it is definitifly worth to take a closer look on XML::Mini::Document. And this we will do in the next chapert.

What happens to more complex XML-structures, containing CDATAs and encoded text when parsing?

To test this, we modify our little test-case again,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/usr/bin/perl
# Simple XML module test unit
###
# We are strict, cauz we are Elitecoderz!
use strict;
use XML::Mini::Document;
use XML::Mini;
 
################
# Get / Check Parameter (here we get the xml file we wanna deal with)
if ($#ARGV+1 != 1) {
	print "Error: Wrong number of parameters.\n";
	exit(1);
}
my $input = $ARGV[0];
chomp($input);
 
################
# Read Inputfile / check content / validate XML structure
# Some modules are able to read directly from a file; for easy going, we use this method here.
 
# Direct, dirty, but simple reading of a file
open(FILE, "<$input") || die "Error: File not readable.\n";
my @lines = <FILE>;
close(FILE);
 
# Put the lines into one string for this parser
my $XMLString = join(' ',@lines);
 
####################################################################
# XML::Mini
my $xmlDoc = XML::Mini::Document->new();
eval {
	$xmlDoc->parse($XMLString);
};
if ($@) {
	print "Error: XML parsing error: $@\n";
	exit(1);
}
my $xmlRoot = $xmlDoc->getRoot();
 
my $firstnode = $xmlDoc->getElementByPath('test/info');
$firstnode->attribute('addon', "Here is an valid xml text\nusing linebreaks");
 
my $cdatanode = $xmlDoc->getElementByPath('test/info',1,5);
print "\n\n--\n".$cdatanode->getValue."\n--\n\n";
 
my $textnode = $xmlDoc->getElementByPath('test/info',1,2);
print "\n\n--\n".$textnode->getValue."\n--\n\n";
 
open(DATEI, ">output_XMLMini4") || die "Datei nicht gefunden";
print DATEI $xmlDoc->toString();
close(DATEI);

and read again from the previously generated xml-file and see the output:

--
 Here is an valid xml text
 using linebreaks
 and unescaped characters like < and > do you see the and?
--

--
Here is an valid xml text
 using linebreaks
 and unescaped characters like &lt; and &gt;
 do you see the and?
--

The result for the CDATA looks quite nice, but the text from our second node unfortunatly does not get unescaped. Well – it’s no big deal to do this your own, but it is also not really my part to do this for XML::Mini::Document.

Result:
Do not try to use the “fromHash()” method, unless you want to destroy your XML. The “toHash()” method can be used to gather quickly information from a XML-file, but that is really all you can use this method for.

The toolbox for directly manipulating XML-structures is really nice. Especially the syntax for getting elements by path is something I really found usefull! Nice idea – great implementation!

XML::Mini::Document deals with all kind of XML-structures and does the escaping automatically. Unescaping of text-nodes must be done manually. That is a bug, that prevent me from using this module.

XML::Mini::Document is the module of your choice, if you have to deal with more complex structures. Since unescaping is a problem I would not recommend to use this module on really huge and sensitive XML-structures. For everything else XML::Mini::Document is easy to use and quick to implement.

The next post will be about “Getting your life compfortable with XML::LibXML”.
In all the tests above XML::LibXML shows no weakness and I could not find any problems. But there are a few other “traps” in XML::LibXML you’ll have to deal with.

Best XML module for perl? XML::LibXML vs. XML::Simple

XML is one of the powerfullest, often missused data-structures these days. To express huge object- and data-structures XML is an easy and elegant way to save and transport all kind of informations. There are XML-parser for nearly all systems, and there are a lot of modules for nearly all IDEs to deal with XML in a more or less easy way.

Due to some trouble with choosing the right “XML toolkit” from cpan, I’ll give you a short overview over the three most important XML perl modules and my experiences with them.

But first I’ll have to make some statements (the usual way to begin a good program ;) ):

Appeal for good XML structures

In the beginning I wrote about the “missuse” of XML; first I’ll give you an example what I understand on “missuse of XML”: The position of XML siblings in equal rank in hirarchy within the document should make no difference; In the following, I’ll give you an example where the position of the command-siplings does matter!

<?xml version="1.0" encoding="iso-8859-1"?>
<myxml general="valid" xmlinfo="important">
  <command action="echo" option="Message" target="/tmp/test.txt"/>
  <command action="mail" option="/tmp/test.txt" address="some@email.com"/>
</myxml>

This is an example for a worse XML-structure! If (for any reason) the two command-siblings are switched, the email will be sent before generating the content.

A better solution for this would be:

<?xml version="1.0" encoding="iso-8859-1"?>
<myxml general="valid" xmlinfo="important">
  <command step="2" action="mail" option="/tmp/test.txt" address="some@email.com"/>
  <command step="1" action="echo" option="Message" target="/tmp/test.txt"/>
</myxml>

Here the attribute “step” is used to order the single commands. So switching the position of the two siblings does not matter (as the program will be able to handle the commands still in the correct order). An alternative way that would be ok is the following structure:

<?xml version="1.0" encoding="iso-8859-1"?>
<myxml general="valid" xmlinfo="important">
  <command action="echo" option="Message" target="/tmp/test.txt">
    <command action="mail" option="/tmp/test.txt" address="some@email.com"/>
  </command>
</myxml>

The main message is: Prepare to get your xml sorted in a new way by bad xml modules! Another reason is: It is absolutly no fun in development when you have to count lines to get the correct value you want (use your computer for counting and do not do it yourself!).
And, when the xml gets changed someday it can happen, that you have to change your programs too, since they have to count from another line. Unfortunatly, this is one of the most frequently occuring crap I have to deal with at customers I’m supporting. :(

Back to the XML modules for perl

First let’s take a look on XML::Simple

Short information:
4 of 5 stars rating on cpan
Last modification date: 15 Aug 2007
Actual ratings show, that the module still is used widely

XML::Simple really is simple to use. After reading a few minutes in the well written documentation, I was able to read, work and write XML-documents with XML::Simple. A really nice feature is, that you do not have to deal with the XML-structures yourself. The XML is parsed into an hash-array construct that represents the xml-structure and is easy to use and (for small changes) easy to change.

First I created a xml-file to read from:

<?xml version="1.0" encoding="iso-8859-1"?>
 <test debug="0" attr1="1" attr2="2" another="&lt;&gt;">
  <info attr1="perl" attr2="xml module" />
  <info attr1="perl" attr2="xml module" />
  <info attr1="perl" attr2="xml module" />
  <info attr1="perl" attr2="xml module">
    <deepinfo>last text here</deepinfo>
  </info>
</test>

The I wrote a short test-unit:

#!/usr/bin/perl
# Simple XML module test unit
###
# We are strict, cauz we are elitecoderz!
use strict;
use XML::Simple;
 
################
# Get / Check Parameter (here we get the xml file we wanna deal with)
if ($#ARGV+1 != 1) {
	print "Error: Wrong number of parameters.\n";
	exit(1);
}
my $input = $ARGV[0];
chomp($input);
 
################
# Read Inputfile / check content / validate XML structure
# Some modules are able to read directly from a file; for easy going, we use this method here.
# Direct, dirty, but simple reading of a file
open(FILE, "<$input") || die "Error: File not readable.\n";
my @lines = <FILE>;
close(FILE);
 
# Put the lines into one string for this parser
my $XMLString = join(' ',@lines);
 
####################################################################
# XML::Simple
my $ref;
eval {
	$ref = XMLin($XMLString, KeepRoot => 1);
};
if ($@) {
	print "Error: XML parsing error: $@\n";
	exit(1);
}
 
# Adding an attribute and a text to the first node
$ref->{'test'}->{'info'}->[0]->{'addon'} = 'valid text';
$ref->{'test'}->{'info'}->[0]->{'content'} = "Here is an valid xml text\nusing linebreaks";
 
# Adding an attribute and an unescaped text to the second node
$ref->{'test'}->{'info'}->[1]->{'addon'} = 'invalid unescaped text';
$ref->{'test'}->{'info'}->[1]->{'content'} = "Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?";
 
# Adding a third node unescaped text to the second node
$ref->{'test'}->{'info'}->[2]->{'addon'} = 'valid unescaped text in cdata';
$ref->{'test'}->{'info'}->[2]->{'content'} = "<![CDATA[Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?]]>";
 
open(FILE, ">$input\_XML-Simple") || die "Datei nicht gefunden";
print FILE XMLout($ref, NoSort => 1, KeepRoot => 1, NoEscape => 1);
close(FILE);
exit(0);

The resulting XML looks like this:

<test attr2="2" attr1="1" another="<>" debug="0">
  <info attr2="xml module" addon="valid text" attr1="perl">Here is an valid xml text using linebreaks</info>
  <info attr2="xml module" addon="invalid unescaped text" attr1="perl">Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and?</info>
  <info attr2="xml module" addon="valid unescaped text in cdata" attr1="perl"><![CDATA[Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and?]]></info>
  <info attr2="xml module" attr1="perl" deepinfo="last text here" />
</test>

Summary to XML::Simple

As you can see (and set by NoEscape => 1 in the XMLout), “evil” characters like the <> are not escaped automatically. That is quite ok, since we set the Option to “1″.
But with autoescaping enabled, the CDATA-part get’s escaped too, and that isn’t really funny, since the CDATA structure get’s damaged by this.
So, after reading from XML, you have to escape all contents manually, before writing the structure again. You have to care about which content must be escaped and which not. A lot of work and a good point for making errors.
Did you become aware of the most evil error in the XML?

<info attr1="perl" attr2="xml module">
    <deepinfo>last text here</deepinfo>
</info>

became:

<info attr2="xml module" attr1="perl" deepinfo="last text here" />

The sibling “deepinfo” was completely removed and put into the parent as attribute. And this is really a bad bad problem, since when you expect the <deepinfo> as sibling, you won’t look for it as an attribut in the parent.

Result:
If you do not work with complex XML-structures that need to be escaped and if XML::Simple is the only module that will read and write the XML-Files, then this is your module of choice!

The next post will be about XML::LibXML vs. XML::Mini::Document. After this, I’ll go on with a series about “getting your life compfortable with XML::LibXML”.
As you can see by this:
I strongly recommend you XML::LibXML. Why I come to this conclusion, you’ll get in the following posts ;)

Update:
Next chapter: XML::LibXML vs. XML::Mini::Document

Get in contact:

Categories