Archive

Posts Tagged ‘Perl’

Cluster Filesystem for FreeBSD – GFS, OCFS2, …?

09:45 AM 2 comments

To make it short: There is absolutely NO real Cluster-Filesystem (like GFS or OCFS2) for FreeBSD at present. Also other projects for distributed filesystems like GlusterFS, PVFS or DRBD are not ported to FreeBSD, or the ports are very old.
Since I was in the need to have four identical data-filesystems (which have to be in sync just seconds after the upload), I wrote a little work-around for rsync, using the FreeBSD audit-system. The idea to use the audit-system for triggering the rsync I got from Luke Marsden, who is monitoring filesystem activity with audit_control and some python-scripts.

Install & Configure Event Audit Support

First of all, the audit_system must be activated and configured. The event-auditing is part of FreeBSD and has to be compiled into the kernel.

Add the following line to your kernel configuration:

options AUDIT

Then rebuild and reinstall your kernel as described in the FreeBSD Handbook

After this, add the following line to your /etc/rc.conf

auditd_enable=”YES”

The next step is, to configure the audit-system: Open the file /etc/security/audit_control and change the config to:

dir:/var/audit
flags:fc,fd,fw
minfree:20
naflags:lo
policy:cnt
filesz:0

That’s all for now. You can now start the audit-system by either calling

/etc/rc.d/auditd start

or by rebooting your system.

Installing & Running rsync

If rsync isn’t already installed on your system, you may do this by using the ports:

cd /usr/ports/net/rsync
make
make install

Installation of rsync should be no issue.

Next step is, to set an alternative path to your data-directory, using a symbolic link (I’ll explain later why).

ln -s /path/to/your/data/ /alternative_data_path/

Now we have to configure rsync to run as daemon. Therfor we create (or change) the config for rsync: /etc/rsyncd.conf

max connections = 5
log file = /var/log/rsync.log
timeout = 30

[shareName]
comment = Name of this “Rsync mount”
path = /alternative_data_path/
read only = no
list = yes
uid = validUser
gid = validGroup
hosts allow = ,
hosts deny = *

To start the rsync daemon, you have to call:

/usr/local/bin/rsync –config=/etc/rsyncd.conf –daemon

It is perhaps a good idea to monitor the rsyncd with the daemontools to make sure, the rsync-service is always available (you have then to run it with the –no-detach option).

The rsync-audit-script

#!/usr/bin/perl
##
# This software is published under the Apchae 2.0 licenses.
# You may obtain a copy of the License at
# 
#   http://www.apache.org/licenses/LICENSE-2.0
#
#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.
#   
#   Author: Erik Scholtz 
#   Web: http://blog.elitecoderz.net
###
# We are strict, cauz we are elitecoderz!
use strict;
use threads qw(yield);
use threads::shared;
use Thread::Semaphore;
 
# No caching
$|=1;
 
################
# Configuration
my $debug = 1;																			# 0/1 to enable logging to the console or disable
my $path = '/path/to/your/data/';				# Path to sync
my @cmds;																				# Syncer commands that should be executed
$cmds[0] = '/usr/local/bin/rsync -raz --progress --size-only /path/to/symboliclink/data/<!--target--> rsync:///shareName/<!--target-->';
$cmds[1] = '/usr/local/bin/rsync -raz --progress --size-only /path/to/symboliclink/data/<!--target--> rsync:///shareName/<!--target-->';
$cmds[2] = '/usr/local/bin/rsync -raz --progress --size-only /path/to/symboliclink/data/<!--target--> rsync:///shareName/<!--target-->';
 
###############################################################################
# DO NOT CHANGE ANYTHING BELOW THIS LINE, UNLESS YOU KNOW WHAT YOU ARE DOING! #
###############################################################################
 
###
# Set Threads yield
threads->yield();
 
# SetUp some thread-shared variables
my $commands :shared;
$commands = &share([]);
 
my $run :shared;
$run = &share({});
$run->{'status'} = 1;
 
my $sema :shared;
$sema = &share({});
 
# Local array where all syncer-threads are stored
my @threads;
 
# Create a thread for each syncer
my $maxid = -1;
for (my $i=0;$i<=$#cmds;$i++) {
	print "Starting syncer $i\n" if $debug;
	$sema->{$i} = Thread::Semaphore->new(0);
	my $syncer = threads->create('syncJob',$run,$sema,$i,$commands,$path,$cmds[$i],$debug);
	push(@threads,$syncer);
	$maxid = $i;
}
 
# Create the Checker thread, which cleanup the jobs and ensures the function of all syncers
$sema->{'checker'} = Thread::Semaphore->new(0);
my $syncer = threads->create('JobChecker',$run,$sema,$commands,$maxid,$debug);
push(@threads,$syncer);
 
# Create the audit thread
print "Starting audit\n" if $debug;
my $auditthread = threads->create('audit',$run,$sema,$commands,$path,$maxid,$debug);
print "Waiting for audi to terminatet\n" if $debug;
$auditthread->join();			# If the audit-thread gets joinable, we have to terminate everything
 
# Terminate all threads and cleanup 
$run->{'status'} = 0;
while ($#threads >=0 ) {
	my $worker = shift(@threads);
	print "Shutdown of syncer ...\n" if $debug;
	$worker->join();
}
print "Shutdown clean completed\n" if $debug;
exit(0);
 
########################################################################################################################################
########################################################################################################################################
 
####################################################################################################
# audit thread
sub audit {
	my $r = shift;
	my $sp = shift;
	my $c = shift;
	my $p = shift;
	my $m = shift;
	my $d = shift;
 
	print "       audit started ...\n" if $d;
	# open listener on the audit device
	open(STATUS, "/usr/sbin/praudit /dev/auditpipe |") || die "can't fork: $!";
	while (<STATUS>) {
		my $line = $_;
		last if ($line eq '' || $r->{'status'}<=0);		# Terminate if audit terminated
		if ($line =~ /path,$p(.+)/) {					# Check if the changed file is in the observed path
			my $file = $1;
			print "Change detected on file: $file\n" if $d;
			my $hash :shared;							# Create a command for the syncers
			$hash = &share({});
			$hash->{'file'} = $file;
			$hash->{'status'} = '';
			$hash->{'time'} = '';
			for (my $j=0;$j<=$m;$j++) {					# init job done charta
				$hash->{$j} = 'no';
			}
			if (1) {
				lock($c);
				push(@{$c},$hash);
				print "Added new job for $file\n" if $d;
			}
			for (my $j=0;$j<=$m;$j++) {					# wakeup syncers
				$sp->{$j}->up();
			}
			$sp->{'checker'}->up();
		}
	}
	close STATUS || die "audit not closed correctly: $! $?";	
	print "       audit terminated ...\n" if $d;
	return(0);
}
 
####################################################################################################
# syncer thread
sub syncJob {
	my $r = shift;
	my $sp = shift;
	my $id = shift;
	my $c = shift;
	my $p = shift;
	my $e = shift;
	my $d = shift;
 
	print "       syncer $id started ...\n" if $d;
 
	while ($r->{'status'}>0) {
		if ($#{$c}>=0) {													# if there are any jobs to be done
			for (my $j=0; $j<=$#{$c}; $j++) {
				next if ($c->[$j]->{$id} eq 'ok');							# if my job is already done skip this job and check next
				my $file = $c->[$j]->{'file'};
				if (-e $p.$file)  {											# check if the file is existing
					$c->[$j]->{$id} = 'working';							# mark this job as being worked on
					my $dif = 1;
					while ($dif>0) {										# check if the file is in upload and changes size within 1,5 secs
						print "Checking Filesize ...\n" if $d;
						my $ssize = -s $p.$file;
						sleep(1.5);
						my $eesize = -s $p.$file;
						$dif = $eesize - $ssize;
						print "Checking Filesize $ssize - $eesize = $dif\n" if $d;
					}
					my $cm = $e;
					$cm =~ s/<!--target-->/$file/g;
					system($cm);											# rsync to other server
				}
				lock($c);
				$c->[$j]->{$id} = 'ok';										# mark job as done for me
			}
		}
		$sp->{$id}->down();
	}
	print "       syncer $id terminated ...\n" if $d;
	return(0);
}
 
####################################################################################################
# checker thread that checks if all jobs are done
sub JobChecker {
	my $r = shift;
	my $sp = shift;
	my $c = shift;
	my $m = shift;
	my $d = shift;
 
	print "       checker started ...\n" if $d;
 
	while ($r->{'status'}>0) {
		while ($#{$c} >= 0 && $r->{'status'}>0) {
			print "    Checker loop ...\n" if $d;
			my $rem = 0;
			foreach my $job (@{$c}) {						# loop through all jobs
				my $mem = 'ok';
				for (my $j=0;$j<=$m;$j++) {					# check job done charta
					if ($job->{$j} eq 'no') {				# job not handled
						$mem = 'no' if ($mem ne 'working');	# job not handled (may never override a job in progress state)
					} elsif ($job->{$j} eq 'working') {		# job in progress (always overrides not handled)
						$mem = 'working';
					}
				}
				# Job not completed
				if ($mem eq 'no') {							
					if ($job->{'time'} eq '') {				# Set timestamp to know, how long this job is already waiting
						$job->{'time'} = time;
					} else {								# Job already got a timestamp
						my $watch = time - $job->{'time'};
						print "Job age: $watch\n" if $d;
						if (time - $job->{'time'} > 300) {	# Job has waited for more than 5 minutes. terminate program
							print "TIME FOR JOB EXCEEDED - shutting down syncer";
							$r->{'status'} = 0;
							for (my $j=0;$j<=$m;$j++) {		# wakeup syncers
								$sp->{$j}->up();
								$sp->{'checker'}->up();		# wakeup ourself
							}
						}
					}
				} elsif ($mem eq 'working') {				# job in progress - just actualize the timestamp
					$job->{'time'} = time;
				} else {
					$job->{'status'} = 'complete';			# job is completely done and is marked for being removed
					$rem = 1;
				}
			}
			# Job to remove available
			if ($rem > 0) {
				lock($c);									# lock the command-queue
				my @arr;
				for (my $j=0;$j<=$#{$c};$j++) {				# store all not handled jobs / drop completed jobs
					my $ex = shift(@{$c});
					if ($ex->{'status'} ne 'complete') {
						push(@arr,$ex);
					}
				}
				for (my $j=0;$j<=$#arr;$j++) {				# put all stored (not finished) jobs back into the command queue
					push(@{$c},$arr[$j]);
				}
			}
			print "    Checker reloop ...\n" if $d;
			sleep(1);
		}
		print "    Checker sleeping (".$#{$c}.")...\n" if $d;
		$sp->{'checker'}->down();
	}
 
	print "       checker terminated ...\n" if $d;
	return(0);
}

This script does the whole magic: It listens via the audit-system for files changed or added and then uses rsync to sync the file to the other systems. And here we come to the part why we need a symbolic link to the data directory: when the script uses rsync to copy a file to a second system, the audit-system of this second system will notify the script there about this change. So the script on the second system would start to copy the file back to the first system and so on. So if you do not use a symbolic link for the rsync, you will create an endless loop of copy and recopy-processes!

Installation and configuration of the script is quite easy

Copy this script to each system that should be kept in sync with the others. I recommend to observe this script via daemontools too. Then edit the script on each system as shown below:

$debug can be set to 0 (for no debug output) or 1 (for debugging output).
$path should be set to the physical path of your data.

For each system that should be kept in sync add the following line. Please keep in mind to increase the number in the square-brakets ($cmds["number"]) by 1 in each line:

$cmds[0] = ‘/usr/local/bin/rsync -raz –progress –size-only /path/to/symboliclink/data/ rsync:///shareName/‘;

Some last important informations:

Before changing anything on your system, make sure you have a complete backup of your system! The usage of this script and howto is at your very own risk. So if you suffer any data-losses by using this howto or the script you can not hold me responsible for this.

To get close to a “realtime sync”, the script starts an own thread for each volume to keep in sync. So you need to have a perl-installation that is thread-enabled.

Post to Twitter Tweet This Post

Best XML module for perl? XML::LibXML vs. XML::Mini::Document

09:54 AM 2 comments

In my last post, I took a look on XML::Simple, a (as called) simple and easy to understand module for perl to deal with XML structures. Today we’ll take a look on the module XML::Mini::Document.

Short information:
3 of 5 stars rating on cpan
Last modification date: 05 Feb. 2008

XML::Mini::Document is a module, that has more the “look and feel” of a XML-parser, but also offers to parse XML to and from Hash-structures. The documentation and the examples are well written, so you will get along with this module quite easy and fast.

We will use the same test-xml, we used for XML::Simple:

<?xml version="1.0" encoding="iso-8859-1"?>
<test debug="0" attr1="1" attr2="2" another="&lt;&gt;">
	<info attr1="perl" attr2="xml module" />
	<info attr1="perl" attr2="xml module" />
	<info attr1="perl" attr2="xml module" />
	<info attr1="perl" attr2="xml module"><deepinfo>last text here</deepinfo></info>
</test>

We’ll do the same operations as with XML::Simple before. The syntax is slightly different. The sourcecode of the new testcase looks therefor like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/perl
# Simple XML module test unit
###
# We are strict, cauz we are Elitecoderz!
use strict;
use XML::Mini::Document;
use XML::Mini;
 
################
# Get / Check Parameter (here we get the xml file we wanna deal with)
if ($#ARGV+1 != 1) {
	print "Error: Wrong number of parameters.\n";
	exit(1);
}
my $input = $ARGV[0];
chomp($input);
 
################
# Read Inputfile / check content / validate XML structure
# Some modules are able to read directly from a file; for easy going, we use this method here.
 
# Direct, dirty, but simple reading of a file
open(FILE, "<$input") || die "Error: File not readable.\n";
my @lines = <FILE>;
close(FILE);
 
# Put the lines into one string for this parser
my $XMLString = join(' ',@lines);
 
 
####################################################################
# XML::Mini
my $xmlDoc = XML::Mini::Document->new();
eval {
	$xmlDoc->parse($XMLString);
};
if ($@) {
	print "Error: XML parsing error: $@\n";
	exit(1);
}
 
my $xmlHash = $xmlDoc->toHash();
 
# Adding an attribute and a text to the first node
$xmlHash->{'test'}->{'info'}->[0]->{'addon'} = 'valid text';
$xmlHash->{'test'}->{'info'}->[0]->{'content'} = "Here is an valid xml text\nusing linebreaks";
 
# Adding an attribute and an unescaped text to the second node
$xmlHash->{'test'}->{'info'}->[1]->{'addon'} = 'invalid unescaped text';
$xmlHash->{'test'}->{'info'}->[1]->{'content'} = "Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?";
 
# Adding a third node unescaped text to the second node
$xmlHash->{'test'}->{'info'}->[2]->{'addon'} = 'valid unescaped text in cdata';
$xmlHash->{'test'}->{'info'}->[2]->{'content'} = "<![CDATA[Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?]]>";
 
# my $newDoc = XML::Mini::Document->new();
$xmlDoc->fromHash($xmlHash);
open(DATEI, ">output_XMLMini") || die "Datei nicht gefunden";
print DATEI $xmlDoc->toString();
close(DATEI);

After running the test, the resulting XML looks like this:

<test>
	<info>
		<attr2>
			xml module
		</attr2>
		<addon>
			valid text
		</addon>
		<attr1>
			perl
		</attr1>
		<content>
			Here is an valid xml text using linebreaks
		</content>
	</info>
	<info>
		<attr2>
			xml module
		</attr2>
		<addon>
			invalid unescaped text
		</addon>
		<attr1>
			perl
		</attr1>
		<content>
			Here is an valid xml text using linebreaks and unescaped characters like &lt; and &gt; do you see the and?
		</content>
	</info>
	<info>
		<attr2>
			xml module
		</attr2>
		<addon>
			valid unescaped text in cdata
		</addon>
		<attr1>
			perl
		</attr1>
		<content>
			&lt;![CDATA[Here is an valid xml text using linebreaks and unescaped characters like &lt; and &gt; do you see the and?]]&gt;
		</content>
	</info>
	<info>
		<attr2>
			xml module
		</attr2>
		<attr1>
			perl
		</attr1>
		<deepinfo>
			last text here
		</deepinfo>
	</info>
	<attr2>
		2
	</attr2>
	<attr1>
		1
	</attr1>
	<another>
		&amp;lt;&amp;gt;
	</another>
	<debug>
		0
	</debug>
</test>
<xml>
	<version>
		1.0
	</version>
	<encoding>
		iso-8859-1
	</encoding>
</xml>

*Outch* – what the heck did happen to our XML?! After the first shock, we see, that the attributes are all put into a new generated node. This is a valid syntax, and all information is still kept in the file (the good point is, that unlike XML::Simple the “deepinfo”-node holds its correct position in XML::Mini::Document). But the problem, that attributes and nodes are mixed up by the module remains the same.

Another big problem is dealing with CDATAs. XML::Mini::Document escapes the CDATA-node and so is not able to write correct CDATA nodes.

And the most evil problem is the treatment of the “xml information node”. It is moved to the bottom and handled like it was a normal XML-node.

First result:

The “fromHash” and “toHash” method in XML::Mini::Document is technical crap, that can only be used on small xmls without attributes and CDATA nodes.

But there is a second way to use XML::Mini::Document

Unlike XML::Simple, dealing with the XML-structure by parsing it from and to a hash isn’t the only way in XML::Mini::Document. It also gives a toolbox to deal with the structure directly. This we want to test next and therefor rewrite our little testcase:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/usr/bin/perl
# Simple XML module test unit
###
# We are strict, cauz we are Elitecoderz!
use strict;
use XML::Mini::Document;
use XML::Mini;
 
################
# Get / Check Parameter (here we get the xml file we wanna deal with)
if ($#ARGV+1 != 1) {
	print "Error: Wrong number of parameters.\n";
	exit(1);
}
my $input = $ARGV[0];
chomp($input);
 
################
# Read Inputfile / check content / validate XML structure
# Some modules are able to read directly from a file; for easy going, we use this method here.
 
# Direct, dirty, but simple reading of a file
open(FILE, "<$input") || die "Error: File not readable.\n";
my @lines = <FILE>;
close(FILE);
 
# Put the lines into one string for this parser
my $XMLString = join(' ',@lines);
 
####################################################################
# XML::Mini
my $xmlDoc = XML::Mini::Document->new();
eval {
	$xmlDoc->parse($XMLString);
};
if ($@) {
	print "Error: XML parsing error: $@\n";
	exit(1);
}
my $xmlRoot = $xmlDoc->getRoot();
 
my $firstnode = $xmlDoc->getElementByPath('test/info');
$firstnode->attribute('addon', "Here is an valid xml text\nusing linebreaks");
 
my $secondnode = $xmlDoc->getElementByPath('test/info',1,2);
$secondnode->attribute('addon','invalid unescaped text');
$secondnode->text("Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?");
 
my $testnode = $xmlDoc->getElementByPath('test');
my $newchild = $testnode->createChild("info");
$newchild->attribute('addon', "unescaped text for a CDATA");
$newchild->cdata("Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?");
 
open(DATEI, ">output_XMLMini3") || die "Datei nicht gefunden";
print DATEI $xmlDoc->toString();
close(DATEI);

The resulting XML looks like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<test another="&lt;&gt;" attr1="1" attr2="2" debug="0">
	<info addon="Here is an valid xml text
using linebreaks" attr1="perl" attr2="xml module" />
	<info addon="invalid unescaped text" attr1="perl" attr2="xml module">
		Here is an valid xml text using linebreaks and unescaped characters like &lt; and &gt; do you see the and?
	</info>
	<info attr1="perl" attr2="xml module" />
	<info attr1="perl" attr2="xml
                        module">
		<deepinfo>
			last text here
		</deepinfo>
	</info>
	<info addon="unescaped text for a CDATA">
<![CDATA[ Here is an valid xml text
using linebreaks
and unescaped characters like < and >
do you see the and? ]]> 
	</info>
</test>

*Wow* – the first and very bad image turns a lot better! The structure is complete, looks good, is escaped where it should be and not, where it shouldn’t. In short words: The XML looks quite fine! After this, it is definitifly worth to take a closer look on XML::Mini::Document. And this we will do in the next chapert.

What happens to more complex XML-structures, containing CDATAs and encoded text when parsing?

To test this, we modify our little test-case again,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/usr/bin/perl
# Simple XML module test unit
###
# We are strict, cauz we are Elitecoderz!
use strict;
use XML::Mini::Document;
use XML::Mini;
 
################
# Get / Check Parameter (here we get the xml file we wanna deal with)
if ($#ARGV+1 != 1) {
	print "Error: Wrong number of parameters.\n";
	exit(1);
}
my $input = $ARGV[0];
chomp($input);
 
################
# Read Inputfile / check content / validate XML structure
# Some modules are able to read directly from a file; for easy going, we use this method here.
 
# Direct, dirty, but simple reading of a file
open(FILE, "<$input") || die "Error: File not readable.\n";
my @lines = <FILE>;
close(FILE);
 
# Put the lines into one string for this parser
my $XMLString = join(' ',@lines);
 
####################################################################
# XML::Mini
my $xmlDoc = XML::Mini::Document->new();
eval {
	$xmlDoc->parse($XMLString);
};
if ($@) {
	print "Error: XML parsing error: $@\n";
	exit(1);
}
my $xmlRoot = $xmlDoc->getRoot();
 
my $firstnode = $xmlDoc->getElementByPath('test/info');
$firstnode->attribute('addon', "Here is an valid xml text\nusing linebreaks");
 
my $cdatanode = $xmlDoc->getElementByPath('test/info',1,5);
print "\n\n--\n".$cdatanode->getValue."\n--\n\n";
 
my $textnode = $xmlDoc->getElementByPath('test/info',1,2);
print "\n\n--\n".$textnode->getValue."\n--\n\n";
 
open(DATEI, ">output_XMLMini4") || die "Datei nicht gefunden";
print DATEI $xmlDoc->toString();
close(DATEI);

and read again from the previously generated xml-file and see the output:

--
 Here is an valid xml text
 using linebreaks
 and unescaped characters like < and > do you see the and?
--

--
Here is an valid xml text
 using linebreaks
 and unescaped characters like &lt; and &gt;
 do you see the and?
--

The result for the CDATA looks quite nice, but the text from our second node unfortunatly does not get unescaped. Well – it’s no big deal to do this your own, but it is also not really my part to do this for XML::Mini::Document.

Result:
Do not try to use the “fromHash()” method, unless you want to destroy your XML. The “toHash()” method can be used to gather quickly information from a XML-file, but that is really all you can use this method for.

The toolbox for directly manipulating XML-structures is really nice. Especially the syntax for getting elements by path is something I really found usefull! Nice idea – great implementation!

XML::Mini::Document deals with all kind of XML-structures and does the escaping automatically. Unescaping of text-nodes must be done manually. That is a bug, that prevent me from using this module.

XML::Mini::Document is the module of your choice, if you have to deal with more complex structures. Since unescaping is a problem I would not recommend to use this module on really huge and sensitive XML-structures. For everything else XML::Mini::Document is easy to use and quick to implement.

The next post will be about “Getting your life compfortable with XML::LibXML”.
In all the tests above XML::LibXML shows no weakness and I could not find any problems. But there are a few other “traps” in XML::LibXML you’ll have to deal with.

Post to Twitter Tweet This Post

Best XML module for perl? XML::LibXML vs. XML::Simple

06:06 PM 2 comments

XML is one of the powerfullest, often missused data-structures these days. To express huge object- and data-structures XML is an easy and elegant way to save and transport all kind of informations. There are XML-parser for nearly all systems, and there are a lot of modules for nearly all IDEs to deal with XML in a more or less easy way.

Due to some trouble with choosing the right “XML toolkit” from cpan, I’ll give you a short overview over the three most important XML perl modules and my experiences with them.

But first I’ll have to make some statements (the usual way to begin a good program ;) ):

Appeal for good XML structures

In the beginning I wrote about the “missuse” of XML; first I’ll give you an example what I understand on “missuse of XML”: The position of XML siblings in equal rank in hirarchy within the document should make no difference; In the following, I’ll give you an example where the position of the command-siplings does matter!

<?xml version="1.0" encoding="iso-8859-1"?>
<myxml general="valid" xmlinfo="important">
  <command action="echo" option="Message" target="/tmp/test.txt"/>
  <command action="mail" option="/tmp/test.txt" address="some@email.com"/>
</myxml>

This is an example for a worse XML-structure! If (for any reason) the two command-siblings are switched, the email will be sent before generating the content.

A better solution for this would be:

<?xml version="1.0" encoding="iso-8859-1"?>
<myxml general="valid" xmlinfo="important">
  <command step="2" action="mail" option="/tmp/test.txt" address="some@email.com"/>
  <command step="1" action="echo" option="Message" target="/tmp/test.txt"/>
</myxml>

Here the attribute “step” is used to order the single commands. So switching the position of the two siblings does not matter (as the program will be able to handle the commands still in the correct order). An alternative way that would be ok is the following structure:

<?xml version="1.0" encoding="iso-8859-1"?>
<myxml general="valid" xmlinfo="important">
  <command action="echo" option="Message" target="/tmp/test.txt">
    <command action="mail" option="/tmp/test.txt" address="some@email.com"/>
  </command>
</myxml>

The main message is: Prepare to get your xml sorted in a new way by bad xml modules! Another reason is: It is absolutly no fun in development when you have to count lines to get the correct value you want (use your computer for counting and do not do it yourself!).
And, when the xml gets changed someday it can happen, that you have to change your programs too, since they have to count from another line. Unfortunatly, this is one of the most frequently occuring crap I have to deal with at customers I’m supporting. :(

Back to the XML modules for perl

First let’s take a look on XML::Simple

Short information:
4 of 5 stars rating on cpan
Last modification date: 15 Aug 2007
Actual ratings show, that the module still is used widely

XML::Simple really is simple to use. After reading a few minutes in the well written documentation, I was able to read, work and write XML-documents with XML::Simple. A really nice feature is, that you do not have to deal with the XML-structures yourself. The XML is parsed into an hash-array construct that represents the xml-structure and is easy to use and (for small changes) easy to change.

First I created a xml-file to read from:

<?xml version="1.0" encoding="iso-8859-1"?>
 <test debug="0" attr1="1" attr2="2" another="&lt;&gt;">
  <info attr1="perl" attr2="xml module" />
  <info attr1="perl" attr2="xml module" />
  <info attr1="perl" attr2="xml module" />
  <info attr1="perl" attr2="xml module">
    <deepinfo>last text here</deepinfo>
  </info>
</test>

The I wrote a short test-unit:

#!/usr/bin/perl
# Simple XML module test unit
###
# We are strict, cauz we are elitecoderz!
use strict;
use XML::Simple;
 
################
# Get / Check Parameter (here we get the xml file we wanna deal with)
if ($#ARGV+1 != 1) {
	print "Error: Wrong number of parameters.\n";
	exit(1);
}
my $input = $ARGV[0];
chomp($input);
 
################
# Read Inputfile / check content / validate XML structure
# Some modules are able to read directly from a file; for easy going, we use this method here.
# Direct, dirty, but simple reading of a file
open(FILE, "<$input") || die "Error: File not readable.\n";
my @lines = <FILE>;
close(FILE);
 
# Put the lines into one string for this parser
my $XMLString = join(' ',@lines);
 
####################################################################
# XML::Simple
my $ref;
eval {
	$ref = XMLin($XMLString, KeepRoot => 1);
};
if ($@) {
	print "Error: XML parsing error: $@\n";
	exit(1);
}
 
# Adding an attribute and a text to the first node
$ref->{'test'}->{'info'}->[0]->{'addon'} = 'valid text';
$ref->{'test'}->{'info'}->[0]->{'content'} = "Here is an valid xml text\nusing linebreaks";
 
# Adding an attribute and an unescaped text to the second node
$ref->{'test'}->{'info'}->[1]->{'addon'} = 'invalid unescaped text';
$ref->{'test'}->{'info'}->[1]->{'content'} = "Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?";
 
# Adding a third node unescaped text to the second node
$ref->{'test'}->{'info'}->[2]->{'addon'} = 'valid unescaped text in cdata';
$ref->{'test'}->{'info'}->[2]->{'content'} = "<![CDATA[Here is an valid xml text\nusing linebreaks\nand unescaped characters like < and >\ndo you see the and?]]>";
 
open(FILE, ">$input\_XML-Simple") || die "Datei nicht gefunden";
print FILE XMLout($ref, NoSort => 1, KeepRoot => 1, NoEscape => 1);
close(FILE);
exit(0);

The resulting XML looks like this:

<test attr2="2" attr1="1" another="<>" debug="0">
  <info attr2="xml module" addon="valid text" attr1="perl">Here is an valid xml text using linebreaks</info>
  <info attr2="xml module" addon="invalid unescaped text" attr1="perl">Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and?</info>
  <info attr2="xml module" addon="valid unescaped text in cdata" attr1="perl"><![CDATA[Here is an valid xml text using linebreaks and unescaped characters like < and > do you see the and?]]></info>
  <info attr2="xml module" attr1="perl" deepinfo="last text here" />
</test>

Summary to XML::Simple

As you can see (and set by NoEscape => 1 in the XMLout), “evil” characters like the <> are not escaped automatically. That is quite ok, since we set the Option to “1″.
But with autoescaping enabled, the CDATA-part get’s escaped too, and that isn’t really funny, since the CDATA structure get’s damaged by this.
So, after reading from XML, you have to escape all contents manually, before writing the structure again. You have to care about which content must be escaped and which not. A lot of work and a good point for making errors.
Did you become aware of the most evil error in the XML?

<info attr1="perl" attr2="xml module">
    <deepinfo>last text here</deepinfo>
</info>

became:

<info attr2="xml module" attr1="perl" deepinfo="last text here" />

The sibling “deepinfo” was completely removed and put into the parent as attribute. And this is really a bad bad problem, since when you expect the <deepinfo> as sibling, you won’t look for it as an attribut in the parent.

Result:
If you do not work with complex XML-structures that need to be escaped and if XML::Simple is the only module that will read and write the XML-Files, then this is your module of choice!

The next post will be about XML::LibXML vs. XML::Mini::Document. After this, I’ll go on with a series about “getting your life compfortable with XML::LibXML”.
As you can see by this:
I strongly recommend you XML::LibXML. Why I come to this conclusion, you’ll get in the following posts ;)

Update:
Next chapter: XML::LibXML vs. XML::Mini::Document

Post to Twitter Tweet This Post

Categories: Perl Tags: , , , , ,

Converting Windows characters to Mac and vice versa – Filter for BBEdit

06:27 PM No comments

When working with scripts written in perl or php, the encoding of special german characters like “ü” (ue), “ö” (oe) and “ä” (ae) can’t be set correctly, since the file-encoding needs to be set to “Mac OS Roman” with “Unix Linefeeds (LF)”. So these special characters, called “Umlaute” gets mapped to untypable characters in the ASCII-table.

Due to the simple and effective integration of perl into BBEdit, there is an easy solution for this problem: A trivial perl script with some Regular Expressions, that replace all characters within a selection by the correct character.

The script for converting Windows to Mac looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/perl -w
 
while(<>) {
	my $line = $_;
	$line =~ s/ƒ/Ä/g;
	$line =~ s/÷/Ö/g;
	$line =~ s/‹/Ü/g;
	$line =~ s/fl/ß/g;
	$line =~ s/‰/ä/g;
	$line =~ s/ˆ/ö/g;
	$line =~ s/¸/ü/g;
	print $line;
}

And verci versus: the script for Mac to Windows looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/perl -w
 
while(<>) {
	my $line = $_;
	$line =~ s/Ä/ƒ/g;
	$line =~ s/Ö/÷/g;
	$line =~ s/Ü/‹/g;
	$line =~ s/ß/fl/g;
	$line =~ s/ä/‰/g;
	$line =~ s/ö/ˆ/g;
	$line =~ s/ü/¸/g;
	print $line;
}

You can also download the two scripts here: Download the scripts for free

Installation:

Copy the two files into your BBEdit “Application Support”-folder, located in your userfolder at:

~/Library/Application Support/BBEdit/Unix Support/Unix Filters/

Unix Filters directory after installation

Unix Filters directory after installation

So your “Unix Filters”-directory will now look something like this, as showed in the picture right standing.

If you create the scripts yourself, please keep in mind that the linefeed format of the file must be set to “Unix (LF)” for the scripts to work properly.

Here you find your new Filter

Here you find your new Filter

After you installed the script, you have to restart BBEdit. To use the filter, simply select the text you want to change. Then select the Filter you want to apply from the “#!” menu to do the conversion.

Additionally characters can be added to this example. Please keep in mind, that you may not break the Regular Expression. A good reference for Regular Expressions can be found at http://de.selfhtml.org.

This is an easy way to deal with a correct ISO-Latin 1 (ISO 8859-1) under BBEdit, using the Mac Roman encoding without having any trouble.

Example:

Here you can see an example of the result of the Filter:

A text in Mac-Roman, selected for conversion

A text in Mac-Roman, selected for conversion

After selecting “Konv Mac>Win.pl”:

The characters within the selection got converted

The characters within the selection got converted

Post to Twitter Tweet This Post