Introduction

XML::Comma is an information management platform. Comma speeds the development of content-heavy, networked applications, and was designed to solve some of the problems that make managing extremely large web sites so expensive, difficult and tedious.

Comma is written mostly in Perl, and its target demographic is the Perl programmer who must build customized, complex systems that handle very large amounts of dynamic content. Like most software that is designed to be used by programmers to build other software, Comma is several things at once: a code library, a design framework, a development methodology and a runtime system all rolled into one. However, Comma's central philosophy is "play well with others," and the system depends heavily on a number of tools -- the Apache web server and its mod_perl extensions, the HTML::Mason web development environment, relational databases, the underlying filesystem and OS utilities -- to implement its functionality and to provide programmers with a complete, flexible, scalable, and familiar toolkit.

Comma shapes information into "documents," and -- as its (full) name implies -- uses XML to structure those documents. XML, like Perl, is a powerful and standard tool for organizing text. But XML, again like Perl, doesn't do much of anything by itself. Comma defines a number of discrete "processes" in the "life-cycle" of a document and provides a framework that abstracts basic activities common to those process. These frameworks include structuring and validation; long-term storage; programmatic manipulation; and indexing for fast sorting, categorization and retrieval.

This document describes the most recent release of comma, comma-1.90. For documentation of comma 1.x, please view this page instead, or for more information about the different branches, please see the list of comma 2.0 enhancements.

This document is available online in HTML format:

Table of Contents

Installation

Dependencies

XML::Comma requires that Perl, a number of CPAN modules, and a relational database be installed in order to function properly. The Perl version must be 5.005 or greater. The basic required CPAN modules (more may be used by additional parts of Comma) are Class::ClassDecorator, Compress::Zlib, Crypt::Blowfish, Crypt::CBC, DBI, Digest::HMAC_MD5, Inline, Lingua::Stem, Math::BaseCalc, PAR, Storable, String::CRC As of comma-1.90, the database must be mysql via DBD::mysql, as the postgres and SQLite backends are non-functional due to lack of interest - if you need these or some other backend, please let us know).

Comma installs in the usual make, make test and make install fashion. The tests, however, won't run until the Comma/Configuration.pm file has been edited to configure a number of standard variables to values that are appropriate for your system. Comma/Configuration.pm controls the overall system configuration, and the version that is in the build directory will be copied to the appropriate location on your machine during the make install operation.

Comma/Configuration.pm contains a package declaration and then a __DATA__ section divider. Everything after the __DATA__ line is configuration information, in the form of a big list of eval'able key/value pairs. Each key specifies the name of a configuration variable, and each value is accessible as a top-level Comma method, for example:

# the top of my Configuration.pm looks like this:

package XML::Comma::Configuration;
use base 'XML::Comma::Pkg::ModuleConfiguration'; 1;
__DATA__

comma_root          =>     '/usr/local/comma',
log_file            =>     '/usr/local/comma/log.comma', 
document_root       =>     '/usr/local/comma/docs',
sys_directory       =>     '/usr/local/comma/sys',
tmp_directory       =>     '/tmp',

defs_directories    =>
  [
   '/allafrica/comma/defs',
   '/usr/local/comma/defs',
   '/usr/local/comma/defs/macros',
   '/usr/local/comma/defs/standard',
   '/usr/local/comma/defs/test'
  ],

###
###

# so, on my system, this assigns '/usr/local/comma' to $str 
my $str = XML::Comma->comma_root();

# and, similarly
my $first_defs_directory = XML::Comma->defs_directories()->[0];

The Configuration.pm file that comes with the distribution fully specifies all of the possible configuration variables, and includes reasonable defaults for all those for which reasonable defaults are likely. Just think of the configuration block as a big hash assignment -- so pretty much any Perl code is, at least theoretically, allowed.

Configuration Variables

Using the SimpleC Parser

The SimpleC parser module requires that the Inline and Inline::C modules be installed on your system. After editing Comma.pm to specify SimpleC as the system parser, run make test as root. The test scripts should attempt to compile SimpleC and cache the results in Comma's tmp directory. If all goes well, the compiled module will be available to all users of the system. It must be admitted, however, that we have abused the Inline mechanisms a bit to achieve the dynamic loading that Comma's config methods require. If Inline::C passes all its tests, but SimpleC doesn't work for you, don't hesitate to let us know.

Documents and DocumentDefinitions

An XML::Comma system stores pieces of information as Documents. The structure and basic behaviors of the Documents in each system are described by DocumentDefinitions. This section introduces Documents and DocumentDefinitions. We will mostly refer to Documents as Docs and DocumentDefinitions as Defs; this saves typing and is consistent with the Perl API.

A Simple Doc and Def

Here is a simple sample Doc, showing the beginnings of a structure that could be used to keep track of information about a registered user of a web site. We'll use this example as we go along, adding features and providing example pieces of code.

<User>
  <username>kwindla</username>
  <email>kwindla@xymbollab.com</email>
  <full_name>Kwindla Hultman Kramer</full_name>
</User>

That's pretty self-explanatory. The whole thing is XML, with a very simple structure. Here is the corresponding Def:

<DocumentDefinition>
  <name>User</name>
  <element><name>username</name></element>
  <element><name>email</name></element>
  <element><name>full_name</name></element>
</DocumentDefinition>

Still pretty simple, so far. For your Comma installation to recognize Docs of the User type, it suffices to put the above Def in a file called User.def somewhere down the defs_directories path. If you're following along at the keyboard, you can do that, now, and you'll be able to try out the code examples that follow.

Basic Manipulation: new(), element(), set() and get()

The most basic parts of the Comma API are the methods that manipulate the elements of a Doc. Let's write a little Perl program to make an "empty" User Doc, set its three elements, and then print the result:

use XML::Comma;
my $doc = XML::Comma::Doc->new ( type=>'User' );
$doc->element('username')->set ( 'kwindla' );
$doc->element('email')->set ( 'kwindla@xymbollab.com' );
$doc->element('full_name')->set ( 'Kwindla Hultman Kramer' );
print $doc->to_string();

Running that program should print out something very similar to the sample Doc, above. (The only difference should be that the three elements are not indented. There's a way to do that, too, but we'll cover the subtleties of to_string() later.)

What did we do, there? Well, let's take the program line by line.

The first line tells Perl that we're going to be using the XML::Comma framework. All of the Comma modules that we'll need -- such as XML::Comma::Doc -- are pulled in by this statement.

The second line creates a new Doc object. The Doc->new() method takes a parameterized argument type, specifying which DocumentDefinition we want our Doc to adhere to.

The next three lines set the contents of the three elements in the Doc. The three statements are completely independent; we could have placed them in any order. We can break these lines up further, to clarify what's going on. Here is the username line in two separate statements:

my $username_element = $doc->element('username');
$username_element->set ( 'khkramer' );

First, the element() method selects for us the element that we're interested in, taking a single argument -- the name of the element, and returning a reference to an Element object. Then we call that object's set method. set() takes a single argument, too, a string which will become the content of the Element.

The final line of the little program prints out our Doc. The to_string() method generates a string of XML text that completely represents the contents of the Doc.

One more basic method call is worth mentioning here: get(). As you might expect, get() is the opposite of set(). It takes no arguments, and returns the contents of an Element as a string:

my $username = $username_element->get();

More Complex Structures: Nested Elements

The Doc so far is very simple: it contains three elements, each of which contain some string-ish content. But we can do better than that, we can introduce elements that, themselves, contain other elements. If we add an address element to the Doc, it might look like this:

<User>
  <username>kwindla</username>
  <email>kwindla@xymbollab.com</email>
  <full_name>Kwindla Hultman Kramer</full_name>

  <address>
    <street1>922 M Street SE</street1>
    <city>Washington</city>
    <state>DC</state>
    <zip>20003</zip>
  </address>
</User>

Corresponding changes in the Def are necessary, of course:

<DocumentDefinition>
  <name>User</name>
  <element><name>username</name></element>
  <element><name>email</name></element>
  <element><name>full_name</name></element>

  <nested_element>
    <name>address</name>
    <element><name>street1</name></element>
    <element><name>street2</name></element>
    <element><name>city</name></element>
    <element><name>state</name></element>
    <element><name>zip</name></element>
    <element><name>country</name></element>
  </nested_element>
</DocumentDefinition>

The new address element is declared as a nested_element. This means that it will serve as a container for other elements, and will not have content of its own. Comma enforces this distinction between simple and nested elements -- an element can have string content, or it can serve as a container for other elements, but it cannot do both.

You might infer from the above that a nested element will not have set() and get() methods, but rather, like a Doc, will provide an element() method. If so, you infer correctly. To get at the pieces of the address, we can simply "walk down the tree", using the methods we already know about.

my $address = $doc->element('address');
my $formatted_address = $address->element('street1')->get() . "\n";
if ( $address->element('street2')->get() ) {
  $formatted_address .= $address->element('street2')->get() . "\n";
}
$formatted_address .= $address->element('city') . ',' .
                      $address->element('state') . '  ' .
                      $address->element('zip');

In fact, a Doc is itself a nested element -- all of the methods that are available for manipulating nested elements are available for Docs, as well. When we talk in more detail about nested elements, we'll often call the nested element a container, and the elements that it contains sub-elements. Just keep in mind that when we describe nested element operations it doesn't matter whether the container is a Doc or a nested element. In a similar vein, elements can be nested as deeply as you want, you just have to declare the nesting in the Def. (And there's even a way to specify arbritrarily deep recursive nesting, but that's best covered in another section entirely.)

Plural Elements

What if we want to store more than one address. We might, like Amazon, keep a number of shipping addresses on file for each user. To do so, we add a line to the Def, declaring that the address element is plural.

<DocumentDefinition>
  <name>User</name>
  <element><name>username</name></element>
  <element><name>email</name></element>
  <element><name>full_name</name></element>

  <nested_element>
    <name>address</name>
    <element><name>street1</name></element>
    <element><name>street2</name></element>
    <element><name>city</name></element>
    <element><name>state</name></element>
    <element><name>zip</name></element>
    <element><name>country</name></element>
  </nested_element>

  <plural>'address'</plural>
</DocumentDefinition>

Note the quotes around address, in the new line. The contents of the plural specifier are evaluated as a Perl expression when the Def is loaded into the system, and the return value of that expression must be a list of elements that the system will allow to be plural.

We gain a lot of flexibility here, by treating a piece of a Def as a bit of Perl code. The price for this flexibility is a little bit of added complexity: the contents of the plural tag must create a valid Perl list. In this case, that means putting quotes around bareword address. Many other parts of Comma use this same strategy of embedding Perl code into DocumentDefinitions, and we'll see much more sophisticated examples shortly.

The element() method continues to work as it always has. If you re-run the earlier code fragments with the new Def in place, the results will be exactly the same. But our understanding of what element() is doing should change a tiny bit: the method doesn't fetch the only matching element for us, it fetches the first one. And, because elements don't exist in a Doc until we manipulate them, element() must create a new element for us if need be.

For plural elements, we obviously need some more methods. We need a way to fetch elements other than the first one, a way to add a new element, and a way to delete elements that we don't need.

# add a new address
my $address2 = $doc->add_element ( 'address' );
$address2->element('street1')->set ( 'PO Box 0000' );
$address2->element('city')->set ( 'Anyplace' );
$address2->element('state')->set ( 'ZZ' );
# add another new address
my $address3 = $doc->add_element ( 'address' );
# change my mind, delete that element
$doc->delete_element ( $address3 );
# get a list of address elements
@addresses = $doc->elements ( 'address' );

The add_element() method takes a single argument, the name of the element to add. It creates a new element of the requested kind, appends that element to the container, and returns the newly-created element. To ask a container to add an element that is not plural, if there is already an element of that kind present, is an error. Remember that element() auto-creates elements as required, so it is never necessary to call add_element() for a non-plural element.

The delete_element() method also takes a single argument, but is a bit more complicated. It will accept an element name as string argument, in which case it deletes the last element of that kind. It will also accept an element object, in which case it will delete that specific element. The method returns true if it deletes anything, false if it does not.

The elements() method accepts a list of element names and returns a list of the elements of those types, in the order that they exist in the container. (In the above example, we only asked for address elements, but we could have asked for username and address elements, or username and full_name and address elements...)

Actually, the return value of elements() is a little trickier than the description above would suggest. In a list context, the method returns an array. But in a scalar context, it returns a reference to an array. This context-awareness makes it possible to write code like:

# quick walk down the tree
my $last_street = $doc->elements('address')->[-1]->element('street')->get();

This is usually not a problem; most of the time, things just work out as you would expect them to. If you assign the return value to an array, you get an array. If you dereference with a subscript, you get an element of the list. But there is one very important case that does not work as you would expect. You can not do the following!!!

# WRONG way to do something if we've got address elements
if ( $doc->elements('address') ) {...}

The above if statement will always be true, because what if sees is the reference. Instead, you must use constructions like the following for conditional elements-ing:

# do something if we've got address elements
if ( @{$doc->elements('address')} ) {...}

Methods

An element holds a piece of information. A method generates a piece of information each time it is called. A document definition may supplement its elements, which hold static data, with methods, which return dynamic data.

Suppose we want to provide a method that will display a user's email address modified in such a way as to make things more difficult for the address-collecting web crawlers often used to build spam databases. Here is a method definition that will fetch the contents of the email element, replace the at-sign and periods with text, and return the result:

<method>
  <name>email_anti_spammed</name>
  <code>
    <![CDATA[
      sub {
        my $self = shift();
        my $email = $self->element('email')->get();
        $email =~ s/\@/ (AT) /;
        $email =~ s/\./ (DOT) /g;
        return $email;
      }
    ]]>
  </code>
</method>

A method is expected to have name and code elements. The name is the name by which the method will be called. The code element should be text that, when eval'ed, returns a reference to an anonymous subroutine. It is this subroutine that will be called when the method is invoked.

Not too surprisingly, the method() routine calls a method. The email_anti_spammed method could be used as follows:

# set email
$doc->element('email')->set ( 'kwindla@allafrica.com' );
# get munged email: kwindla (AT) allafrica (DOT) com
my $munged = $doc->method('email_anti_spammed');

Methods are often most useful at the top level of a document; they function both as bits of reusable code and as programmatic short-cuts. But methods can be defined as "part" of any element -- not just the top-level Doc. Here is a new definition for the address element that includes a method to generate a formatted block of text suitable for printing on an envelope (some of the code inside this method will be familiar from an earlier example):

  <nested_element>
    <name>address</name>
    <element><name>street1</name></element>
    <element><name>street2</name></element>
    <element><name>city</name></element>
    <element><name>state</name></element>
    <element><name>zip</name></element>
    <element><name>country</name></element>

    <method>
      <name>formatted</name>
      <code>
        <!CDATA[
          # returns a block-formatted address.
          # takes one optional arg indicating whether the country field
          #   should be included: print_country => 1
          sub {
            my ( $self, %args ) = @_;
            my $formatted_address = $self->element('street1')->get() . "\n";
            if ( $self->element('street2')->get() ) {
              $formatted_address .= $self->element('street2')->get() . "\n";
            }
            $formatted_address .= $self->element('city') . ',' .
                                  $self->element('state') . '  ' .
                                  $self->element('zip');
            if ( $args{print_country} ) {
              $formatted_address .= ' ' . $self->element('country');
            }
            return "$formatted_address\n";
          }
        ]]>
      </code>
    </method>
  </nested_element>

The formatted example demonstrates that methods may make use of arguments. The first argument to method() is the name of the method to be invoked; any arguments after that are passed to the invokee. Here is example usage of the formatted method:

# use a hypothetical &envelope_print sub to generate text on a mailing evelope
envelope_print ( $doc->element('full_name')->get() . "\n" );
envelope_print ( $doc->element('address')->method('formatted', print_country=>1) );

Do What I Mean: Shortcut Syntax

The element() syntax is quite verbose. Comma provides a more concise syntax that reduces the length and unwieldiness of common method calls. This shortcut syntax has a "Do What I Mean" design, which, of course, means that it sometimes doesn't do what you meant.

Shortcuts work via Perl's method AUTOLOAD framework. Any Doc or nested element automatically recognizes Perl methods that have the same name as their defined methods and sub-elements. Because our User Def defines username, email, full_name, address and email_anti_spammed elements and methods, all of the following Perl method calls are allowed:

# top-level User 'shortcut' methods
$doc->username();
$doc->email();
$doc->full_name();
$doc->address();
$doc->email_anti_spammed();

What a shortcut call does depends on what the underlying object referenced is. In the simplest, most useful, and most common case -- here represented by username, email and full_name -- the shortcut fetches the content from the element with the same name as the shortcut.

# get username with less typing
my $username = $doc->username();
# which is the same thing as:
my $username = $doc->element('username')->get();

If the shortcut is called with an argument, then a set() is performed rather than a get().

# set the username
$doc->username ( 'kwindla' );

In the case of a nested element such as address, on the other hand, a get() would make no sense. In the case of a singular, nested element, the shortcut call returns the element. In the case of address element, which is both nested and plural, the shortcut call returns a list or reference to a list of the address elements.

# 'address' shortcut
my $first_address = $doc->address()->[0];
# which is the same thing as:
my $first_address = $doc->element('address')->[0];

For a Comma method, such as email_anti_spammed, the shortcut calls the method. So $doc->email_anti_spammed() becomes $doc->method('email_anti_spammed'). It is possible for a method and an element to have the same name; in this case, the shortcut calls the method rather than accessing the element. Comma methods shadow elements of the same name in the context of shortcut calls.

A table of shortcuts and their not-short equivalents is probably the easiest way to describe all of the seven possible ways a shortcut can be resolved. Here then, are the many faces of $x->foo ( [@args] ).

$x->method('foo', @args) If there is a method named foo
$x->element('foo') For singular, nested foo
$x->elements('foo') For plural, nested foo
$x->element('foo')->get() For singular, non-nested foo called with no arguments
$x->elements('foo')->set ( $args[0] ) For singular, non-nested foo called with arguments
$x->elements_group_get('foo') For plural, non-nested foo called with no arguments
$x->elements_group_add('foo', @args) For plural, non-nested foo called with arguments

We've used examples from the top level of the User Doc, but short-cut methods are applicable to any nested element context. (Indeed, shortcuts are most useful in terms of keystrokes saved when used to shorten multi-level traversals.) Here is a line of code to grab the zip-code of the first stored address in a User Doc:

# a shortcut version of $doc->elements('address')->[0]->element('zip')->get()
$doc->address()->[0]->zip();

Nested Element Helper Methods: elements_group_get() and Friends

Shortcuts are one kind of convenience method; they're not strictly necessary but do save typing and make code easier to read. Another set of convenience methods are supported by nested elements: the group helpers. These methods make it possible to manipulate instances of a non-nested, plural element as a single group. To demonstrate, we first need to add a simple, plural element to our User Def. In an even more contrived attempt to come up with an example than normal, let's allow a user to be known by a number of nicknames.

<element><name>nickname</name></element>
<plural>'nickname'</plural>

A Doc that includes several nickname elements might look like this:

<User>
<username>kwindla</username>
<email>kwindla@xymbollab.com</email>
<full_name>Kwindla Hultman Kramer</full_name>
<nickname>Junior</nickname>
<nickname>khkramer</nickname>
<nickname>smooth_operator</nickname>
</User>

To add one or more new nickname elements to this Doc, we can use one of the group helper methods: elements_group_add(). The first argument to elements_group_add() is the name of the element(s) we'll be adding; the remaining arguments specify the content for each new element.

# add two nicknames
$doc->elements_group_add ( 'nickname', 'Sneezy', 'Forgetful' );

# note: the above statement is equivalent to the following two lines of code:
$doc->add_element('nickname')->set ( 'Sneezy' );
$doc->add_element('nickname')->set ( 'Forgetful' );

The opposite function, deleting particular elements from a group, is handled by the elements_group_delete() method. Again, the first argument supplies a name and the remainder of the arguments specify content strings. If the content of an element matches one of the supplied strings, that element will be deleted. (Any strings that are not matched will be ignored.) If elements_group_delete is only given the first, name, argument, then all elements in the group are deleted. This provides a convenient idiom for clearing and re-setting an elements group.

# remove the nicknames we just added (wrong movie)
$doc->elements_group_delete ( 'nickname', 'Sneezy', 'Forgetful' );

# remove all the nicknames and replace them with a list of nicknames we
# get back from a couple of subroutine calls
my @new_nicknames = Television::Stooges::nicknames();
push @new_nicknames, Usenet::Rec::Humor::Stooges::FanFiction::nicknames();
$doc->elements_group_delete ( 'nickname' );
$doc->elements_group_add_uniq ( 'nickname', @new_nicknames );

To query a group for the presence of a particular piece of content, use the elements_group_lists() method. This method expects two arguments: name and content.

# check that we really removed the Snow White stuff
print "no more dwarves"
  if  ! $doc->elements_group_lists('nickname', 'Sneezy')  and
      ! $doc->elements_group_lists('nickname', 'Forgetful');

To slurp the contents of the group's elements into a list, use elements_group_get(). As is the case with most of the nested element "plural" methods, elements_group_get() returns either an array in a list context and an array reference in a scalar context.

# get all of the nicknames
my @nicknames = $doc->elements_group_get ( 'nicknames' );
# get the last nickname
my $last_nickname = $doc->elements_group_get('nicknames')->[-1];

Finally, elements_group_add_uniq() works like elements_group_add() except that it ignores duplicates. If we always use elements_group_add_uniq() to add to the nicknames list we will never list a nickname twice.

# add a new nickname
$doc->elements_group_add_uniq ( 'Bashful' );
# add several more nicknames, skipping 'Bashful' because it's already present
$doc->elements_group_add_uniq ( 'Dopey', 'Bashful', 'Doc' );

Whitespace: Ignored and Trimmed

XML-based systems must define how they treat whitespace. HTML, for example, treats all occurrences of whitespace as equivalent. With the exception of content inside a pre tag, which is preserved as formatted, there is no difference between a single space and a boatload of carriage returns. (With the exception, of course, of pre tags, which preserve whitespace exactly as supplied.)

Comma treats whitespace surrounding its tags as non-meaningful, stripping it all out. The following Docs are exactly the same:

<!-- Two equivalent Docs -->

<User>
<username> kwindla </username>
<full_name> Kwindla Hultman Kramer </full_name>
</User>

<User><username>kwindla</username><full_name>Kwindla Hultman Kramer</full_name></User>

Comma's stripping of tag-adjacent whitespace has a very important corrolary: whitespace is trimmed from the beginning and end of all element content. So the two set() statements below are equivalent, and the string comparison will always be false:

# set the username
$doc->element('username')->set ( 'kwindla' );
# set the username to the same thing -- whitespace is "trimmed"
$doc->element('username')->set ( ' kwindla ' );

# because the whitespace is gone, this can *never* be true
my $matched  =  $doc->element('username')->get  eq  ' kwindla ';

Of course, the auto-trimming only applies to tags defined in Comma document definitions. It is often convenient to embed XML-marked-up text in a Comma element as "flat" content -- an element that stores an HTML snippet, for example, will include XML tags that have no "meaning" to Comma. Element content is always preserved verbatim (after whitespace is trimmed from the very beginning and very end) by the system; any XML-like strings inside element content are treated exactly like all other text.

XML Escape/Unescape

Every Comma Doc is a syntactically-legal XML document. All tags must be properly balanced and nested, and bare ampersands, left brackets and right brackets must be properly escaped. Elements that contain XML-like tags or markup characters as part of their content will need to take special action to ensure that proper formatting, escaping or CDATA wrapping happens.

Let's add a bio element to our User Def, and discuss some of the issues involved in storing HTML as element content.

<!-- new 'bio' element: holds a chunk of HTML text -->
<element><name>bio</name></element>
<User>
<username> kwindla </username>
<full_name> Kwindla Hultman Kramer </full_name>
</User>

<bio> Kwin is a programmer who likes <a href="http://use.perl.org">Perl</a>
and <a href="http://www.motorola.com/mcu">6812</a>
assembly language. </bio>

The above Doc is perfectly fine. Because the two a tags are balanced, the parser has no problem reading in the Doc. After parsing is finished the content of the bio element is treated just like any other "flat" piece of content.

We will run into problems, however, if we're not extremely careful about the HTML we try to store in the bio element. For example, HTML includes a number of "empty" tags that are usually used in a non-balanced fashion -- img and br, for example. Unless we force the use of XHTML syntax, which mandates XML-compatible tag usage, we'll need to either escape all mark-up characters or wrap content in a CDATA section.

The utility methods XML_basic_escape and XML_basic_unescape handle simple escaping and unescaping of markup characters.

use Comma::Util qw ( XML_basic_escape XML_basic_unescape );
# escape a string
$escaped = XML_basic_escape ( '<img src="picture.png">' );
$unescaped = XML_basic_unescape ( $escaped );

The set() and get() methods provide a means to escape and unescape strings during get and set operations. If set() is called with additional arguments following the content arg, they are interpreted as paremeters that effect how the set is performed. The argument escape=>1 forces the content string to be escaped before other pieces of the set routine -- validation, etc. -- go to work. Similarly, calling get() with the parameterized arg unescape=>1 unescapes the content string before it is returned.

# safe set()
$doc->element('bio')->set ( $html_stuff, escape=>1 );

# get() bio content in a string that we can incorporate directly into
# a web page
$doc->element('bio')->get ( unescape=>1 );

Our other option, as mentioned above, is to "wrap" the bio element's content in an XML CDATA section. The CDATA envelope forces an XML parser to treat the characters inside it as plain text. Comma allows an element to be flagged as CDATA-fied, meaning that on output the entire contents will be wrapped in a CDATA section. Comma treats this CDATA facility as high-impact and coarse-grained. As a result the declaration is a one-way street: once a CDATA element, always a CDATA element. The cdata_wrap() method flips the switch, so to speak.

# configure the bio element so that it always CDATA-wraps its content
$doc->element('bio')->cdata_wrap();
# now we can set() with impunity
$doc->set ( $messy_html );

The to_string() method on the CDATA-set element will produce output that looks something like this:

<bio><![CDATA[Kwin is a programmer who likes <a href="http://use.perl.org">Perl</a>
and <a href="http://www.motorola.com/mcu">6812</a>
assembly language.]]></bio>

Flexible and Automatic Escape/Unescape

Escaping and unescaping element content is common enough to warrant specific configurability for each Element in a Def of:

  1. The code that performs the escape operation
  2. The code that performs the unescape operation
  3. Whether to automatically escape element content on a set()
  4. Whether to automatically unescape element content on a get()

Here is a (silly) example of a custom escape/unescape pair as part of an Element's definition:

<element>
  <name>Xs_are_dangerous</name>
  <escapes>
    <escape_code> 
      sub { my $str=shift; $str =~ s:X:--x--:g; return $str; }
    </escape_code>
    <unescape_code>
      sub { my $str=shift; $str =~ s:--x--:X:g; return $str; }
    </unescape_code>
    <auto>1</auto>
  </escapes>
</element>

Within the escapes section, escape_code specifies some code that performs the ecape, and unescape_code specifies some code that performs the unescape. They default, respectively, to:

   \&XML::Comma::Util::XML_basic_escape
   \&XML::Comma::Util::XML_basic_unescape

The auto element controls behaviors 3 and 4, from the list above. The content of auto is eval'ed at Def load time, and if auto contains a scalar value, that value sets the default for both escaping and unescaping. If auto contains a listref, the first value in the list controls escaping, and the second unescaping. auto defaults to "0".

In the example above, auto is "1", so content is silently escaped by the element's set() method and silently unescaped by its get() method. Of course, explicitly passing escape=>0 to set() or unescape=>0 to get() overrides this behavior:

# if $el is an Xs_are_dangerous element...

# set $el content to "TE--x-- ME--x--"
$el->set ( "TEX MEX" );

# get back our string TEX MEX
$str = $el->get();

# get back the literal "TE--x-- ME--x--" stored in $el
$str = $el->get ( unescape => 0 );

# set $el content to literal "TEX MEX" -- no escape
$el->set ( "TEX MEX", escape => 0 );

Three more element Def examples:

<element>
  <name>all_basic_escaped</name>
  <escapes><auto>1</auto></escapes>
</element>

<element>
  <name>esc_basic_escaped</name>
  <escapes><auto>[1,0]</auto></escapes>
</element>

<element>
  <name>unesc_basic_escaped</name>
  <escapes><auto>[0,1]</auto></escapes>
</element>

Automatic Content: <default>

It is often useful to define default content for a class of elements, content that get() will return for any instance of an element that doesn't have content of its own. We can amend the definition of the bio element (defined in the previous section) to provide a standard "no information available" string if a User Doc doesn't include a bio.

<element>
  <name>bio</name>
  <default>No bio information available.</default>
</element>
# set() bio information
$doc->element('bio')->set ( 'Kwindla is a programmer' );

# get() will return our new bio -- this prints out 'Kwindla is a programmer';
print $doc->element('bio')->get();

# "clear" bio content by passing set() an undef argument
$doc->element('bio')->set();

# now get() will return our default string -- 'No bio information available'
print $doc->element('bio')->get();

As the above code demonstrates, calling set() with an undefined value as its content argument (which passing no arguments does implicitly) "clears" the content of an element, and any subsequent get() calls will again return the default string. Note that only an undef argument will clear an element's content; in particular, an empty string is perfectly valid as content and a get() on an element with an empty string as its content will happily return that empty string.

It is sometimes important to differentiate between an element that doesn't have any content and an element that has the same content as its Def's default string. The get_without_default() method returns an element's content exactly as is, without falling back to any default value that may be defined. Unlike get(), which returns an empty string if there is neither element content nor Def default, get_without_default() returns undef if an element has no content at all.

Storing Dynamic Information in Defs: pnotes

Document definitions are static constructs. However it can be useful to tie some dynamic bits of information -- status or state flags, simple lookup tables and the like -- to a def. It can also be useful to have simple access to a perl-level hash that can store arbritrary references.

To enable a Def to "hold" some long-lived bits of dynamic information, each def exposes a unique pnotes hash, available to any piece of code in the system. (Comma borrowed the idea for, and the name of, the pnotes hash from Apache.)

# a bit of pnotes manipulation

my $def = XML::Comma::Def->read ( name=>'some_docdef' );
$def->def_pnotes()->{'foo'} = 'bar';

# prints out 'Foo from def: bar'
print "Foo from def: " . $def->def_pnotes()->{'foo'} . "\n";

my $doc = XML::Comma::Doc->new ( type => 'some_docdef' );

# prints out 'Foo from doc: bar'
print "Foo from doc: " . $doc->def_pnotes()->{'foo'} . "\n";

# prints out 'Foo from pathname: bar'
print "Foo from pathname: " . XML::Comma->pnotes('some_docdef')->{'foo'} . "\n";

# and every element down a def's tree has its own pnotes, too
XML::Comma->pnotes('some_docdef:nested_element:another_element')->{'test'} = 'Ok';
print "Ok down longer pathname: " . XML::Comma->pnotes('some_docdef:nested_element:another_element')->{'test'} . "\n";

There are three new methods here. Each element exposes a def_pnotes() method, which returns a reference to that element's def's pnotes hash. Each def also exposes a def_pnotes() method, which returns a reference to its own pnotes hash. The two methods are "different but the same" -- for convenience, you can call def_pnotes() on an element or on that element's def and get back the same hash reference.

The third new method is the system call XML::Comma->def_pnotes(), which takes a pathname and returns that def path's pnotes hash.

Not just for Defs: pnotes for Elements

Sometimes you need to store bits of perl-level data that are specific to a particular Doc, rather than to a Def. You could always write a closure-ish method that creates persistant variables, but Comma provides a simple, Element-bound pnotes hash as an alternative.

Here's the workhorse method from the MailMessageReader input/output filter, which uses Mark Overmeer's Mail::Message module to parse an internet email message and create a Doc. The code sticks the Mail::Message object into the doc's pnotes hash, for possible later use.

sub input {
  my $msg = Mail::Message->read ( $_[1] );
  my $doc = XML::Comma::Doc->new ( type => $_[0]->{_doctype} );

  $doc->message_id ( get_message_id($msg) );
  $doc->subject    ( $msg->get ('Subject') );
  $doc->from       ( $msg->get ('From') );
  $doc->to         ( $msg->get ('To') );

  my $date = $msg->get( 'Date' );
  if ( $date ) {
    my $unix_time = Date::Parse::str2time ( $date );
    $doc->date ( $date );
    $doc->date_utime ( $unix_time );
  }

  foreach ( get_references($msg) ) {
    $doc->add_element('reference')->set ( $_ );
  }

  foreach ( get_parts_content_types($msg) ) {
    $doc->add_element('part_content_type')->set ( $_ );
  }

  my $plain_part = get_plain_part ( $msg );
  my $body = autoformat $plain_part->decoded if  $plain_part;
  $doc->body ( $body )  if  $body;
  $doc->pnotes()->{mail_message_object} = $msg;

  return $doc;
}

Storage and Retrieval

Manipulating Docs in memory is only a small part of the story. We need a way to store Docs in permanent collections, a way to retrieve these permanently stored Docs, and a way to manipulate the collections themselves.

The Store Definition

Let's introduce a new section to the User Def: store.

<DocumentDefinition>
  <name>User</name>
  <element><name>username</name></element>
  <element><name>email</name></element>
  <element><name>full_name</name></element>

  <nested_element>
    <name>address</name>
    <element><name>street1</name></element>
    <element><name>street2</name></element>
    <element><name>city</name></element>
    <element><name>state</name></element>
    <element><name>zip</name></element>
  </nested_element>

  <plural>'address'</plural>

  <store>
    <name>main</name>
    <base>comma_guide</base>
    <location>Sequential_file</location>
  </store>
</DocumentDefinition>

This is the simplest possible store specification: we supply a name, a base directory and a location.

The name element specifies how we'll refer to this particular store. As with elements, we can specify more than one store, so we need names to differentiate one from 'nother. We've called this particular store main.

The base element supplies a directory, underneath the document root, where we're going to put the Docs that we're storing. For this store, since the base is comma_guide, all of the storage will take place in <document_root>/comma_guide/.

The location element specifies how Docs will be stored within the base context. In this case we're storing Docs in a series of sequentially-numbered files.

Two Methods: store() and retrieve()

With this definition of our main store in place, we're ready to store and retrieve User documents.

# make a new Doc, so we have something to store.
my $doc = XML::Comma::Doc->new ( type=>'User' );
$doc->element('username')->set ( 'kwindla' );
$doc->element('email')->set ( 'kwindla@xymbollab.com' );
# write this Doc out to the "main" permanent store
my $key = $doc->store ( store => 'main' )->doc_key();
# now read the Doc back in, manipulate it, and store it back out to the same place
my $d2 = XML::Comma::Doc->retrieve ( $key );
$d2->element('full_name')->set ( 'Kwindla Hultman Kramer' );
$d2->store();

There are three new methods here -- store(), retrieve(), and doc_key.

The store() method writes a Doc out to permanent storage. A store => <name> argument must be supplied the first time the method is called on a new Doc, to specify which of the stores in the Def will be used. The store() method re-returns a reference to the Doc, so that you can chain method calls together easily. The doc_key method returns a unique, long-term identifier for the stored Doc.

The retrieve() method fetches a Doc out of storage, and expects to be supplied a document key as its argument.

Where Are the Files?

It's worth looking at the files that store() writes out. If you ran the above bit of code, you should be able to look in your document root and see a directory named comma_guide. In that directory, there should be a file named 0001. (And if you ran the code multiple times, also 0002 0003, etc.) The contents of these files should look familiar: the text in them was produced by an internal call to to_string(). We can compare the output from a to_string() call with the contents of a store file, to confirm this:

my $store = XML::Comma::Def->read(name=>'User')->get_store('main');
my $doc = XML::Comma::Doc->retrieve ( type => 'User',
                                      store => 'main',
                                      id => $store->first_id() );
# print out the doc with a to_string()
print "doc retrieved...\n"
print "  key: " . $doc->doc_key() . "\n";
print "  from to_string()...\n";
print "----\n";
print $doc->to_string();
print "----\n";
# cat the file that we got the doc from
print "  from file: " . $doc->doc_location() . "\n";
open ( FILE, '<'.$doc->doc_location() );
my @lines = <FILE>;
close ( FILE );
print "----\n";
print @lines;
print "----\n";

We've snuck several things into the above example.

In the first line we read() the User Def. This is the Def that we've been adding to as we go along in this chapter, but here we're going to be querying it programmatically, rather than editing it as a text file. Def->read() gives us a reference to the Def object, upon which we immediately call get_store() to get a reference to our main store. We use that to get the id of the first document we stored in main, whatever and whenever that was. A document id, as you might guess, is one of the parts that makes up a document key. (The other mandatory parts are a document type and a store name.) As you can see, retrieve() is flexible: it accepts a single argument and interprets that as a key (as in the previous example); it is also happy to accept separate, parameterized arguments supplying a type, store name and Doc id, which is what we've done here.

Again, we see the doc_key() method, which returns this Doc's key, and a new method, doc_location(), which returns the underlying file that this Doc was fetched from. It is worth noting that doc_location() is rarely used in the course of "normal" Doc manipulation, because Comma handles all of the underlying filesystem tasks that are part of ordinary storage, retrieval and the like.

There are other "doc_foo()" methods, including doc_store() which returns a reference to the store that was used to fetch or store the Doc, and doc_id(), which returns the Doc's id. It is an error to call any of the doc_foo() methods on a newly-created Doc that has not yet been stored.

Multiple Users and Processes: Permissions and Locking

Access permissions are an important part of any multi-use system. XML::Comma uses the underlying filesystem to provide basic permissions facilities. The store definition may include a file_permissions element, which sets the rwx permissions on any stored files. Here is our main store with a new line that makes these files world-readable but writable only by their owner:

  <store>
    <name>main</name>
    <base>comma_guide</base>
    <location>Sequential_file</location>
    <file_permissions>644</file_permissions>
  </store>

The 644 specification is suitable for a system in which all User editing is done by processes running as a single user, but in which many users might need to run processes that need read-access to User information. It is actually more common for a group of users to need write access to a Doc collection; for that reason the default value of the file_permissions element -- the value that is used by the system if no specifier is given -- is 664.

Because Comma depends on the filesystem to manage permissions, you will need to understand how the filesystem determines and applies permissions information to/for individual files in order to set up complicated scenarios. Remember that Comma code always runs as part of some particular process, under the ownership of a specific user.

Permissions restrictions address issues of information ownership and security. File permissions discriminate among multiple users of a system. An even more fundamental set of problems is posed by the multi-process nature of the systems on which Comma runs. We must be able to lock Docs so that concurrent processes do not simultaneously attempt to modify a file.

The retrieve() method automatically acquires a lock on the requested Doc. As long as this lock is held, the Doc cannot be retrieved again. The store() method automatically unlocks the stored Doc.

Because of the automatic locking, retrieve() is a relatively heavy-weight method. In addition, if retrieve() cannot immediately acquire its lock, it waits -- re-trying periodically -- until it finally can. The retrieve() method should therefore be used carefully, with the time that a Doc is held open kept as short as possible. (An optional argument to retrieve, timeout=><seconds> is also available. With a timeout specified, retrieve() will throw an error if it is unable to acquire its lock within the given number of seconds.)

The read() method is an alternative to retrieve(), for situations in which a Doc will be read but not modified. In fact, in most applications, read() is by far the most common access method. Because read() does not need to acquire a lock, it is somewhat faster than retrieve(). The two methods take the same arguments.

There is one other method in the retrieve family: retrieve_no_wait(). This method is exactly like retrieve(), except that if it fails to immediately acquire a lock it returns undef, rather than blocking. Programmers with extensive experience designing multi-threaded/concurrent systems will find uses for this method: other programmers will find abuses. In general, if you can't describe in exact and minute detail why you are using retrieve_no_wait(), you shouldn't be.

As the necessary complement to retrieve(), store() must unlock objects as they are written out to permanent storage so that other users of the system will be able to fetch them. After storage, a Doc object becomes read-only, as if it had been opened with read().

It is possible to store() a Doc without unlocking it (useful, for example, to write out intermediate changes as part of a series of operations). The keep_open=<true> argument specifies that the lock be retained. (Conversely, a Doc that has been opened read-only can be locked with the get_lock() or get_lock_no_wait() methods.)

Finally, the methods erase(), copy() and move() perform the operations that their names suggest:

# retrieve and then erase a Doc
my $doc = XML::Comma::Doc->retrieve ( $key_a );
$doc->erase();
# retrieve and the move a Doc
$doc = XML::Comma::Doc->retrieve ( $key_b );
$doc->move ( store=>'other_store' );
# read and copy a Doc (we're not modifying the original, so it's
# okay to read() instead of retrieve()
$doc = XML::Comma::Doc->read ( $key_c );
$doc->copy ( store=>'other_store' );

As a side note, copy() and move() accept the same arguments as store(), including keep_open=<true>, and you should always supply a store=><name> when copying and moving -- the normal use of these methods is to transfer a Doc from one store to another. (Confusingly, in this normal case, copy() is really just a synonym for store(); calling store() with a new store=><name> specifier effectively performs a copy. The only case in which the actual copy() method is uniquely required is the copying of a Doc within the same store.)

Iterating Over Stored Docs

It is often necessary to process some or all of the Docs in a store. Methods exist to fetch the first and last ids in a store and, given an id, to fetch the ids before and after it. In one of the examples above we retrieved the first Doc in the main store. We'll begin with that same code, and then go on to iterate through all of the Docs in the store.

my $store = XML::Comma::Def->read(name=>'User')->get_store('main');
my $doc = XML::Comma::Doc->retrieve ( type => 'User',
                                      store => 'main',
                                      id => $store->first_id() );
print "first doc: " . $doc->doc_key() . "\n";
while ( my $id = $store->next_id($doc->doc_id()) ) {
  $doc = XML::Comma::Doc->retrieve ( type => 'User',
                                     store => 'main',
                                     id => $id );
  print "next  doc: " . $doc->doc_key() . "\n";
}

This code uses the store's first_id() and next_id() methods. To iterate in the other direction, we could substitute last_id() and previous_id().

The prev_ and next_ methods are fine for fetching a few docs, but for sizable loops they are a little clumsy and a lot slow. An iterator provides a means by which to apply repetitive operations to a set of stored documents quickly and easily.

# basic iterator -- start from the end and work backwards
my $iterator = $store->iterator();
while ( my $doc = $iterator->prev_read() ) {
  print "working on doc: " . $doc->doc_id() . "\n";
}

# with some additional parameters -- start from the beginning and
# limit the set to the first 500 docs
$iterator = $store->iterator ( size=>500, pos=>'-' );
while ( my $doc = $iterator->next_read() ) {
  print "working on doc: " . $doc->doc_id() . "\n";
}

An iterator is obtained by calling the store's iterator() method. By default, iterator() provides access to all of the store's documents, starting with the last doc. (This is the default because iterating backwards over recently-stored docs is a fairly common thing to do.) Two arguments to iterator() modify this default behavior: store=> limits the size of the iterator's result set, and pos=> specifies whether the iterator is initially set to point at the end or at the beginning of the set -- '+' specifies the end (and is equivalent to the default of not specifying a pos), and '-' specifies the beginning. The size=> argument can only be used to pick out the first or last n documents. There is no way to pull a subset of documents out of the "middle" of a store. When used with pos=>'-', the size specifier will select documents from the beginning of the store, and when a pos=> argument is not given (or when pos=>'+' is specified), the size specifier will select documents from the end of the store.

The basic iterator methods are next_id(), prev_id(), next_read(), prev_read(), next_retrieve(), and prev_retrieve(). The names are pretty self-explanatory. Each of these methods returns an id or doc, as the case may be, unless the iterator has passed the beginning or end of its collection, in which case the method returns undef. The six methods can be called in any combination and in any order. (Criticism-inclined readers may, at this point, be thinking that "iterator" is a poor name for this class, given that it is possible to move across the set in any order and backwards and forwards. Those readers are probably correct.)

Four more methods are defined for advanced mucking around with an iterator. These methods should be wielded with caution, as they are not usually needed and they don't do any error or sanity checking. The length() method returns the size of the iterator's document set; the index() method gives the position of the current pointer into that document set; the inc() method moves the pointer a relative amount -- with no argument inc() adds one to the pointer, given an argument it adds that value to the pointer (-1 is a common argument); and the set() method sets the pointer to an absolute index value -- so $iterator->set($iterator->length()) would reset an iterator such that the next call to prev_id() will fetch the last id in the set.

Location Chains

So far, our storage definition for main has used only a single location element. We saw above that specifying Sequential_file governed the "file" portion of the storage location. To understand how to create more complex storage patterns, we need to understand how multiple location specifiers can be "chained" together.

A filesystem is a heirarchical store: directories contain files and directories, which contain more files and directories, which contain more files and directories, ad infinitum. Each time a Doc is stored, Comma uses the document_root, the base specifier and the location elements in a storage definition to build a "location chain" that determines where in the filesystem to save the written-out Doc. For our main store, the chain looks like this:

document root base location
XML::Comma->document_root() comma_config <location>Sequential_file</location>

There are other location specifiers besides Sequential_file. Some of these are designed to be used in pairs or groups, so that several location specifiers can be combined as part of a chain. One of these "intermediate" specifiers is Sequential_dir, which is similar to Sequential_file except that it determines an intermediate directory in the location chain rather than a final file. Here is our store definition with a new addition:

  <store>
    <name>main</name>
    <base>comma_guide</base>
    <location>Sequential_dir</location>
    <location>Sequential_file</location>
  </store>

The first file stored by this store will be located at:

<document_root>/comma_guide/0001/0001

We've added a directory level to the chain; the first 0001 comes from the Sequential_dir, the second from the Sequential_file. One effect of this addition is to increase the capacity of the store. We're limited to 9999 files per directory, so before we could store a maximum of 9999 Docs and now we can store up to 9999 * 9999, or 99,980,001. And we can add as many Sequential_dirs to the chain as we like, increasing the number of directories in the resulting storage locations.

Location specifiers often accept arguments that further determine how they behave in the chain. Sequential_file recognized two arguments, and Sequential_dir recognizes one. Here is another modified version of our storage definition:

  <store>
    <name>main</name>
    <base>comma_guide</base>
    <location>Sequential_dir:max,10</location>
    <location>Sequential_file:max,99,extension,'.xml'</location>
  </store>

Now each of the location specifiers has an arguments list attached. A colon separates the specifier name from the arguments, and the arguments themselves take the form of a Perl list, which will be turned into a hash of key/value pairs when the definition is loaded.

The first argument is common to both declarations: max specifies the maximum number of files that will be allowed in this part of the chain. (When we stated above that we were limited to 9999 files, we were referring to the default value of the max argument. If we had wanted to square the capacity of the storage without adding an intermediate directory, we could have simply specified max,99_980_001 as an argument to Sequential_file. Doing so has a serious drawback, however; finding, creating and deleting files gets progressively slower as the number of files in a directory climbs.)

Sequential_files second argument, extension, provides an extension to be tacked onto the end of every Doc's storage file. This can be useful if, for example, other tools for managing or manipulating files will co-exist with XML::Comma in a given application. With our most recent definition, the first and last files in the a store would have the following locations:

<document_root>/comma_guide/01/01.xml
<document_root>/comma_guide/10/99.xml

The Storage in More Detail section provides additional information about storage definitions, including documentation for all of the standard location modules.

Validation, Macros and Hooks

Document Definitions describe and constrain the basic structure of the documents that we can produce. For example, an attempt to make use of an element that isn't specified in a document's Def generates an error. This section describes Comma's mechanisms for "validating" the structure of documents and the content of elements.

Document Structure: Required Elements and validate()

Section Three introduced the plural tag. This tag determines which elements may exist multiple times in the given container. Another container-level tag is required, which specifies that a container must include at least one of each of the specified elements. Here is our User Def with a new validity constraint:

<DocumentDefinition>
  <name>User</name>
  <element><name>username</name></element>
  <element><name>email</name></element>
  <element><name>full_name</name></element>

  <nested_element>
    <name>address</name>
    <element><name>street1</name></element>
    <element><name>street2</name></element>
    <element><name>city</name></element>
    <element><name>state</name></element>
    <element><name>zip</name></element>
  </nested_element>

  <plural>'address'</plural>
  <required>qw( username email full_name )</required>

  <store>
    <name>main</name>
    <base>comma_guide</base>
    <location>Sequential_file</location>
  </store>
</DocumentDefinition>

To be "valid," a User Doc must now have content in its username, email and full_name elements. A document that is not valid cannot be stored -- the storage routines all call the method validate(), which throws an error if all required elements are not present. The validate() method can also be called directly. It takes no arguments and returns the emtpy string; it's only function is to throw an error if the Doc doesn't pass all validity tests. Here are two simple code snippets, for more information see the section on errors and error handling:

# check whether a Doc passes validity tests
eval {
  $doc->validate();
}; if ( $@ ) {
  print "doc didn't validate: $@\n";
}

# the same idea, but during a store()
my $key
eval {
  $key = $doc->store( store=>'main' );
}; if ( $@ ) {
  print "doc couldn't be stored: $@\n";
} else {
  print "doc was stored successfully: $key\n";
}

Our example use of required is not very complicated. As with all things to do with nested elements, required and validate() are just as applicable deep inside a nested structure as at the very top level. Any nested element can specify a required list, and can be checked with a call to validate(). More interestingly, calls to validate() automatically check the validity of all elements underneath the caller, so a Doc-level validity check walks the entire document tree. This is convenient and it makes good theoretical sense: no element can be valid that itself contains invalid parts.

Element Content: Macros and validate_content()

A container's validity is a function of the sub-elements that it contains. A simple element's validity is a function of its contents. A macro defines and limits the type of content that an element may have. Here is our User Def with macros added to its username and email definitions.

<DocumentDefinition>
  <name>User</name>

  <element>
    <name>username</name>
    <macro>length:min,4,max,20</macro>
  </element>

  <element>
    <name>email</name>
    <macro>email</macro>
  </element>

  <element><name>full_name</name></element>

  <nested_element>
    <name>address</name>
    <element><name>street1</name></element>
    <element><name>street2</name></element>
    <element><name>city</name></element>
    <element><name>state</name></element>
    <element><name>zip</name></element>
  </nested_element>

  <plural>'address'</plural>
  <required>qw( username email full_name )</required>

  <store>
    <name>main</name>
    <base>comma_guide</base>
    <location>Sequential_file</location>
  </store>
</DocumentDefinition>

We can use the validate_content() method to check whether a string can be accepted as an element's content. The method takes a single argument -- the prospective content -- and throws an error if the content fails to pass the validity checks. It is not usually necessary to call validate_content() directly, because set() calls the method at the very beginning of its operation, before doing anything else. Here is a typical bit of error-checked set() code:

# modifying a User doc
eval {
  $doc->username ( $username );
  $doc->email ( $email );
  $doc->full_name ( $full_name );
}; if ( $@ ) {
  handle_content_error ( $@ );
}

The validate() method is also defined for non-nested elements. It is possible to use the unsafe append() method to construct invalid element content (and also possible to read invalid Docs out of storage). The validate() method checks an element's existing content for validity. Just as with nested elements, this method is called by all of the storage methods, so that an invalid document will not be written out to permanent storage.

As with storage location specifiers, the macro tag should contain a name followed by an optional argument list (with a colon in between). Different macros expect different numbers of arguments and different argument formats. The enum macro, for example, takes a list of strings that will be the only acceptable contents for the element being defined. Let's add a new subscription element to the User Def, indicating what level of service a user has paid for. (This time, we won't re-produce the whole Def, just the new element.)

<element>
  <name>subscription</name>
  <macro>enum: 'basic', 'premium', 'lifetime'</macro>
</element>

There are only four possible values for the content of the subscription element: undef, basic, premium, and lifetime. "Hmm, undef" you say, "I don't see undef in that list? Well, enum always includes undef as an implicit member of the possible-contents list. The reason for this will be clear after a little reflection: because Comma treats a content-less element as indistinguishable from an element that is not there at all, undef must be legal content for all elements. To make an empty element illegal is actually the same operation as to make it required. If we want every User Doc to include subscription information, we can define the subscription element to be required:

<element>
  <name>subscription</name>
  <macro>enum: 'basic', 'premium', 'lifetime'</macro>
</element>

<required>'subscription'</required>

More Flexibility: Perl Hooks

The required and macro facilities that we've just seen are actually implemented using a finer-grained, more flexible tool: the hook. A hook is a piece of Perl code that will, under specific circumstance, be automatically called by the Comma system. Declaring an element as required actually installs a validate_hook -- the required tag is just a short-cut, provided because the facility is so important and so commonly used. The following two pieces of a hypothetical Def are exactly equivalent:

<!-- 1) a required tag specifying two element names-->
<required>'foo', 'bar'</required>

<!-- 2) the two validate_hooks that are actually installed
when the Def is parsed, one for 'foo' and one for 'bar' -->

<validate_hook>
<![CDATA[
  sub {
    my $self = shift();
    my $req_el = \$self->elements('foo')->[0];
    die "required element 'foo' not found in " . $self->tag_up_path() . "\n" if
                    (! $req_el) or 
                    ((! $req_el->def()->is_nested()) and (! $req_el->get()));
  } 
]]>
</validate_hook>

<validate_hook>
<![CDATA[
  sub {
    my $self = shift();
    my $req_el = \$self->elements('bar')->[0];
    die "required element 'bar' not found in " . $self->tag_up_path() . "\n" if
                    (! $req_el) or 
                    ((! $req_el->def()->is_nested()) and (! $req_el->get()));
  } 
]]>
</validate_hook>

This example demonstrates the common convention for writing hooks: most hooks are subroutines that are compiled into code references when the Def is loaded by the system; they can expect to be passed certain arguments when they are invoked; they should make use of the Comma API to do whatever work they need to do; and they should return appropriate values or throw errors, depending on what is expected of them.

We can go over the first of these hooks line by line. (The second hook is exactly the same, except that 'bar' is substituted for 'foo' in two places.) The first line is an opening CDATA tag. Perl snippets usually include characters that are illegal in XML -- the arrow operator is particularly common in this kind of code -- so wrapping the content in a CDATA section is a near necessity. The second line begins an anonymous subroutine declaration. The third line establishes a named variable, $self, which comes from the first argument to the sub. The next line fetches the first 'foo' element, if any, into $req_el -- the $req_el variable now holds either an element object or an undefined value. The final statement throws an error if either $req_el is not defined, or $req_el is a non-nested, empty element. (NOTE: FIXING the obvious bug here, real real soon.)

The required example demonstrates the use of a validate_hook in the "structural" context -- checking the sub-elements of a nested element. We can use the same approach to validate the contents of a non-nested element, but in this case we must expect two arguments: the element and the proposed new content. Imagine, if you will, an element, <delicate_sensibilities>, which must contain text that will not shock or offend children, great aunts and members of the clergy. Imagine, also, a hypothetical CPAN module Lingua::FCC_Check, which can check for words that are proscribed by the Federal Communications Commission from over-the-air broadcast in the United States. Here, then, is a definition for the <delicate_sensibilities> element:

<element>
  <name>delicate_sensibilities</name>
  <validate_hook>
  <![CDATA[
    use Lingua::FCC_Check;
    sub {
      die "unacceptable language detected for " . $_[0]->tag_up_path() . "\n"  if
        Lingua::FCC_Check::check ( $_[1] );
    }
  ]]>
  </validate_hook>
</element>

The only new thing in this example is the use statement that precedes the subroutine definition. We need to pull in the Lingua::FCC_Check module, so we do that just as we would in a stand-alone program.

To summarize, validate_hooks may be defined for both simple and nested elements and should take the form of anonymous subroutines. In the case of a nested element, the hook expects the element itself to be the sole argument. In the case of a non-nested element, the hook expects to be passed the element and a string containing the content to be checked. The validate() method calls any hooks that have been defined for an element; as does validate_content(). More hooks (called as part of storage, indexing, etc.) will be introduced as we go along, and documentation for all available hooks can be found in the hooks reference section.

Writing New Macros

Macros were designed as a way to extend the syntax of document definitions without modifying any of the Comma system code. When an element definition is loaded, any macros that it contains are given a chance to execute. Writing and installing macros is relatively easy. In general, macros work by installing hooks, so you've already seen most of what you need to know to create a new macro.

For a macro to be available to the system, the definition loader must be able to find it. The loader will look in the same places that it looks for defs (the list of directories in the defs_directory Comma variable), and it will look in files named macro.extension, where macro is the name of the macro and extension is the string defined by the macro_extension variable.

To turn the "FCC_Check" example from the previous section into a macro, we need to save a few lines of perl code in a file that meets the above criteria (on my system, I'm using /comma/defs/macros/fcc_approved.macro). Here is the contents of the file:

# fcc_check: a macro to check element content for blue language

use Lingua::FCC_Check;
$self->add_hook ( 'validate_hook',

  sub {
    die "unacceptable language detected for " . $_[0]->tag_up_path() . "\n"  if
        Lingua::FCC_Check::check ( $_[1] );
  }

);

The first line is just a comment that helps us remember what this code snippet does, if we run across it in an unexpected place. The use statement and the subroutine definition are familiar from the validation hook version of this code. The only new thing here is the add_hook() method. The syntax is a little hard to see, but add_hook() is quite simple: it expects a hook name as its first argument, and a subroutine reference or string (which will be eval'ed and must become a subroutine reference) as its second argument. The subroutine is installed as a hook of the requested type.

Turning the FCC check into a macro simplifies the definition of the delicate_sensibilities element considerably. Even more imporantly, we can re-use this macro in any other Def on this system, and changes -- bug fixes, new additions to the FCC list -- will only need to be made to the macro, not to each element that defines the hook.

# the new, improved delicate_sensibilities def
<element>
  <name>delicate_sensibilities</name>
  <macro>fcc_approved</macro>
</element>

The range macro (part of the standard Comma installation) provides a slightly more complex example. This macro is used to limit content to a range of numbers, for example between one and ten. As such, range requires two arguments; the first specifies the low end of the range and second the high end.

# range macro: takes two arguments, low-end and high-end

my $low = $macro_args[0];
my $high = $macro_args[1];

$self->add_hook ( 'validate_hook',

  sub {
    my ( $doc, $content ) = @_;
    if ( $content < $low  or
         $content > $high ) {
      die "'$content' out of range ($low:$high)\n";
    };
  }

);

The only thing here that we haven't seen before is the pre-defined variable @macro_args. Like $self, the @macro_args array is filled with the appropriate values by the macro loader. Macros can do whatever they want with the arguments that are supplied them. This macro simply makes use of the first two elements in the list as part of the hook subroutine. (It should actually probably do a little bit of pro-active error checking.) Here is how we might use the range macro.

<element>
  <name>one_to_ten</name>
  <macro>range:1,10</macro>
</element>

#include: Defs From Components

Defs that are part of a single system or application usually share common element definitions, hooks, and methods. These common components can be abstracted out, placed into separate files, and #include'ed into as many defs as necessary.

<!-- file 'simple_el.include' -- anywhere in the defs_directories -->
<element><name>included_el</name></element>


<!-- and a 'simple_def.def' that uses the above include -->
<DocumentDefinition>
  <name>simple_def</name>
  <element><name>el_one</name></element>
  <element><name>el_two</name></element>

  <? #include simple_el ?>
</DocumentDefinition>

The #include syntax is quite different from most everything else that Comma defines for defs. XML afficionados will recognize it as a preprocessor declaration, a special part of an XML document that is intended for the consumption of a particular parser or tool-chain and should be ignored by everyone else. Using the preprocessor declaration syntax makes it possible to exempt #include directives from the normal rules governing what can go where in a def.

When a Comma parser encounters an #include statement, it looks at the word immediately following #include and tries to find an .include file of that name somewhere in the system's def directories. If it succeeds, the parser continues reading in the def from that file. When the parser reaches the end of the .include file, it returns to the original file and continues on. Except for adjusting the filename and line numbers that are reported if the parser encounters an error, this switch between files is completely transparent -- using an #include is the same as cutting and pasting the content of the #include into the def.

Of course, this wouldn't be Comma if you couldn't gussy up your #includes with perl code. For many purposes, the simple include behavior described above is perfectly sufficient. But sometimes the content of the #include needs to be customized for the def at hand. Here is an example, an include that takes two arguments and generates a customized method:

<!-- file 'first_word_method.include' -->

sub {
  my ( $method_name, $el_name ) = @_;
  return <<END;

<method>
  <name>$method_name</name>
  <code><![CDATA[
    sub {
      my \$self = shift;
      my \$content = \$self->get($el_name);
      \$content =~ m|^(\w+)|;
      return \$1 || '';
    } ]]></code>
 </method>
END
}


<!-- and a def that uses 'first_word_method.include' -->
<DocumentDefinition>
  <name>another_include_example</name>

  <element><name>paragraph</name></element>
  <? #include {first_word_method} 'fw_paragraph', 'paragraph' ?>
</DocumentDefinition>

Wrapping the name of the include in curly brackets indicates that this is a dynamic, rather than a static include. Comma expects a dynamic .include file to return a code reference that, when executed, will return the content to be folded into the def. Any text that follows the curly-bracketed include name will be treated as a list to be eval'ed, then passed to the code reference as its arguments.

Indexing

XML::Comma implements storage and indexing separately.

Comma storage generally involves writing complete documents out to disk. Each stored document is retrievable by a unique key, and collections of stored documents can be iterated across in key order. Most of the time, stored documents are saved as normal, XML-formatted text files. Modern filesystems are fast, robust and well understood. Relying on the standard filesystem functionality enables a systems administrator to use normal tools for backup, maintenance and monitoring, and allows programmers to use standard utilities for quick or simple manipulations. (It is very convenient, for example, to be able to do a quick grep on a directory full of Docs.)

Comma indexing involves saving pieces of documents in a relational database so that complex search, sort and retrieval operations can be performed flexibly and efficiently. These tasks are "above and beyond" what a filesystem is capable of, so Comma builds its indexing functionality as a relational database framework. The system can be configured to use any RDBMS; Comma provides a standard interface that sits atop the sophisticated storage and query capabilities of platforms such as MySQL, Postgres or Oracle.

A User Index Definition

An index allows a collection of Docs to be searched and sorted according to their elements' contents. We'll build an index for our User Docs to demonstrate the basic features of the indexing framework.

An index defines one or more fields, with each field normally corresponding to an element or method in the document definition. A simple index might only contain a single field:

<DocumentDefinition>
<name>User</name>
<element><name>username</name></element>
<element><name>email</name></element>
<element><name>full_name</name></element>

<nested_element>
<name>address</name>
<element><name>street1</name></element>
<element><name>street2</name></element>
<element><name>city</name></element>
<element><name>state</name></element>
<element><name>zip</name></element>
</nested_element>

<plural>'address'</plural>

<store>
<name>main</name>
<base>comma_guide</base>
<location>Sequential_file</location>
</store>

<index>
<name>main</name>
<field><name>email</name></field>
</index>

</DocumentDefinition>

With the main index part of our document definition, we can use the index_update() method to add documents to it. Calling index_update() -- which takes as its index=> argument the name of the index to update -- adds a document to an index or, if the document is already present, updates the index to reflect any changes.

# add/update this Doc's record in the 'main' index
$doc->index_update ( index => 'main' );

On the other hand, index_remove() deletes a document from an index. Like index_update it expects the name of an index as an index=> argument.

# delete this Doc's record in the 'main' index
$doc->index_remove ( index=>'main' );

Indexing from different stores and defs

Sometimes it can be useful to index different document types or documents from multiple stores in the same document type into a single index. Comma allows one to do this by introducing the <store> and <doctype> directives within an index.

Querying the Index: Iterators

We need a way to get at the User Docs that are in our index. First we need a handle to the index itself. Then we can ask the index for an iterator that will step through all of the Docs:

# get 'main' index
my $index = XML::Comma::Def->read(name=>'User')->get_index ('main');
# get iterator
my $i = $index->iterator();
# iterate, printing out "$key: $email"
while ( $i ) {
  print $i->doc_key() . ': ' . $i->email() . "\n";
  $i++;
}

There are several new methods here. The get_index() method operates like get_storage(), taking a single argument and returning the index of that name. The index's iterator() method returns an iterator object, which provides a means to step through the documents in the index. An iterator can only deal with documents one at a time, and can only advance in one direction through its sequence. Here, we use the ++ operator to advance the iterator.

Every iterator exposes its fields as methods, so we call the email() method to get the value of this record's email field -- a value which came originally from the email element of the Doc that this record represents.

Every iterator implicitly includes the doc_key and the doc_id as fields, so doc_key() and doc_id() are always available as methods. Another special method, record_last_modified() is also available. As its name suggests, record_last_modified() returns a timestamp (unix system time) indicating when the index record was last changed.

The ++ operator is actually a short cut for a named method, iterator_next(). And if that isn't bad enough, there's an implicit check of the iterator_has_stuff() method triggered by the boolean context of the while statement. The implicit-nesses are an example of operator overloading, which is exmplained in detail in the Camel book. For our purposes, suffice it to say that 1) it is easy and correct to write an iterator loop as above, and 2) you've just seen the only two overloaded operators that the Iterator class defines -- ++ => iterator_next() and boolean-ization => iterator_has_stuff().

In general, I prefer compact idioms, and the simple ++ loop is both compact and (to me, anyway) highly readable. However, in the spirit of over-explanation, here are several exactly-equivalent versions of the lowly iterator loop. (And a note to careful observers: it doesn't matter that some of these loops "increment" the iterator on loop entry and some don't -- an iterator contrives to point to its first record in a "lazy" fashion so that programmers don't ever have to worry about whether an iterator is newly-created or not.)

my $i = $index->iterator();
while ( $i++ ) {
  $i->foo();
}

my $i = $index->iterator();
while ( $i->iterator_has_stuff() ) {
  $i->foo();
  $i->iterator_next();
}

my $i = $index->iterator();
while ( $i ) {
  $i->foo();
  $i->iterator_next();
}

my $i = $index->iterator();
while ( $i->iterator_next() ) {
  $i->foo();
}

my $i = $index->iterator();
while ( $i ) {
  $i->foo();
  $i++;
}

NOTE -- there is a bug in perl 5.6.x and 5.8.0 that makes the first (and most compact) idiom above leak memory. The iterator object won't get properly garbage collected, when the while() look is written like that. This can be a big problem in long-running contexts (such as inside a web server. For any long-running code, use one of the more verbose forms.

Of course, an iterator that steps through all of the records in an index is not usually what you want. The iterator() method accepts arguments that specify matching and sorting criteria, making it possible to construct iterators that return a subset of an index in a specified order.

The where_clause => <sql-where> argument matches a conditional phrase against the index's fields to narrow down the records that are returned. The order_by => <sql-order-by> argument controls the ordering of the records.

The iterator() method constructs a complex SQL statement that, when executed by the database, selects the records that the iterator will include. If a where_clause or order_by argument is supplied when an iterator is constructed, that piece of SQL logic is integrated into the iterator's complete SQL statement. Given a knowledge of generic SQL, it is easy to write where_clause and order_by arguments -- simply treat each field as you would a column in the database and the iterator parser will do the rest. An iterator with botha where_clause and an order_by might look like this:

# find all users with hotmail addresses and sort alphabetically
my $i = $index->iterator ( where_clause => 'email LIKE "%.hotmail.com"',
                           order_by => 'email' );

Let's add another couple of fields to this index, so we can build some more interesting iterators. The username element is easy to add; it's just another field. If we also want to add the zip of the first address, that's a little harder. The index fields that we've seen so far map to top-level pieces of a Doc. One way to get at the zip information we need is to add a method to the User Def that fetches the zip of the first address. Here is the new method, along with the expanded index:

<!-- a method to return the zip code of the first address -->
<method>
  <name>first_zip</name>
  <code>
    <![CDATA[ sub { return $_[0]->element('address')->element('zip') } ]]>
  </code>
</method>

<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
</index>

Another way, if this method seems not likely to be used except to build the index table, is to add a code specifier to the field:

<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field>
  <name>first_zip</name>
  <code>
    <![CDATA[ sub { return $_[0]->element('address')->element('zip') } ]]>
  </code>
</field>
</index>

As you can see, this is almost like adding another method to the Def -- in fact, we didn't change the embedded perl at all. The main difference is that we're not "cluttering up" the top level of our Def with a method that will only be used as part of the indexing operations. The code block is passed two arguments, the doc being indexed and the index element. It turns out that you almost always use the doc, and almost never use the index.

Adding a code block to a field disassociates the name of the field from the data that it stores. This is, obviously, useful. It can also be confusing. The default, non-code behavior is worth sticking to whenever possible, to keep defs and programs as maintainable as possible. (A code block can also be part of collection and sort elements, which are described below.)

And here are a few possible iterators:

# find all the users with .edu addresses -- in any order
my $i = $index->iterator ( where_clause => 'email LIKE "%.edu"' );

# find all the users with .edu addresses in Bevery Hills
my $i = $index->iterator ( where_clause => 'email LIKE "%.edu" AND zip = "90210'" );

# sort the .edu email addresses by string length in descending order, then
# alphabetically by username (uses mysql's LENGTH function)
my $i = $index->iterator ( where_clause => 'email LIKE "%.edu"',
                           order_by => 'DESC LENGTH(email), username' );

Plural Items in Indexes

The field elements of an index hold values derived from a Doc's elements and methods. As we've seen, fields can be used to select sets of records from an index, and to control the order in which those results are returned. One limitation of using fields in this way, however, is that each field can only hold a single value per record. Looked at another way, fields do an excellent job standing in for singular elements, but are not at all suited to dealing with plural elements. A field that corresponds to a plural element will always contain only the value of the first of those elements.

Another type of index element, the collection, is designed to accomodate plural values, and to allow the kinds of "sorting" operations that are common to many kinds of documents. Unlike a field, a collection cannot be used in a where clause; collections are a special-purpose tool. Let's add a collection to our User index that will allow us to select all of our records that include an address with a given zip-code -- any address this time, not just the first one.

<!-- a method to return the zip codes of each address, as an array -->
<method>
  <name>zips</name>
  <code>
    <![CDATA[ sub {
      my @addresses;
      foreach my $addr $_[0]->elements('address') {
        push @addresses, $addr->element('zip')->get();
      }
      return @addresses;
    ]]>
  </code>
</method>

<!-- the expanded index definition, now including the zips info
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
<collection><name>zips</name></collection>
</index>

We tied the zips collection to a method, but we could just as easily have tied it to a plural, non-nested element. The collection isn't particular, it just wants to be handed an array when the index is updated.

To select all of the users with an address in a given zip code, we request an iterator qualified by a collection_spec. A collection_spec is a string of the form <collection_name>:<value>. The collection_spec argument can be combined with the where_clause and order_by specifiers that we've already seen:

# select all the users with an address in the 20003 zip code -- in any order
my $i = $index->iterator ( collection_spec => 'zips:20003' );

# select the users as above, and order them alphabetically by username
my $i = $index->iterator ( collection_spec => 'zips:20003',
                           order_by => 'username' );

# select all of the users with an address in 20003 who also have a .edu
#   email address, and order them alphabetically by username
my $i = $index->iterator ( collection_spec => 'zips:20003',
                           where_clause => 'email LIKE "%.edu"',
                           order_by => 'username' );

Complex Collection Selectors

It is possible to specify complex collection_spec arguments, when creating an Iterator.

For example:

# select all of the users with an address in any 2000x OR 0213x zip code
# who also have a .edu email address, and order them alphabetically by
# username
my $i = $index->iterator ( collection_spec => 'zips:2000% AND zips:0213%',
                           where_clause => 'email LIKE "%.edu"',
                           order_by => 'username' );

# select users with an address NOT in the 20003 zip code.
my $i = $index->iterator ( collection_spec => 'NOT zips:20003' );

# a hypothetical collection pair with spaces and single quotes. We use
two backslashes here because the double quotes that surround the whole
string treat a single backslash as part of an escape character!
my $i = $index->iterator ( collection_spec => "'test:it\\'s easy' OR
                                               'test:it\\'s hard'" );

It is fairly easy to create complex specs that slow down database queries quite a lot. In particular, OR'ing together selections on large collections is very slow.

Full Text Search

A special content-holder is available that enables full-text search on an index component. We could make all of a User's address information searchable by defining a method to generate a chunk of "address text", then defining an index textsearch container:

<!-- addresses_text: a method to glob all of a User's addresses together into
     a single string -->
<method>
<name>addresses_text</name>
<code>
  <![CDATA[
    sub {
      my $self = shift();
      my $addr_text = $self->full_name() . "\n";
      foreach my $a ( $self->elements('address') );
        $addr_text .= $a->street1() . "\n".
        $addr_text .= $a->street2() . "\n" .
        $addr_text .= $a->city() . ' ' . $a->state() . ' ' . $a->zip() . "\n";
      }
      return $addr_text;
    }
  ]]>
</code>
</method>

<!-- the 'main' index, redefined to add full-text search on the addresses -->
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
<collection><name>zips</name></collection>
<textsearch><name>addresses_text</name></textsearch>
</index>

With the new 'addresses_text' textsearch in place, we can use the full-text search feature in constructing iterators. A textsearch_spec argument specifies keywords that must appear in a record for it to be returned as part of an iterator's result set:

# look up all users who have the word "elm" (or "elms", "elmy", "elmed",
# etc.) in any of their addresses
my $i = $index->iterator ( textsearch_spec=>'addresses_text:elm' );

# look up all users with an 'elm' and a 'springfield' in any
# of their addresses
my $i = $index->iterator ( textsearch_spec=>'addresses_text:elm springfield' );

As the comments in the above example imply, the textsearch subsystem includes a "preprocessor" interface that allows words to be stemmed and pruned before indexing. The preprocessor defaults to XML::Comma::Pkg::Textsearch::Preprocessor_En, which handles English text. It includes a stop list of roughly 500 words, and relies on the CPAN module Lingua::Stem to do its stemming.

There are currently two other preprocessors in the standard distribution, Preprocessor_Fr for French and Preprocessor_Sp for Spanish. If you are only handling English-language content, you can skip the next three code examples, which detail how to specify Preprocessors other than Preprocessor_En.

A textsearch's which_preprocessor element controls which preprocessor will be used: which_preprocessor should define a sub that will be passed some combination of four arguments -- an active $doc; $index and $textsearch objects; and a search "attribute." The sub must return the name of the Preprocessor package that should be used. Here is a typical example:

  <textsearch>
    <name>body_text</name>
    <which_preprocessor>
      sub { return 'XML::Comma::Pkg::Textsearch::Preprocessor_Fr'; }
    </which_preprocessor>
  </textsearch>

Not much to it.

Things get somewhat more complex when we have to choose between multiple pre-processors on the fly. A Preprocessor is used in two different contexts: 1) when a doc is indexed, and 2) when a search is performed. The routine below uses the $doc argument to determine what Preprocessor to use in the former case, and the $attribute argument in the latter.

  <textsearch>
    <name>paragraph</name>
    <which_preprocessor>
      <![CDATA[ 
        use XML::Comma::Pkg::Textsearch::Preprocessor_En;
        use XML::Comma::Pkg::Textsearch::Preprocessor_Sp;
        sub {
          my ( $doc, $index, $ts, $attribute ) = @_;
          if ( $doc->lang_code() eq 'sp'  or  $attribute eq 'sp' ) {
            return 'XML::Comma::Pkg::Textsearch::Preprocessor_Sp';
          } else {
            return 'XML::Comma::Pkg::Textsearch::Preprocessor_En';
          }
        }
      ]]>
    </which_preprocessor>
  </textsearch>

The $attribute argument's value comes from the textsearch_spec, which has a special form for just this purpose:

# look up a word in the index, stemmed by the Spanish pre_processor
my $i = $index->iterator ( textsearch_spec=>'body_text{sp}:lobos' );

The extra bit of text after the textsearch name, enclosed in curly brackets, is stripped off and passed as the $attribute to the which_preprocessor sub.

The back end of the textsearch facility is currently implemented on top of, and as part of, the Comma database-specific modules. It's efficiency is only mediocre, and performing the indexing operation on each document write is somewhat resource-intensive. Because of this, a textsearch can specify that it's operations should be deferred -- performed as a batch rather than on each and every update of the index. A cron job or application hook can be written to call an index's sync_deferred_textsearches() method at some convenient time (or at some regular interval).

<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
<collection><name>zips</name></collection>
<textsearch>
  <name>addresses_text</name>
  <defer_on_update>1</defer_on_update>
</textsearch>
</index>

A number of important features are missing from the current implementation of full-text search: support for more languages, the ability to search for phrases within text, boolean OR'ing, etc. The strengths of the current implementation are that the full-text search is fully integrated with the rest of the database system, so complex iterators that include several different kinds of qualifiers can easily be constructed; and that the storage overhead is relatively small (only the inverted index is stored in the database, and that in a compressed form).

Work to improve the textsearch framework is certainly an area of interest for the Comma developers. It is likely that integration with database-provided full-text search capabilities is the best long-term option for fast, robust operation. Oracle certainly provides such capabilities. For the moment, the open source databases lag behind in this area.

Using an Iterator Over and Over: iterator_refresh()

It is often convenient to "reuse" an iterator. The iterator_refresh() method re-fills and resets the iterator. In its no-argument form iterator_refresh() is equivalent to asking the index for a new iterator with exactly the same specifications. However, the method also takes two optional arguments to limit the total number and the starting position of the results that are returned. Here are some examples:

# usage: $iterator->iterator_refresh ( [ limit_number [, limit_offset ]] );

## simple refresh of a once-used iterator
#
my $i = $index->iterator ( collection_spec => 'zips:20003' );
while ( $i++ ) {
  # ... do some stuff
}
$i->iterator_refresh();
# now we can loop through again
while ( $i++ ) {
  # .. do some other stuff
}


## using iterator_refresh() to process only the first 10 results of a set
#
my $i = $index->iterator()->iterator_refresh ( 10 );
while ( $i++ ) {
  # ... do something with the first 10 (or fewer, if there weren't
  #     even that many)
}


## using iterator_refresh() to process the eleventh through fifteenth
## results (noting that the second argument, the offset, is zero-indexed)
#
my $i = $index->iterator()->iterator_refresh ( 5, 10 );
while ( $i++ ) {
  # ... do something with these five results (again, assuming that
  #     there are that many)
}

Fetching the Record's Doc: read_doc() and retrieve_doc()

Generally, an index should be designed so that its fields hold the most commonly-used pieces of information in a Doc. Of course, any criterion that will be used to select from an index must be available as a field or collection. Additionally, any part of a Doc that is regularly used during an iteration should also be defined as a field.

But sometimes you actually need to get the Doc itself from an iterator -- perhaps to do some complex read operation, or to check the content of an element that is so infrequently used that it makes little sense to include it as a field, or even to change the Doc and re-store it. Two iterator methods make this possible: read_doc() and retrieve_doc().

As the name suggests, read_doc() is analagous to Doc->read() and fetches a read-only copy of the document, while retrieve_doc() is like Doc->retrieve(), returning a fully modifiable Doc.

# print a simple list indicating how many addresses each User has defined
my $i = $index->iterator();
while ( $i++ ) {
  print $i->doc_key() . ': ' . scalar @{$i->read_doc()->elements('address')} . "\n";
}

# permanently delete (from the store that this index is tied to)
#   all documents with a .edu email address
my $i = $index->iterator ( where_clause => 'email LIKE
"%.edu"' ); while ( $i++ ) {
  print "deleting " $i->doc_key();
  $i->retrieve_doc()->erase();
}

Fetching One Record: single() and Company

In iterator retrieves a set of records, in a particular order. Sometimes you only want one record from an index. The single() method accepts the same arguments as iterator(), but it never returns more than one record, and if no records satisfy its specification it returns undef.

Usually, single() is used when you know there will only be one record in the index that matches your selection criteria. For example, we could write a pre_store_hook to make sure that no document is ever stored that has the same email address as a document that is already present. (See the section on advanced store techniques for more information about store hooks.)

<pre_store_hook>
<![CDATA[
  sub {
    my ( $self, $store ) = @_;
    my $email = $self->element('email')->get();
    my $index = $self->def()->get_index('main');
    if ( $index->single(where_clause => "email = '$email'") ) {
      die "the email address '$email' is already in use\n";
    }
  } 
]]>
</pre_store_hook>

The single() method isn't strictly necessary (you can always substitute some equivalent, if longer and more involved, iterator creation and refresh statement), but it does save some typing and makes code more readable. In the same spirit, two more methods exist that provide additional short-cuttage: single_read() and single_retrieve().

As their names suggest, single_read() is a single() call plus (if possible) a read_doc(), and single_retrieve is the same except with retrieve_doc(). Both methods return <