Comma 2.0

Comma 2.0 features complete backwards compatibility with comma 1.x, as well as various new features designed to improve programmer productivity, increase performance, and increase integration with perl. For more information, read on, or skip directly to the section you want:

Upgrade Considerations

  • Comma 2.0 changes the database format used for collection types. These will need to be drop'd and re-indexed.
  • Comma 2.0 introduces a few new dependencies: Storable, Crypt::Blowfish (replaces Crypt::Twofish), PAR, and Test::More.
  • Postgres support in comma 2.0 has not been been as thoroughly tested as it was in comma 1.x. Please let us know your experiences if you intend to use comma with a postgres backend.
  • You will experience loss or corruption of log messages if you try to put your comma.log on an NFS directory. This is because the new and greatly improved locking mechanism uses flock().
  • With versions of DBD::mysql <= 3.0007, you will see lots of warning messages like this:
      prepare_cached(
              SELECT table_name 
                FROM index_tables 
                WHERE doctype  = ? 
                AND textsearch = ? 
                  ORDER BY table_type
            ) statement handle DBI::st=HASH(0x8b2b390) still Active at 
      .test/lib//XML/Comma/SQL/Base.pm line 755
    
    These are harmless, but, if bothersome, can be avoided by upgrading DBD::mysql to version 3.0007 or above. More information is available here

New features

Defmodule

TODO

Lazy evaluation and auto-instantiation of iterator fields

Let's say you have a bunch of documents and, if some arbitrarily complex condition that depends in part on indexed field(s) is true, you need to access other fields in the document. In comma 1.x, you can quickly write some slow code to do this, like so: *

while(++$it) {
  my $d = $it->read_doc;
  if( f1($d->indexed_field) ) {
    f2($d->unindexed_field);
  }
}

But this reads each document from disk - what if f1() only returns true 1 in 400 times? Then you could do something like this, which is much more efficient:

while(++$it) {
  if( f1($it->indexed_field) ) {
    my $d = $it->read_doc;
    f2($d->unindexed_field);
  }
}

But when you're developing a complex application quickly, you shouldn't need to remember One More Thing (are these fields indexed or not)? Comma 2.0 Does What You Mean, efficiently, querying the database when possible, and pulling in elements from disk when necessary.

while(++$it) {
  if( f1($it->indexed_field) ) {
    f2($it->unindexed_field);
  }
}

Iterator->to_array()

What if, instead of the iterator API, you need or want to treat your results as an array of documents? In comma 1.x, you had to do this yourself:

my @res = ();
while(++$it) {
  push @res, $it->full_name;
}
@res = sort { $a cmp $b } @res;

In comma 2.0, you can replace this with a one-liner:

my @res = sort { $a cmp $b } map { $_->full_name } $it->to_array;

That's a bit more compact, but what if you wanted to do something more complex?

my $index; my $it = $index->iterator();
my @ret = ();
while(++$it) {
  push @ret, $it->full_name;
}
@ret = sort { $a cmp $b } @ret;
foreach my $name (@ret[0..9]) {
  my $d = $index->single_read(where_clause => "full_name = '$name'");
  f($d);
}

In 2.0, this all gets reduced to roughly half the code, which, as a bonus, runs much faster:

my $index; my $it = $index->iterator();
my @ret = sort { $a->full_name cmp $b->full_name } $it->to_array;
foreach my $vd (@ret[0..9]) {
  f($vd);
}

Of course, sorting these documents is not very complicated, and, in this simple example, you'd almost certainly have used an order_by clause instead:

my $index; my $it = $index->iterator(order_by => "full_name ASC");
my @ret = ();
while(++$it) {
  push @ret, $it->read_doc;
}
foreach my $d (@ret[0..9]) {
  f($d);
}

And now you can use even less code to express the same concept with the new syntax:

my $index; my $it = $index->iterator(order_by => "full_name ASC");
my @ret = $it->to_array;
foreach my $vd (@ret[0..9]) {
  f($vd);
}

What's more, depending on what f() is, this is between slightly and much faster than the previous idiom because virtual docs only call read_doc when they must - things in the index are always read from the index. This also lets us pull more and more of our logic into the general purpose language (perl) from the specific (SQL) - if we want to.

Allow operator overloading and shortcut syntax on storage iterators

Comma 2.0 introduces overloading and operators which make storage iterators have an API much more like indexing operators. This is another case of letting the coder forget details that aren't important, letting her keep a smaller working set, so to speak. So, whereas in comma 1.x you'd have:

my $it = $store->iterator(pos => '-')
while(my $doc = $it->next_read) {
  ...
}

You can now use the following idiom, which is identical to how you would use an indexing iterator:

my $it = $store->iterator(pos => '-')
while(++$it) {
  my $doc = $it->read_doc();
  ...
}

Note that the ++ operator goes in the direction you "probably" want according to the direction you chose with the pos => operator in the constructor. That is to say, if you are using pos => "-", ++ calls next_read, otherwise, it calls prev_read. This is so that, regardless of pos argument, while(++$it) will always go from the "start" to "end" of the iterator, as it does in indexing iterators, regardless of any order_by clause.

Also, storage iterators in comma 1.x were heavily tied to the concept of calling $it->{next_read|prev_read} to read a doc reference before accessing any of it's information:

my $it = $store->iterator(pos => '+')
while(my $doc = $it->prev_read) {
  print "doc_id: " . $doc->doc_id .
    "has field1: " . $doc->field1 . "\n";
}

Comma 2.0 introduces a read_doc() function and allows shortcut syntax on storage iterators. These behave "as you would expect" if you are familiar with these concepts in indexing iterators. So the above can be re-written as either of the below:

while(++$it) {
  print "doc_id: " . $it->doc_id .
    "has field1: " . $it->field1 . "\n";
  ...
}

OR:

while(++$it) {
  my $doc = $it->read_doc;
  print "doc_id: " . $doc->doc_id .
    "has field1: " . $doc->field1 . "\n";
  ...
}

Of course, the Comma 1.x syntax for these idioms still works and is not deprecated. However, it is not advisable to switch between the two idioms mid-loop (but if you do and encounter unexpected results, please let us know).

multiple defs can put data from multiple stores in the same index

TODO

Changelog

  • introduction of XML::Comma::Util::DefModule, which allows for much easier maintenance of defs which are predominated by perl code (methods and hooks). Also allows a relatively easy way to plug existing perl modules into comma. perldoc XML::Comma::Util::DefModule for details.
  • multiple defs can put data from multiple stores in the same index.
  • support for "virtual" docs and $it->to_array (add some examples to this space
  • installs as any other standard perl module, and prompts for configuration values instead of making you iteratively error out and change them. configuration remembers previous input values. only the make install step needs root privileges.
  • add support for lazy field evaluation in iterators, ie:
        my $it = XML::Comma::Def->d->get_index("i")->
          iterator(fields => [ "a", "b" ]);
        $it->a(); #fast
        $it->b(); #fast
        $it->c(); #slow, but works
    
  • new(file => ...), new(block => ...), read(), and retrieve() now validate by default, so you can't get an invalid doc. override via validate => [0,1] argument or for system-wide default, set validate_new in Configuration.pm
  • Allow operator overloading on storage iterators, ie:
      while(++$it) { my $doc = $it->read_doc(); ... }
    
  • dramatically improve performance of Log.pm (this also fixes a potential shell-injection bug when log() received tainted data). because of the security implications, this has been backported to comma 1.x, >rev. 841 in svn
  • blob elements now can be append()'d to
  • added toggle method to boolean macro
  • add date_8 method to unix_time macro
  • fix race conditions in next_sequential_id pointed out by Bill Herrin
  • fix a copy/move bug with temporary blobs
  • allow slashes in derived_file <location>s
  • added better support for 'binary table' collection type, deprecated 'many tables' and 'stringified' collection types.
  • add Timestamped_random locations for pseudo-random document name keyed of system time.
  • fix index_only storage to allow derive_from
  • add -module flags to comma-load-and-store-doc.pl, comma-drop-index.pl (for use with DefModule)
  • add $iterator->select_count() for efficiently determining the number of elements in indexing iterators (mysql support only)

Footnotes

* There are cases where having a method in the def might allow you to avoid some of these problems, but, as we saw in the section on XML::Comma::Util::DefModule, this too can be done much more tersely and logically in comma 2.0