XML::Comma -- A Platform for Large-Scale Web Development

XML::Comma is an information management platform. Comma speeds the development of content-heavy applications, and was designed to solve some of the problems that make managing extremely large web sites so expensive, difficult and tedious.

Comma is written mostly in Perl, and its target demographic is the Perl programmer who must build customized, complex systems that handle very large amounts of dynamic content. Like most software that is designed to be used by programmers to build other software, Comma is several things at once: a code library, a design framework, a development methodology and a runtime system all rolled into one. However, Comma's central philosophy is "play well with others," and the system depends heavily on a number of tools -- among them the Apache web server and its mod_perl extensions, the HTML::Mason web development environment, relational databases, and underlying filesystem and OS utilities -- to implement its functionality and to provide programmers with a complete, flexible, scalable, and familiar toolkit.

Comma shapes information into "documents," and -- as its (full) name implies -- uses XML to structure those documents. XML, like Perl, is a powerful and standard tool for organizing text. But XML, again like Perl, doesn't do much of anything by itself. Comma defines a number of discrete "processes" in the "life-cycle" of a document and provides a framework that abstracts basic activities common to those process. These frameworks include:

  • structuring and validation;
  • long-term storage;
  • programmatic manipulation;
  • indexing for fast sorting and retrieval.

Embedding Perl in XML, and Vice Versa

XML::Comma was designed by Perl programmers for Perl programmers. Comma's core assumption is that the fastest, most efficient and most flexible way to specify something, or to transform something, or to keep track of something, is usually to write a little bit of Perl code. So Comma provides facilities for embedding contextually-appropriate Perl inside XML "document definitions", and specifies an API that allows chunks of information to be manipulated as Perl objects and then written out to permanent store as XML files.

As mentioned above, Comma also takes a lot of the drudgery out of writing code to validate documents, to store and retrieve them, and to break them up and shove the pieces into a relational database. These frameworks -- particularly the filesystem and database interfaces -- are very modular and relatively complex. But much document behavior can be specified in the XML-formatted document definitions, so you often don't need to write Perl code (though you always can if you want to). And experienced programmers can write Comma "macros" and "methods" that are callable from anywhere in the Comma system and can be easily used by less-experienced programmers or even by web designers who don't think that they're programming at all.

Given all this, it's not quite accurate to think of XML::Comma as 10,000 lines of glue code for your snippets of Perl -- but it's close. Comma's goal is to provide an application-neutral framework that abstracts many of the common problems inherent to managing several hundred thousand chunks of often-changing information.

What's That About Relational Databases?

Comma usually stores documents in flat files as XML-marked up text. But the system indexes -- using a relational database -- as many (or as few) of the stored documents as the application requires. This strategy of separating "storage" and "indexing" is an attempt to have it both ways: the filesystem offers speed (of document-scale writing and retrieval), standard-ness and flexibility; a relational database offers speed (of element-level retrieval and set-of-documents-level extraction), standard-ness (through SQL-driven queries) and power.

The traditional approach to this kind of information management is to use the relational database for both storage and indexing, and some database administrators will likely be leery of a system that does not rely completely and solely on the database. But there are a number of potential advantages to asking the filesystem and the database each to do what they are best at:

  • Database load decreases. The application doesn't ask the database to do document-scale reads and writes, only to read from or update index tables. This amounts to a sort of rudimentary "load-balancing" between the filesystem and database; in fact, the filesystem and database can run on separate physical machines, with one or both of them accessed across the network. (And explicit load-balancing across multiple databases is also possible, as Comma allows different database connections to be specified for different parts of the system.)
  • Database table size decreases. Because the database stores only the pieces of information that need to be indexed or piece-wise accessed, database tables can be much smaller than they would be in a traditional architecture. More tables can fit in memory, and fewer reads from disk are required.
  • Information architecture is easier. Relational databases don't store heirarchical information, so some transformation must be designed. If only a subset of the heirarchical information in each document needs to go in the database, then the transformation only needs to account for that subset.
  • You don't need to trust your database as much. You always have to trust your filesystem -- everything the computer does depends on it. In traditional large-scale information systems, you have to trust your database that much, too. Comma treats the filesystem as the canonical store, and the database as secondary. This doesn't mean that the database isn't important -- it is, it's critical. But it does mean that if anything happened to the database, you can rebuild it from the filesystem rather than from tape backup. There has been an explosion of interest in the last couple of years in a newer, lighter-weight, less-proven generation of database systems. MySQL, for example, is heavily used by web developers, largely because for certain types of usage it is very, very fast, and secondarily because it's free. Oracle is much, much more feature-rich, robust, flexible and configurable -- but it's also a good deal slower for some common usage patterns and it's not at all free. Comma lets you choose which database should sit behind your system: the API abstracts the difference between the two for all sorts of common operations, and allows you to drop down to SQL when you need database-specific expressiveness.

The Web Application Challenge: Many Users, Constant Evolution

Big web sites require back-end systems that can scale to handle vast amounts of content and large numbers of users. Comma was originally built as an in-house tool by programmers at allAfrica.com, the largest Africa-oriented site on the web, which posts around 700 news stories a day, maintains an archive of some 300,000 stories, and serves an audience of roughly 400,000 users.

Solid web applications are written in layers. At the "bottom" layer, Comma provides tools that allow developers to design simple, robust structures for the various kinds of information that a back-end must track: content files, user records, product and inventory data, shopping cart and session information, etc. One layer up, common application functionality (sometimes called "business logic") is encapsulated in Perl modules that manipulate and maintain the Comma-produced information objects. Finally, user-interface code is parceled into HTML::Mason components (or mod_perl handlers).

Using the Comma API and this layered approach, even a complex application feature-set can often be prototyped in just a few hours. Extensions to current features (or new management and reporting tools) often require only a few lines of code.

The centralized architecture of web applications (and server-dependent internet apps in general) is both a great strength and an inherent weakness. On the one hand, information repositories become available to anyone with a web browser, and new functionality can be deployed instantly and seamlessly. On the other, the bottlenecks of a client-server design and the difficulties of developing and maintaining application code that is robust, stateless and aggressively multi-process are significant. Comma handles some of those difficulties itself, cooperates with the Apache web server and the HTML::Mason delivery engine to handle some others, and lets the programmer focus on building the application.

Using XML::Comma

XML::Comma is Free Software released under the GNU General Public License. Of course, you should read the license if you are interested in using Comma, but in general terms: anyone who wants to is free to use Comma as a development platform, to examine the source code, and to modify it to suit their needs. The GPL does "enforce" the freedom of the Comma source code -- if you modify the core Comma code and re-distribute your modifications, then your changes must be made available on the same terms as the original distribution.

The copyright on XML::Comma is held by AllAfrica Global Media. AllAfrica funded the original development of Comma, and continues to support major work on the platform.

The most-recently-released version of XML::Comma is always available at the Download page.