mirror of
https://github.com/uw-imap/imap.git
synced 2024-11-16 10:28:23 +01:00
218 lines
8.9 KiB
Plaintext
218 lines
8.9 KiB
Plaintext
|
/* ========================================================================
|
||
|
* Copyright 1988-2006 University of Washington
|
||
|
*
|
||
|
* Licensed under the Apache License, Version 2.0 (the "License");
|
||
|
* you may not use this file except in compliance with the License.
|
||
|
* You may obtain a copy of the License at
|
||
|
*
|
||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||
|
*
|
||
|
*
|
||
|
* ========================================================================
|
||
|
*/
|
||
|
|
||
|
Mailbox Format Characteristics
|
||
|
Mark Crispin
|
||
|
11 December 2006
|
||
|
|
||
|
|
||
|
When a mailbox storage technology uses local files and
|
||
|
directories directly, the file(s) and directories are layed out in a
|
||
|
mailbox format.
|
||
|
|
||
|
I. Flat-File Formats
|
||
|
|
||
|
In these formats, a mailbox and all the messages inside are a
|
||
|
single file on the filesystem. The mailbox name is the name of the
|
||
|
file in the filesystem, relative to the user's "mail home directory."
|
||
|
|
||
|
A flat-file format mailbox is always a file, never a directory.
|
||
|
This means that it is impossible to have a flat-file format mailbox
|
||
|
that has inferior mailbox names under it (so-called "dual-usage"
|
||
|
mailboxes). For some inexplicable reason, some people want this.
|
||
|
|
||
|
The mail home directory is usually the same as the user login
|
||
|
home directory if that concept is meaningful; otherwise, it is some
|
||
|
other default directory (e.g. "C:\My Documents" on Windows 98). This
|
||
|
can be redefined by modifying the c-client source code or in an
|
||
|
application via the SET_HOMEDIR mail_parameters() call.
|
||
|
|
||
|
For example, a mailbox named "project" is likely to be found in
|
||
|
the file "project" in the user's home directory. Similarly, a mailbox
|
||
|
named "test/trial1" (assuming a UNIX system) is likely to be found in
|
||
|
the file "trial1" in the subdirectory "test" in the user's home
|
||
|
directory.
|
||
|
|
||
|
Note that the name "INBOX" has special semantics and rules, as
|
||
|
described in the file naming.txt.
|
||
|
|
||
|
The following flat-file formats are supported by c-client as of
|
||
|
the time of this writing:
|
||
|
|
||
|
. unix This is the traditional UNIX mailbox format, in use for nearly
|
||
|
30 years. It uses a line starting with "From " to indicate
|
||
|
start of message, and stores the message status inside the
|
||
|
RFC822 message header.
|
||
|
|
||
|
unix is not particularly efficient; the entire mailbox file
|
||
|
must be read when the mailbox is open, and when reading message
|
||
|
texts it is necessary to convert the newline convention to
|
||
|
Internet standard CR LF form. unix preserves UIDs, and allows
|
||
|
the creation of keywords.
|
||
|
|
||
|
Only one process may have a unix-format mailbox open
|
||
|
read/write at a time.
|
||
|
|
||
|
. mmdf This is the format used by the MMDF mailer. It uses a line
|
||
|
consisting of 4 <CTRL/A> (0x01) characters to indicate start
|
||
|
and end of message. Optionally, there may also be a unix
|
||
|
format "From " line. It otherwise has the same
|
||
|
characteristics as unix format.
|
||
|
|
||
|
. mbx This is the current preferred mailbox format. It can be
|
||
|
handled quite efficiently by c-client, without the problems
|
||
|
that exist with unix and mmdf formats. Messages are stored
|
||
|
in Internet standard CR LF format.
|
||
|
|
||
|
mbx permits shared access, including shared expunge. It
|
||
|
preserves UIDs, and allows the creation of keywords.
|
||
|
|
||
|
. mtx This is supported for compatibility with the past. This is
|
||
|
the old Tenex/TOPS-20 mail.txt format. It can be handled
|
||
|
quite efficiently by c-client, and has most of the
|
||
|
characteristics of mbx format.
|
||
|
|
||
|
mtx is deficient in that it does not support shared expunge;
|
||
|
it has no means to store UIDs; and it has no way to define
|
||
|
keywords except through an external configuration file.
|
||
|
|
||
|
. tenex This is supported for compatibility with the past. This is
|
||
|
the old Columbia MM format. This is similar to mtx format,
|
||
|
only it uses UNIX-style bare-LF newlines instead of CR LF
|
||
|
newlines, thus incurring a performance penalty for newline
|
||
|
conversion.
|
||
|
|
||
|
. phile This is not strictly a format. Any file which is not in a
|
||
|
recognized format is in phile format, which treats the entire
|
||
|
contents of the file as a single message.
|
||
|
|
||
|
|
||
|
II. File/Message Formats
|
||
|
|
||
|
In these formats, a mailbox is a directory, and each the messages
|
||
|
inside are separate files inside the directory. The file names of
|
||
|
these files are generally the text form of a number, which also
|
||
|
matches the UID of the message.
|
||
|
|
||
|
In the case of mx, the mailbox name is the name of the directory
|
||
|
in the filesystem, relative to the user's "mail home directory." In
|
||
|
the case of news and mh, the mailbox name is in a separate namespace
|
||
|
as described in the file naming.txt.
|
||
|
|
||
|
A file/message format mailbox is always a directory. This means
|
||
|
that it is possible to have a file/message format mailbox that has
|
||
|
inferior mailbox names under it (so-called "dual-usage" mailboxes).
|
||
|
For some inexplicable reason, some people want this.
|
||
|
|
||
|
Note that the name "INBOX" has special semantics and rules, as
|
||
|
described in the file naming.txt.
|
||
|
|
||
|
The following file/message formats are supported by c-client as of
|
||
|
the time of this writing:
|
||
|
|
||
|
. mx This is an experimental format, and may be removed in a future
|
||
|
release. An mx format mailbox has a .mxindex file which holds
|
||
|
the message status and unique identifiers. Messages are
|
||
|
stored in Internet standard CF LF form, so the file size of
|
||
|
the message file equals the size of the message.
|
||
|
|
||
|
mx is somewhat inefficient; the entire directory must be read
|
||
|
and each file stat()'d. We found it intolerable for a
|
||
|
moderate sized mailbox (2000 messages) and have more or less
|
||
|
abandoned it.
|
||
|
|
||
|
. mh This is supported for compatibility with the past. This is
|
||
|
the format used by the old mh program.
|
||
|
|
||
|
mh is very inefficient; the entire directory must be read
|
||
|
and each file stat()'d, and in order to determine the size
|
||
|
of a message, the entire file must be read and newline
|
||
|
conversion performed.
|
||
|
|
||
|
mh is deficient in that it does not support any permanent
|
||
|
flags or keywords; and has no means to store UIDs (because
|
||
|
the mh "compress" command renames all the files, that's
|
||
|
why).
|
||
|
|
||
|
. news This is an export of the local filesystem's news spool, e.g.
|
||
|
/var/spool/news. Access to mailboxes in news format is read
|
||
|
only; however, message "deleted" status is preserved in a
|
||
|
.newsrc file in the user's home directory. There is no other
|
||
|
status or keywords.
|
||
|
|
||
|
news is very inefficient; the entire directory must be
|
||
|
read and each file stat()'d, and in order to determine the
|
||
|
size of a message, the entire file must be read and newline
|
||
|
conversion performed.
|
||
|
|
||
|
news is deficient in that it does not support permanent flags
|
||
|
other than deleted; does not support keywords; and has no
|
||
|
expunge.
|
||
|
|
||
|
|
||
|
Soapbox on File/Message Formats
|
||
|
|
||
|
If it sounds from the above descriptions that we're not putting
|
||
|
too much effort into file/message formats, you are correct.
|
||
|
|
||
|
There's a general reason why file/message formats are a bad idea.
|
||
|
Just about every filesystem in existance serializes file creation and
|
||
|
deletions because these manipulate the free space map. This turns out
|
||
|
to be an enormous problem when you start creating/deleting more than a
|
||
|
few messages per second; you spend all your time thrashing in the
|
||
|
filesystem.
|
||
|
|
||
|
It is also extremely slow to do a text search through a
|
||
|
file/message format mailbox. All of those open()s and close()s really
|
||
|
add up to major filesystem thrashing.
|
||
|
|
||
|
|
||
|
What about Cyrus and Maildir?
|
||
|
|
||
|
Both formats are vulnerable to the filesystem thrashing outlined
|
||
|
above.
|
||
|
|
||
|
The Cyrus format used by CMU's Cyrus server (and Esys' server)
|
||
|
has a special associated flat file in each directory that contains
|
||
|
extensive data (including pre-parsed ENVELOPEs and BODYSTRUCTUREs)
|
||
|
about the messages. Put another way, it's a (considerably) more
|
||
|
featureful form of mx. It also uses certain operating system
|
||
|
facilities (e.g. file/memory mapping) which are not available on older
|
||
|
systems, at a cost of much more limited portability than c-client.
|
||
|
These considerably ameliorate the fundamental problems with
|
||
|
file/message formats; in fact, Cyrus is halfway to being a database.
|
||
|
Rather than support Cyrus format in c-client, you should run Cyrus or
|
||
|
Esys if you want that format.
|
||
|
|
||
|
The Maildir format used by qmail has all of the performance
|
||
|
disadvantages of mh noted above, with the additional problem that the
|
||
|
files are renamed in order to change their status so you end up having
|
||
|
to rescan the directory frequently to locate the current names
|
||
|
(particularly in a shared mailbox scenario). It doesn't scale, and it
|
||
|
represents a support nightmare; it is therefore not supported in the
|
||
|
official distribution. Maildir support code for c-client is available
|
||
|
from third parties; but, if you use it, it is entirely at your own
|
||
|
risk (read: don't complain about how poorly it performs or bugs).
|
||
|
|
||
|
|
||
|
So what does this all mean?
|
||
|
|
||
|
A database (such as used by Exchange) is really a much better
|
||
|
approach if you want to move away from flat files. mx and especially
|
||
|
Cyrus take a tenative step in that direction; mx failed mostly because
|
||
|
it didn't go anywhere near far enough. Cyrus goes much further, and
|
||
|
scores remarkable benefits from doing so.
|
||
|
|
||
|
However, a well-designed pure database without the overhead of
|
||
|
separate files would do even better.
|