Before sending to Roberto.

This commit is contained in:
Diego Nehab 2007-05-31 22:27:40 +00:00
parent 7b195164b0
commit 3074a8f56b

View File

@ -23,19 +23,17 @@ received in consecutive function calls, returning partial
results after each invocation. Examples of operations that can be results after each invocation. Examples of operations that can be
implemented as filters include the end-of-line normalization implemented as filters include the end-of-line normalization
for text, Base64 and Quoted-Printable transfer content for text, Base64 and Quoted-Printable transfer content
encodings, the breaking of text into lines, SMTP byte encodings, the breaking of text into lines, SMTP dot-stuffing,
stuffing, and there are many others. Filters become even and there are many others. Filters become even
more powerful when we allow them to be chained together to more powerful when we allow them to be chained together to
create composite filters. In this context, filters can be seen create composite filters. In this context, filters can be seen
as the middle links in a chain of data transformations. Sources an sinks as the middle links in a chain of data transformations. Sources an sinks
are the corresponding end points of these chains. A source are the corresponding end points of these chains. A source
is a function that produces data, chunk by chunk, and a sink is a function that produces data, chunk by chunk, and a sink
is a function that takes data, chunk by chunk. In this is a function that takes data, chunk by chunk. In this
chapter, we describe the design of an elegant interface for filters, article, we describe the design of an elegant interface for filters,
sources, sinks and chaining, refine it sources, sinks, and chaining, and illustrate each step
until it reaches a high degree of generality. We discuss with concrete examples.
implementation challenges, provide practical solutions,
and illustrate each step with concrete examples.
\end{abstract} \end{abstract}
@ -52,7 +50,7 @@ transfer coding, and the list goes on.
Many complex tasks require a combination of two or more such Many complex tasks require a combination of two or more such
transformations, and therefore a general mechanism for transformations, and therefore a general mechanism for
promoting reuse is desirable. In the process of designing promoting reuse is desirable. In the process of designing
LuaSocket 2.0, David Burgess and I were forced to deal with \texttt{LuaSocket~2.0}, David Burgess and I were forced to deal with
this problem. The solution we reached proved to be very this problem. The solution we reached proved to be very
general and convenient. It is based on the concepts of general and convenient. It is based on the concepts of
filters, sources, sinks, and pumps, which we introduce filters, sources, sinks, and pumps, which we introduce
@ -62,18 +60,18 @@ below.
with chunks of input, successively returning processed with chunks of input, successively returning processed
chunks of output. More importantly, the result of chunks of output. More importantly, the result of
concatenating all the output chunks must be the same as the concatenating all the output chunks must be the same as the
result of applying the filter over the concatenation of all result of applying the filter to the concatenation of all
input chunks. In fancier language, filters \emph{commute} input chunks. In fancier language, filters \emph{commute}
with the concatenation operator. As a result, chunk with the concatenation operator. As a result, chunk
boundaries are irrelevant: filters correctly handle input boundaries are irrelevant: filters correctly handle input
data no matter how it was originally split. data no matter how it is split.
A \emph{chain} transparently combines the effect of one or A \emph{chain} transparently combines the effect of one or
more filters. The interface of a chain must be more filters. The interface of a chain is
indistinguishable from the interface of its components. indistinguishable from the interface of its components.
This allows a chained filter to be used wherever an atomic This allows a chained filter to be used wherever an atomic
filter is expected. In particular, chains can be chained filter is expected. In particular, chains can be
themselves to create arbitrarily complex operations. themselves chained to create arbitrarily complex operations.
Filters can be seen as internal nodes in a network through Filters can be seen as internal nodes in a network through
which data will flow, potentially being transformed many which data will flow, potentially being transformed many
@ -93,15 +91,13 @@ anything to happen. \emph{Pumps} provide the driving force
that pushes data through the network, from a source to a that pushes data through the network, from a source to a
sink. sink.
These concepts will become less abstract with examples. In In the following sections, we start with a simplified
the following sections, we start with a simplified interface, which we later refine. The evolution we present
interface, which we refine several times until no obvious is not contrived: it recreates the steps we followed
shortcomings remain. The evolution we present is not ourselves as we consolidated our understanding of these
contrived: it recreates the steps we followed ourselves as concepts within our application domain.
we consolidated our understanding of these concepts and the
applications that benefit from them.
\subsection{A concrete example} \subsection{A simple example}
Let us use the end-of-line normalization of text as an Let us use the end-of-line normalization of text as an
example to motivate our initial filter interface. example to motivate our initial filter interface.
@ -141,23 +137,23 @@ it with a \texttt{nil} chunk. The filter responds by returning
the final chunk of processed data. the final chunk of processed data.
Although the interface is extremely simple, the Although the interface is extremely simple, the
implementation is not so obvious. Any filter implementation is not so obvious. A normalization filter
respecting this interface needs to keep some kind of context respecting this interface needs to keep some kind of context
between calls. This is because chunks can for example be broken between calls. This is because a chunk boundary may lie between
between the CR and LF characters marking the end of a line. This the CR and LF characters marking the end of a line. This
need for contextual storage is what motivates the use of need for contextual storage motivates the use of
factories: each time the factory is called, it returns a factories: each time the factory is invoked, it returns a
filter with its own context so that we can have several filter with its own context so that we can have several
independent filters being used at the same time. For independent filters being used at the same time. For
efficiency reasons, we must avoid the obvious solution of efficiency reasons, we must avoid the obvious solution of
concatenating all the input into the context before concatenating all the input into the context before
producing any output. producing any output.
To that end, we will break the implementation in two parts: To that end, we break the implementation into two parts:
a low-level filter, and a factory of high-level filters. The a low-level filter, and a factory of high-level filters. The
low-level filter will be implemented in C and will not carry low-level filter is implemented in C and does not maintain
any context between function calls. The high-level filter any context between function calls. The high-level filter
factory, implemented in Lua, will create and return a factory, implemented in Lua, creates and returns a
high-level filter that maintains whatever context the low-level high-level filter that maintains whatever context the low-level
filter needs, but isolates the user from its internal filter needs, but isolates the user from its internal
details. That way, we take advantage of C's efficiency to details. That way, we take advantage of C's efficiency to
@ -191,22 +187,21 @@ end
The \texttt{normalize} factory simply calls a more generic The \texttt{normalize} factory simply calls a more generic
factory, the \texttt{cycle} factory. This factory receives a factory, the \texttt{cycle} factory. This factory receives a
low-level filter, an initial context, and an extra low-level filter, an initial context, and an extra
parameter, and returns the corresponding high-level filter. parameter, and returns a new high-level filter. Each time
Each time the high-level filer is passed a new chunk, it the high-level filer is passed a new chunk, it invokes the
invokes the low-level filter passing it the previous low-level filter with the previous context, the new chunk,
context, the new chunk, and the extra argument. The and the extra argument. It is the low-level filter that
low-level filter in turn produces the chunk of processed does all the work, producing the chunk of processed data and
data and a new context. The high-level filter then updates a new context. The high-level filter then updates its
its internal context, and returns the processed chunk of internal context, and returns the processed chunk of data to
data to the user. It is the low-level filter that does all the user. Notice that we take advantage of Lua's lexical
the work. Notice that we take advantage of Lua's lexical
scoping to store the context in a closure between function scoping to store the context in a closure between function
calls. calls.
Concerning the low-level filter code, we must first accept Concerning the low-level filter code, we must first accept
that there is no perfect solution to the end-of-line marker that there is no perfect solution to the end-of-line marker
normalization problem itself. The difficulty comes from an normalization problem. The difficulty comes from an
inherent ambiguity on the definition of empty lines within inherent ambiguity in the definition of empty lines within
mixed input. However, the following solution works well for mixed input. However, the following solution works well for
any consistent input, as well as for non-empty lines in any consistent input, as well as for non-empty lines in
mixed input. It also does a reasonable job with empty lines mixed input. It also does a reasonable job with empty lines
@ -218,17 +213,18 @@ The idea is to consider both CR and~LF as end-of-line
is seen alone, or followed by a different candidate. In is seen alone, or followed by a different candidate. In
other words, CR~CR~and LF~LF each issue two end-of-line other words, CR~CR~and LF~LF each issue two end-of-line
markers, whereas CR~LF~and LF~CR issue only one marker each. markers, whereas CR~LF~and LF~CR issue only one marker each.
This idea correctly handles the Unix, DOS/MIME, VMS, and Mac This method correctly handles the Unix, DOS/MIME, VMS, and Mac
OS, as well as other more obscure conventions. OS conventions.
\subsection{The C part of the filter} \subsection{The C part of the filter}
Our low-level filter is divided into two simple functions. Our low-level filter is divided into two simple functions.
The inner function actually does the conversion. It takes The inner function performs the normalization itself. It takes
each input character in turn, deciding what to output and each input character in turn, deciding what to output and
how to modify the context. The context tells if the last how to modify the context. The context tells if the last
character processed was an end-of-line candidate, and if so, processed character was an end-of-line candidate, and if so,
which candidate it was. which candidate it was. For efficiency, it uses
Lua's auxiliary library's buffer interface:
\begin{quote} \begin{quote}
\begin{C} \begin{C}
@stick# @stick#
@ -252,12 +248,10 @@ static int process(int c, int last, const char *marker,
\end{C} \end{C}
\end{quote} \end{quote}
The inner function makes use of Lua's auxiliary library's The outer function simply interfaces with Lua. It receives the
buffer interface for efficiency. The context and input chunk (as well as an optional
outer function simply interfaces with Lua. It receives the
context and the input chunk (as well as an optional
custom end-of-line marker), and returns the transformed custom end-of-line marker), and returns the transformed
output chunk and the new context. output chunk and the new context:
\begin{quote} \begin{quote}
\begin{C} \begin{C}
@stick# @stick#
@ -291,33 +285,29 @@ initial state. This allows the filter to be reused many
times. times.
When designing your own filters, the challenging part is to When designing your own filters, the challenging part is to
decide what will be the context. For line breaking, for decide what will be in the context. For line breaking, for
instance, it could be the number of bytes left in the instance, it could be the number of bytes left in the
current line. For Base64 encoding, it could be a string current line. For Base64 encoding, it could be a string
with the bytes that remain after the division of the input with the bytes that remain after the division of the input
into 3-byte atoms. The MIME module in the LuaSocket into 3-byte atoms. The MIME module in the \texttt{LuaSocket}
distribution has many other examples. distribution has many other examples.
\section{Filter chains} \section{Filter chains}
Chains add a lot to the power of filters. For example, Chains add a lot to the power of filters. For example,
according to the standard for Quoted-Printable encoding, the according to the standard for Quoted-Printable encoding,
text must be normalized into its canonic form prior to text must be normalized to a canonic end-of-line marker
encoding, as far as end-of-line markers are concerned. To prior to encoding. To help specifying complex
help specifying complex transformations like these, we define a transformations like this, we define a chain factory that
chain factory that creates a composite filter from one or creates a composite filter from one or more filters. A
more filters. A chained filter passes data through all chained filter passes data through all its components, and
its components, and can be used wherever a primitive filter can be used wherever a primitive filter is accepted.
is accepted.
The chaining factory is very simple. All it does is return a The chaining factory is very simple. The auxiliary
function that passes data through all filters and returns function~\texttt{chainpair} chains two filters together,
the result to the user. The auxiliary taking special care if the chunk is the last. This is
function~\texttt{chainpair} can only chain two filters because the final \texttt{nil} chunk notification has to be
together. In the auxiliary function, special care must be pushed through both filters in turn:
taken if the chunk is the last. This is because the final
\texttt{nil} chunk notification has to be pushed through both
filters in turn:
\begin{quote} \begin{quote}
\begin{lua} \begin{lua}
@stick# @stick#
@ -333,7 +323,7 @@ end
@stick# @stick#
function filter.chain(...) function filter.chain(...)
local f = arg[1] local f = arg[1]
for i = 2, table.getn(arg) do for i = 2, @#arg do
f = chainpair(f, arg[i]) f = chainpair(f, arg[i])
end end
return f return f
@ -343,7 +333,7 @@ end
\end{quote} \end{quote}
Thanks to the chain factory, we can Thanks to the chain factory, we can
trivially define the Quoted-Printable conversion: define the Quoted-Printable conversion as such:
\begin{quote} \begin{quote}
\begin{lua} \begin{lua}
@stick# @stick#
@ -361,7 +351,7 @@ pump.all(in, out)
The filters we introduced so far act as the internal nodes The filters we introduced so far act as the internal nodes
in a network of transformations. Information flows from node in a network of transformations. Information flows from node
to node (or rather from one filter to the next) and is to node (or rather from one filter to the next) and is
transformed on its way out. Chaining filters together is our transformed along the way. Chaining filters together is our
way to connect nodes in this network. As the starting point way to connect nodes in this network. As the starting point
for the network, we need a source node that produces the for the network, we need a source node that produces the
data. In the end of the network, we need a sink node that data. In the end of the network, we need a sink node that
@ -376,8 +366,8 @@ caller by returning \texttt{nil} followed by an error message.
Below are two simple source factories. The \texttt{empty} source Below are two simple source factories. The \texttt{empty} source
returns no data, possibly returning an associated error returns no data, possibly returning an associated error
message. The \texttt{file} source is more usefule, and message. The \texttt{file} source works harder, and
yields the contents of a file in a chunk by chunk fashion. yields the contents of a file in a chunk by chunk fashion:
\begin{quote} \begin{quote}
\begin{lua} \begin{lua}
@stick# @stick#
@ -404,9 +394,13 @@ end
\subsection{Filtered sources} \subsection{Filtered sources}
It is often useful to chain a source with a filter. A A filtered source passes its data through the
filtered source passes its data through the
associated filter before returning it to the caller. associated filter before returning it to the caller.
Filtered sources are useful when working with
functions that get their input data from a source (such as
the pump in our first example). By chaining a source with one or
more filters, the function can be transparently provided
with filtered data, with no need to change its interface.
Here is a factory that does the job: Here is a factory that does the job:
\begin{quote} \begin{quote}
\begin{lua} \begin{lua}
@ -425,23 +419,16 @@ end
\end{lua} \end{lua}
\end{quote} \end{quote}
Our motivating example in the introduction chains a source
with a filter. Filtered sources are useful when working with
functions that get their input data from a source (such as
the pump in the example). By chaining a source with one or
more filters, the function can be transparently provided
with filtered data, with no need to change its interface.
\subsection{Sinks} \subsection{Sinks}
Just as we defined an interface for sources of Just as we defined an interface a data source,
data, we can also define an interface for a we can also define an interface for a data destination.
destination for data. We call any function respecting this We call any function respecting this
interface a \emph{sink}. In our first example, we used a interface a \emph{sink}. In our first example, we used a
file sink connected to the standard output. file sink connected to the standard output.
Sinks receive consecutive chunks of data, until the end of Sinks receive consecutive chunks of data, until the end of
data is notified with a \texttt{nil} chunk. A sink can be data is signaled by a \texttt{nil} chunk. A sink can be
notified of an error with an optional extra argument that notified of an error with an optional extra argument that
contains the error message, following a \texttt{nil} chunk. contains the error message, following a \texttt{nil} chunk.
If a sink detects an error itself, and If a sink detects an error itself, and
@ -529,8 +516,8 @@ common that it deserves its own function:
function pump.step(src, snk) function pump.step(src, snk)
local chunk, src_err = src() local chunk, src_err = src()
local ret, snk_err = snk(chunk, src_err) local ret, snk_err = snk(chunk, src_err)
return chunk and ret and not src_err and not snk_err, if chunk and ret then return 1
src_err or snk_err else return nil, src_err or snk_err end
end end
% %
@ -539,7 +526,10 @@ function pump.all(src, snk, step)
step = step or pump.step step = step or pump.step
while true do while true do
local ret, err = step(src, snk) local ret, err = step(src, snk)
if not ret then return not err, err end if not ret then
if err then return nil, err
else return 1 end
end
end end
end end
% %
@ -571,21 +561,23 @@ The way we split the filters here is not intuitive, on
purpose. Alternatively, we could have chained the Base64 purpose. Alternatively, we could have chained the Base64
encode filter and the line-wrap filter together, and then encode filter and the line-wrap filter together, and then
chain the resulting filter with either the file source or chain the resulting filter with either the file source or
the file sink. It doesn't really matter. the file sink. It doesn't really matter. The Base64 and the
line wrapping filters are part of the \texttt{LuaSocket}
distribution.
\section{Exploding filters} \section{Exploding filters}
Our current filter interface has one flagrant shortcoming. Our current filter interface has one flagrant shortcoming.
When David Burgess was writing his \texttt{gzip} filter, he When David Burgess was writing his \texttt{gzip} filter, he
noticed that a decompression filter can explode a small noticed that a decompression filter can explode a small
input chunk into a huge amount of data. To address this, we input chunk into a huge amount of data. To address this
decided to change our filter interface to allow exploding problem, we decided to change the filter interface and allow
filters to return large quantities of output data in a chunk exploding filters to return large quantities of output data
by chunk manner. in a chunk by chunk manner.
More specifically, after passing each chunk of input data to More specifically, after passing each chunk of input to
a filter and collecting the first chunk of output data, the a filter, and collecting the first chunk of output, the
user must now loop to receive data from the filter until no user must now loop to receive other chunks from the filter until no
filtered data is left. Within these secondary calls, the filtered data is left. Within these secondary calls, the
caller passes an empty string to the filter. The filter caller passes an empty string to the filter. The filter
responds with an empty string when it is ready for the next responds with an empty string when it is ready for the next
@ -593,7 +585,7 @@ input chunk. In the end, after the user passes a
\texttt{nil} chunk notifying the filter that there is no \texttt{nil} chunk notifying the filter that there is no
more input data, the filter might still have to produce too more input data, the filter might still have to produce too
much output data to return in a single chunk. The user has much output data to return in a single chunk. The user has
to loop again, this time passing \texttt{nil} each time, to loop again, now passing \texttt{nil} to the filter each time,
until the filter itself returns \texttt{nil} to notify the until the filter itself returns \texttt{nil} to notify the
user it is finally done. user it is finally done.
@ -602,9 +594,9 @@ the new interface. In fact, the end-of-line translation
filter we presented earlier already conforms to it. The filter we presented earlier already conforms to it. The
complexity is encapsulated within the chaining functions, complexity is encapsulated within the chaining functions,
which must now include a loop. Since these functions only which must now include a loop. Since these functions only
have to be written once, the user is not affected. have to be written once, the user is rarely affected.
Interestingly, the modifications do not have a measurable Interestingly, the modifications do not have a measurable
negative impact in the the performance of filters that do negative impact in the performance of filters that do
not need the added flexibility. On the other hand, for a not need the added flexibility. On the other hand, for a
small price in complexity, the changes make exploding small price in complexity, the changes make exploding
filters practical. filters practical.
@ -617,7 +609,7 @@ and SMTP modules are especially integrated with LTN12,
and can be used to showcase the expressive power of filters, and can be used to showcase the expressive power of filters,
sources, sinks, and pumps. Below is an example sources, sinks, and pumps. Below is an example
of how a user would proceed to define and send a of how a user would proceed to define and send a
multipart message with attachments, using \texttt{LuaSocket}: multipart message, with attachments, using \texttt{LuaSocket}:
\begin{quote} \begin{quote}
\begin{mime} \begin{mime}
local smtp = require"socket.smtp" local smtp = require"socket.smtp"
@ -656,8 +648,8 @@ assert(smtp.send{
The \texttt{smtp.message} function receives a table The \texttt{smtp.message} function receives a table
describing the message, and returns a source. The describing the message, and returns a source. The
\texttt{smtp.send} function takes this source, chains it with the \texttt{smtp.send} function takes this source, chains it with the
SMTP dot-stuffing filter, creates a connects a socket sink SMTP dot-stuffing filter, connects a socket sink
to the server, and simply pumps the data. The message is never with the server, and simply pumps the data. The message is never
assembled in memory. Everything is produced on demand, assembled in memory. Everything is produced on demand,
transformed in small pieces, and sent to the server in chunks, transformed in small pieces, and sent to the server in chunks,
including the file attachment that is loaded from disk and including the file attachment that is loaded from disk and
@ -665,14 +657,14 @@ encoded on the fly. It just works.
\section{Conclusions} \section{Conclusions}
In this article we introduce the concepts of filters, In this article, we introduced the concepts of filters,
sources, sinks, and pumps to the Lua language. These are sources, sinks, and pumps to the Lua language. These are
useful tools for data processing in general. Sources provide useful tools for stream processing in general. Sources provide
a simple abstraction for data acquisition. Sinks provide an a simple abstraction for data acquisition. Sinks provide an
abstraction for final data destinations. Filters define an abstraction for final data destinations. Filters define an
interface for data transformations. The chaining of interface for data transformations. The chaining of
filters, sources and sinks provides an elegant way to create filters, sources and sinks provides an elegant way to create
arbitrarily complex data transformation from simpler arbitrarily complex data transformations from simpler
transformations. Pumps simply move the data through. components. Pumps simply move the data through.
\end{document} \end{document}