Big Ball of Mud

Pipelines

with one comment

pipeline-alaskaPipelines (or pipes and filters) are a commonly used architectural pattern used for processing large amount of data, meaning that you have more much data items to process than operations to apply. In regular procedural approach, it is very easy to end up with monolithic, inflexible application with non-reusable components. Implementing a change by adding or removing operations may be difficult. Another problem is boilerplate code mixing up with business logic. If you ever dealt with batch processing (inevitable in every business system), you know that things like performance and transaction management may become crucial for this and it is very hard to implement a processing component which separates these issues from logic.

The idea is to break down the processing business logic into small components and connect them using pipes, to form a pipeline. Instead of water or oil, this pipeline will now transport our data:

pipeline-diagram

Basically the operations (sometimes called filters) are chunks of our logic, while pipes connect operations with each other and allow data flow. This pattern can be seen in different forms e.g. in command line pipes:

C:\>dir | more

A very interesting example of using pipelines for processing is Yahoo Pipes.

Linq queries can also be viewed as data pipelines:

            string[] numbers = { "zero", "one", "two", "three",
                                   "four", "five", "six", "seven",
                                   "eight", "nine" };
            var shortNums =
                from n in numbers
                where n.Length < 5
                select n;
            foreach (var num in shortNums)
            {
                Console.WriteLine(num);
            }

In this example shortNums query is a pipeline with one Where operation. Linq pipelines are build by wrapping the iterators (here array iterator is wrapped by Where iterator, which is then wrapped by Select operator).

Operations are generally stateless components which do processing, transforming and filtering of data. They are easily reusable, modifiable and free of boilerplate code. The only way operations communicate with each other should be through the data stream. Pipes connect operations, but can also perform additional tasks, like data buffering or parallelizing the execution.

Data flow

Now that we know how a basic pipeline looks, let’s see how data can flow through the pipe. Each pipe will have at least one input and output:

pipeline-input-output

Input and output here are operations connected to the pipeline. Operations can be passive, that means each data item will just go through it when provided by pipes (think chained method calls – each method is such a simple passive operation) or they can be active,  i.e. initiate operations by themselves.

A push operation occurs when the data is available in the active input, and it initiates the flow:

pipeline-push

The picture describes the push operation on the input side.

A pull operation occurs when active output component requires new data and requests the pipe for a new data item:

pipeline-pull

Linq queries are usually pull pipelines, due to the lazy execution approach. In the previously presented code, the data does not flow through the pipe, until the foreach loop requests next element from the outermost iterator (from select clause). This makes Linq operations active.

Push and pull operations can mixed in a single pipe: it can buffer data pushed by active input and then serve it to active output when requested.

The pipelines presented here showed a simple sequential dataflow:

pipeline-sequentialSometimes the logic in the flow does not require sequential execution and the flow can split and joined:

pipeline-parallelThis approach allows parallel execution of some steps.

Execution

The main advantage of pipelines is the possibility to execute the same logic in different ways, e.g. parallel execution. Instead of sequentially executing steps for each element:

pipeline-sequential_processingwe can try to parallalize the execution as spawn a new instance of the pipeline for each incoming item:

pipeline-parallel_processingIf you know something about CPUs, you probably are familiar with the similar technique with regard to processors.

Summary

Pipelines can be very handy and despite all the advanced theory, we can use it in simple way (with Linq for example). The main benefits of this approach are reusability and composability of operations and possibility to manage execution from outside. There are also some disadvantages, like error processing for example. In next posts I will try to show some possible implementations, especially for batch processing.

About these ads

Written by bigballofmud

2009/04/18 at 11:56 pm

Posted in C#, Patterns, Pipelines

One Response

Subscribe to comments with RSS.

  1. Pipelines « Big Ball of Mud…

    Thank you for submitting this cool story – Trackback from DotNetShoutout…

    DotNetShoutout

    2009/04/19 at 12:00 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: