Home Products In-Memory Analytics

In-Memory Analytics & Pipelining Vector-Based Statistical Functions

Watch a two-minute video on pipelining statistical functions for in-memory analytics, or download a 10-page white paper on the topic.

 

In-memory analytics seeks to accelerate the task of finding meaningful patterns in market data. In-memory database systems (IMDSs) are an irreplaceable tool for this job.

But IMDSs are not new – in fact, they’re almost the “new normal” for analytics in capital markets. What data management solutions moves analytics beyond the status quo, and into higher profitability for trading organizations via a time-based advantage?

This need is met by two key features in McObject’s eXtremeDB Financial Edition database system: columnar data layout to overcome traditional DBMSs’ (and IMDSs’) shortcomings in dealing with time series data (including market data), and the programming technique of pipelining vector-based statistical functions to process sequences (columns).

These features build on eXtremeDB’s already-fast IMDS architecture and move the focus for optimizing in-memory analytics into the CPU cache, where columnar layout and pipelining ensures priority-handling of the most time-critical data.

 

Columnar vs. Row-Based Data Handling

 

Database management systems (DBMSs) conventionally store data in tables consisting of rows and columns, with each row containing one instance of each column value. A DBMS accumulates rows into database pages for processing.

However, market data typically takes the form of a time series, or some value measured repeatedly over time. A time series will occupy a column in a database table. The problem with the row-based approach is that for time series analysis (and hence market data analysis), only a sub-set of the columns in a given table is likely to be of interest for a given operation.

For example, many financial calculations would use only the closing price (the Close column) in the table shown in Figure 1, below. Row-oriented processing results in entire pages being fetched into CPU cache, including irrelevant Open, Volume and Date elements. Significant bandwidth is wasted because only a quarter of the data fetched is actually used. When an application needs to process a single column of data (whose elements would be in multiple rows) a different layout is more efficient.

 

row-based data layout

Figure 1. Both relational and object databases typically store data in tables consisting
of rows and columns, and transfer data row-by-row between
storage, main memory
and the CPU cache.

 

Columnar layout accelerates time series analysis  – and hence in-memory analytics – by storing and subsequently fetching data in a column-by-column fashion on a database page, as shown in Figure 2.

 

column-based data layout

 

Figure 2. With a columnar approach, tables’ columns are the basic unit for
constructing the database pages used for DBMS input/output. When time series
are needed for a calculation, this approach avoids flooding the CPU cache with
irrelevant data.

 

The eXtremeDB Financial Edition database system delivers columnar data handling via declaration of a column of type ‘sequence’, and builds on it with pipelining technology that enables multiple vector-based statistical functions to work on time series data within CPU cache, without the need to output interim results back into main memory.

 

Pipelining for High Performance

 

Array- or vector-based programming techniques simplify working with market data by enabling an operation (such as a statistical function) to apply at once to an entire set of values, such as a time series. Pipelining is the coding technique in which eXtremeDB Financial Edition’s vector-based statistical functions are arranged end-to-end in order to process one or more time series, such that the output of one function in the pipeline becomes input for the next function.

While feeding the results of one function to another may sound like standard programming practice, pipelining in eXtremeDB Financial Edition uses an approach called tile-based processing of vector elements to greatly accelerate processing. The DBMS brings data into the CPU cache in very small units called tiles (instead of the much larger database pages), then all functions in a pipeline act on this data before the end result is materialized as output in main memory. In contrast, other database systems (even columnar ones) would transfer the output of each function into temporary tables in main memory, before acting on this transformed data with the next function.

 

Pipelining Example – Moving Average Crossover Points

 

Here’s how pipelining works. Let’s say the application needs to calculate 5-day and 21-day moving averages for a stock, and detect the points where the faster moving average (5-day) crosses over or under the slower one (21-day).

This is accomplished below, using eXtremeDB Financial Edition’s vector-based statistical functions as arguments to a SELECT statement:

SELECT seq_map(Close,
seq_cross(seq_sub(seq_window_agg_avg(Close, 5),
seq_window_agg_avg(Close, 21)), 1))
FROM Security
WHERE symbol = 'IBM';

 

Here is what’s happening in the code, step by step:

  1. Two invocations of ‘seq_window_agg_avg’ execute over the closing price sequence to obtain 5-day and 21-day moving averages
  2. The function ‘seq_sub’ subtracts 21- from 5-day moving averages
  3. The result “feeds” a fourth function, ‘seq_cross’, to identify where the 5- and 21-day moving averages cross
  4. Finally, the function ‘seq_map’ maps the crossovers to the original ‘Close’ sequence, returning closing prices where the moving averages crossed

In the example above, columnar handling results in faster performance because only the column(s) of interest (closing prices, in our example) are brought into the CPU cache at the start of the operation. In contrast, conventional row-wise handling would bring database pages consisting of entire rows with all their columns into CPU cache.

Even more significantly, this approach eliminates the need to create, populate and query temporary tables outside the CPU cache (i.e. in main memory), as would be required by other database systems and vector-based programming languages to manage interim results of processing (for example, no table is needed to contain 5-day moving averages, 21-day moving averages, etc.).

Even more significantly, this approach eliminates the need to create, populate and query temporary tables outside the CPU cache (i.e. in main memory), as would be required by other database systems and vector-based programming languages to manage interim results of processing (for example, no table is needed to contain 5-day moving averages, 21-day moving averages, etc.).

 

More on pipelining with eXtremeDB Financial Edition:

 

Short (two minute) video
Data sheet
10-page technical white paper

 

© All Content Copyright 2017 McObject, LLC