You are viewing jtommaney

We've launched: open source alpha available now at infinidb.org

   See www.infinidb.org for details.

The intention will be to post information specific to InifiniDB performance, scalability, or features on that site. Topics here may be more general in nature, i.e. comparisons of column versus row for different use cases that may apply to any column architecture dbms.

Comparison of block touches in support projecting rows and columns:

 These examples abstract out the cost to access a given set of rows to highlight the relative cost to return (project) rows and columns.  The assumption here is that the returned rows are generally contiguous. 

 

There are a number of operations whose cost will be different between a traditional row based dbms and a column based dbms.  For this discussion, an abstract table with 20 columns (4 1-byte columns, 8 4-byte columns, 8 8-byte columns) will be used with 10 million rows.  The row length is 100 bytes of data based on the data types listed. 

 Additional assumptions/approximations include; data is stored in blocks/pages of 8k bytes, that any overhead for row identifiers is negligible, the primary measure of cost is block touches, distribution of rows within blocks is contiguous, and that data is written once. 


Read Operations

Row Based Cost

Column Based Cost

Read 20 columns for 1 row in the table.

1 block must be read.

 

Read 100 bytes from 8192 bytes in one block.

20 blocks must be read.

 

Read between 1 and 8 bytes from the first block for each of the 20 columns.

Read 20 columns for first 2,000 rows in the table.

24.41 blocks must be read.

 

Read 200,000 bytes / 8192 bytes per block.

28 blocks must be read.

 

Read: (4 + 8 + 16) =28 blocks.

1 block for each of the 1-byte columns

     (reading first 2000 bytes from each)

1 block for each of the 4 byte columns

2 blocks for each of the 8 byte columns  

Read 20 columns for first 8k rows in the table.

100 blocks must be read.

 

Read 819,200 bytes / 8192 bytes per block.

100 blocks must be read.

 

Read:  (4 + 32 + 64) = 100 blocks.

1 block for each of the 1-byte columns

4 blocks for each of the 4 byte columns

8 blocks for each of the 8 byte columns

Read 5 columns for 1 row in the table. 

(1 1-byte, 2 4-byte, 2 8-byte)

1 block must be read.

 

Read 25 bytes from a 100 byte row record within the first block.

5 blocks must be read.

 

Read between 1 and 8 bytes from the first block for each of the 10 columns.

Read 5 columns for first 2,000 rows in the table.

24.41 blocks must be read.

 

Read 200,000 bytes / 8192 bytes per block to return all of the rows.  Processing 25 bytes from each 100 byte record.  Block reads are unchanged vs.  20 column use case.  

7 blocks must be read.

 

Read: (1 + 2 + 4) =7 blocks.

1 block for each of the 1-byte columns

     (reading first 2000 bytes from each)

1 block for each of the 4 byte columns

2 blocks for each of the 8 byte columns  

Read 5 columns for first 8k rows in the table.

100 blocks must be read.

 

Read 819,200 bytes / 8192 bytes per block.  Processing 25 bytes from each 100 byte record.  Block reads are unchanged vs.  20 column use case. 

25 blocks must be read.

 

Read: (1 + 8 + 16) = 50 blocks.

1 block for each of the 1-byte columns

4 blocks for each of the 4 byte columns

8 blocks for each of the 8 byte columns

 

Projection Conclusion 1:

·         Row dbms are optimized for “select * of a few rows” (OLTP).

·         Column dbms are optimized for ‘select subset of columns for many rows” (DW). 

 

 Projection Conclusion 2:

·         Row dbms projection operations become more expensive if the number of columns in the table increases (row size increases).

·         Column dbms projection operations are not affected by the number of columns in the table, only the number of columns selected.

           

 Comparison of access methods - Index Lookup Operations:


These examples abstract out the cost to display/project the rows and columns to highlight the relative cost to access rows and columns.  This example also assumes an 8 byte row identifier. 

 

 

Index Lookup Operation

Row Based Cost

Column Based Cost

Find one record on an indexed column

Walk index tree structure, typically 3-5 blocks. 

Walk index tree structure, typically 3-5 blocks.  Not implemented in all column-based dbms.

Find 2000 discrete records on an indexed column. 

Walk index tree structure for each record, typically 6000 to 10000 block touches.  Note that the number of distinct blocks being touched would typically be less because of blocks being re-touched multiple times.

 

                         N/A

 

Not typically used because a Full Column Scan operation is a lower cost operation.

Find 8000 discrete records on an indexed column. 

N/A (not typically used because a Full Index Scan operation is a lower cost operation).

 

Walk index tree structure for each record, typically 24,000 to 40,000 block touches.  Note that the number of distinct blocks being touched would typically be less because of blocks being re-touched multiple times.

 

                         N/A

 

Not typically used because a Full Column Scan operation is a lower cost operation.

 

 

Index Conclusion 1:

·         Cost of index increases in direct proportion to the number of individual records being accessed. 

 

Rule of thumb on when an index is faster than scanning the table (row dbms) or scanning the column (column dbms):

 

  • For a row based dbms, traditional rule of thumb is that an index is most useful when the number of rows found is less than 5.0% of the rows in the table.

 

  • For a column based dbms, the rule of thumb is that an index is most useful when the number of rows found is less than 0.25% of the rows in the table (assuming 20 similar sized columns in the table). 

 

Comparison of access methods – Full Index/Column Scan Operations:

 

A number of dbms systems have implemented a Full Index Scan operation where the index tree is avoided and the index leaf structures are scanned in a one-pass method.  This has the benefit of reducing the total block touches when large number of individual records must be found. 

           

Assuming 8 bytes column with a large number of distinct values, and an 8 byte row identifier, a full scan of an index leaf structure may be around 19,531 blocks.   Calculation; (8 bytes + 8 bytes) * 10,000,000 rows / 8192 bytes per block. 

 

Worst case for an index tree lookup assuming a tree depth of 4, i.e. 4 blocks touched in support of each individual lookup and 10 million lookups (100% of the rows), the number of blocks touched would be 40,000,000.  This would not be selected by a robust optimizer. 

 

Index/Column Scan Operation

Row Based Cost

Column Based Cost

Full Index Scan– find 2000 records by walking the index list (leaf) structure in a one-pass operation.

1-byte:  ~ 9,765 block touches

4-byte:  ~14,648 block touches

8-byte:  ~19,531 block touches

 

Assuming an 8 byte row identifier.

For a (unique) 4-byte column:

(12 bytes x 10 million rows)/8192

 

 

 

      N/A  (not typically used because full column scan is faster)

 

 

Full Column Scan– find 2000 records by scanning the column in a one-pass operation.

 

 

      N/A

 

 

1-byte:  ~ 1,220 block touches

4-byte:  ~ 4,882 block touches

8-byte:  ~ 9,765 block touches

 

 

Full Index Scan– find 8000 records by walking the index list (leaf) structure in a one-pass operation.

1-byte:  ~ 9,765 block touches

4-byte:  ~14,648 block touches

8-byte:  ~19,531 block touches

 

Assuming an 8 byte row identifier.

For a (unique) 4-byte column:

(12 bytes x 10 million rows)/8192

 

 

 

      N/A  (not typically used because full column scan is faster)

 

 

Full Column Scan– find 8000 records by scanning the column in a one-pass operation.

 

 

      N/A

 

1-byte:  ~ 1,220 block touches

4-byte:  ~ 4,882 block touches

8-byte:  ~ 9,765 block touches

 

 

 

Index/Column Scan Conclusion:

 

  • Both the index and column scan operations are a fixed cost operation, i.e. the cost to perform in terms of block touches does not increase as the number of records identified is found.  These scan operations are common for large DW operations.

           

Additional Index/Column Scan Conclusions:

 

  • The column scan operation is a lower cost operation than an index scan because the row identifier need not be stored in the column structure for a system with fixed addressing. 

 

  • The column scan operation avoids the overhead of creating, managing, and storing the index, potentially accelerating load rates.

 

 Comparison of access methods – Table Scan/Column Scan Operations:

 

Of course, not every column in a row based dbms can be indexed, and the alternative is a full table scan.  For a column based dbms, each column is accessed separately.

  

Table/Column Scan Operation

Row Based Cost  (Table Scan)

Column Based Cost  (Column Scan)

Find 1 row on an un-indexed column.

Scan the table:

 

(10 million rows * 100 byte records) / 8192 = 122,070 block touches

Scan the column:

 

1-byte:  ~ 1,220 block touches

4-byte:  ~ 4,882 block touches

8-byte:  ~ 9,765 block touches

 

Find 8k rows on an un-indexed column.

Scan the table:

 

(10 million rows * 100 byte records) / 8192 = 122,070 block touches

Scan the column:

 

1-byte:  ~ 1,220 block touches

4-byte:  ~ 4,882 block touches

8-byte:  ~ 9,765 block touches

 

 

Scan Conclusion:

            Row based dbms executes full table scans when un-indexed.

Column dbms never execute full table scans.  Column scan costs vary between 1% and 8% of the cost of a table scan for our 20 column table. 

There are a number of operations whose cost will be different between a traditional row based dbms and a column based dbms.  For this discussion, an abstract table with 20 columns (4 1-byte columns, 8 4-byte columns, 8 8-byte columns) will be used with 10 million rows.  The row length is 100 bytes of data based on the data types listed. 

 Additional assumptions/approximations include; data is stored in blocks/pages of 8k bytes, that any overhead for row identifiers is negligible, the primary measure of cost is block touches, distribution of rows within blocks is contiguous, and that data is written once. 

 

Comparison of block touches in support of inserts and deletions:

 

Insert Operation

Row Based Cost

Column Based Cost

Insert or delete 1 row.

1 block when no indexes exist.  Additional blocks required for each index.

20 blocks must be modified.

 

Insert or delete 2,000 rows.

24.41 blocks must be modified when no indexes are used.  Additional blocks required for each index.

 

Insert 200,000 bytes / 8192 bytes per block.

28 blocks must be modified.

 

Change: (4 + 8 + 16) =28 blocks.

1 block for each of the 1-byte columns

1 block for each of the 4 byte columns

2 blocks for each of the 8 byte columns  

Insert or delete 8k rows.

100 blocks must be modified when no indexes are used.  Additional blocks required for each index.

 

Insert 819,200 bytes / 8192 bytes per block.

100 blocks must be modified.

 

Change:  (4 + 32 + 64) = 100 blocks.

1 block for each of the 1-byte columns

4 blocks for each of the 4 byte columns

8 blocks for each of the 8 byte columns

 Insert/Delete Conclusion;

·         Row dbms are optimized for individual inserts (OLTP), but benefits are reduced by additional I/O for any indices in place.

·         Column dbms insert efficiency approaches row dbms efficiency with bulk inserts, as measured by block touches.

 

Comparison of block touches in support of updates:

 

Update Operation

Row Based Cost

Column Based Cost

Update 1 column for 1 row.

1 block must be modified.

1 block must be modified.

 

Update 4 columns for 1 row.

1 block must be modified.

4 blocks must be modified.

 

Update 1 column for 2000 rows.

24.41 blocks must be modified.

1 block must be modified.

 

Update 4 columns for 2000 rows.

24.41 blocks must be modified.

4 blocks must be modified.

 

Update 1 column for 8192 rows.

100 blocks must be modified.

1 block must be modified.

 

Update 4 columns for 8192 rows.

100 blocks must be modified.

4 blocks must be modified.

 

 Update Conclusion;

·         Row dbms are optimized for “update multiple columns in one row” (OLTP).

·         Column dbms are optimized for “updates of many rows at a time” (DW).

Quick Analysis of SSB scaling

The SSB scaling numbers trended reasonably well as the system scaled. 

2 -> 4 node scaling and 4->8 node scaling delivered a scalability factor averaging .552 with a range from .512 to .608.   Perfect linear scaling would be .500, i.e. doubling system resources cuts elapsed time in half.

1 -> 2 node scaling averaged .284 (better than linear) due to elimination of physical IO with a larger cache.   Queries ranged from .128 to .499 scalability factor. 

Software is available for trial via our Early Adopter Program at www.calpont.com .

We ran a quick scalability test of Calpont join behavior across using a Star Schema Benchmark data set at a scale factor of 1000. The Star Schema Benchmark transforms a TPC-H / DBT-3 data to a more standardized data warehouse star schema data model, and the 1000 scale factor includes 6 billion rows in the primary fact table. Information on the star schema bench (SSB) can be found at http://www.cs.umb.edu/~xuedchen/research/publications/DataWarehousePerformanceDissertationProposal.pdf .

-----------------------------------------------------------------------------------------------------------------------------------------
-- Note that these queries are run without any tuning or indices created for these joins or filters.
-- Basically, this is just 1) Create tables (without index or partition declarations).
-- 2) Load tables, fact table was loaded one month at a time.
-- 3) Run queries.
-- In addition, there is no expectation that rows in different tables are co-located.
-- Therefore, result should be similar across a wide variety of join and predicate cases.
-----------------------------------------------------------------------------------------------------------------------------------------
 





The x axis show additional servers being included, the y axis is elapsed time in seconds (log).

 


Query Details ...Collapse )

Query Details ...Collapse )

Query Details ...Collapse )

Query Details ...Collapse )

Query Details ...Collapse )

Query Details ...Collapse )

Query Details ...Collapse )

Query Details ...Collapse )

Query Details ...Collapse )

Query Details ...Collapse )
Query Details ...Collapse )

Profile Near-Linear Scale-Out for Aggregation


Profile Near-Linear Scale-Out for Aggregation

Calpont offers near-linear scale-out for aggregation behavior. To demonstrate, we have a set of queries that aggregate across 1, 3, 5 or 7 columns against an f_log table has just over 5.14 Billion rows, representing 140 days worth of web log data. The f_log table has 25 columns as created, but the queries use:

An 8-byte date range filter (hour_id)
Between 1 and 4 dimension keys (4-byte) (*_dim_id)
Between 0 and 3 flag/type attributes (1-byte).

There are 3 dimensions to this profile:
1) Scaling the group by operation across 1,3,5,7 columns. The queries touch 12, 17, 22, and 27 bytes per row.
2) Scaling the number of rows based on the hour_id filter (327 Million, 666 Million, and 1.09 Billion rows matching filters against 10, 20, and 30 days of data).
3) Scaling the Calpont Performance Module (PM) tier to add additional processing capabilities. Each PM is a Dell server with 8 cores and 16GB memory.

The cache is flushed at the start of each run. The limit clause was used provide cleaner output, average elapsed times appears to be within 1% when using limit 5 and without a limit clause.



Calpont's ability to execute this work in parallel across all available Performance Modules allows for horizontal (or vertical) scaling to analyze billions of rows, 10's of billions, or more.

Look for future performance profiles to understand the full scalabilty and ease of administration of the Calpont system. Visit www.Calpont.com to apply for the Early Adopter Program to understand if Calpont is right for you.

 

Read more...Collapse )

Comparison with BKA results

Just to clarify, this is a comparison of Calpont performance against the BKA performance.  Calpont uses a multi-threaded, distributed hash join operation rather than an index operation, and so does not use the BKA access directly.  Note that the BKA performance enhancement clearly is the right direction to improve index operations.  This comparison just shows the power of hash join for big data.

Result graphed.  Note that the graphic contains multiple comparisons with differing conditions. 

The first 2 metrics come from the 2008 UC presentation, the third a Calpont run at a scale factor of 1 (6 million rows in lineitem table), and  are disk runs.  Calpont is configured with one performance module.

The next 2 metrics scale the number of rows by 10x, and 100x with Calpont timings, again disk timings, and again 1 performance module.

The next metric used sf100 (600 million rows in lineitem table), and added an addition performance module (our distributed tier), again disk timings.

The last three metrics use the sf100 scale, but are cached runs with 1,2, and 4 performance modules. 



 

Read more...Collapse )Read more...Collapse )


 


Calpont comparison BKA query

To begin to profile the Calpont multi-threaded, distributed hash join capabilities, we used one of the queries from Igor Babaev's BKA presentation from UC '08.  Detail was not provided on class of server used in the BKA presentation, so this isn't a direct comparison, only an approximation.  The bench started with 1 Calpont performance module (our distributed tier) - a Dell 8 core server with 16 GB memory.  We then extended the demo to understand how scaling our distributed tier impacted the join rate. 

The query was a join between two DBT3 tables; Lineitem and Part.  The number of rows varies by a scale factor (sf), for example sf1 is ~1GB, sf10 is ~10GB, sf100 is ~100GB. 

Rows (Millions)
Table             sf1     sf10     sf100
Lineitem        6        60          600
Part                .8          8            80

The query applies predicates to both sides, reducing the number of rows into the join.  For example, at sf100, the query joins 327,294,042  lineitem rows with 122,451 rows. Below is a run showing table counts going into the join. 

Note this is an example to show the operations and cardinality involved.  The BKA test measured disk access, and those measures will follow in another post.  Execution of the  counts obviously populates the cache.

Read more...Collapse )


mysql> desc lineitem;
+-----------------+---------------+------+-----+---------+-------+
| Field           | Type          | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+-------+
| l_orderkey      | int(11)       | YES  |     | NULL    |       |
| l_partkey       | int(11)       | YES  |     | NULL    |       |
| l_suppkey       | int(11)       | YES  |     | NULL    |       |
| l_linenumber    | bigint(20)    | YES  |     | NULL    |       |
| l_quantity      | decimal(12,2) | YES  |     | NULL    |       |
| l_extendedprice | decimal(12,2) | YES  |     | NULL    |       |
| l_discount      | decimal(12,2) | YES  |     | NULL    |       |
| l_tax           | decimal(12,2) | YES  |     | NULL    |       |
| l_returnflag    | char(1)       | YES  |     | NULL    |       |
| l_linestatus    | char(1)       | YES  |     | NULL    |       |
| l_shipdate      | date          | YES  |     | NULL    |       |
| l_commitdate    | date          | YES  |     | NULL    |       |
| l_receiptdate   | date          | YES  |     | NULL    |       |
| l_shipinstruct  | char(25)      | YES  |     | NULL    |       |
| l_shipmode      | char(10)      | YES  |     | NULL    |       |
| l_comment       | varchar(44)   | YES  |     | NULL    |       |
+-----------------+---------------+------+-----+---------+-------+
16 rows in set (0.00 sec)

mysql> desc part;
+---------------+---------------+------+-----+---------+-------+
| Field         | Type          | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| p_partkey     | int(11)       | YES  |     | NULL    |       |
| p_name        | varchar(55)   | YES  |     | NULL    |       |
| p_mfgr        | char(25)      | YES  |     | NULL    |       |
| p_brand       | char(10)      | YES  |     | NULL    |       |
| p_type        | varchar(25)   | YES  |     | NULL    |       |
| p_size        | int(11)       | YES  |     | NULL    |       |
| p_container   | char(10)      | YES  |     | NULL    |       |
| p_retailprice | decimal(12,2) | YES  |     | NULL    |       |
| p_comment     | varchar(23)   | YES  |     | NULL    |       |
+---------------+---------------+------+-----+---------+-------+
9 rows in set (0.00 sec)

mysql> \. flush
--------------
select calflushcache()
--------------

+-----------------+
| calflushcache() |
+-----------------+
|               0 |
+-----------------+
1 row in set (0.24 sec)

mysql> select count(p_partkey) from part where p_retailprice > 2050;
--------------
select count(p_partkey) from part where p_retailprice > 2050
--------------

+------------------+
| count(p_partkey) |
+------------------+
|           122451 |
+------------------+
1 row in set (0.54 sec)

mysql> select count(p_partkey) from part where p_retailprice > 2050;
--------------
select count(p_partkey) from part where p_retailprice > 2050
--------------

+------------------+
| count(p_partkey) |
+------------------+
|           122451 |
+------------------+
1 row in set (0.14 sec)

mysql> select count(l_partkey) from lineitem where l_discount > 0.04;
--------------
select count(l_partkey) from lineitem where l_discount > 0.04
--------------

+------------------+
| count(l_partkey) |
+------------------+
|        327294042 |
+------------------+
1 row in set (15.11 sec)

mysql> select count(l_partkey) from lineitem where l_discount > 0.04;
--------------
select count(l_partkey) from lineitem where l_discount > 0.04
--------------

+------------------+
| count(l_partkey) |
+------------------+
|        327294042 |
+------------------+
1 row in set (2.53 sec)

mysql> SELECT COUNT(*) FROM part, lineitem WHERE l_partkey=p_partkey AND p_retailprice>2050 and l_discount > 0.04;
--------------
SELECT COUNT(*) FROM part, lineitem WHERE l_partkey=p_partkey AND p_retailprice>2050 and l_discount > 0.04
--------------

+----------+
| COUNT(*) |
+----------+
|  2004039 |
+----------+
1 row in set (3.60 sec)

mysql> SELECT COUNT(*) FROM part, lineitem WHERE l_partkey=p_partkey AND p_retailprice>2050 and l_discount > 0.04;
--------------
SELECT COUNT(*) FROM part, lineitem WHERE l_partkey=p_partkey AND p_retailprice>2050 and l_discount > 0.04
--------------

+----------+
| COUNT(*) |
+----------+
|  2004039 |
+----------+
1 row in set (3.29 sec)

 


Calpont at MySQL UC 2009

While Calpont has been engaged with a number of members of the MySQL community over the past years, this will be our first wide presentation of our storage engine capabilities.

What if loading your big data for analysis was easy? How about simple scale-out for data warehousing that allows analysis of all the data using all the resources of a distributed system, while retaining the simplicity of a single MySQL instance? How about blindingly fast analysis of the raw data without requiring aggregation that can lose the detail? Come find out what Calpont can offer: 

Calpont's foundation or Pillars:

Capable:
   - Purpose built for analytics
   - Built for big data
   - Built for speed

Scalable:
  - Scalable scan, filter, aggregation, and hash join
  - Intra-server parallelism (multithreaded within sever)
  - Inter-server parallelism (scale-out)

Extendable:
 - Extend the data - load rate consistent over time
 - Extend the data model - simple column add
 - Extend the database functionality - UDF capabilite

Simple:
 - Load & Go - simple bulk load methodology
 - Automatic partitioning
 - Automatic parallelism

Come join us, Wed. at 2:00 pm.  Ballroom H.

http://www.mysqlconf.com/mysql2009/public/schedule/detail/8997

Data Warehousing, Data Factories, and MySQL

Re: the title.

Explosive data growth has become problematic for traditional dbms architectures, and the creation of automated data factories (web logs, web transactions, micro transactions, RFID) that create new data every click/second/fractional second have applied further pressure on traditional approaches.  

Traditional approaches that rely on tables and indices can have interesting problems when the size of a table or index can exceed currently available memory.  Table scans that rely on physical I/O can be expensive in terms of both time to execute and the cost to scale I/O performance.  However, the fundamental inefficiency of executing a 'select *' from storage to satisfy a 'select 4 columns from some range of rows' becomes truly problematic when the wasted I/O is measured in terabytes.  Traditional index operations can have a more subtle, but potentially more dramatic impact on performance if the index can't be retained in cache.  Because I/O in support of an index lookup can be executed once per row under the worst case, some traditional index operations can be more expensive than a table scan when individual blocks are read many times.  In addition, indices generally increase the time to load the table, the memory requirements, the total database size, as well as being a conditional solution that works only when a specific predicate is present and works best for a limited range of selectivity.

MySQL's storage engine capability is certainly a testament to the idea that there is no single best architecture for all business cases.  Calpont is offering a purpose-built MySQL storage engine with significant architectural features to solve data warehouse problems.  This column will be used to explore the capabilities of the system. 

Profile

jtommaney
Jim Tommaney
Website

Latest Month

November 2009
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
2930     

Syndicate

RSS Atom
Powered by LiveJournal.com
Designed by Tiffany Chow