Jim Tommaney (jtommaney) wrote,
Jim Tommaney

Scalable Star Schema Benchmark (SSB) Join Metrics

We ran a quick scalability test of Calpont join behavior across using a Star Schema Benchmark data set at a scale factor of 1000. The Star Schema Benchmark transforms a TPC-H / DBT-3 data to a more standardized data warehouse star schema data model, and the 1000 scale factor includes 6 billion rows in the primary fact table. Information on the star schema bench (SSB) can be found at http://www.cs.umb.edu/~xuedchen/research/publications/DataWarehousePerformanceDissertationProposal.pdf .

-- Note that these queries are run without any tuning or indices created for these joins or filters.
-- Basically, this is just 1) Create tables (without index or partition declarations).
-- 2) Load tables, fact table was loaded one month at a time.
-- 3) Run queries.
-- In addition, there is no expectation that rows in different tables are co-located.
-- Therefore, result should be similar across a wide variety of join and predicate cases.

The x axis show additional servers being included, the y axis is elapsed time in seconds (log).


Base table cardinality is 30 million customers, 10 million suppliers, 20 million parts, 2556 days, and 6 billion lineorders (actually 5,999,989,709).

The system under test includes between 1 and 8 performance modules (scalable component of Calpont architecture), each with two quad core CPUs at 2.0 GHz and 16 GB memory. This is run with a shared disk infrastructure and single fiber channel to each performance module. The system could also be configured with direct attached/on-board disk, or with dual fiber channel.

Filters used for each table and cardinality returned are included here:

-- 30 million rows, find 360k (about 1.2%)
select count(*) from customer where c_city in ('INDONESIA1','INDONESIA2','INDONESIA3');
| count(*) |
| 360736 |

-- 10 million rows, find 120k (about 1.2%)
select count(*) from supplier where s_city in ('INDONESIA1','INDONESIA2','INDONESIA3');
| count(*) |
| 119731 |

-- 20 million rows, find 240k (about 12%)
select count(*) from part where p_mfgr in ('MFGR#1','MFGR#4')
and p_category not in ('MFGR#12','MFGR#11','MFGR#13','MFGR#14','MFGR#41','MFGR#42','MFGR#43');
| count(*) |
| 239495 |

-- 2556 rows, find 365 (about 14%)
select count(*) from dateinfo where d_year = 1997;
| count(*) |
| 365 |

-- 6 billions rows, find 910 million (about 15%)
select count(*) from lineorder where lo_orderdate between 19970101 and 19971231;
| count(*) |
| 910240598 |

This test measures a 5 table join operation, as well as variations that remove one table from this query, yielding 4 distinct flavors. The first query is run with an empty cache, and then re-run. The remaining queries are run immediately afterwards.

-- --------------------------------------------------------
-- 5 tables in join
-- 360k, 120k, 240k, dates joined with 900 million
-- --------------------------------------------------------
select d_yearmonthnum, s_city, c_city, p_mfgr, sum(lo_extendedprice), count(*)
from customer, supplier, part, dateinfo, lineorder
where c_custkey = lo_custkey and
s_suppkey = lo_suppkey and
p_partkey = lo_partkey and
d_datekey = lo_orderdate and
lo_orderdate between 19970101 and 19971231 and
c_city in ('INDONESIA1','INDONESIA2','INDONESIA3') and
s_city in ('INDONESIA1','INDONESIA2','INDONESIA3') and
p_mfgr in ('MFGR#1','MFGR#4') and
p_category in ('MFGR#12','MFGR#11','MFGR#43') and
d_year = 1997
group by 1,2,3,4
order by 1,2,3,4;

Tags: data warehousing, dw, mpp, mysql, open source, scalability, ssb, star schema benchmark

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your IP address will be recorded