Hadoop Ecosystem: Hive Partitioning and Bucketing

Hive Partitioning:

In Hive, Partitioning is used to avoid scanning of the entire table for queries with filters (fine grained queries). This will improve the response times of the jobs. It will process the files from selected partitions which are supplied with where clause. We must specify the partitioned columns in the where clause part other wise it will scan the entire table.

In Hive, We have two types of partitioning.

Static Partitioning
Dynamic Partitioning

Static Partitioning:

*************creating table using single partition column*************

create table india (name string, age int, place string) partitioned by (state string) row format delimited fields terminated by '\t';

load data local inpath '/path of the file in local system(Linux)/filename' into table india partition (state='AP');
[OR]
load data inpath '/path of the file in HDFS/filename' into table india partition (state='AP');

******** creating table using multiple partition columns***********

create table population(name string, age int, place string) partitioned by (state string, dist string) row format delimited fields terminated by '\t';

load data local inpath '/path of the file in local system(Linux)/filename' into table population partition (state='AP', dist='PKM');

load data local inpath '/path of the file in local system(Linux)/filename' into table population partition (state='AP', dist='GNT');

load data local inpath '/path of the file in local system(Linux)/filename' into table population partition (state='KA', dist='Mandya');

load data local inpath '/path of the file in local system(Linux)/filename' into table population partition (state='KA', dist='Sivmoga');

[OR]

load data inpath '/path of the file in HDFS/filename' into table population partition (state='AP', dist='PKM');

load data inpath '/path of the file HDFS/filename' into table population partition (state='AP', dist='GNT');

load data inpath '/path of the file in HDFS/filename' into table population partition (state='KA', dist='Mandya');

load data inpath '/path of the file inHDFS/filename' into table population partition (state='KA', dist='Sivmoga');

select * from population; => Direct HDFS call
select * from population where state='AP'; => Direct HDFS call
select * from population where state='AP' and dist='GNT'; => Direct HDFS call
select * from population where state='AP' and age > 30; => MapReduce Job
select avg(age) from population where state='AP' and dist='GNT' group by age; => MapReduce Job
select avg(age) from population group by place; => MapReduce Job

Dynamic Partitioning:

In Dynamic partitioning, we have two modes;

Strict Mode - at least one static partitioned column
Non Strict Mode - Static partitioned column is not required

By Default Hive Dynamic partitioning is enabled in strict mode.
There is a constraint per dynamic partitioning per node and cluster.
By default, maximum no of partitions per node is 100.
maximum no of partitions per cluster is 1000

Strict Mode:

create table part_stocks(sdate string, open double, high double, low double, close double, volume bigint, adj_close double) partitioned by (stock string, market string) row format delimited fields terminated by '\t';

In the above table, we have two partitioned columns
stock => dynamic partitioned column
market => static partitioned column

increase the no of dynamic partitions per node and cluster if necessary.
set hive.exec.max.dynamic.partitions.pernode=300

insert into table part_stocks partition(stock, market='NYSE') select sdate, open, high, low, close, volume, adj_close, stock from stocks;

Non Strict Mode:

Hive Dynamic partitioning in non strict mode requires turn off "hive.exec.dynamic.partition.mode" to nonstrict.
set hive.exec.dynamic.partition.mode=nonstrict

create table part_stocks(market string, sdate string, open double, high double, low double, close double, volume bigint, adj_close double) partitioned by (stock string) row format delimited fields terminated by '\t';

In the above table, we have one partitioned column
stock => dynamic partitioned column

increase the no of dynamic partitions per node and cluster if necessary.
set hive.exec.max.dynamic.partitions.pernode=300

insert into table part_stocks partition(stock) select market, sdate, open, high, low, close, volume, adj_close, stock from stocks;

Note: Partitioning is good for improving the query response times, but over partitioning creates problems for both HDFS (too many small files) and MapReduce (several map tasks)
*******************************************************************

Hive Bucketing:

To overcome the over partitioning in Hive, it is better to use Bucketing or Combination of Partitioning and Bucketing:

**** enable the bucketing **********
By default Bucketing is disabled in Hive, enable it using the following parameter
set hive.enforce.bucketing = true;

****creating table with dynamic partition ans using the bucketing concept******

CREATE TABLE stocks_bucketed(market string, stock string, open double, high double, low double, close double, volume bigint, adj_close double) PARTITIONED BY(sdate STRING) CLUSTERED BY(stock) INTO 5 BUCKETS row format delimited fields terminated by '\t';

describe stocks_bucketed;

insert into table stocks_bucketed partition(sdate) select market, stock, open,high,low,close,volume,adj_close,sdate from stocks;

Hadoop Ecosystem

Thursday, August 21, 2014

Hive Partitioning and Bucketing

2 comments:

About Me