People coming from RDBMS background might know the benefit of Indexing.Indexes are useful for faster access to rows in a table. If we want to apply indexing using Hive then the first expectation might be that with indexing it should take less time to fetch records and it should not launch a map reduce job. Whereas in practice a map reduce job would still be launched on a Hive query even though an index is created on ahive table.Map/reduce job runs on the table that holds the index data to get all the relevant offsets into the main table and then using those offsets it figures out which blocks to read from the main table. So you will not see map/reduce go away even when you are running queries on tables with indexes on them. The biggest advantage of having index is that it does not require a full table scan and it would query only the HDFS blocks required.
The difference b/w compact and bitmap indexes(Hive 0.8) is how they store the mapping from values to the rows in which the value occurs (Compact Index seems to store (value, block-id) pairs while Bitmap Index stores (value , list of rows as a bitmap)).
For example on a hadoop cluster if we run the following query without index
select * from test_table where test_column=02196666033;
It takes 71 seconds to run as it would perform a complete table scan. Note this table has storage in RCFile fromat and ~ 1 billion records.
The cluster is a 8 node hadoop cluster.
Now we can create the index with following commands
hive> create INDEX test_column_index ON TABLE test_table(test_column) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;
OK
Time taken: 3.134 seconds
hive> ALTER INDEX test_column_index ON test_table REBUILD;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 26
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201208160845_0121, Tracking URL = http://test-cluster-1:50030/jobdetails.jsp?jobid=job_201208160845_0121
Kill Command = /home/hadoop/hadoop-0.20.2-hdh3u3/bin/../bin/hadoop job -Dmapred.job.tracker=test-cluster-1:50300 -kill job_201208160845_0121
2012-08-20 12:43:32,481 Stage-1 map = 0%, reduce = 0%
2012-08-20 12:43:36,503 Stage-1 map = 1%, reduce = 0%
…………………………………………………………………………………………….
2012-08-20 12:44:34,797 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201208160845_0121
Loading data to table default.default__test_table_test_column_index__
Deleted hdfs://test-cluster-1:54310/user/hive/warehouse/default__test_table_test_column_index__
Table default.default__test_table_test_column_index__ stats: [num_partitions: 0, num_files: 26, num_rows: 0, total_size: 179836098]
OK
Time taken: 68.077 seconds
Note this has now created an index table with following name under default Database “default__test_table_test_column_index__”
hive> INSERT OVERWRITE DIRECTORY "/tmp/index_test_result" SELECT `_bucketname` , `_offsets` FROM default__test_table_test_column_index__ where test_column=02196666033 ;
……………..
Moving data to: /tmp/index_test_result
1 Rows loaded to /tmp/index_test_result
OK
Time taken: 2.683 seconds
hive> SET hive.index.compact.file=/tmp/index_test_result;
hive> SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
hive> SELECT * FROM test_table WHERE test_column=02196666033;
15836400000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01
15836400000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01
15836580000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01
15836760000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01
………………………………
15836760000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01
Time taken: 2.479 seconds
Now If you see the difference, the same query would now take 2.4 seconds instead of 71 seconds as it is using indexing.
hive> SELECT count(*) FROM test_table WHERE test_column=02196666033;
602
Time taken: 3.417 seconds
If you want to see what is happened is that now the index table has stored the location of HDFS blocks where the data resides for this query and when query is fired it just reads form those blocks the actual data
hive> describe default__test_table_test_column_index__;
OK
test_column string from deserializer
_bucketname string from deserializer
_offsets array<bigint> from deserializer
hive> select * from default__test_table_test_column_index__ where test_column=02196666033;
02196666033 hdfs://test-cluster-1:54310/user/hive/warehouse/test_table/000033_0 [1653776,9918864,6611866,8265443,28615496,18337446,30330621,21765257,25190947,14908809,26903328,16623458,20052481,194,4958659,3305997,23478272,11571563,13224916]