Thursday, September 20, 2012

Installing Snappy codec


Following are the steps to install Snappy with Hbase
2.       Unzip and untar  snappy.
3.       Type following command under snappy-1.05 folder ./configure; make; make install
4.       Snappy would install lib*snappy.so files under /usr/local/lib directory by default. Copy all lib*snappy*.so* files to $HADOOP_HOME/lib/native/Linux-amd64-64 dir
5.       You should see following after running ls
[hadoop@test-cluster-1 Linux-amd64-64]$ pwd
/home/hadoop/hadoop-0.20.2-hdh3u3/lib/native/Linux-amd64-64
[hadoop@test-cluster-1 Linux-amd64-64]$ ls -lrt
total 1940
-rw-rw-r-- 1 hadoop hadoop 224086 Mar 25 23:15 libhadoop.so.1.0.0
-rw-rw-r-- 1 hadoop hadoop 224086 Mar 25 23:15 libhadoop.so.1
-rw-rw-r-- 1 hadoop hadoop 224086 Mar 25 23:15 libhadoop.so
-rw-rw-r-- 1 hadoop hadoop   1171 Mar 25 23:15 libhadoop.la
-rw-rw-r-- 1 hadoop hadoop 411654 Mar 25 23:15 libhadoop.a
-rwxrwxr-x 1 hadoop hadoop    957 Jun 29 11:55 libsnappy.la
-rwxrwxr-x 1 hadoop hadoop 369308 Jun 29 11:55 libsnappy.a
-rwxrwxr-x 1 hadoop hadoop 171844 Jun 29 11:55 libsnappy.so
-rwxrwxr-x 1 hadoop hadoop 171844 Jun 29 11:55 libsnappy.so.1
-rwxrwxr-x 1 hadoop hadoop 171844 Jun 29 11:55 libsnappy.so.1.1.3

6.       Add following in $HBASE_HOME/conf/hbase-env.sh ( change dir as per ur installation) 
export HADOOP_CONF_DIR=/home/hadoop/hadoop-0.20.2-hdh3u3/conf
export HBASE_CLASSPATH=$HADOOP_CONF_DIR:/home/hadoop/hbase-hdh3u3/conf/hbase-site.xml
export HBASE_LIBRARY_PATH=$HBASE_LIBRARY_PATH:/home/hadoop/hadoop-0.20.2-hdh3u3/lib/native/Linux-amd64-64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/hadoop/hadoop-0.20.2-hdh3u3/lib/native/Linux-amd64-64
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/hadoop/hadoop-0.20.2-hdh3u3/lib/native/Linux-amd64-64
export LIBRARY_PATH=$LIBRARY_PATH:/home/hadoop/hadoop-0.20.2-hdh3u3/lib/native/Linux-amd64-64
Scp the libsnappy* to all your nodes in the cluster
7.       Restart   hbase and hadoop.
8.       In  hbase shell type following command :     create 'test', { NAME => 'column', COMPRESSION => 'SNAPPY' }
9.       If there is no error snappy is installed properly

Hive Indexing


People coming from RDBMS background might know the benefit of Indexing.Indexes are useful for faster access to rows in a table. If we want to apply indexing using Hive then the first expectation might be that with indexing it should take less time to fetch records and it should not launch a map reduce job. Whereas in practice a map reduce job would still be launched on a Hive query even though an index is created on ahive table.Map/reduce job runs on the table that holds the index data to get all the relevant offsets into the main table and then using those offsets it figures out which blocks to read from the main table. So you will not see map/reduce go away even when you are running queries on tables with indexes on them. The biggest advantage of having index is that it does not require a full table scan and it would query only the HDFS blocks required.
The difference b/w compact and bitmap indexes(Hive 0.8) is how they store the mapping from values to the rows in which the value occurs (Compact Index seems to store (value, block-id) pairs while Bitmap Index stores (value , list of rows as a bitmap)).

For example on a hadoop cluster if we run the following query without index
select * from test_table where test_column=02196666033;

It takes 71 seconds to run as it would perform a complete table scan.  Note this table has storage in RCFile fromat and ~ 1 billion records.
The cluster is a 8 node hadoop cluster. 
Now we can create the index with following commands

hive> create INDEX test_column_index ON TABLE test_table(test_column) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'  WITH DEFERRED REBUILD;
OK
Time taken: 3.134 seconds
hive> ALTER INDEX  test_column_index  ON test_table REBUILD; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 26
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201208160845_0121, Tracking URL = http://test-cluster-1:50030/jobdetails.jsp?jobid=job_201208160845_0121
Kill Command = /home/hadoop/hadoop-0.20.2-hdh3u3/bin/../bin/hadoop job  -Dmapred.job.tracker=test-cluster-1:50300 -kill job_201208160845_0121
2012-08-20 12:43:32,481 Stage-1 map = 0%,  reduce = 0%
2012-08-20 12:43:36,503 Stage-1 map = 1%,  reduce = 0%
…………………………………………………………………………………………….
2012-08-20 12:44:34,797 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201208160845_0121
Loading data to table default.default__test_table_test_column_index__
Deleted hdfs://test-cluster-1:54310/user/hive/warehouse/default__test_table_test_column_index__
Table default.default__test_table_test_column_index__ stats: [num_partitions: 0, num_files: 26, num_rows: 0, total_size: 179836098]
OK
Time taken: 68.077 seconds

Note this has now created an index table with following name under default Database “default__test_table_test_column_index__”

hive> INSERT OVERWRITE DIRECTORY "/tmp/index_test_result" SELECT `_bucketname` , `_offsets` FROM default__test_table_test_column_index__ where  test_column=02196666033 ;
……………..
Moving data to: /tmp/index_test_result
1 Rows loaded to /tmp/index_test_result
OK
Time taken: 2.683 seconds

hive> SET hive.index.compact.file=/tmp/index_test_result;
hive> SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

hive> SELECT * FROM  test_table WHERE test_column=02196666033;
15836400000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
15836400000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
15836580000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
15836760000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
………………………………
15836760000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
Time taken: 2.479 seconds

Now If you see the difference, the same query would now take 2.4 seconds instead of 71 seconds as it is using indexing.

hive> SELECT count(*) FROM  test_table WHERE test_column=02196666033;
602
Time taken: 3.417 seconds


If you want to see what is  happened is that now the index table has stored the location of HDFS blocks where the data resides for this query and when query is fired it just reads form those blocks the actual data
hive> describe  default__test_table_test_column_index__;
OK
test_column  string  from deserializer
_bucketname     string  from deserializer
_offsets        array<bigint>   from deserializer

hive> select * from default__test_table_test_column_index__ where test_column=02196666033;
02196666033     hdfs://test-cluster-1:54310/user/hive/warehouse/test_table/000033_0       [1653776,9918864,6611866,8265443,28615496,18337446,30330621,21765257,25190947,14908809,26903328,16623458,20052481,194,4958659,3305997,23478272,11571563,13224916]