Sunday 20 December 2015

HIVE - HADOOP

What is Hive?

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analysing easy. Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. So Hive is the best alternative.
We will use the same data used in previous blogs. masters.csvbatting.csv
Data can be downloaded from following link.

Accessing Hue

You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Password : 1111

Starting the HIVE:

You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).

Writing Queries

You click on the query editor to go to the query page.

Queries

The 1st step we do is create a temp_batting table with the query
create table temp_batting (col_value STRING);
1

Next, using the following code we load the scores into the table 
LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;
2
Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.
SELECT * FROM temp_batting LIMIT 100;

3

Once the query is executed the results will follow
4
Now we create a new table called BATTING  which will contain 3 columns
create table batting (player_id STRING, year INT, runs INT);
5
Now we extract the data we want from temp_batting and copy it into batting.  We do it with a regexp pattern and build a multi line query.
insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;
Now we group the data by year with the below code
SELECT year, max(runs) FROM batting GROUP BY year;
6
OUTPUT
7

Now we need to find out which player has scored which runs for the given year. We can do this with the below command.

SELECT a.year, a.player_id, a.runs from batting a JOIN(SELECT year,max(runs) runs FROM batting GROUP BY year )b ON (a.year = b.year AND a.runs = b.runs) ;
8

OUTPUT:

9
This is the final output and this completes the tutorial on HIVE.

1 comment: