Summarizing Data: December 2015

Sunday, 20 December 2015

HIVE - HADOOP

What is Hive?

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analysing easy. Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. So Hive is the best alternative.

We will use the same data used in previous blogs. masters.csv, batting.csv

Data can be downloaded from following link.

http://seanlahman.com/files/database/lahman591-csv.zip

Accessing Hue

You can access HUE from the entering the address 127.0.0.1:8000

Password : 1111

Starting the HIVE:

You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).

Writing Queries

You click on the query editor to go to the query page.

Queries

The 1st step we do is create a temp_batting table with the query

create table temp_batting (col_value STRING);

Next, using the following code we load the scores into the table

LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;

Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.

SELECT * FROM temp_batting LIMIT 100;

Once the query is executed the results will follow

Now we create a new table called BATTING which will contain 3 columns

create table batting (player_id STRING, year INT, runs INT);

Now we extract the data we want from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;

Now we group the data by year with the below code

SELECT year, max(runs) FROM batting GROUP BY year;

OUTPUT

Now we need to find out which player has scored which runs for the given year. We can do this with the below command.

SELECT a.year, a.player_id, a.runs from batting a JOIN(SELECT year,max(runs) runs FROM batting GROUP BY year )b ON (a.year = b.year AND a.runs = b.runs) ;

OUTPUT:

This is the final output and this completes the tutorial on HIVE.

Sunday, 6 December 2015

First Program in PIG - HADOOP

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

How Does PIG Work?

Install the following:-

Oracle VM VirtualBox :

https://www.virtualbox.org/

Hortonworks Sandbox on a VM:

http://hortonworks.com/products/hortonworks-sandbox/#install

Objective:

We have an input file for baseball statistics from 1871–2011 and it contains over 90,000 rows. Our objective is to compute the highest runs by a player for each year. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

Select the Advanced options to Access the Secure Shell (SSH) client with the following credentials:

URL: http://127.0.0.1:8000
Username: hue
Password: 1111

On the HUE home page at the top, click on the pig icon (highlighted in the image below)

Creating Folder and uploading the .csv files.

Give a title and write a code in the box as shown in the below figure, click on the “EXECUTE” button to run the script.

The first thing to notice is we never really address single rows of data to the left of the equals sign and on the right we just describe what we want to do for each row. We just assume things are applied to all the rows. We also have powerful operators like GROUP and JOIN to sort rows by a key and to build new data objects.

RESULT:

Following step to be followed:-

Step 1:

We will load the data and for that we will use “PigStorage” function.

batting = load ‘Batting.csv’ using PigStorage(‘,’);

Step 2:

Next step is to name the fields. We will use “Generate” statement for assigning names to all fields.

runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;

Step 3

Grouping each statement by the “year” field.

grp_data = GROUP runs by (year);

Step 4:

finding maximum runs for above grouped data.

max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;

Step 5:

we will join this with the runs data object in order pick up the player id.

Then we “Dump” data in the output.

join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);

join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;

DUMP join_data;