Summarizing Data: First Program in PIG

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

How Does PIG Work?

Install the following:-

Oracle VM VirtualBox :

https://www.virtualbox.org/

Hortonworks Sandbox on a VM:

http://hortonworks.com/products/hortonworks-sandbox/#install

Objective:

We have an input file for baseball statistics from 1871–2011 and it contains over 90,000 rows. Our objective is to compute the highest runs by a player for each year. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

Select the Advanced options to Access the Secure Shell (SSH) client with the following credentials:

URL: http://127.0.0.1:8000
Username: hue
Password: 1111

On the HUE home page at the top, click on the pig icon (highlighted in the image below)

Creating Folder and uploading the .csv files.

Give a title and write a code in the box as shown in the below figure, click on the “EXECUTE” button to run the script.

The first thing to notice is we never really address single rows of data to the left of the equals sign and on the right we just describe what we want to do for each row. We just assume things are applied to all the rows. We also have powerful operators like GROUP and JOIN to sort rows by a key and to build new data objects.

RESULT:

Following step to be followed:-

Step 1:

We will load the data and for that we will use “PigStorage” function.

batting = load ‘Batting.csv’ using PigStorage(‘,’);

Step 2:

Next step is to name the fields. We will use “Generate” statement for assigning names to all fields.

runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;

Step 3

Grouping each statement by the “year” field.

grp_data = GROUP runs by (year);

Step 4:

finding maximum runs for above grouped data.

max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;

Step 5:

we will join this with the runs data object in order pick up the player id.

Then we “Dump” data in the output.

join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);

join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;

DUMP join_data;

Summarizing Data

Sunday, 6 December 2015

First Program in PIG - HADOOP

1 comment: