Sunday, 20 December 2015

HIVE - HADOOP

What is Hive?

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analysing easy. Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. So Hive is the best alternative.
We will use the same data used in previous blogs. masters.csvbatting.csv
Data can be downloaded from following link.

Accessing Hue

You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Password : 1111

Starting the HIVE:

You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).

Writing Queries

You click on the query editor to go to the query page.

Queries

The 1st step we do is create a temp_batting table with the query
create table temp_batting (col_value STRING);
1

Next, using the following code we load the scores into the table 
LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;
2
Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.
SELECT * FROM temp_batting LIMIT 100;

3

Once the query is executed the results will follow
4
Now we create a new table called BATTING  which will contain 3 columns
create table batting (player_id STRING, year INT, runs INT);
5
Now we extract the data we want from temp_batting and copy it into batting.  We do it with a regexp pattern and build a multi line query.
insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;
Now we group the data by year with the below code
SELECT year, max(runs) FROM batting GROUP BY year;
6
OUTPUT
7

Now we need to find out which player has scored which runs for the given year. We can do this with the below command.

SELECT a.year, a.player_id, a.runs from batting a JOIN(SELECT year,max(runs) runs FROM batting GROUP BY year )b ON (a.year = b.year AND a.runs = b.runs) ;
8

OUTPUT:

9
This is the final output and this completes the tutorial on HIVE.

Sunday, 6 December 2015

First Program in PIG - HADOOP

What is Pig?
Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

How Does PIG Work?
Install the following:-
Oracle VM VirtualBox : 

Hortonworks Sandbox on a VM: 


Objective:
We have an input file for baseball statistics from 1871–2011 and it contains over 90,000 rows. Our objective is to compute the highest runs by a player for each year. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.
Select the Advanced options to Access the Secure Shell (SSH) client with the following credentials: 


URL: http://127.0.0.1:8000
Username: hue
Password: 1111


On the HUE home page at the top, click on the pig icon (highlighted in the image below)


Creating Folder and uploading the .csv files.


Give a title and write a code in the box as shown in the below figure, click on the “EXECUTE” button to run the script.



The first thing to notice is we never really address single rows of data to the left of the equals sign and on the right we just describe what we want to do for each row. We just assume things are applied to all the rows. We also have powerful operators like GROUP and JOIN to sort rows by a key and to build new data objects.


RESULT:
Following step to be followed:-


  • Step 1:

We will load the data and for that we will use “PigStorage” function.
batting = load ‘Batting.csv’ using PigStorage(‘,’);
  • Step  2:

Next step is to name the fields. We will use “Generate” statement for assigning names to all fields.
runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;
  • Step 3

 Grouping each statement by the “year” field.
grp_data = GROUP runs by (year);
  • Step 4:

finding maximum runs for above grouped data.
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
  •  Step 5:

we will  join this with the runs data object in order pick up the player id.
Then we “Dump” data in the output.
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
DUMP join_data;


Saturday, 21 November 2015

Hadoop MapReduce












Hadoop is a software framework that can be installed on a commodity Linux cluster to permit large scale distributed data analysis. No hardware modification is needed other than possible changes to meet minimum recommended RAM, disk space, etc. requirements per node (e.g., see Cloudera's guidelines ). The initial version of Hadoop was created in 2004 by Doug Cutting (and named after his son’s stuffed elephant). Hadoop became a top-level Apache Software Foundation project in January 2008. There have been many contributors, both academic and commercial (Yahoo being the largest such contributor), and Hadoop has a broad and rapidly growing user community .

Components - Hadoop provides the robust, fault-tolerant Hadoop Distributed File System (HDFS), inspired by Google's file system , as well as a Java-based API that allows parallel processing across the nodes of the cluster using the MapReduce paradigm. Use of code written in other languages, such as Python and C, is possible through Hadoop Streaming, a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. Also, Hadoop comes with Job and Task Trackers that keep track of the programs’ execution across the nodes of the cluster.

Data locality – Hadoop tries to automatically colocate the data with the computing node. That is, Hadoop schedules Map tasks close to the data on which they will work, with "close" meaning the same node or, at least, the same rack. This is a principal factor in Hadoop’s performance. In April 2008 a Hadoop program, running on 910-node cluster, broke a world record, sorting a terabyte of data in less than 3.5 minutes. Speed improvements have continued as Hadoop has matured .

Fault-tolerant, shared-nothing architecture - tasks must have no dependence on each other, with exception of mappers feeding into reducers under Hadoop control. Hadoop can detect task failure and restart programs on other healthy nodes. That is, node failures are handled automatically, with tasks restarted as needed. A single point of failure currently remains at the one name node for the HDFS file system.

Reliability – data is replicated across multiple nodes; RAID storage is not needed.

Programming support - unlike, for example, parallel programming using MPI, data flow is implicit and handled automatically; it does not need coding. For tasks fitting the MapReduce paradigm, Hadoop simplifies the development of large-scale, fault-tolerant, distributed applications on a cluster of (possibly heterogeneous) commodity machines.

    MapReduce

    MapReduce is a functional programming paradigm that is well suited to handling parallel processing of huge data sets distributed across a large number of computers, or in other words, MapReduce is the application paradigm supported by Hadoop and the infrastructure presented in this article. MapReduce, as its name implies, works in two steps:
    1. Map: The map step essentially solves a small problem: Hadoop’s partitioner divides the problem into small workable subsets and assigns those to map processes to solve.
    2. Reduce: The reducer combines the results of the mapping processes and forms the output of the MapReduce operation.
    My Map definition purposely used the work “essentially” because one of the things that give the Map step its name is its implementation. While it does solve small workable problems, the way that it does it is that it maps specific keys to specific values. For example, if we were to count the number of times each word appears in a book, our MapReduce application would output each word as a key and the value as the number of times it is seen. Or more specifically, the book would probably be broken up into sentences or paragraphs, and the Map step would return each word mapped either to the number of times it appears in the sentence (or to “1” for each occurrence of every word) and then the reducer would combine the keys by adding their values together.
    Listing 1 shows a Java/Pseudo-code example about how the map and reduce functions might work to solve this problem.

    Listing 1 - Java/Pseudocode for MapReduce

    public void map( String name, String sentence, OutputCollector output ) {
      for( String word : sentence ) {
        output.collect( word, 1 );
      }
    }
    
    public void reduce( String word, Iterator values, OutputCollector output ) {
      int sum = 0;
      while( values.hasNext() ) {
        sum += values.next().get();
      }
      output.collect( word, sum );
    }
    Listing 1 does not contain code that actually works, but it does illustrate from a high-level how such a task would be implemented in a handful of lines of code. Prior to submitting your job to Hadoop, you would first load your data into Hadoop. It would then distribute your data, in blocks, to the various slave nodes in its cluster. Then when you did submit your job to Hadoop, it would distribute your code to the slave nodes and have each map and reduce task process data on that slave node. Your map task would iterate over every word in the data block passed to it (assuming a sentence in this example), and output the word as the key and the value as “1”. The reduce task would then receive all instances of values mapped to a particular key; for example, it may have 1,000 values of “1” mapped to the work “apple”, which would mean that there are 1,000 apples in the text. The reduce task sums up all of the values and outputs that as its result. Then your Hadoop job would be set up to handle all of the output from the various reduce tasks.
    This way of thinking is quite a bit different from how you might have approached the problem without using MapReduce, but it will become clearer in the next article on writing MapReduce applications, in which we build several working examples.

HDFS file system – There are some drawbacks to HDFS use. HDFS handles continuous updates (write many) less well than a traditional relational database management system. Also, HDFS cannot be directly mounted onto the existing operating system. Hence getting data into and out of the HDFS file system can be awkward.

In addition to Hadoop itself, there are multiple open source projects built on top of Hadoop. Major projects are described such below.
Hive

Hive is a data warehouse framework built on top of Hadoop, developed at Facebook, used for ad hoc querying with an SQL type query language and also used for more complex analysis. Users define tables and columns. Data is loaded into and retrieved through these tables. Hive QL, a SQL-like query language, is used to create summaries, reports, analyses. Hive queries launch MapReduce jobs. Hive is designed for batch processing, not online transaction processing – unlike HBase (see below), Hive does not offer real-time queries.
Pig

Pig is a high-level data-flow language (Pig Latin) and execution framework whose compiler produces sequences of Map/Reduce programs for execution within Hadoop. Pig is designed for batch processing of data. Pig’s infrastructure layer consists of a compiler that turns (relatively short) Pig Latin programs into sequences of MapReduce programs. Pig is a Java client-side application, and users install locally – nothing is altered on the Hadoop cluster itself. Grunt is the Pig interactive shell.
Mahout and other expansions to Hadoop programming capabilities

Hadoop is not just for large-scale data processing. Mahout is an Apache project for building scalable machine learning libraries, with most algorithms built on top of Hadoop. Current algorithm focus areas of Mahout: clustering, classification, data mining (frequent itemset), and evolutionary programming. Obviously, the Mahout clustering and classifier algorithms have direct relevance in bioinformatics - for example, for clustering of large gene expression data sets, and as classifiers for biomarker identification. In regard to clustering, we may note that Hadoop MapReduce-based clustering work has also been explored by, among others, M. Ngazimbi  and by K. Heafield at Google (Hadoop design and k-Means clustering ). The many bioinformaticians that use R may be interested in the “R and Hadoop Integrated Processing Environment” (RHIPE), S. Guhi’s Java package that integrates the R environment with Hadoop so that it is possible to code MapReduce algorithms in R. (Also note the IBM R-based Ricardo project ). For the growing community of Python users in bioinformatics, Pydoop, a Python MapReduce and HDFS API for Hadoop that allows complete MapReduce applications to be written in Python, is available. These are samplings from the large number of developers working on additional libraries for Hadoop. One last example in this limited space: the new programming language Clojure , which is predominantly a functional language, e.g., a dialect of Lisp that targets the Java Virtual Machine, has been given a library (author S. Sierra ) to aid in writing Hadoop jobs.
Cascading

Cascading  is a project providing a programming API for defining and executing fault tolerant data processing workflows on a Hadoop cluster. Cascading is a thin, open source Java library that sits on top of the Hadoop MapReduce layer. Cascading provides a query processing API that allows programmers to operate at a higher level than MapReduce, and to more quickly assemble complex distributed processes, and schedule them based on dependencies.
HBase

Lastly, an important Apache Hadoop-based project is HBase , which is modeled on Google's BigTable database . HBase adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system, with random real-time read/write access to data. Each HBase table is stored as a multidimensional sparse map, with rows and columns, each cell having a time stamp. A cell value at a given row and column is by uniquely identified by (Table, Row, Column-Family:Column, Timestamp) → Value. HBase has its own Java client API, and tables in HBase can be used both as an input source and as an output target for MapReduce jobs through TableInput/TableOutputFormat. There is no HBase single point of failure. HBase uses Zookeeper, another Hadoop subproject, for management of partial failures.

All table accesses are by the primary key. Secondary indices are possible through additional index tables; programmers need to denormalize and replicate. There is no SQL query language in base HBase. However, there is also a Hive/HBase integration project  that allows Hive QL statements access to HBase tables for both reading and inserting. Also, there is the independent HBql project to add a dialect of SQL and JDBC bindings for HBase.

A table is made up of regions. Each region is defined by a startKey and EndKey, may live on a different node, and is made up of several HDFS files and blocks, each of which is replicated by Hadoop. Columns can be added on-the-fly to tables, with only the parent column families being fixed in a schema. Each cell is tagged by column family and column name, so programs can always identify what type of data item a given cell contains. In addition to being able to scale to petabyte size data sets, we may note the ease of integration of disparate data sources into a small number of HBase tables for building a data workspace, with different columns possibly defined (on-the-fly) for different rows in the same table. Such facility is also important. (See the biological integration discussion below.)

In addition to HBase, other scalable random access databases are now available. HadoopDB is a hybrid of MapReduce and a standard relational db system. HadoopDB uses PostgreSQL for db layer (one PostgreSQL instance per data chunk per node), Hadoop for communication layer, and extended version of Hive for a translation layer. Also, there are non-Hadoop based scalable alternatives also based on the Google BigTable concept, such as Hypertable , and Cassandra. And there are other so-called noSQL scalable dbs of possible interest: Project Voldemort, Dynamo (used for Amazon’s Simple Storage Service (S3)), and Tokyo Tyrant, among others. However, these non-Hadoop and non-BigTable database systems lie outside of our discussion here.

References from - http://www.informit.com/articles/article.aspx?p=2008905
                                      http://www.biomedcentral.com/1471-2105/11/S12/S1

Friday, 2 October 2015

SAS Basics By Ron Cody

Using the SAS data set BloodPressure, produce a report showing Gender, Age, SBP,        and DBP. Order the report in Gender and Age order

Code

Output



Using the SAS data set College, report the mean GPA for the following categories of ClassRank: 0–50 = bottom half, 51–74 = 3rd quartile, and 75 to 100 = top quarter. Do this by creating an appropriate format. Do not use a DATA step


Code 




Output


Using the data set Blood, produce frequencies for the variable Chol (cholesterol). Use a format to group the frequencies into three groups: low to 200 (normal), 201 and higher (high), and missing. Run PROC FREQ twice, once using the MISSING option, and once without. Compare the percentages in both listings. 


Code


Output




Produce the following table. Note that the keyword ALL has been renamed Total, Gender is formatted, and ClassRank (a continuous numeric variable) has been formatted into two groups (0–70 and 71 and higher). 

Code


Output 




. Produce the following table. Note that the ALL column has been renamed Total.


 Code


Output