Summarizing Data: November 2015

Hadoop is a software framework that can be installed on a commodity Linux cluster to permit large scale distributed data analysis. No hardware modification is needed other than possible changes to meet minimum recommended RAM, disk space, etc. requirements per node (e.g., see Cloudera's guidelines ). The initial version of Hadoop was created in 2004 by Doug Cutting (and named after his son’s stuffed elephant). Hadoop became a top-level Apache Software Foundation project in January 2008. There have been many contributors, both academic and commercial (Yahoo being the largest such contributor), and Hadoop has a broad and rapidly growing user community .

Components - Hadoop provides the robust, fault-tolerant Hadoop Distributed File System (HDFS), inspired by Google's file system , as well as a Java-based API that allows parallel processing across the nodes of the cluster using the MapReduce paradigm. Use of code written in other languages, such as Python and C, is possible through Hadoop Streaming, a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. Also, Hadoop comes with Job and Task Trackers that keep track of the programs’ execution across the nodes of the cluster.

Data locality – Hadoop tries to automatically colocate the data with the computing node. That is, Hadoop schedules Map tasks close to the data on which they will work, with "close" meaning the same node or, at least, the same rack. This is a principal factor in Hadoop’s performance. In April 2008 a Hadoop program, running on 910-node cluster, broke a world record, sorting a terabyte of data in less than 3.5 minutes. Speed improvements have continued as Hadoop has matured .

Fault-tolerant, shared-nothing architecture - tasks must have no dependence on each other, with exception of mappers feeding into reducers under Hadoop control. Hadoop can detect task failure and restart programs on other healthy nodes. That is, node failures are handled automatically, with tasks restarted as needed. A single point of failure currently remains at the one name node for the HDFS file system.

Reliability – data is replicated across multiple nodes; RAID storage is not needed.

Programming support - unlike, for example, parallel programming using MPI, data flow is implicit and handled automatically; it does not need coding. For tasks fitting the MapReduce paradigm, Hadoop simplifies the development of large-scale, fault-tolerant, distributed applications on a cluster of (possibly heterogeneous) commodity machines.

MapReduce

MapReduce is a functional programming paradigm that is well suited to handling parallel processing of huge data sets distributed across a large number of computers, or in other words, MapReduce is the application paradigm supported by Hadoop and the infrastructure presented in this article. MapReduce, as its name implies, works in two steps:

Map: The map step essentially solves a small problem: Hadoop’s partitioner divides the problem into small workable subsets and assigns those to map processes to solve.
Reduce: The reducer combines the results of the mapping processes and forms the output of the MapReduce operation.

My Map definition purposely used the work “essentially” because one of the things that give the Map step its name is its implementation. While it does solve small workable problems, the way that it does it is that it maps specific keys to specific values. For example, if we were to count the number of times each word appears in a book, our MapReduce application would output each word as a key and the value as the number of times it is seen. Or more specifically, the book would probably be broken up into sentences or paragraphs, and the Map step would return each word mapped either to the number of times it appears in the sentence (or to “1” for each occurrence of every word) and then the reducer would combine the keys by adding their values together.

Listing 1 shows a Java/Pseudo-code example about how the map and reduce functions might work to solve this problem.

Listing 1 - Java/Pseudocode for MapReduce

public void map( String name, String sentence, OutputCollector output ) {
  for( String word : sentence ) {
    output.collect( word, 1 );
  }
}

public void reduce( String word, Iterator values, OutputCollector output ) {
  int sum = 0;
  while( values.hasNext() ) {
    sum += values.next().get();
  }
  output.collect( word, sum );
}

Listing 1 does not contain code that actually works, but it does illustrate from a high-level how such a task would be implemented in a handful of lines of code. Prior to submitting your job to Hadoop, you would first load your data into Hadoop. It would then distribute your data, in blocks, to the various slave nodes in its cluster. Then when you did submit your job to Hadoop, it would distribute your code to the slave nodes and have each map and reduce task process data on that slave node. Your map task would iterate over every word in the data block passed to it (assuming a sentence in this example), and output the word as the key and the value as “1”. The reduce task would then receive all instances of values mapped to a particular key; for example, it may have 1,000 values of “1” mapped to the work “apple”, which would mean that there are 1,000 apples in the text. The reduce task sums up all of the values and outputs that as its result. Then your Hadoop job would be set up to handle all of the output from the various reduce tasks.

This way of thinking is quite a bit different from how you might have approached the problem without using MapReduce, but it will become clearer in the next article on writing MapReduce applications, in which we build several working examples.

HDFS file system – There are some drawbacks to HDFS use. HDFS handles continuous updates (write many) less well than a traditional relational database management system. Also, HDFS cannot be directly mounted onto the existing operating system. Hence getting data into and out of the HDFS file system can be awkward.

In addition to Hadoop itself, there are multiple open source projects built on top of Hadoop. Major projects are described such below.
Hive

Hive is a data warehouse framework built on top of Hadoop, developed at Facebook, used for ad hoc querying with an SQL type query language and also used for more complex analysis. Users define tables and columns. Data is loaded into and retrieved through these tables. Hive QL, a SQL-like query language, is used to create summaries, reports, analyses. Hive queries launch MapReduce jobs. Hive is designed for batch processing, not online transaction processing – unlike HBase (see below), Hive does not offer real-time queries.
Pig

Pig is a high-level data-flow language (Pig Latin) and execution framework whose compiler produces sequences of Map/Reduce programs for execution within Hadoop. Pig is designed for batch processing of data. Pig’s infrastructure layer consists of a compiler that turns (relatively short) Pig Latin programs into sequences of MapReduce programs. Pig is a Java client-side application, and users install locally – nothing is altered on the Hadoop cluster itself. Grunt is the Pig interactive shell.
Mahout and other expansions to Hadoop programming capabilities

Hadoop is not just for large-scale data processing. Mahout is an Apache project for building scalable machine learning libraries, with most algorithms built on top of Hadoop. Current algorithm focus areas of Mahout: clustering, classification, data mining (frequent itemset), and evolutionary programming. Obviously, the Mahout clustering and classifier algorithms have direct relevance in bioinformatics - for example, for clustering of large gene expression data sets, and as classifiers for biomarker identification. In regard to clustering, we may note that Hadoop MapReduce-based clustering work has also been explored by, among others, M. Ngazimbi and by K. Heafield at Google (Hadoop design and k-Means clustering ). The many bioinformaticians that use R may be interested in the “R and Hadoop Integrated Processing Environment” (RHIPE), S. Guhi’s Java package that integrates the R environment with Hadoop so that it is possible to code MapReduce algorithms in R. (Also note the IBM R-based Ricardo project ). For the growing community of Python users in bioinformatics, Pydoop, a Python MapReduce and HDFS API for Hadoop that allows complete MapReduce applications to be written in Python, is available. These are samplings from the large number of developers working on additional libraries for Hadoop. One last example in this limited space: the new programming language Clojure , which is predominantly a functional language, e.g., a dialect of Lisp that targets the Java Virtual Machine, has been given a library (author S. Sierra ) to aid in writing Hadoop jobs.
Cascading

Cascading is a project providing a programming API for defining and executing fault tolerant data processing workflows on a Hadoop cluster. Cascading is a thin, open source Java library that sits on top of the Hadoop MapReduce layer. Cascading provides a query processing API that allows programmers to operate at a higher level than MapReduce, and to more quickly assemble complex distributed processes, and schedule them based on dependencies.
HBase

Lastly, an important Apache Hadoop-based project is HBase , which is modeled on Google's BigTable database . HBase adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system, with random real-time read/write access to data. Each HBase table is stored as a multidimensional sparse map, with rows and columns, each cell having a time stamp. A cell value at a given row and column is by uniquely identified by (Table, Row, Column-Family:Column, Timestamp) → Value. HBase has its own Java client API, and tables in HBase can be used both as an input source and as an output target for MapReduce jobs through TableInput/TableOutputFormat. There is no HBase single point of failure. HBase uses Zookeeper, another Hadoop subproject, for management of partial failures.

All table accesses are by the primary key. Secondary indices are possible through additional index tables; programmers need to denormalize and replicate. There is no SQL query language in base HBase. However, there is also a Hive/HBase integration project that allows Hive QL statements access to HBase tables for both reading and inserting. Also, there is the independent HBql project to add a dialect of SQL and JDBC bindings for HBase.

A table is made up of regions. Each region is defined by a startKey and EndKey, may live on a different node, and is made up of several HDFS files and blocks, each of which is replicated by Hadoop. Columns can be added on-the-fly to tables, with only the parent column families being fixed in a schema. Each cell is tagged by column family and column name, so programs can always identify what type of data item a given cell contains. In addition to being able to scale to petabyte size data sets, we may note the ease of integration of disparate data sources into a small number of HBase tables for building a data workspace, with different columns possibly defined (on-the-fly) for different rows in the same table. Such facility is also important. (See the biological integration discussion below.)

In addition to HBase, other scalable random access databases are now available. HadoopDB is a hybrid of MapReduce and a standard relational db system. HadoopDB uses PostgreSQL for db layer (one PostgreSQL instance per data chunk per node), Hadoop for communication layer, and extended version of Hive for a translation layer. Also, there are non-Hadoop based scalable alternatives also based on the Google BigTable concept, such as Hypertable , and Cassandra. And there are other so-called noSQL scalable dbs of possible interest: Project Voldemort, Dynamo (used for Amazon’s Simple Storage Service (S3)), and Tokyo Tyrant, among others. However, these non-Hadoop and non-BigTable database systems lie outside of our discussion here.

References from - http://www.informit.com/articles/article.aspx?p=2008905
http://www.biomedcentral.com/1471-2105/11/S12/S1

Saturday, 21 November 2015

Hadoop MapReduce

MapReduce

Listing 1 - Java/Pseudocode for MapReduce