What is Hadoop Streaming and Why Should I use it?

April 14, 2016

Hadoop, formally called Apache Hadoop, is an Apache Software Foundation project and open source software platform for scalable, distributed computing. Hadoop can provide fast and reliable analysis of both structured data and unstructured data.

Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.

Hadoop consists of HDFS and MapReduce

HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves).

MapReduce takes care of scheduling tasks, monitoring them and re-executing any failed tasks.

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++. You can translate your Python code using Jython into a Java jar file but this is not convenient.

Or you could write the code in python itself. The "trick" behind the following Python code is that we will use the Hadoop Streaming API

Test your mapper.py and reducer.py scripts locally before using them in a MapReduce job. Before we run the actual MapReduce job, we must first copy the files from our local file system to Hadoop's HDFS.

https://hadoop.apache.org/docs/r1.2.1/streaming.html

Besides streaming you can also use PyDoop which is a wrapper over Streaming APIs written in Python.

Search This Blog

Questions that baffled me

What is Hadoop Streaming and Why Should I use it?

Comments

Post a Comment

Popular posts from this blog

Is Apple's 64 Bit Architecture a Hype?

Appcelerator Announces Virtual Private Cloud

Article: Which AR (Augmented Reality) libraries exist on Android/iPhone?