What is Hadoop Streaming and Why Should I use it?

Hadoop, formally called Apache Hadoop, is an Apache Software Foundation project and open source software platform for scalable, distributed computing. Hadoop can provide fast and reliable analysis of both structured data and unstructured data. 

Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.

Hadoop consists of HDFS and MapReduce

HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves).

MapReduce takes care of scheduling tasks, monitoring them and re-executing any failed tasks. 


Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++. You can translate your Python code using Jython into a Java jar file but this is not convenient.


Or you could write the code in python itself. The "trick" behind the following Python code is that we will use the Hadoop Streaming API

Test your mapper.py and reducer.py scripts locally before using them in a MapReduce job. Before we run the actual MapReduce job, we must first copy the files from our local file system to Hadoop's HDFS.



Besides streaming you can also use PyDoop which is a wrapper over Streaming APIs written in Python.

Comments

Popular posts from this blog

What is Skylake and how does it compare with Broadwell or Haswell?

What is Tokenization?

What is three in a box model?