Map – Reduce Data Processing with Hadoop and Spark.
- General
Map – Reduce Data Processing with Hadoop and Spark.
What is MapReduce?
MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.
- MapReduce consists of two distinct tasks — Map and Reduce.
- As the name MapReduce suggests, reducer phase takes place after the mapper phase has been completed.
- So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
- The output of a Mapper or map job (key-value pairs) is input to the Reducer.
- The reducer receives the key-value pair from multiple map jobs.
- Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.
A Word Count Example of MapReduce
- We have generated a text file(demo.txt) with Script which has some random String Data.
- Now we start our Hadoop in master Node by Command in sbin directory of hadoop.
1start-all.sh
- Now we Put our file on Hdfs System by command
1 |
hadoop fs -put '/home/auriga/demo.txt' /auriga/inputdir/ |
- Now to Run a Hadoop Program we need a Jar File to execute So we create A jar file which contains our java Code/Program of Word Count .
- Open Eclipse -> Make java Project ->Add External Jar Files (Go to add external jar then go to hadoop directory ->share folder ->hadoop ->and hdfs folder jars and common folder jars).
- Then Write Code for Reducer and Mapper Class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
package hadoop_map_reduce; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1); private Text word = new Text(); protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer tokenizer = new StringTokenizer(value.toString()); while (tokenizer.hasMoreTokens()) { String token = tokenizer.nextToken(); word.set(token); context.write(word, ONE); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); } } public static void main (String[] arg) throws Exception { Configuration conf=new Configuration(); Job job = Job.getInstance(conf,"word count"); job.setJarByClass(WordCount.class); job.setJobName("wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); FileInputFormat.addInputPath(job, new Path(arg[0])); FileOutputFormat.setOutputPath(job, new Path(arg[1])); boolean success = job.waitForCompletion(true); System.exit(job.waitForCompletion(true)?0:1); } |
- Now Compile it & then Check if Any Error is coming or not IF not Then go to Export as a Jar File or make a jar file.
- Now open Terminal and to Perform map Reduce Write Code .
1 |
hadoop jar '/home/auriga/hadoopWordCount.jar' /auriga/inputdir/demo.txt /auriga/inputdir/outputdir14/ |
- Here “haddop jar hdfs_input_file_path hdfs_output_path”. Hdfs output path is where you want to save your result .
we can see the result in Hadoop Directory .
Now Lets Perform With same Word Count With Spark.
- Install spark and open terminal and Write “spark-shell”.
- Now write The Command on Terminal .
1val text=sc.textFile("hdfs://192.168.43.155:9000/auriga/inputdir/demo.txt");
- we are making a variable text and taking text file from Hdfs system.
1val count=text.flatMap(line=>line.split(" "));
- Now we Split the text file based on space .
1val map=count.map(word=>(word,1));
- we perform map operation.
1val reduce=map.reduceByKey(_+_);
- Performing Reduce Operation
1reduce.collect
we all know that spark has has lazy Operation when we perform Transformations on RDD it performed as lazy . And when we perform Actions then it Actually Perform .
Observation
Spark is Way faster than Hadoop . Hadoop takes some Time to Process Map-Reduce function but on the Other hand Spark is Way Faster in Seconds it gives Result.
Conclusion
- If We conclude we understand that spark faster than hadoop but question is Why? Here is Answer
- Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce .
For more Understanding in Deep you can visit these links
Related content
Auriga: Leveling Up for Enterprise Growth!
Auriga’s journey began in 2010 crafting products for India’s