Map – Reduce Data Processing with Hadoop and Spark.

Published On: 22 September 2022.By suraj palod.

General

Map – Reduce Data Processing with Hadoop and Spark.

What is MapReduce?

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.

MapReduce consists of two distinct tasks — Map and Reduce.
As the name MapReduce suggests, reducer phase takes place after the mapper phase has been completed.
So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

A Word Count Example of MapReduce

We have generated a text file(demo.txt) with Script which has some random String Data.
Now we start our Hadoop in master Node by Command in sbin directory of hadoop.

start-all.sh

1

start-all.sh
Now we Put our file on Hdfs System by command

hadoop fs -put '/home/auriga/demo.txt' /auriga/inputdir/

1	hadoop fs -put '/home/auriga/demo.txt' /auriga/inputdir/

Now to Run a Hadoop Program we need a Jar File to execute So we create A jar file which contains our java Code/Program of Word Count .
Open Eclipse -> Make java Project ->Add External Jar Files (Go to add external jar then go to hadoop directory ->share folder ->hadoop ->and hdfs folder jars and common folder jars).

Then Write Code for Reducer and Mapper Class

package hadoop_map_reduce;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount
{
	public static class MapClass extends Mapper<Object, Text, Text, IntWritable> 
		{

			private static final IntWritable ONE = new IntWritable(1);
			private Text word = new Text();


			protected void map(Object key, Text value, Context context)
					throws IOException, InterruptedException
			{
	
				StringTokenizer tokenizer = new StringTokenizer(value.toString());
				while (tokenizer.hasMoreTokens())
				{
					String token = tokenizer.nextToken();
					word.set(token);
					context.write(word, ONE);
				}
			}

		}
	public static class Reduce extends
	Reducer<Text, IntWritable, Text, IntWritable>
	{

		private IntWritable count = new IntWritable();

		@Override
		protected void reduce(Text key, Iterable<IntWritable> values,
				Context context) throws IOException, InterruptedException 
		{
		
			int sum = 0;
			for (IntWritable value : values) {
				sum += value.get();
			}
			count.set(sum);
			context.write(key, count);
		}
	}
	public static void main (String[] arg) throws Exception {		
       Configuration conf=new Configuration();
		Job job = Job.getInstance(conf,"word count");
		job.setJarByClass(WordCount.class);
		job.setJobName("wordcount");
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
        job.setMapperClass(MapClass.class);
        job.setReducerClass(Reduce.class);

		FileInputFormat.addInputPath(job, new Path(arg[0]));
		FileOutputFormat.setOutputPath(job, new Path(arg[1]));

		boolean success = job.waitForCompletion(true);
		System.exit(job.waitForCompletion(true)?0:1);  
	}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

package hadoop_map_reduce;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount

{

public static class MapClass extends Mapper<Object, Text, Text, IntWritable>

{

private static final IntWritable ONE = new IntWritable(1);

private Text word = new Text();

protected void map(Object key, Text value, Context context)

throws IOException, InterruptedException

{

StringTokenizer tokenizer = new StringTokenizer(value.toString());

while (tokenizer.hasMoreTokens())

{

String token = tokenizer.nextToken();

word.set(token);

context.write(word, ONE);

}

public static class Reduce extends

Reducer<Text, IntWritable, Text, IntWritable>

{

private IntWritable count = new IntWritable();

@Override

protected void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException

{

int sum = 0;

for (IntWritable value : values) {

sum += value.get();

}

count.set(sum);

context.write(key, count);

}

public static void main (String[] arg) throws Exception {

Configuration conf=new Configuration();

Job job = Job.getInstance(conf,"word count");

job.setJarByClass(WordCount.class);

job.setJobName("wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(MapClass.class);

job.setReducerClass(Reduce.class);

FileInputFormat.addInputPath(job, new Path(arg[0]));

FileOutputFormat.setOutputPath(job, new Path(arg[1]));

boolean success = job.waitForCompletion(true);

System.exit(job.waitForCompletion(true)?0:1);

}

Now Compile it & then Check if Any Error is coming or not IF not Then go to Export as a Jar File or make a jar file.
Now open Terminal and to Perform map Reduce Write Code .

hadoop jar '/home/auriga/hadoopWordCount.jar' /auriga/inputdir/demo.txt /auriga/inputdir/outputdir14/

1	hadoop jar '/home/auriga/hadoopWordCount.jar' /auriga/inputdir/demo.txt /auriga/inputdir/outputdir14/

Here “haddop jar hdfs_input_file_path hdfs_output_path”. Hdfs output path is where you want to save your result .

we can see the result in Hadoop Directory .

Now Lets Perform With same Word Count With Spark.

Install spark and open terminal and Write “spark-shell”.
Now write The Command on Terminal .

val text=sc.textFile("hdfs://192.168.43.155:9000/auriga/inputdir/demo.txt");

1

val text=sc.textFile("hdfs://192.168.43.155:9000/auriga/inputdir/demo.txt");
we are making a variable text and taking text file from Hdfs system.

val count=text.flatMap(line=>line.split(" "));

1

val count=text.flatMap(line=>line.split(" "));
Now we Split the text file based on space .

val map=count.map(word=>(word,1));

1

val map=count.map(word=>(word,1));
we perform map operation.

val reduce=map.reduceByKey(_+_);

1

val reduce=map.reduceByKey(_+_);
Performing Reduce Operation

reduce.collect

1

reduce.collect

we all know that spark has has lazy Operation when we perform Transformations on RDD it performed as lazy . And when we perform Actions then it Actually Perform .

Observation

Spark is Way faster than Hadoop . Hadoop takes some Time to Process Map-Reduce function but on the Other hand Spark is Way Faster in Seconds it gives Result.

Conclusion

If We conclude we understand that spark faster than hadoop but question is Why? Here is Answer
Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce .