Hadoop in a snap

4 November 2018

When technology is progressing well, it becomes easier to use.

So you want to test hadoop locally - not for testing or production, but for playing and development. But all of the installation examples look rather complicated, and you know that everyone produces their own distribution, so why are you downloading and extracting a tarball?

Anyway, here’s an easy installation on ubuntu 18.04.

snap install --beta hadoop

Now check that you actually have something, and run an example. Parts of this follow the hadoop docs, but with ubuntu-snap-specific commands, which you might find easier to follow.

hadoop jar /snap/hadoop/current/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.3.jar
#... pick an example that doesn't depend on hdfs for now
sudo hadoop jar /snap/hadoop/current/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 8 500

Even if you already know π to a thousand places, it’s fun to have a Monte Carlo method tell you approximately what it is:

Number of Maps  = 8
Samples per Map = 500
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Starting Job
18/11/04 22:36:13 INFO client.RMProxy: Connecting to ResourceManager at /
18/11/04 22:36:13 INFO input.FileInputFormat: Total input paths to process : 8
18/11/04 22:36:14 INFO mapreduce.JobSubmitter: number of splits:8
18/11/04 22:36:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541542844479_0004
18/11/04 22:36:14 INFO impl.YarnClientImpl: Submitted application application_1541542844479_0004
18/11/04 22:36:14 INFO mapreduce.Job: The url to track the job: http://local-ubuntu:8088/proxy/application_1541542844479_0004/
18/11/04 22:36:14 INFO mapreduce.Job: Running job: job_1541542844479_0004
18/11/04 22:36:18 INFO mapreduce.Job: Job job_1541542844479_0004 running in uber mode : false
18/11/04 22:36:18 INFO mapreduce.Job:  map 0% reduce 0%
18/11/04 22:36:25 INFO mapreduce.Job:  map 63% reduce 0%
18/11/04 22:36:26 INFO mapreduce.Job:  map 75% reduce 0%
18/11/04 22:36:27 INFO mapreduce.Job:  map 88% reduce 0%
18/11/04 22:36:28 INFO mapreduce.Job:  map 100% reduce 0%
18/11/04 22:36:30 INFO mapreduce.Job:  map 100% reduce 100%
18/11/04 22:36:30 INFO mapreduce.Job: Job job_1541542844479_0004 completed successfully
18/11/04 22:36:30 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=182
		FILE: Number of bytes written=1085904
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=2112
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=35
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters 
		Launched map tasks=8
		Launched reduce tasks=1
		Data-local map tasks=8
		Total time spent by all maps in occupied slots (ms)=32949
		Total time spent by all reduces in occupied slots (ms)=1508
		Total time spent by all map tasks (ms)=32949
		Total time spent by all reduce tasks (ms)=1508
		Total vcore-milliseconds taken by all map tasks=32949
		Total vcore-milliseconds taken by all reduce tasks=1508
		Total megabyte-milliseconds taken by all map tasks=33739776
		Total megabyte-milliseconds taken by all reduce tasks=1544192
	Map-Reduce Framework
		Map input records=8
		Map output records=16
		Map output bytes=144
		Map output materialized bytes=224
		Input split bytes=1168
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=224
		Reduce input records=16
		Reduce output records=0
		Spilled Records=32
		Shuffled Maps =8
		Failed Shuffles=0
		Merged Map outputs=8
		GC time elapsed (ms)=1231
		CPU time spent (ms)=2770
		Physical memory (bytes) snapshot=2354618368
		Virtual memory (bytes) snapshot=17203613696
		Total committed heap usage (bytes)=1699741696
	Shuffle Errors
	File Input Format Counters 
		Bytes Read=944
	File Output Format Counters 
		Bytes Written=97
Job Finished in 17.162 seconds
Estimated value of Pi is 3.14000000000000000000
Image of pi to a thousand places, although 3.141592 should be enough for anyone.

Introducing hdfs

Make a directory and copy an input file.txt to the (pseudo-)distributed filesystem, then for old times’ sake, run the word-count example on that single file.

$ sudo hadoop.hdfs dfs -mkdir /user/iain
$ sudo hadoop.hdfs dfs -chown -R iain:iain /user/iain
$ hadoop.hdfs dfs -put ~/file.txt /user/iain/
$ hadoop jar /snap/hadoop/current/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /user/iain/file.txt /etc/hadoop/wc_output
[ output snipped ]
$ hadoop.hdfs dfs -ls /etc/hadoop/wc_output
Found 2 items
-rw-r--r--   1 root supergroup          0 2018-11-13 23:46 /etc/hadoop/wc_output/_SUCCESS
-rw-r--r--   1 root supergroup         32 2018-11-13 23:46 /etc/hadoop/wc_output/part-r-00000
$ hadoop.hdfs dfs -cat /etc/hadoop/wc_output/part-r-00000
This    1
a       1
file    1
is      1
simple  1