[Ubuntu] [長期] Hadoop 兩個Node實作

Environment:

Node1&2 : Ubuntu 12.04.3 64bit
Controller : Ubuntu 12.04.3 32bit
Java ver. : Java 7
Hadoop ver. : hadoop-1.2.1.tar.gz

三台機器隨著離開研究所以後…還回去了XD
先暫停此計畫


2013-12-26

[@@@安裝Ubuntu 12.04.3 64bit]

兩台Node安裝好Ubuntu 12.04.3 64bit,開SSH連線(22),不限port,正在思考控制端該用哪一台電腦。

[@@@使Ubuntu能透過SSH登入]
安裝SSH方法:

$ sudo apt-get install ssh

打開SSH port (22):

$ sudo gedit /etc/ssh/ssh_config

找到

#     Port 22

把#去掉即可

這樣子的狀態是可以被任何IP位置透過SSH方式連入。


2013-12-28

[@@@限制SSH的IP登入位置]
首先設定了SSH只能透過特定port連入。

$ sudo gedit /etc/hosts.allow

在最後一行加入

ALL:

在/etc資料夾下有兩個檔案:hosts.allow和hosts.deny,前者是用來設定哪些IP可以登入這台電腦,後者則是用來設定哪些IP不能登入這台電腦。

接著開始設定Hadoop前置作業。

[@@@安裝Java 7]
根據Hadoop官方wiki,Ubuntu 12.04 及 Java 7是可以被接受的

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

[@@@新增Hadoop專屬使用者帳號]
$ sudo addgroup hadoop
$ sudo adduser –ingroup hadoop hduser

[@@@設定SSH Public Key]
[email protected]:~$ su – hduser
[email protected]:~$ ssh-keygen -t rsa -P “”
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory ‘/home/hduser/.ssh’.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 [email protected]
The key’s randomart image is:
[…略…]
[email protected]:~$

[email protected]:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

[email protected]:~$ ssh localhost
The authenticity of host ‘localhost (::1)’ can’t be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[…再略…]
[email protected]:~$

此時可離開了。

[@@@取消IPv6]
因為沒用到又怕干擾,先關掉全機IPv6。

$ sudo vi /etc/sysctl.conf

在最後一行加入

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

:wq儲存離開後重新開機

$ sudo shutdown -r now


2013-12-30

首先我想要說的第一件事情是

我真的覺得太扯,學校鎖HDFS幹嘛…

[@@@Hadoop下載]
原本抓好hadoop-2.2.0.tar.gz準備開始動作,結果2.2.0裡面竟然沒有conf,整個傻眼,只好改抓hadoop-1.2.1.tar.gz,其實很常遇到sudo的問題,要注意哪些動作屬於哪個使用者。

根據範例我們把hadoop放在 /usr/local 操作

$ cd /usr/local
$ sudo wget http://apache.cdpa.nsysu.edu.tw/hadoop/core/stable1/hadoop-1.2.1.tar.gz
$ sudo tar xzf hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 hadoop
$ sudo chown -R hduser:hadoop hadoop

[@@@更新.bashrc]
注意這邊是更新hduser的.bashrc

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs=”hadoop fs”
unalias hls &> /dev/null
alias hls=”fs -ls”

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed ‘lzop’ command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

[@@@調校]
除了製作tmp要sudo,其他還是調hduser的部分,直接vi即可

1. /usr/local/hadoop/conf/hadoop-env.sh

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

改成

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

2. 製作tmp

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp

接下來要橋XML檔參數,以下參數都要加入在 … 之間

3. conf/core-site.xml

hadoop.tmp.dir
/app/hadoop/tmp
A base for other temporary directories.

fs.default.name
hdfs://localhost:54310
The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.

4. conf/mapred-site.xml

mapred.job.tracker
localhost:54311
The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.

5. conf/hdfs-site.xml

dfs.replication
1
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

[@@@Hadoop第一步: Format]
接著的動作都是在hduser底下了

$ /usr/local/hadoop/bin/hadoop namenode -format

訊息大概長這樣

[email protected]:/usr/local/hadoop$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
[email protected]:/usr/local/hadoop$

有”10/05/08 16:59:57 INFO common.Storage: Storage directory …/hadoop-hduser/dfs/name has been successfully formatted.”這串即代表成功

[@@@Hadoop啟動與停止]
啟動語法如下
[email protected]:~$ /usr/local/hadoop/bin/start-all.sh

訊息大概長這樣

[email protected]:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
[email protected]:/usr/local/hadoop$

可以鍵入jps來查詢是否成功

[email protected]:/usr/local/hadoop$ jps
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode

測試停止Hadoop

[email protected]:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
[email protected]:/usr/local/hadoop$

[@@@測試MapReduce]
分別從下面三個網站下載要測試的電子書,選Plain Text UTF-8格式
http://www.gutenberg.org/ebooks/20417
http://www.gutenberg.org/ebooks/5000
http://www.gutenberg.org/ebooks/4300

儲存在 /tmp/gutenberg
然後再次打開Hadoop

[email protected]:~$ /usr/local/hadoop/bin/start-all.sh

跑MapReduce之前要先把測試資料弄進HDFS裡

[email protected]:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg
[email protected]:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
Found 1 items
drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:40 /user/hduser/gutenberg
[email protected]:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg
Found 3 items
-rw-r--r--   3 hduser supergroup     674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt
-rw-r--r--   3 hduser supergroup    1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt
-rw-r--r--   3 hduser supergroup    1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt
[email protected]:/usr/local/hadoop$

執行MapReduce任務

[email protected]:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

訊息可能長這樣

[email protected]:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3
10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001
10/05/08 17:43:02 INFO mapred.JobClient:  map 0% reduce 0%
10/05/08 17:43:14 INFO mapred.JobClient:  map 66% reduce 0%
10/05/08 17:43:17 INFO mapred.JobClient:  map 100% reduce 0%
10/05/08 17:43:26 INFO mapred.JobClient:  map 100% reduce 100%
10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001
10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17
10/05/08 17:43:28 INFO mapred.JobClient:   Job Counters
10/05/08 17:43:28 INFO mapred.JobClient:     Launched reduce tasks=1
10/05/08 17:43:28 INFO mapred.JobClient:     Launched map tasks=3
10/05/08 17:43:28 INFO mapred.JobClient:     Data-local map tasks=3
10/05/08 17:43:28 INFO mapred.JobClient:   FileSystemCounters
10/05/08 17:43:28 INFO mapred.JobClient:     FILE_BYTES_READ=2214026
10/05/08 17:43:28 INFO mapred.JobClient:     HDFS_BYTES_READ=3639512
10/05/08 17:43:28 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=3687918
10/05/08 17:43:28 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=880330
10/05/08 17:43:28 INFO mapred.JobClient:   Map-Reduce Framework
10/05/08 17:43:28 INFO mapred.JobClient:     Reduce input groups=82290
10/05/08 17:43:28 INFO mapred.JobClient:     Combine output records=102286
10/05/08 17:43:28 INFO mapred.JobClient:     Map input records=77934
10/05/08 17:43:28 INFO mapred.JobClient:     Reduce shuffle bytes=1473796
10/05/08 17:43:28 INFO mapred.JobClient:     Reduce output records=82290
10/05/08 17:43:28 INFO mapred.JobClient:     Spilled Records=255874
10/05/08 17:43:28 INFO mapred.JobClient:     Map output bytes=6076267
10/05/08 17:43:28 INFO mapred.JobClient:     Combine input records=629187
10/05/08 17:43:28 INFO mapred.JobClient:     Map output records=629187
10/05/08 17:43:28 INFO mapred.JobClient:     Reduce input records=102286

檢查有沒有成功輸出log

[email protected]:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
Found 2 items
drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:40 /user/hduser/gutenberg
drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:43 /user/hduser/gutenberg-output
[email protected]:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output
Found 2 items
drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs
-rw-r--r--   1 hduser supergroup     880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000
[email protected]:/usr/local/hadoop$

反譯

[email protected]:/usr/local/hadoop$ mkdir /tmp/gutenberg-output
[email protected]:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output
[email protected]:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output
"(Lo)cra"       1
"1490   1
"1498," 1
"35"    1
"40,"   1
"A      2
"AS-IS".        1
"A_     1
"Absoluti       1
"Alack! 1
[email protected]:/usr/local/hadoop$

[@@@Hadoop Web UI]
http://localhost:50070/ –> web UI of the NameNode daemon
http://localhost:50030/ –> web UI of the JobTracker daemon
http://localhost:50060/ –> web UI of the TaskTracker daemon

<>
//使用hduserr中(hduserl尚未實作 conf/core-site.xml)

————————————–參考文獻————————————–
[1] Running Hadoop on Ubuntu Linux (Multi-Node Cluster) – : http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
[2] Running Hadoop on Ubuntu Linux (Single-Node Cluster) – : http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
[3] Ubuntu作業系統教學─如何設定SSH遠端連線功能 – : http://ipzoner.pixnet.net/blog/post/23520297-ubuntu%E4%BD%9C%E6%A5%AD%E7%B3%BB%E7%B5%B1%E6%95%99%E5%AD%B8%E2%94%80%E5%A6%82%E4%BD%95%E8%A8%AD%E5%AE%9Assh%E9%81%A0%E7%AB%AF%E9%80%A3%E7%B7%9A%E5%8A%9F%E8%83%BD
[4] How To Install Oracle (Sun) Java JDK & JRE in Ubuntu via PPA – : http://community.linuxmint.com/tutorial/view/1414
[5] Hadoop Java Versions – : http://wiki.apache.org/hadoop/HadoopJavaVersions
[6] shutdown與reboot用起來 不同嗎? – : http://www.ubuntu-tw.org/modules/newbb/viewtopic.php?viewmode=flat&order=ASC&topic_id=8335&forum=2&move=prev
[7] vim 程式編輯器 – : http://linux.vbird.org/linux_basic/0310vi.php
[8] 如何利用WGET下載檔案,並儲存到指定的目錄 – : http://www.inote.tw/2009/06/wget.html
[9] Warning: $HADOOP_HOME is deprecated. hadoop1.0.4解决方法 – : http://chenzhou123520.iteye.com/blog/1826002
[10] WARN snappy.LoadSnappy: Snappy native library not loaded – : http://stackoverflow.com/questions/10878038/warn-snappy-loadsnappy-snappy-native-library-not-loaded
[11]

發表迴響

你的電子郵件位址並不會被公開。 必要欄位標記為 *

這個網站採用 Akismet 服務減少垃圾留言。進一步瞭解 Akismet 如何處理網站訪客的留言資料