Column
Dec 23, 2021 04:56 PM
Column 1
Description
Hadoop 是一個用來儲存與平行處理大量數據的框架。
Field
Data engineer
Status
Done

Hadoop
Big Data
5V
- Volume
- Velocity
- Variety
- Veracity
- Value
What's Hadoop
Hadoop is a framework that uses distributed storage and parallel processing to store and manage big data. It is the software most used by data analysts to handle big data, and its market size continues to grow. There are three components of Hadoop:
- Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
- Hadoop MapReduce - Hadoop MapReduce is the processing unit.
- Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource management unit.
Comparing with RDBMS

Components
HDFS


MapReduce
- Map phase − In the Map phase, the user-defined map function processes the input data. In the map function, the user puts the business logic. The output from the Map phase is the intermediate outputs and is stored on the local disk.
- Reduce phase – This phase is the combination of the shuffle phase and the reduce phase. In the Reduce phase, the output from the map stage is passed to the Reducer where they are aggregated. The output of the Reduce phase is the final output. In the Reduce phase, the user-defined reduce function processes the Mappers output and generates the final results.
input
這個是要做計算的原始資料,以上圖為例其實就是一堆文字清單
split
把input資料做分散處理 - 以hadoop來說,當MapReduce工作被輸入的時候,會被切割到各個cluster裡面等待做處理
map
這個就是MapReduce裡面的Map階段 - 每一個節點會把對應切割出來的資料建立key value結果 - key是字本身,然後value是1代表找到一筆
combine
這個其實也是在map的機器裡面做 - 把每一個key一樣的先做一次加總,避免傳送多次出去
shuffle & sort
在進入reduce階段之前,會先被做一個排序,因此相關的key值會放在一起
reduce
這個階段會做實際的加總,因此每一個key以的的value會被加總
output
這個是最後得到的結果

MapReduce 的 JobTracker

YARN
JobTracker負責資源管理(通過管理TaskTracker節點),追蹤資源消費/釋放,以及Job的生命週期管理(排程Job的每個Task,追蹤Task進度,為Task提供容錯等)。而TaskTracker的職責很簡單,依次啟動和停止由JobTracker分配的Task,並且週期性的向JobTracker彙報Task進度及狀態資訊。
YARN
YARN的最基本思想是將JobTracker的兩個主要職責:資源管理和Job排程管理分別交給兩個角色負責。一個是全域性的ResourceManager,一個是每個應用一個的ApplicationMaster。ResourceManager以及每個節點一個的NodeManager構成了新的通用系統,實現以分散式方式管理應用。

ResourceManager是系統中仲裁應用之間資源分配的最高權威。而每個應用一個的ApplicationMaster負責向ResourceManager協商資源,並與NodeManager協同工作來執行和管理task。ResourceManager有一個可插入的排程器,負責向各個應用分配資源以滿足容量,組等限制。這個排程器是一個純粹的排程器,意思是它不負責管理或追蹤應用的狀態,也不負責由於硬體錯誤或應用問題導致的task失敗重啟工作。排程器只依據應用的資源需求來執行排程工作,排程內容是一個抽象概念Resource Container,其中包含了資源元素,例如記憶體,CPU,網路,磁碟等。
NodeManager是每個節點一個的slave,其負責啟動應用的container,管理他們的資源使用(記憶體,CPU,網路,磁碟),並向ResourceManager彙報整體的資源使用情況。
每個應用一個的ApplicationMaster負責向ResourceManager的排程器協商合理的Resource Container並追蹤他們的狀態,管理進度。從系統角度看,ApplicationMaster本身也是以一個普通container的形式執行。

- Client向Yarn提交Application,這裡我們假設是一個MapReduce作業。
- ResourceManager向NodeManager通訊,為該Application分配第一個容器。並在這個容器中執行這個應用程式對應的ApplicationMaster。
- ApplicationMaster啟動以後,對作業(也就是Application)進行拆分,拆分task出來,這些task可以執行在一個或多個容器中。然後向ResourceManager申請要執行程式的容器,並定時向ResourceManager傳送心跳。
- 申請到容器後,ApplicationMaster會去和容器對應的NodeManager通訊,而後將作業分發到對應的NodeManager中的容器去執行,這裡會將拆分後的MapReduce進行分發,對應容器中執行的可能是Map任務,也可能是Reduce任務。
- 容器中執行的任務會向ApplicationMaster傳送心跳,彙報自身情況。當程式執行完成後,ApplicationMaster再向ResourceManager登出並釋放容器資源。
Advantage/ Disadvantage
Advantages
- More storage and computing power can be achieved by addition of more nodes to Hadoop cluster. This eliminates need to buy external hardware. Hence it is cheaper solution.
- It can handle unstructured data and semi-structured data.
- Hadoop clusters provide storage and distributed computing all in one.
- HDFS layer in hadoop has self healing, replicating and fault tolerance characteristics. It automatically replicates data if server or disk got crashed.
- Hadoop offers scalability, reliability and plenty of libraries for various applications at lower cost.
- It helps in distributing data on different servers and prevents network overloading.
Disadvantages
- It is not suitable for small and real time data applications.
- Joining multiple data set operations are complex
- It does not have storage or network level encryption.
Ecosystem
