本篇記錄部署過程,主要參考「開發 Ansible Playbooks 部署 Kubernetes v1.11.x HA 叢集 」延伸 kube-ansible 感謝 KaiRen
改版後增加 Nvidia Docker
為 ansible 部署過程中,協助增加 NVIDIA Docker 與 k8s-device-plugin,完成 Node 節點環境的 GPU 資源使用(內文為裸機加載GPU部之署過程記錄)。
節點資訊 本次安裝作業系統採用 Ubuntu 16.04 Desktop,測試環境為實體主機:
本次 Kubernetes 安裝版本:
Kubernetes v1.11.2
Etcd v3.2.9
containerd v1.1.2
節點資訊 本次安裝作業系統採用Ubuntu 16.04 Desktop,測試環境為實體主機:
IP Address
Hostname
CPU
Memory
Extra Device
192.168.0.98
VIP
192.168.0.81
k8s-m1
4
16G
無
192.168.0.82
k8s-m2
4
16G
無
192.168.0.83
k8s-m3
4
16G
無
192.168.0.84
k8s-g1
4
16G
GTX 1060 6G
192.168.0.85
k8s-g2
4
16G
GTX 1060 6G
192.168.0.86
k8s-g3
4
16G
GTX 1060 6G
192.168.0.87
k8s-g4
4
16G
GTX 1060 6G
所有節點事前準備 安裝前需要確認以下幾個項目:
所有節點的網路之間可以互相溝通。
部署節點對其他節點不需要 SSH 密碼即可登入。
所有節點都擁有 Sudoer 權限,並且不需要輸入密碼。
所有節點需要安裝 Python。
所有節點需要設定 /etc/host 解析到所有主機。
部署節點需要安裝 Ansible。
部署節點對其他節點不需要 SSH 密碼即可登入:
1 $ echo "ubuntu ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ubuntu && sudo chmod 440 /etc/sudoers.d/ubuntu
確認環境網路DNS設定:
1 2 3 4 5 6 7 8 $ echo "nameserver 8.8.8.8" >> /etc/resolvconf/resolv.conf.d/tail $ resolvconf -u $ cat /etc/resolv.conf nameserver 127.0.1.1 nameserver 8.8.8.8
GPU節點事前準備 (Node) 由於GPU使用需要事先安裝 CUDA & NVIDIA Driver 於環境部分:
透過 APT 安裝 NVIDIA Driver(v410.79) 與 CUDA 10
1 2 3 4 $ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.130-1_amd64.deb $ sudo dpkg -i cuda-repo-ubuntu1604_10.0.130-1_amd64.deb $ sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub $ sudo apt-get update && sudo apt-get install -y cuda
部署節點(Master) Ubuntu 16.04 安裝 Ansible:
1 2 3 $ sudo apt-get install -y software-properties-common git cowsay $ sudo apt-add-repository -y ppa:ansible/ansible $ sudo apt-get update && sudo apt-get install -y ansible
測試 NVIDIA Dirver 與 CUDA 是否有安裝完成:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 $ cat /usr/local /cuda/version.txt CUDA Version 10.0.130 $ sudo nvidia-smi Fri Dec 9 10:25:24 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 106... Off | 00000000:03:00.0 Off | N/A | | 38% 28C P8 5W / 120W | 0MiB / 6077MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
透過 Ansible 部署 Kubernetes 這邊執行由 kairen
撰寫的 kube-ansible
專案,並透過 Ansible 來部署 Kubernetes HA 叢集,透過 Git 取得專案:
1 2 $ git clone https://github.com/kairen/kube-ansible.git $ cd kube-ansible
Kubernetes 叢集 + GPU 修改 inventory/hosts.ini
來描述被部署的節點與群組關係:
這邊為設定節點/etc/host
解析到所有主機,直接在主機IP後面直接ssh登入資訊
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 $ vim inventory/hosts.ini [etcds] 192.168.0.[81:83] ansible_user=ubuntu ansible_password=password [masters] 192.168.0.[81:83] ansible_user=ubuntu ansible_password=password [nodes] 192.168.0.84 ansible_user=ubuntu ansible_password=password 192.168.0.85 ansible_user=ubuntu ansible_password=password 192.168.0.86 ansible_user=ubuntu ansible_password=password 192.168.0.87 ansible_user=ubuntu ansible_password=passowrd [kube-cluster:children] masters nodes
ansible_user 為節點系統 SSH 的使用者名稱。 ansible_password 為節點系統 SSH 的使用者密碼。
接著編輯 group_vars/all.yml
來根據需求設定功能,如以下範例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 $ vim group_vars/all.yml --- kube_version: 1.11.2 container_runtime: nvidia-docker cni_enable: true container_network: calico cni_iface: "enp0s25" vip_interface: "enp0s25" vip_address: 192.168.0.98 etcd_iface: "enp0s25" enable_ingress: true enable_dashboard: true enable_logging: false enable_monitoring: true enable_metric_server: true grafana_user: "admin" grafana_password: "p@ssw0rd"
上面綁定網卡若沒有輸入,通常會使用節點預設網卡(一般來說是第一張網卡)。
這邊測試發現,需要事先確認確認,所有節點中每個節點上的網卡名稱是否一致,如實驗環境 Ubuntu16.04 網卡名稱都為 enp0s25
。 完成設定 group_vars/all.yml
檔案後,就可以先透過 Ansible 來檢查叢集狀態:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 $ ansible -i inventory/hosts.ini all -m ping 192.168.0.81 | SUCCESS => { "changed" : false , "ping" : "pong" } 192.168.0.82 | SUCCESS => { "changed" : false , "ping" : "pong" } 192.168.0.83 | SUCCESS => { "changed" : false , "ping" : "pong" } 192.168.0.84 | SUCCESS => { "changed" : false , "ping" : "pong" } 192.168.0.85 | SUCCESS => { "changed" : false , "ping" : "pong" } 192.168.0.86 | SUCCESS => { "changed" : false , "ping" : "pong" } 192.168.0.87 | SUCCESS => { "changed" : false , "ping" : "pong" }
接續檢查每個 Node 節點 GPU Driver 是否成功運行狀態:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 $ ansible -i inventory/hosts.ini all -a "nvidia-smi" -b 192.168.0.84 | SUCCESS | rc=0 >> Thu Dec 9 12:00:54 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 106... Off | 00000000:03:00.0 Off | N/A | | 38% 29C P8 4W / 120W | 0MiB / 6077MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ 192.168.0.87 | SUCCESS | rc=0 >> Thu Dec 9 12:00:57 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 106... Off | 00000000:03:00.0 On | N/A | | 40% 33C P8 7W / 120W | 323MiB / 6077MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1202 G /usr/lib/xorg/Xorg 171MiB | | 0 3191 G compiz 149MiB | +-----------------------------------------------------------------------------+
當叢集確認沒有問題後,即可執行cluster.yml來部署 Kubernetes 叢集:
1 $ ansible-playbook -i inventory/hosts.ini cluster.yml
查看元件狀態 1 2 3 4 5 6 7 $ kubectl get cs NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-1 Healthy {"health" : "true" } etcd-2 Healthy {"health" : "true" } etcd-0 Healthy {"health" : "true" }
1 2 3 4 5 6 7 8 9 $ kubectl get no NAME STATUS ROLES AGE VERSION k8s-m1 Ready master 2m v1.11.2 k8s-m2 Ready master 2m v1.11.2 k8s-m3 Ready master 2m v1.11.2 k8s-n1 Ready <none> 2m v1.11.2 k8s-n2 Ready <none> 2m v1.11.2 k8s-n3 Ready <none> 2m v1.11.2 k8s-n4 Ready <none> 2m v1.11.2
測試GPU節點是否可以正常運作 這邊簡易部署 gpu-pod
測試節點 kubernetes divice pligin 可以正常使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 cat <<EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: restartPolicy: Never containers: - image: nvidia/cuda name: cuda command : ["nvidia-smi" ] resources: limits: nvidia.com/gpu: 1 EOF pod "gpu-pod" created $ kubectl get po -a -o wide Flag --show-all has been deprecated, will be removed in an upcoming release NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE gpu-pod 0/1 Completed 0 1h 10.244.1.5 k8s-n1 <none> $ kubectl logs gpu-pod Sun Dec 9 10:26:43 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 105... Off | 00000000:02:00.0 On | N/A | | 40% 24C P8 N/A / 75W | 62MiB / 4032MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+
Addons 部署 1 $ ansible-playbook -i inventory/hosts.ini addons.yml
完成後即可透過 kubectl 來檢查服務,如 kubernetes-dashboard:
1 2 3 4 5 6 $ kubectl get po,svc -n kube-system -l k8s-app=kubernetes-dashboard NAME READY STATUS RESTARTS AGE pod/kubernetes-dashboard-6948bdb78-7424h 1/1 Running 0 2m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes-dashboard ClusterIP 10.108.226.213 <none> 443/TCP 1h
完成後,即可透過 API Server 的 Proxy 來存取 https://192.168.0.98:8443/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/#!/login
登入查詢kubernetes-dashboard Token
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 $ kubectl -n kube-system get secret NAME TYPE DATA AGE deployment-controller-token-kmcmz kubernetes.io/service-account-token 3 1h $ kubectl -n kube-system describe secret deployment-controller-token-kmcmz Name: deployment-controller-token-kmcmz Namespace: kube-system Labels: <none> Annotations: kubernetes.io/service-account.name=deployment-controller kubernetes.io/service-account.uid=e4e91ed4-fb9b-11e8-baef-d05099d079fb Type: kubernetes.io/service-account-token Data ==== ca.crt: 1428 bytes namespace: 11 bytes token: eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJkZXBsb3ltZW50LWNvbnRyb2xsZXItdG9rZW4ta21jbXoiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiZGVwbG95bWVudC1jb250cm9sbGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiZTRlOTFlZDQtZmI5Yi0xMWU4LWJhZWYtZDA1MDk5ZDA3OWZiIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOmRlcGxveW1lbnQtY29udHJvbGxlciJ9.IRQUhsVU4AJ36-qNClW7htzFJis1Mf_YSySIBKYuZ7uuaCzGcXRZtJ-nPo0SFBq7XufBMydjKwKP6tmsG1NsjttC3ETX-OnCV7u9BW0DK4HX6YloS-6Ik2rN9nHOa5iRpSNwCB2l6axGofoLkIosRCYMhdUyI5E9ZIrNKV-AvKehZkFtxXQCE3DbWGiklj1QPVq2oypfkwBEZG4GSlFkxPoIkzQQTbmZDfH036hi9DpBcUJIU41IJb9npdx65NA39Oskjdwiym1z_JlAhlhnE-uCPc-IjHirw_bEcn7mhDBf-1O2kr0IVmAbczFi82aoCagTDtUjBLP7BJ3k0v0gxQ
顯示畫面:
重置叢集狀態 最後若想要重新部署叢集的話,可以透過 reset-cluster.yml
來清除叢集:
1 $ ansible-playbook -i inventory/hosts.ini reset-cluster.yml
[#補充]部署修正網卡一致
[#補充] 若無需要 HA 部署 (單M,單N-測試)
IP Address
Hostname
CPU
Memory
192.168.0.13
VIP
192.168.0.10
k8s-m1
4
16G
192.168.0.11
k8s-n1
4
16G
配置 inventory/hosts.ini
範例:
1 2 3 4 5 6 7 8 9 10 11 12 13 $ vim inventory/hosts.ini [etcds] 192.168.0.10 ansible_user=ubuntu ansible_password=password [masters] 192.168.0.10 ansible_user=ubuntu ansible_password=password [nodes] 192.168.0.11 ansible_user=ubuntu ansible_password=password [kube-cluster:children] masters nodes
修正 inventory/group_vars/all.yml 範例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 $ vim inventory/group_vars/all.yml --- kube_version: 1.11.2 container_runtime: nvidia-docker cni_enable: true container_network: calico cni_iface: "eth0" vip_interface: "eth0" vip_address: 192.168.0.13 etcd_iface: "eth0" enable_ingress: true enable_dashboard: true enable_logging: false enable_monitoring: true enable_metric_server: true grafana_user: "admin" grafana_password: "p@ssw0rd"
以上[補充範例]為一台Master&一台Node節點透過部署,並修正確認網卡名稱為一致,而vip配置部分統一設值為master
資訊,並且重新運行ansible HA腳本即可執行成功。