2767 字
14 分钟
CI/CD 进阶实战与踩坑记录

前面把基础的 CI/CD 流水线跑通后,我又花了几天时间折腾了一些进阶功能。这篇记录一下代码扫描、全链路监控、灰度发布这些东西,以及中途遇到的一些坑。

从一个 Go 项目开始#

之前用的都是现成的 Nginx 镜像,这次想搞点真实的应用,就写了个简单的 Go Web 服务。

创建项目#

我本地的 Go 环境是 1.25,直接开干:

Terminal window
mkdir go-demo-app
cd go-demo-app
go mod init go-demo-app
go get github.com/prometheus/client_golang/prometheus/promhttp
go mod tidy

写代码#

main.go 很简单,就三个接口:

package main
import (
"fmt"
"net/http"
"os"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var version = "v1.0.0"
func main() {
// 主页面
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
hostname, _ := os.Hostname()
fmt.Printf("Received request from %s\n", r.RemoteAddr)
fmt.Fprintf(w, "<h1>Go Demo App</h1>")
fmt.Fprintf(w, "<div>Version: <strong>%s</strong></div>", version)
fmt.Fprintf(w, "<div>Hostname: %s</div>", hostname)
})
// 健康检查
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write([]byte("ok"))
})
// Prometheus 监控指标
http.Handle("/metrics", promhttp.Handler())
fmt.Printf("Starting server on port 8080, version: %s\n", version)
if err := http.ListenAndServe(":8080", nil); err != nil {
fmt.Printf("Error starting server: %s\n", err)
}
}

Dockerfile#

用的是多阶段构建,最终镜像只有十几 MB:

# 构建阶段
FROM reg.westos.org/library/golang:1.25-alpine AS builder
WORKDIR /app
# Go 代理加速
ENV GOPROXY=https://goproxy.cn,direct
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o go-demo-app .
# 运行阶段
FROM reg.westos.org/library/alpine:latest
WORKDIR /root/
COPY --from=builder /app/go-demo-app .
EXPOSE 8080
CMD ["./go-demo-app"]

因为 Kaniko 访问不了 DockerHub,我提前把基础镜像推到了 Harbor:

Terminal window
docker pull golang:1.25-alpine
docker pull alpine:latest
docker login reg.westos.org -u admin -p 12345
docker tag golang:1.25-alpine reg.westos.org/library/golang:1.25-alpine
docker tag alpine:latest reg.westos.org/library/alpine:latest
docker push reg.westos.org/library/golang:1.25-alpine
docker push reg.westos.org/library/alpine:latest

推送到 Gitea#

在 Gitea 创建 go-demo-app 仓库,然后推上去:

Terminal window
git init
git add .
git commit -m "Initial commit: Go app with metrics"
git branch -M main
git remote add origin http://192.168.100.10:30030/admin/go-demo-app.git
git push -u origin main

配置 Jenkins Pipeline#

在仓库根目录新建 Jenkinsfile,用来自动构建镜像:

pipeline {
agent {
kubernetes {
yaml '''
apiVersion: v1
kind: Pod
spec:
hostAliases:
- ip: "192.168.100.14"
hostnames:
- "reg.westos.org"
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:debug
command:
- sleep
- infinity
volumeMounts:
- name: registry-creds
mountPath: /kaniko/.docker/
volumes:
- name: registry-creds
secret:
secretName: harbor-auth
items:
- key: .dockerconfigjson
path: config.json
'''
}
}
environment {
IMAGE_REPO = "reg.westos.org/library/go-demo-app"
IMAGE_TAG = "v1.0.${BUILD_NUMBER}"
}
stages {
stage('Checkout') {
steps {
checkout scm
}
}
stage('Build & Push') {
steps {
container('kaniko') {
sh """
/kaniko/executor \
--context `pwd` \
--dockerfile `pwd`/Dockerfile \
--destination ${IMAGE_REPO}:${IMAGE_TAG} \
--destination ${IMAGE_REPO}:latest \
--skip-tls-verify \
--insecure
"""
}
}
}
stage('Update Manifest') {
steps {
echo "TODO: Update ArgoCD manifest with new tag ${IMAGE_TAG}"
}
}
}
}

在 Jenkins 新建一个 Pipeline 任务,选择 “Pipeline from SCM”,指向 Gitea 的 go-demo-app 仓库,分支选 main

跑一次构建,去 Harbor 看看镜像有没有生成。

准备 K8s 部署文件#

go-demo-app 仓库里新建 deploy/deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: go-demo-app
namespace: default
labels:
app: go-demo-app
spec:
replicas: 1
selector:
matchLabels:
app: go-demo-app
template:
metadata:
labels:
app: go-demo-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
hostAliases:
- ip: "192.168.100.14"
hostnames:
- "reg.westos.org"
containers:
- name: go-demo-app
image: reg.westos.org/library/go-demo-app:latest
imagePullPolicy: Always
ports:
- containerPort: 8080
resources:
limits:
cpu: "500m"
memory: "128Mi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
imagePullSecrets:
- name: harbor-auth
---
apiVersion: v1
kind: Service
metadata:
name: go-demo-app-svc
namespace: default
labels:
app: go-demo-app
spec:
type: NodePort
selector:
app: go-demo-app
ports:
- name: http
port: 8080
targetPort: 8080
nodePort: 30095

推送到 Gitea,然后在 ArgoCD 创建新应用 go-demo-app

  • Repository URL: http://192.168.100.10:30030/admin/go-demo-app.git
  • Path: deploy
  • Sync Policy: Automatic,勾选 Prune 和 Self Heal

同步后访问 http://192.168.100.10:30095,能看到页面就算成功了。

实现 CI 闭环#

现在 Jenkins 每次构建出新镜像,还需要手动改 deployment.yaml 里的版本号,这太蠢了。我们让 Jenkins 自己去改。

修改 Jenkinsfile#

主要改动是加一个 git-tools 容器,用来提交代码:

pipeline {
agent {
kubernetes {
yaml '''
apiVersion: v1
kind: Pod
spec:
hostAliases:
- ip: "192.168.100.14"
hostnames:
- "reg.westos.org"
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:debug
command:
- sleep
- infinity
volumeMounts:
- name: registry-creds
mountPath: /kaniko/.docker/
- name: git-tools
image: bitnami/git:latest
command:
- sleep
- infinity
volumes:
- name: registry-creds
secret:
secretName: harbor-auth
items:
- key: .dockerconfigjson
path: config.json
'''
}
}
environment {
IMAGE_REPO = "reg.westos.org/library/go-demo-app"
IMAGE_TAG = "v1.0.${BUILD_NUMBER}"
GITEA_REPO = "192.168.100.10:30030/admin/go-demo-app.git"
}
stages {
stage('Checkout') {
steps {
checkout scm
}
}
stage('Build & Push') {
steps {
container('kaniko') {
sh """
/kaniko/executor \
--context `pwd` \
--dockerfile `pwd`/Dockerfile \
--destination ${IMAGE_REPO}:${IMAGE_TAG} \
--destination ${IMAGE_REPO}:latest \
--skip-tls-verify \
--insecure
"""
}
}
}
stage('Update Manifest') {
steps {
container('git-tools') {
withCredentials([usernamePassword(credentialsId: 'gitea-auth', usernameVariable: 'GIT_USER', passwordVariable: 'GIT_PASS')]) {
sh """
git config --global user.email "jenkins@westos.org"
git config --global user.name "Jenkins CI"
git config --global --add safe.directory '*'
git checkout main
git pull origin main
sed -i "s|image: ${IMAGE_REPO}:.*|image: ${IMAGE_REPO}:${IMAGE_TAG}|" deploy/deployment.yaml
echo "Updated deployment.yaml:"
grep "image:" deploy/deployment.yaml
if git status --porcelain | grep deploy/deployment.yaml; then
git add deploy/deployment.yaml
git commit -m "Deploy: update image tag to ${IMAGE_TAG} [skip ci]"
git push http://${GIT_USER}:${GIT_PASS}@${GITEA_REPO} main
else
echo "No changes to commit"
fi
"""
}
}
}
}
}
}

推送更新后跑一次构建,去 Gitea 看看 deployment.yaml 有没有自动更新版本号。如果有,去 ArgoCD 看 Pod 有没有重启。这样 CI 闭环就通了。

接入代码扫描 (SonarQube)#

为了显得专业一点,加个代码扫描。

安装 SonarQube#

Terminal window
helm repo add sonarqube https://SonarSource.github.io/helm-chart-sonarqube
helm repo update
kubectl create namespace sonarqube
cat <<EOF > sonar-values.yaml
community:
enabled: true
service:
type: NodePort
nodePort: 30099
persistence:
enabled: true
storageClass: "nfs-client"
size: 5Gi
elasticsearch:
configureNode: false
monitoringPasscode: "westos_monitor_123"
EOF
helm install sonarqube sonarqube/sonarqube -n sonarqube -f sonar-values.yaml
kubectl get pods -n sonarqube -w

SonarQube 启动比较慢,等个 2-3 分钟。

配置 SonarQube#

访问 http://192.168.100.10:30099,默认账号 admin/admin,首次登录要改密码,我改成了 Admin123456789!

创建项目:

  1. Create a local project
  2. Project display name: go-demo-app
  3. Project Key: go-demo-app
  4. Main branch name: main
  5. 选 “Follows the instance’s default”,点 Create

然后生成 Token,选 “Locally”,点 Generate,复制这个 Token(长这样:sqp_xxxx...)。

配置 Jenkins#

  1. 安装插件:SonarQube Scanner
  2. 添加凭证:
    • Kind: Secret text
    • Secret: 粘贴刚才的 Token
    • ID: sonar-token
  3. 配置系统:
    • Manage Jenkins → System → SonarQube servers
    • 勾选 “Environment variables”
    • Name: sonar-server
    • Server URL: http://sonarqube-sonarqube.sonarqube.svc.cluster.local:9000
    • Token: 选 sonar-token

修改 Jenkinsfile#

在 Pod Template 里加一个 sonar-cli 容器,然后在 Build 之前加一个 Code Analysis 阶段:

// Pod Template 里加这个容器
- name: sonar-cli
image: sonarsource/sonar-scanner-cli:latest
command:
- sleep
- infinity
// stages 里加这个阶段
stage('Code Analysis') {
steps {
container('sonar-cli') {
withSonarQubeEnv('sonar-server') {
sh """
sonar-scanner \
-Dsonar.projectKey=go-demo-app \
-Dsonar.sources=. \
-Dsonar.host.url=http://sonarqube-sonarqube.sonarqube.svc.cluster.local:9000 \
-Dsonar.login=$SONAR_AUTH_TOKEN
"""
}
}
}
}

跑一次构建,成功后去 SonarQube 网页看看有没有扫描报告。

bc89b42fa4665e8f773dd677f5bedc8b

资源优化#

这时候我发现集群有点撑不住了,看了下节点资源:

Terminal window
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl patch deployment metrics-server -n kube-system --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'
kubectl top nodes

结果:

NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
k8s-master 134m 6% 1227Mi 74%
k8s-node1 133m 6% 5524Mi 73%
k8s-node2 59m 2% 856Mi 11%

node1 快累死了,node2 在偷懒。再看看 Pod:

Terminal window
kubectl top pods -A --sort-by=memory
NAMESPACE NAME CPU(cores) MEMORY(bytes)
sonarqube sonarqube-sonarqube-0 19m 2186Mi
jenkins jenkins-0 2m 1495Mi
kube-system kube-apiserver-k8s-master 31m 494Mi

问题很明显:资源分配不均,而且 SonarQube 和 Jenkins 太吃内存了。

调整资源分配#

先把 node1 禁止调度,强制 SonarQube 和 Jenkins 重建到 node2:

Terminal window
kubectl cordon k8s-node1
kubectl delete pod sonarqube-sonarqube-0 -n sonarqube
kubectl delete pod -n jenkins -l app.kubernetes.io/name=jenkins
# 确认跑到 node2 了
kubectl -n sonarqube get pod -o wide
kubectl -n jenkins get pod -o wide
# 解封
kubectl uncordon k8s-node1

再看节点指标,平衡多了:

NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
k8s-master 130m 6% 1277Mi 77%
k8s-node1 68m 3% 1440Mi 19%
k8s-node2 145m 7% 883Mi 11%

限制内存占用#

修改 sonar-values.yaml,加上资源限制:

resources:
requests:
cpu: "200m"
memory: "1Gi"
limits:
cpu: "1000m"
memory: "1800Mi"
jvmCeOpts: "-Xmx1024m -Xms512m"
jvmOpts: "-Xmx1024m -Xms512m"

更新:

Terminal window
helm upgrade sonarqube sonarqube/sonarqube -n sonarqube -f sonar-values.yaml

修改 jenkins-manual.yaml

controller:
javaOpts: "-Xms512m -Xmx800m -Djenkins.install.runSetupWizard=false"
resources:
requests:
cpu: "200m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1280Mi"

更新:

Terminal window
helm upgrade jenkins jenkins/jenkins -n jenkins -f jenkins-manual.yaml

再看内存占用:

NAMESPACE NAME CPU(cores) MEMORY(bytes)
sonarqube sonarqube-sonarqube-0 913m 767Mi
kube-system kube-apiserver-k8s-master 53m 530Mi
jenkins jenkins-0 945m 328Mi

优化效果很明显,虽然还是有点卡,但至少不会整个节点崩溃了。

全链路监控#

接下来搞监控,用 Prometheus + Grafana + Loki。

部署 kube-prometheus-stack#

Terminal window
cat <<EOF > monitor-values.yaml
grafana:
service:
type: NodePort
nodePort: 30000
prometheus:
prometheusSpec:
retention: 5d
resources:
requests:
memory: 512Mi
cpu: 200m
limits:
memory: 1Gi
cpu: 1000m
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
alertmanager:
alertmanagerSpec:
replicas: 1
kubeStateMetrics:
enabled: true
nodeExporter:
enabled: true
EOF
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create ns monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
-f monitor-values.yaml
kubectl get pods -n monitoring -w

监控 Jenkins#

在 Jenkins 安装插件 Prometheus metrics,重启后访问 http://192.168.100.10:30080/prometheus/ 能看到指标就行。

创建 jenkins-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: jenkins
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/instance: jenkins
namespaceSelector:
matchNames:
- jenkins
endpoints:
- port: http
path: /prometheus/
interval: 15s

应用:

Terminal window
kubectl apply -f jenkins-monitor.yaml

访问 Grafana#

访问 http://192.168.100.10:30000,账号 admin,密码用命令查:

Terminal window
kubectl -n monitoring get secret prometheus-grafana \
-o jsonpath="{.data.admin-password}" | base64 -d ; echo

我的密码是:sgbrApYRHjBKbJzdv7YbS1LgGZGqPAToqZ4x1NDj

进入 Grafana,左侧菜单 Dashboards → New → Import,输入 Dashboard ID 24357,数据源选 Prometheus,点 Import。

就能看到 Jenkins 的监控大盘了。

41296523d2e7871e17230893838f035e

也可以导入 1609813105 看主机和 K8s 集群的信息。

632560302e3e4f84aeb5652c240c72e8

f37460f220035f8709ae67dd8e91c373

监控 Go 应用#

创建 app-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: go-demo-app
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchLabels:
app: go-demo-app
namespaceSelector:
matchNames:
- default
endpoints:
- port: http
path: /metrics
interval: 5s

应用:

Terminal window
kubectl apply -f app-monitor.yaml

在 Grafana 的 Explore 里搜索 go_goroutinesprocess_cpu_seconds_total,能看到图表就说明成功了。

日志聚合 (Loki)#

Grafana 已经装了,只需要装 Loki 和 Promtail。

Terminal window
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
kubectl create ns logging
cat <<EOF > loki-values.yaml
loki:
enabled: true
image:
repository: grafana/loki
tag: "2.9.3" # 默认的 2.6 版本 Grafana 连不上
pullPolicy: IfNotPresent
config:
auth_enabled: false
commonConfig:
replication_factor: 1
storage:
type: filesystem
singleBinary:
replicas: 1
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
memory: 512Mi
promtail:
enabled: true
config:
clients:
- url: http://loki:3100/loki/api/v1/push
EOF
helm install loki grafana/loki-stack -n logging -f loki-values.yaml

配置 Grafana 数据源#

进入 Grafana → Connections → Data Sources → Add data source → 选择 Loki。

URL 填:http://loki.logging.svc.cluster.local:3100

点 Save & Test。

验证日志#

进入 Grafana → Explore → 数据源选 Loki。

Label filters 选 namespace = jenkinsdefault,点 Run query。

现在不仅能看到 CPU 曲线,还能在同一个页面看到 Pod 的日志,全链路可视化就实现了。

Argo Rollouts 灰度发布#

最后搞个灰度发布,实现无人值守自动上线。

安装 Argo Rollouts#

Terminal window
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# 安装 kubectl 插件
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x ./kubectl-argo-rollouts-linux-amd64
mv ./kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
kubectl get pods -n argo-rollouts

创建 AnalysisTemplate#

go-demo-app 仓库的 deploy 目录下,新建 analysis.yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-check
namespace: default
spec:
args:
- name: service-name
metrics:
- name: success-rate
successCondition: result[0] == 1
provider:
prometheus:
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
# 演示用的查询,永远返回 1
# 生产环境应该改成真实的成功率查询
vector(1)

将 Deployment 升级为 Rollout#

修改 deploy/deployment.yaml,主要改动:

  • apiVersion 改为 argoproj.io/v1alpha1
  • kind 改为 Rollout
  • replicas 改为 5(方便看灰度效果)
  • 加上 strategy.canary 配置
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: go-demo-app
namespace: default
labels:
app: go-demo-app
spec:
replicas: 5
selector:
matchLabels:
app: go-demo-app
strategy:
canary:
steps:
- setWeight: 20 # 先切 20% 流量
- analysis:
templates:
- templateName: success-rate-check
args:
- name: service-name
value: go-demo-app
- pause: {duration: 30s} # 人工观察 30 秒
- setWeight: 50 # 再切 50% 流量
- pause: {duration: 10s}
- setWeight: 100 # 最后全量上线
template:
metadata:
labels:
app: go-demo-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
hostAliases:
- ip: "192.168.100.14"
hostnames:
- "reg.westos.org"
containers:
- name: go-demo-app
image: reg.westos.org/library/go-demo-app:latest
imagePullPolicy: Always
ports:
- containerPort: 8080
resources:
limits:
cpu: "200m"
memory: "128Mi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
imagePullSecrets:
- name: harbor-auth
---
apiVersion: v1
kind: Service
metadata:
name: go-demo-app-svc
namespace: default
labels:
app: go-demo-app
spec:
type: NodePort
selector:
app: go-demo-app
ports:
- name: http
port: 8080
targetPort: 8080
nodePort: 30095

删除旧的 Deployment:

Terminal window
kubectl delete deployment go-demo-app

推送到 Gitea,去 ArgoCD 点 Sync。

模拟灰度发布#

开一个终端窗口,实时监控 Rollout 状态:

Terminal window
kubectl argo rollouts get rollout go-demo-app -n default -w

然后去 Jenkins 跑一次构建。

观察过程:

  1. Step 1: 状态变为 Paused,新版本 Pod 启动 1 个(20%),旧版本 4 个
  2. Analysis: 后台跑 AnalysisRun,连 Prometheus 查询 vector(1)
  3. Pass: 如果 Prometheus 连接正常,Analysis 显示 Successful
  4. Step 2: 权重自动增加到 50%
  5. Complete: 最终权重变成 100%,旧版本 Pod 全部消失

163492a4e552faed921a6f7c10858de8

全部切为新版本

82815ed2de155af98b23e71eff469d7a

整个过程完全自动化,无需人工介入。这就是 GitOps + 灰度发布的魅力。

小结#

折腾了这么多天,从基础的 CI/CD 到代码扫描、全链路监控、灰度发布,整个云原生工具链算是体验了一遍。虽然中途遇到不少资源不足、配置错误的问题,但每次解决问题都能学到新东西。

最大的感受就是:云原生不是银弹,但确实能解决很多传统部署方式解决不了的问题。比如灰度发布,以前要写一堆脚本,现在一个 YAML 就搞定了。

CI/CD 进阶实战与踩坑记录
https://dev-null-sec.github.io/posts/ci-cd进阶实战与踩坑记录/
作者
DevNull
发布于
2025-06-20
许可协议
CC BY-NC-SA 4.0