How to analyze profile data of cpu high load in OpenShift v3

Solution Unverified - Updated

Environment

  • OpenShift Container Platform 3.x

Issue

Resolution

  • Before collecting pprof data on an OpenShift process, please conduct troubleshooting:

    1. Confirm with top command that go process has been high load
    2. Getting sos_report and look through the log files
    3. Analyze sysstat
    4. Get perf top (perf top shows the name of function.)

Preparation

  • Before analyze the profile data, you need to get the data. Please refer How to get profile data from openshift process? to get the profile data.

  • We recommend OPENSHIFT_PROFILE=web. For quick access to data, however for full profiles, you need to use ''cpu, memory and block" profiles if you want to see data (*.pprof profiles) under /var/lib/origin/

Basic steps

  1. Set OPENSHIFT_PROFILE=web
  2. Get profile after high load cpu trace and goroutine stack dump.

Advanced Profile Collection / Review

  1. Set OPENSHIFT_PROFILE to any of the following:

    • cpu
    • memory
    • block
  2. Restart the "process" you are profiling.

    • *.pprof profiles file reside under /var/lib/origin/ after a restart - however in some cases these are not seen till the second restart.
    systemctl restart atomic-openshift-{master*|node}
    
  3. Restart the "process" you are profiling and collect the profile file.

    • *.pprof profiles file reside under /var/lib/origin/ after a restart (note the time stamp should be before the second restart).
    systemctl restart atomic-openshift-{master*|node}
    

Analysis and Diagnostics of the Profile

After collecting the data from an OpenShift process (called creating a profile), you then need to review, the data which often means you need to review the output of commands run (during the collection phase), or review / analyze a profile file. The later of these is what the rest of the solution focus on.

Creating a Review Host / Container

When using ''cpu, memory and block" profiles and a data (*.pprof profiles) file under /var/lib/origin/ is used.

  1. To review the data you need an environment (often times this means installing rhel and the OCP binaries) however this can be done with docker using the following:

    • Please review the notes / comments in the docker file.
        FROM registry.access.redhat.com/rhel7:latest
    
        ### Remove these 3 lines if your "analyze host" (where docker is installed) is a RHEL system. 
        ENV RHN_USERNAME=USERNAME
        ENV RHN_PASSWORD=PASSWORD
        ENV RHN_POOL=POOL_ID
    
        ### Note: rhel-7-server-ose-3.#-rpms needs to be set to the version of OCP you are running - that matches where the profile was collected.
        ENV OCP_REPOS="rhel-7-server-optional-rpms rhel-7-server-rpms rhel-7-server-extras-rpms rhel-7-server-ose-3.#-rpms"
    
        ### Note: atomic-openshift-master and atomic-openshift-node may need to be adjusted to match exact versions of the RPM's you have installed, 
        ###           on the host where, the profile was collected. 
        ENV PKGS="golang atomic-openshift-master atomic-openshift-node"
    
        ENV GOPATH="/opt"
        ENV DATA_FILE="profile.pprof"
        ENV PROFILE_FILE="/tmp/profile.pprof"
    
        COPY $DATA_FILE $PROFILE_FILE
    
        ### Remove these next 2 lines if your "analyze host" (where docker is installed) is a RHEL system. 
        RUN subscription-manager register --username $RHN_USERNAME --password $RHN_PASSWORD && \ 
            subscription-manager attach --pool $RHN_POOL && \
            subscription-manager repos --disable=*
        RUN CMD=$(echo -n "subscription-manager repos"; for REPO in $OCP_REPOS; do echo -n " --enable=$REPO"; done) && \
            eval $(echo $CMD)
    
        ### Use this next line if your "analyze host" (where docker is installed) is a RHEL system. 
        #RUN for REPO in $OCP_REPOS; do yum-config-manager --enable $REPO ; done
    
        RUN INSTALL_PKGS="$PKGS" && \
            yum install -y --setopt=tsflags=nodocs $INSTALL_PKGS && \
            rpm -V $INSTALL_PKGS && \
            yum clean all -y
    
        RUN go get github.com/google/pprof && export PATH=$PATH:$(go env GOPATH)/bin
    
        RUN ${GOPATH}/bin/pprof -text /usr/bin/openshift $PROFILE_FILE | tee /opt/text_profile
        RUN ${GOPATH}/bin/pprof -tree /usr/bin/openshift $PROFILE_FILE | tee /opt/tree_profile
    
  2. Provided, your profile.pprof (renamed to this) and the above Dockerfile are in the same directory, you can run:

    sudo docker build -t pprof_container -f Dockerfile .
    
    • This will produce /opt/text_profile and /opt/tree_profile file in the pprof_container that you can review to analyze the profile that was taken.
      • This output will also be displayed as part of the build steps.

Basic diagnosis

  • If using the container (built above) you can run these same commands, in the profile, by entering the container.
# sudo docker run -it --rm pprof_container /opt/bin/pprof /usr/bin/openshift /tmp/profile.pprof

CPU trace

  • These section can be used to show you what are the top CPU consumers for the process (or profile collection window).

topN

  • The most useful command is topN. It shows top N samples in the profile.
pprof) top10
2010ms of 3320ms total (60.54%)
Dropped 109 nodes (cum <= 16.60ms)
Showing top 10 nodes out of 174 (cum >= 50ms)
      flat  flat%   sum%        cum   cum%
     600ms 18.07% 18.07%      640ms 19.28%  syscall.Syscall
     540ms 16.27% 34.34%      640ms 19.28%  regexp.(*machine).add
     210ms  6.33% 40.66%      210ms  6.33%  scanblock
     190ms  5.72% 46.39%      800ms 24.10%  regexp.(*machine).step
     130ms  3.92% 50.30%      150ms  4.52%  runtime.mallocgc
     110ms  3.31% 53.61%      110ms  3.31%  runtime.futex
      70ms  2.11% 55.72%       70ms  2.11%  runtime.MSpan_Sweep
      60ms  1.81% 57.53%       70ms  2.11%  regexp.(*machine).alloc
      50ms  1.51% 59.04%       50ms  1.51%  path/filepath.(*lazybuf).append
      50ms  1.51% 60.54%       50ms  1.51%  runtime.memmove

topN -cum

  • To sort by the fourth and fifth columns, use the -cum (for cumulative) flag:
(pprof) top5 -cum
0 of 3320000000ns total (    0%)
Dropped 109 nodes (cum <= 16600000ns)
Showing top 5 nodes out of 174 (cum >= 2460000000ns)
      flat  flat%   sum%        cum   cum%
         0     0%     0% 2820000000ns 84.94%  runtime.goexit
         0     0%     0% 2650000000ns 79.82%  github.com/google/cadvisor/manager.(*containerData).housekeeping
         0     0%     0% 2520000000ns 75.90%  github.com/google/cadvisor/manager.(*containerData).housekeepingTick
         0     0%     0% 2520000000ns 75.90%  github.com/google/cadvisor/manager.(*containerData).updateStats
         0     0%     0% 2460000000ns 74.10%  github.com/google/cadvisor/container/libcontainer.GetStats

Reference

Golang project doc - Content from blog.golang.org is not included.profiling go programs

SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.