Aug 19, 2016

Analyzing performance issues in a large scale system with ELK

Application overview

I’m working on a large project here at QAware. Besides a lot of source code, we have our project running on an extensive amount of servers. 
The following figure gives you a brief overview. The rectangles are servers and the strange A* items are the best magnifying glasses I can paint by myself (and they represent analysing points).





All servers exist at least as pairs (up to larger clusters). Ignoring the relational database servers this leaves us at about 54 servers in production.
The customer has a separate operations team running the application, but in the following case we were asked for help.


Problem report

Users of a 3rd party application using our services reported performance issues. It was not clear on which occasions this happens or which services are affected.

Unfortunately the operations team does not run performance monitoring.


Analysing the problem

Fortunately we were prepared for these kinds of problems:
  • We have log files on each server and include them centrally in Elasticsearch.
  • We have Kibana running to search and visualize the data.
  • We log performance information on all servers.
  • We can relate log statements from each server to one specific request.
Visualizing this information with charts in Kibana was a huge help to track down the problem. Here are some key figures (I left out some steps).

A1 - Check if we have a problem

I searched for incoming service calls on point A1 (see application overview) and created a pie chart. Each slice represents a duration range for how long the request took.
A1 is our access point for service calls and it is therefore the best spot to determine if services are slow. I chose the pie chart to get a fast overview of all request and the distribution of their runtime duration.






Only the large green slice represents service calls below 5s duration.

Steps in Kibana:

  • Choose Visualize
  • Choose 'Pie chart'
  • Choose 'From a new search'
  • Enter <Query for all Service on A1>
  • Under 'buckets' click on 'Split slices'
  • Set up slices as follows
    • Aggregation: Histogram
    • Field: <duration in ms field>
    • Interval: 5000
  • Press Play

Result:
We clearly had a problem. There are complex business services involved, but a response time above 5s is unacceptable. In the analysed time slot 20% of the service calls took longer!

A2 - Show the search performance

I choose the most basic search (more or less an ID lookup) which is performed inside the application (see point A2 in the application overview) and created a line chart for the request time.
By choosing a point between application server and database, I basically split the application in half and checked where the time was lost.

This time I used a line chart with date histogram, to show if there is any relation between slow service calls and the time of the day.




Steps in Kibana:
  • Choose Visualize
  • Choose 'Line chart'
  • Choose 'From a new search'
  • Enter <Query for the basic search on A2>
  • Set up 'metrics' -> 'Y-Axis' as follows
    • Aggregation: Average
    • Field: <duration field>
  • Under 'buckets' click on 'X-Axis'
  • Set up X-Axis as follows
    • Aggregation: Date Histogram
    • Field: @timestamp
    • Interval: Auto
  • Press Play


Result:
As you can see the duration time skyrockets in some hours and you could see the same graph on every work day. Conclusion: There is a load problem.

A3 – Check the search performance of different SOLRs

I made another visualization for the different SOLRs we run. We have one for each language. I basically took the line chart from A2 and added a sub bucket. This way you can split up the graph by a new dimension (in our case the language) and see if it is related to the problem.



Steps in Kibana:

  • Choose Visualize
  • Choose 'Line chart'
  • Choose 'From a new search'
  • Enter <Query for all searches on A3> 
  • Set up 'metrics' -> 'Y-Axis' as follows
    • Aggregation: Average
    • Field: <duration field>
  • Under 'buckets' click on 'X-Axis'
  • Set up X-Axis as follows
    • Aggregation: Date Histogram
    • Field: @timestamp
    • Interval: Auto
  • Click on 'Add sub-buckets'
  • Click on 'Split Lines'
  • Set up 'Split Lines' as follows
    • Aggregation: Terms
    • Field: <language field>
    • Top: 20 (in our case)
  • Press Play


Results:
We could see the load problem equally distributed among all languages. Which makes no sense, because we have minor languages that never get much load. A quick look on some query times in the SOLRs confirmed that. The queries itself were fast.

Result

We knew it was a load problem and it was not a problem of the SOLRs or the application itself. Possible bottlenecks left were the apache reverse proxy or the network itself. Both of them wouldn’t have been my initial guess.

Shortly afterwards we helped the operations team to track down a misconfigured SOLR reverse proxy. It used file caching on a network device!


Conclusion 


  • Visualizing the data was a crucial help for us to locate the problem. If you only look at a list of independent log entries in text form, it is much harder to make the correct conclusions.
  • Use different charts depending on the question you want to answer.
  • Use visual log analysing tools like Kibana (ELK stack). You can use them for free and they can definitely help a lot.

No comments:

Post a Comment