In testing web applications with JMeter, I will mainly write about running the test plans, recording the results and interpreting them.
When do I stop ?
One of the main questions you have to ask yourself when you start stress testing a web application is: when do I stop? This question is not as easy a question as it seems, the response depends on your initial objectives and on “scientific” criteria allowing you to decide when you have met the initial objectives. Eventually, it comes down to measuring and interpreting the “results” of your stress tests.
Before going any further, we should spend some time on the measurable outcomes of a stress test. There are mainly 2 interesting measures that you can record when you run a stress test on a web application:
- The throughput: is the number of requests per unit of time (seconds, minutes, hours) that are sent to your server during the test.
- The response time: is the elapsed time from the moment when a given request is sent to the server until the moment when the last bit of information has returned to the client
The throughput is the real load processed by your server during a run but it does not tell you anything about the performance of your server during this same run. This is the reason why you need both measures in order to get a real idea about your server’s performance during a run. The response time tells you how fast your server is handling a given load.
We are now much closer to find an answer to our initial question: you can stop stress testing your application when for a measured throughput the measured response time is “too high”. This is the right answer in an ideal world where information systems behave in a deterministic manner … another way to answer our question could also be: you can stop stress testing your application when your system crashes / collapses / starts to behave unexpectedly … However, I will stick to our first answer for a while as it contains another interesting question: what is a “high” response time for a web application (or any application or information system used by real people)? To make it short, based on usability studies it is possible to define response time limits where the user interaction with an information system radically changes. These limits are tightly related with the nature of the human being: psychology as well as brain performance - 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
- 1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.
- 10 seconds is about the limit for keeping the user’s attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.
Using these limits allows us to give a precise end point to the stress tests of a system; it helps us define in collaboration with our client (or users) what is an acceptable response time. For example, the last time I made stress tests for a client, we agreed that the acceptable upper limit of the response times for his system was 7 seconds: he wanted to know how many concurrent users his system would handle.
The remaining problem now is how to measure / estimate the throughput and response timesof our system using JMeter: some simple statistics and mathematics are needed here.
Run your test plan and record the meaningful measures …
First of all, JMeter provides us with several different “listeners” allowing to record these 2 variables in various ways (graphics, tables, trees, files). I would say that most of these “listeners” are useless or to put it in a different way, one of them is a must have in order to do have all the necessary information in hand: the Summary Report. In order to understand this report and to implement scenarios efficiently we must keep the following things in mind:
- JMeter records response times and throughput for each “sampler” of each “thread group” defined in your test plan.
- In the Summary Report, one line is displayed for each different “sampler” based on thesampler’s names: you can group or differentiate samplers in the report just by playing with their names.
- Each “sampler” is executed many times: the Summary Report provides us with meanvalues (and standard deviations) for the throughput and response times of each named “sampler”.
- Global values (mean and standard deviation) for throughput and response times are also calculated in the Summary Report.
- The Summary Report allows you to store the measures of each run in a “csv” file: you can thus analyse and interpret the results in a spreadsheet program.
Other reports are also useful particularly at the beginning when building and testing your scenarios:
- The View Results Tree is very handy when “debugging” a scenario as it allows to monitor all the HTTP Requests and Responses exchanged with the server. The draw back is that it consumes too much memory to be used in a large stress test.
- The View Results in Table listener is also useful in the early stages of the stress test implementation as it gives a good and fast overview of the execution of a test plan. However, this listener also consumes too much memory to be used in a large stress test.
- I have also found some very interesting JMeter plugins on a Google Code project. One of them, the “Active Threads Over Time” helped me a lot when trying to set the ramp up throughput by playing with the “ramp up” and “number of threads” parameters of the thread group.
One more element that you should have in mind when performing stress tests is the performance bottleneck of the computer running the tests themselves:
- It is very common when running stress tests on large production systems to reach the limits of the computer running the tests before reaching the limits of the tested server.
- When the computer running the tests is reaching its limits (memory, number of threads, cpu …) all the measures recorded by the stress tests tool are wrong or at least biased.
- There are two way to face this problem: (1) one is to optimize your scenarios and the way you run them and the (2) second is to set up a distributed infrastructure.
(1) In the JMeter manual, you will find the following advises in the section 16.6 of the Best Practises page: Some suggestions on reducing resource usage.
- Use non-GUI mode: jmeter -n -t test.jmx -l test.jtl
- Use as few Listeners as possible; if using the -l flag as above they can all be deleted or disabled.
- Rather than using lots of similar samplers, use the same sampler in a loop, and use variables (CSV Data Set) to vary the sample.
- Don’t use functional mode
- Use CSV output rather than XML
- Only save the data that you need
- Use as few Assertions as possible
If your test needs large amounts of data – particularly if it needs to be randomised – create the test data in a file that can be read with CSV Dataset. This avoids wasting resources at run-time.
(2) In the JMeter manual, you will find the Remote Testing page giving you precise instructions necessary to set up a distributed testing environment and a PDF describing how it all works architecture-wise. My experience is that it is all very easy to set up and that it gives excellent results: in the end, it comes down to running the “jmeter-server” scripts on the slaves and to configure the existing host in the master’s configuration file (jmeter.properties). The only 2 or 3 little problems I came across with the distributed testing are: - Do not forget to give memory to your jmeter slaves and master (set Xms and Xmx in the jmeter.properties file) the default values a very low.
- If you use external resources such as a CSV Data Set, you should have them on all your slave installation under the same location (a full path is needed in your scenario)
- Beware of multiple thread groups and schedulers, they leak huge amounts of memory on the slaves
Last but not least, you should never perform your stress tests against a server or infrastructure that was just started. Servers usually need a warm-up before they reach their full speed: this is particularly true for the Java platform where you surely don’t want to measure class loading time, JSP compilation time or native compilation time.
Interpret the results …
In order to interpret the results of a stress tests, it is important to understand some basic elements of Statistics:
The following equation show how the mean value (μ) is calculated:
μ = 1/n * Σi=1…n xi
The mean value of a given measure is what is commonly referred to as the average value of this measure. An important thing to understand is that the mean value can be very misleading as it does not show you how close (or far) your values are from the average. An example is always better than a long explanation.
Let’s assume that we are measuring response times in milliseconds in 2 different stress tests:
Stress Test 1:
- x1=100
- x2=110
- x3=90
- x4=900
- x5=890
- x6=910
gives you μ = 1/6 * (100 + 110 + 90 + 900 + 890 + 910) = 500 ms
Stress Test 2:
- x1=490
- x2=510
- x3=535
- x4=465
- x5=590
- x6=410
gives you μ = 1/6 * (490 + 510 + 535 + 465 + 590 + 410) = 500 ms
In both cases the mean value (μ) is the same. However if you observe closely the values taken by the response times you will see that in the first case, the values are “far” from the mean value where in the second case, the values are “close” to the mean value. It is quite obvious with this example that a measure of this distance to the mean value is needed in order to draw any kind of conclusion based on the mean value.
The following equation show how the standard deviation (σ) is calculated:
σ = 1/n * √ Σi=1…n (xi-μ)2
The standard deviation (σ) measures the mean distance of the values to their average (μ). In other words it gives us a good idea of the dispersion or variability of the measures to their mean value. Let’s go back to our example and calculate the standard deviation of each of our theoretical stress tests:
Stress Test 1:
σ = 1/6 * sqrt( (100-500)^2 + (110-500)^2 + (90-500)^2 + (900-500)^2 + (890-500)^2 + (910-500)^2 ) ≈ 163 ms
Stress Test 2:
σ = 1/6 * sqrt( (490-500)^2 + (510-500)^2 + (535-500)^2 + (465-500)^2 + (590-500)^2 + (410-500)^2 ) ≈ 23 ms
The 2 values of the standard deviation calculated above are very different:
- in the first case, the standard deviation is high compared to the mean value, which shows us that our measures are very variable (or mostly far from the mean value) and that the mean value is not very significant.
- in the second case, the standard deviation is low compared to the mean value, which shows us that our measures are not dispersed (or mostly close to the mean value) and that the mean value is significant.
(3) The sampling size and the quality of the measure
Another interesting question is whether our calculated mean value is a good estimation of the “real” mean value. In other word, when calculating the mean value of the response time during a test case do we have a good estimation of the “real” mean response time of the same scenario repeated indefinitely. In probability theory, the Central Limit Theorem states conditions under which the mean of a sufficiently large number of independent randomvariables, each with finite mean and variance, will be approximately normally distributed. The measures of response times and throughput obtained during stress tests comply with the Central Limit Theorem as we usually have: a large number of independent and random measures which have a finite (calculated by JMeter) mean value and standard deviation. We can thus assume that the mean values of the response time and the throughput are approximatively normally distributed.
This allow us to calculate a Confidence Interval for these mean values. The Confidence Intervalgives us a measure of the quality of our mean values as it allows us to calculated the variability of our mean value (interval) with a predefined probability. You can for example decide to calculate your Confidence Interval at 95%, which will tell you that the probability to have a mean value within the calculated interval is 95%. On the contrary, you can decide to calculate the probability to have you mean value within a given interval (see the examples below). The following equation show how the Confidence Interval (CI) is calculated:
CI = [μ - Z*σ/√n, μ + Z*σ/√n]
where:
- μ is the calculated mean value of our sample,
- σ is the calculated standard deviation of our sample
- and Z is the value for which the area under the “bell shaped curve” of the standard normal distribution represents the half the chosen Confidence C (anyone who can explain this better is welcome).
The following table gives values of Z for various given values of Confidence C:
C | Z |
0.80 | 1.281551565545 |
0.90 | 1.644853626951 |
0.95 | 1.959963984540 |
0.98 | 2.326347874041 |
0.99 | 2.575829303549 |
0.995 | 2.807033768344 |
0.998 | 3.090232306168 |
0.999 | 3.290526731492 |
0.9999 | 3.890591886413 |
0.99999 | 4.417173413469 |
Source: http://en.wikipedia.org/wiki/Normal_distribution
If we go back to our previous examples, we can calculate the confidence intervals of our mean values at 95% :
CI1 = [500 - 1.96*163/sqrt(6); 500 + 1.96*163/sqrt(6)] ≈ [370; 630]
CI2 = [500 - 1.96*23/sqrt(6); 500 + 1.96*23/sqrt(6)] ≈ [482; 518]
This means that the probability to have a mean response time in the calculated confidence interval is 95%.
We can also calculate the probability to have the mean value in the interval [490, 510]:
10 = Z1 * 163 / sqrt(6) => Z1 = 10 * sqrt(6) / 163 => Z1 ≈ 0.15 => C1 ≈ 12%
10 = Z2 * 23 / sqrt(6) => Z2 = 10 * sqrt(6) / 23 => Z2 ≈ 1.06 => C2 ≈ 71%
Notes:
These are just given as examples of how to calculate the confidence interval … the conditions are not met for the Central Limit Theorem with such a small sample.
Conclusion
As a conclusion, we can say that the best way to interpret our stress test results is to use the Summary Report provided by JMeter and to store it in a “csv” file for every run. In this report we can find, the mean response time, the mean throughput, the standard deviation of the response time and the standard deviation of the throughput for every named sampler and globally for a the run.
Based on the explanations above, I recommend the following methodology:
- If we have a high number of samples (which is usually the case in stress tests) and a low standard deviation than we can conclude without risk that we have a good estimation of the mean value of both the response time and the throughput of our system and that the “real” number will be close to the calculated mean values.
- If we have a high number of samples (which is usually the case in stress tests) and a high standard deviation, we probably have a good estimation of the mean value but should however consider to estimate a confidence interval. In any case, if the variability of the measure is high investigation is needed on a technical point of view as variability of response times and throughput is obviously related to instability of the system tested.
- If we have a low number of samples and a high standard deviation than we almost certainly have a very bad estimation of the mean value, which means that we are measuring the wrong thing, the wrong way.
Monitor your systems while you run the tests …
It is often useful to monitor the system (and its various components) while you are stressing it. Various tools may be used that vary from one platform to another. On the Java platform you may use the excellent “jvisualvm” provided with the latest versions of the JDK and interacting with the various monitoring hooks integrated in the JVM. Monitoring Java Web Applications is a subject in itself … I can try to share my thoughts on it some time … in another post