Stephen commented about Tuning Manager and doing something useful with the data. I thought I would use this as an opportunity to highlight some of the things I look at on a regular basis (almost daily, including today in fact). Part I – Write Pending.
First a little bit of background; Write Pending is a measure of unwritten write I/Os which are in cache and haven’t been destaged to disk. A USP/9980V (and most other arrays) store write I/O s in cache and acknowledge successful completion to the host immediately. The writes are then destaged to disk asynchronously some time later.
Caching writes (which I could be wrong, could have been invented by IBM and called Cache Fast Write or DASD Fast Write, I can’t remember which was which) provides the host with a consistent response and helps to remove the mechanical delay of writing to disk. On average, most I/O will be a mixture of reads and writes and therefore an array can even out the write I/O to provide a more consistent I/O response. I see consistent I/O response as one of the major benefits or features of enterprise arrays.
Tuning Manager collects Write Pending (expressed as a percentage of cache being used for write I/O) in two ways; you can use Performance Reporter and collect real-time data from the agent/command device talking to the array. This can be plotted on a nice graph down to 10 second intervals. You can also collect historical WP data, which is retrieved from the agent on an hourly basis, providing an average and maximum figure. Hourly stats are collected for a maximum of 99 periods (just over 4 days) and can be aggregated into daily, weekly and monthly figures.
A rising and falling WP figure is not a problem and this is to be expected in the normal operation of an array, however WP can reach critical levels at which a USP will perform more aggressive destaging to disk; these figures are 30% and 40%. Once WP reaches 70%, inflow control takes over and starts to elongate response times to host write requests to limit the amount of data coming into the array. This obviously has a major impact on performance. I would look to put monitoring in place within Tuning Manager to alert when WP reaches these figures. Why would an array see high WP levels? Clearly, all the issues are write I/O related but they could include high levels of backup, database loads, file transfers and so on. It could also be an issue with data layout.
If an array can’t cope with WP reaching 30%+ then some tuning may be required; consider distributing workload over a wider time period (all backups may have been set up to start at the same time, for instance) and investigate for write activity how that data is laid out; distributing the writes over more LUNs and array groups could be a remedy. It also may be possible that you have to choose faster array group disks for certain workloads and/or purchase additional cache.
I think Tuning Manager could do with a few improvements in terms of how Write Pending data is collected. Firstly, unless I’m running Performance Reporter, I have no idea within an hourly period when the max WP was reached and for how long it stayed at this level. Some kind of indicator would be useful – an hour is a long time to trawl through host data.
I’d also like to know, when the Max WP occurs, which LUNs had the most write I/O activity. LDEV activity is also summarised on the hour so it’s not easy to see which LUNs within an hourly interval were those causing the WP issue; as an example, an LDEV could have been extremely active for 10 minute and caused the problem but, on average could have done much less I/O than another LDEV which was consistently busy (with a low I/O rate) for the whole 1 hour period. The averaging over 1 hour has a tendency to mask out peaks.
Finally, I’d like to be able to have triggers which can start more granular collection. So when a critical WP issue occurs, I want to collect 1 minute I/O stats for the next hour to an external file. I think it is possible to trigger a script from an alert, but I’m not clear on whether that trigger occurs at the point the max WP value is reached or whether it is triggered when the interval collection occurs on the hour (by which time I could be 59 minutes too late).
Next time, I’ll cover Side-file usage for those who run TrueCopy async.