This chapter covers the issues and considerations involved in designing storage subsystems. Guidance on actually implementing the ideas set out here can be found in General AIX Storage Management, and Practical Examples.
Designing storage subsystems involves evaluating the requirements that the business processes that will be executed on the machine have, in terms of data access and availability. Systems will generally be used for more than one purpose (database applications may compete for resource with word processing and image based applications for example), and it is important to attempt to configure the environment in such a way that each process can perform within required tolerances (these are usually performance and availability related - user response time, and recovery in the event of error or failure for example). As each process will have differing requirements, this task will of necessity involve some compromise. The design of the AIX storage management components, as has been covered in Operating System Software Components, does allow great flexibility in organization. The logical volume manager allows the physical disks to be partitioned and the space thus created organized in different ways to enable performance requirements to be met in one logical volume, and availability requirements in another for example.
The first task is therefore to evaluate the storage requirements of the application set that will be executed on the system in terms of:
Each of these areas will now be examined in more detail.
The design of the volume group and logical volume organization has a major impact upon performance, availability, and recovery. The first consideration in the process is volume group allocation.
The most common hardware failure in a storage subsystem is disk failure, followed by failure of adapters and power supplies. When failures of this type occur, recovery will be easier if a sensible volume group design has been implemented. Multiple volume groups should generally be implemented for the following reasons:
All physical volumes within a volume group must have the same physical partition size. In some cases greater granularity may be required in allocation of physical partitions to logical volumes, and the only way to implement this is to place those logical volume with differing physical partition requirements into separate volume groups. An example of when this might be necessary would be an environment where many small logical volumes need to be created for specialized file systems of a size that may entail much wasted space with 4MB partitions (2MB file systems say). The greater flexibility afforded by the smaller partition is offset by increased performance overhead to the LVM.
If there is a requirement for implementing a file system in a non-quorum volume group, then a separate volume group that does not utilize quorum checking needs to be created.
In order to allow important confidential data to be removed and stored in a secure place when required, a volume group including physical volumes on removable disks should be created. At night, for example, the volume group can be exported and the disks with the sensitive data removed and kept in a secure place.
In order to reduce bottlenecks in a volume group with many journaled file systems, multiple JFS logs may be required.
In some cases there may be a requirement to share a physical volume between systems, for availability or shared access for example. If the physical volumes so utilized are maintained in a separate volume group, then this volume group can be exported and varied off line for reuse on a second system (import, vary on), without interrupting the normal operation of either system.
The number of volume groups created should therefore be decided based upon consideration of these points.
The next consideration should be the number of physical volumes per volume group. This affects quorum checking and mirroring. A volume group with two disks and quorum checking will fail to vary on if the disk with two VGDAs fails (see Quorum Checking for a description of this process). With more than two disks, 51% or more of the VGDAs must become unavailable for the vary on to fail, and data to become inaccessible. This is particularly true in a two disk mirrored system, failure of the two VGDA disk will result in no access, even though a good copy of the data is still available.
Enough physical disks must also be included to support the mirroring strategy required, both in terms of space for the mirrored copies, and number of disks for the policies. If mirroring is to be done across the maximum number of physical volumes possible, for availability purposes, then it makes sense to have at least enough space to ensure the copies are stored on separate physical volumes. A disk failure in this scenario will not impact access to the data.
The delineating factors for deciding upon the number of logical volumes to create are basically performance and availability. As many logical volumes should be created as there are different performance and availability requirements. The design of the logical volumes themselves to satisfy these requirements is covered in Planning for Performance, and Planning for Availability. Within this however, there is the consideration of disk space utilization. Depending upon the intended purpose of file systems that will be created within logical volumes, different fragment sizes may be required to optimally utilize the available disk space in the logical volumes. As has been described in Fragmentation, choosing different fragment sizes can significantly improve disk space use. If there is a need for file systems containing many small files, then a logical volume for each file system with different requirements should be created.
The primary considerations when creating file systems are as follows:
Fragment size should be considered only if there will be many files in the file system less than 32KB in size, or compression will be used. In the former case, the fragment size should be selected based upon the average size of the files, in order to minimize wasted space. For example, file less than 512 bytes or that will grow in chunks less than 512 bytes would be more economically stored in a file system with a fragment size of 512. Compression is discussed later in this section. This is an AIX Version 4 facility.
The number of bytes per i-node is described in Variable I-nodes. This parameter controls the number of i-nodes created in the file system. The main consideration will be the number of expected files; if only a few large files will be stored, then increase the NBPI to reduce the number of i-nodes created, and hence free up disk resource that would have be used by the extra i-nodes. The NBPI and fragment size together directly affect the maximum possible size of the file system, and this is therefore a further consideration. File system size is discussed later in this section. This is an AIX Version 4 facility.
Compression is discussed in Compression. If disk space is at a premium, and performance is not the major issue, then file system compression should be considered. Using compression can reduce the amount of storage space required by files enormously, at the cost of the overhead required for the compression. The algorithm is performed on a fragment basis, and its effectiveness is dependent upon the type of information contained in the file. Larger fragment sizes will help to offset the performance overhead, by reducing the number of allocation requests and physical I/O. This is an AIX Version 4 facility.
The size of the file system should be chosen to be large enough to accommodate the required files. It is better to err on the small side, as file systems can be easily expanded as the limit approaches, while reducing (which can be done) is more work. Recovering free space within file systems lost due to fragmentation can be accomplished using the defragfs command which is discussed in Storage Management Files and Commands Summary.
Having created volume groups and added the required number of physical volumes, the logical volumes and file systems need to be created. There are two basic considerations: performance and availability. Generally, designing for high performance will impact availability, and vice versa. The next two sections look at design from these perspectives.
The performance of a disk subsystem is a combination of factors that includes:
This includes the physical performance capabilities of the adapter, as well as the organization of devices using the adapter. In order to maximize performance to high speed disk devices on an adapter, the characteristics of the adapter should be considered. For example, SCSI adapters can support multiple devices operating in either synchronous or asynchronous modes (see Small Computer System Interface Adapter for information on SCSI technology). To achieve maximum throughput for a synchronous disk device, only other synchronous devices should be attached to the adapter, and the total bandwidth (or throughput) of these devices should not exceed the capabilities of the adapter itself.
The same considerations apply to adapters of other types. If multiple devices are supported on the adapter, the total bandwidth available should not be exceeded.
Finally, obviously the fastest adapter that meets the environmental requirements of the site (in terms of cable lengths, and devices supported) should be selected to maximize performance. Other functions like command tag queuing (for some SCSI adapters), and differential communications (to reduce errors) will also improve performance.
Physical disk drives themselves support different levels of function in their hardware. Some drives support bad block relocation and elevator seek functions internally for example (see Disk Storage for a discussion of disk technology). Off-loading these functions from the LVM will increase performance.
Some subsystems, such as the 7135, support striping within the subsystem (RAID 0) which will also increase performance.
Again, selecting disks with the fastest overall read/write performance figures should be the policy for maximizing performance.
Selecting the highest performance hardware goes a long way to maximizing the performance of a disk subsystem, but the software implementation in terms of data placement on the disks and access methods (random or sequential) are also vital to the overall result.
Under AIX Version 4, the LVM supports striping, which means that the logical partitions of a logical volume can be spread across multiple disks and therefore accessed concurrently (see Disk Striping for a discussion of striping). Striping will maximize performance for sequential reads and writes, where the LVM can schedule consecutive reads and writes simultaneously to blocks on different disks. Performance will be further enhanced when the disks are on different adapters, thereby allowing full concurrency. Setting up striping is discussed in Striped Logical Volumes.
Whether striping is to be used or not, the placement of the data on the disk surface itself affects the performance of the subsystem. The LVM provides a number of parameters at logical volume setup that govern the policies it will enforce in terms of data placement and access. These policies are explained in Logical Volume Manager Policies. In order to maximize performance, the following policies should be adopted:
For maximum performance logical partitions should be selected in the center of the disk.
For maximum performance the maximum number of physical volumes available should be used for the logical volumes logical partitions. This will allow the LVM to schedule requests for long sequential reads or writes across physical disks in parallel.
Mirroring should generally be disabled for maximum performance. If it is required however, then the scheduling policy should be set to parallel and the allocation policy to strict. This will cause the LVM to place copies on separate physical volumes, and to perform writes in parallel, thereby maximizing performance. In addition, reads will be scheduled to the copy of the data required that is closest to a disk head, improving read performance. Write verification and mirror write consistency should also be set to no. This will prevent the LVM from wasting a disk revolution on every write to read back the data for validity, and also stop the LVM waiting for all writes to copies to succeed before returning successful completion of the write.
Adopting these policy settings in the LVM will maximize performance, but at the expense of availability. If availability is an equally great issue, then compromises will be necessary, such as using mirroring with reduced efficacy (as described in this section).
With regard to maximizing performance from the file system point of view, there are several configuration options that can be taken at file system creation time:
Using the largest fragment size of 4096KB will minimize allocation operations and maximize throughput from the file system. This could be at the expense of space utilization within the file system, depending upon the sizes of files within (see Planning Disk Utilization for a discussion of maximizing disk space utilization).
Using compression increases the overheads of reads and writes to the file system, and therefore to maximize performance, compression should not be used.
If many file systems are using the same log device, then this can introduce a bottleneck. Reducing the number of file systems that will be concurrently accessing a log device will avoid this problem. In order to do this, multiple log devices should be created within the volume group. To maximize performance, the log device should be on a different physical volume, and preferably a different adapter to the file systems sharing the log.
The JFS uses 4KB buffers for reading and writing, and returns success to a requesting application on receipt of the data. The actual physical write is not done until the buffer is full. This means that just using the JFS can improve disk I/O performance.
Fragmentation of the file system will have an adverse effect on I/O performance as it will increase the number of seeks required to access information. The smaller the fragment size selected, the worse this problem can become. Regular defragmentation of the file system, will alleviate this problem, and is can be accomplished using the defragfs command, discussed in Storage Management Files and Commands Summary.
There are a number of operating system parameters that affect the performance of I/O subsystems. These parameters should be adjusted with caution as they have system wide scope; this means that though they may radically improve performance for one application using one logical volume, they may have a detrimental effect on others:
This is a Virtual Memory Manager feature that allows the VMM to read in pages of information from disk before they are required. If the VMM suspects that a large sequential read is about to take place, it will use the values set in minpgahead and maxpgahead to decide on how many extra pages ahead of the current one to read in. This means that when requests for subsequent pages arrive, the required pages are already in memory and time is saved. More detail on the setting of these parameters is available in the InfoExplorer section Tuning Sequential Read Ahead.
This feature is intended to prevent those programs that generate very large amounts of I/O from saturating the I/O queues with requests, and thereby causing the response time of less demanding applications to deteriorate. Disk I/O pacing enforces high and low water mark values on the number of I/O requests that can be outstanding for any memory segment (this effectively means for any file). When the number of outstanding I/O requests for a segment reaches the high water mark, the process making the requests is put to sleep until the number of requests has reached the low water mark.
This feature is set to off by default. See the InfoExplorer article on Use of Disk-I/O Pacing for further details on tuning these parameters.
When there are multiple requests in a SCSI device drivers queue, it attempts to coalesce these requests into a smaller number of larger requests. The largest request size (in terms of data actually transmitted) that the device driver will build, is limited by the max_coalesce parameter. See the InfoExplorer section on Modifying the SCSI Device Driver max_coalesce Parameter for details on adjusting this setting.
It is possible to enforce a limit on the maximum number of outstanding requests on a queue for a given SCSI bus or disk drive. Setting these parameters can improve performance for those devices that do not provide sophisticated queue handling algorithms. See the InfoExplorer section on Setting the SCSI-Adapter and Disk-Device Queue Limits for further information on adjusting these parameters.
The LVM uses a construct called a pbuf to control pending disk I/O. In AIX Version 3, a pbuf is required for each page being read or written, which for applications with heavy I/O can result in pbuf pool depletion. In AIX Version 4, a pbuf is used for every sequential I/O request, regardless of the number of pages involved, thereby reducing the load on the pbuf pool. It is possible to tune the number of pbufs available, and in some cases this can improve performance. See the section in InfoExplorer on Controlling the Number of System pbufs, for further information.
The final performance considerations are at the actual application level. The design of the application can have a major effect upon performance. It is not always possible to affect the way in which application operate, but on those occasions where it is, the following considerations should be taken into account:
Applications can use asynchronous disk I/O, which means that control returns to the application from the read/write as soon as the request has been queued. The application can then continue working while the physical disk operation takes place. Obviously not all applications will be able to take advantage of this feature, but for those that can, the performance benefits are significant. More information on this feature can be found in InfoExplorer in the section on Performance Implications of Asynchronous Disk I/O.
In a similar fashion to asynchronous disk I/O, the sync() system call schedules a write of all modified memory data pages to disk, but returns immediately. Conversely, the fsync() call will not return until the writing is complete. Those application which must know whether the write was successful will not be able to take advantage of this, but for those that can, again the performance benefits can be significant. For more information, see the InfoExplorer section on Performance Implications of sync/fsync.
Examples of using LVM and file system configuration commands to maximize performance are detailed in A Design Example for Improved Performance.
Designing a disk subsystem for availability also involves a number of considerations, including:
From an availability standpoint, it is better to design a storage system using more rather than less adapters. This is really of benefit in the case where mirroring is to be used; having mirrored copies on separate adapters means that failure of one adapter will still leave the information accessible from the copy on the other adapter.
Redundancy is one of the most important mechanisms for ensuring availability. This entails having backups for all vital system components in much the same way as multiple adapters and mirroring above; within storage subsystem components, this really means having backup power supplies, cooling fans, adapters, data paths, and spare disk drives that can be automatically switched in when required, with no service or information loss.
At a pure operating system level, redundancy is limited to mirroring and multiple adapters. Much greater availability guarantees can be achieved using the features of external devices providing many of the backup features discussed. Devices available that support these kinds of features include the IBM 7135 and the 9570. Details of these and other similar devices can be found in Disk Storage Products.
This set of performance and availability features is discussed in Selecting the Correct Disk Storage Devices. RAID levels 1, 3, and 5 provide increasing levels of high availability and performance external to the operating system. The attached subsystem performs all of the RAID functionality under the covers, and presents an ordinary disk drive interface to the operating system. Subsystems which support RAID to varying degrees include the IBM 7135, IBM 3514, and IBM 9570. A brief discussion of these and other devices can be found in Disk Storage Products.
As in the case of performance, there are several options that can be taken at logical volume creation time to maximize availability of data. These options include policies governing the placement of data on the physical disks and mirroring. Data can be divided into two categories, operating system data, and user data, and the mirroring setup is slightly different for each case:
This procedure will maximize availability of the operating system. At least three physical disks should be in the rootvg to ensure that a quorum will always be available in the event of a single disk failure.
Mirroring the root volume group involves setting up one or two copies of each logical volume in the volume group; the procedure for actually implementing this is explained in rootvg Mirroring - Implementation and Recovery.
The Non-Volatile RAM must be updated to reflect the new disks available as boot devices, so that in the event of failure of the main copy of the rootvg, a reboot can be effected from another copy.
Mirroring user data also involves creating copies of all of those logical volumes requiring high availability. For maximum availability, the following policies should be selected:
The number of copies of a logical volume maintained by the LVM can be one, two, or three. Maximum protection against failure is provided using three copies, though at increased overhead. Again, availability is mainly achieved at the cost of performance, though this will depend on the intended usage of the mirrored logical volume. If the volume will mainly be used for reading, then performance can be enhanced, as the LVM will schedule reads to the disk head closest to the required data.
Having the logical partitions comprising the mirrored logical volume spread across the minimum number of disks will optimize availability. Ideally, each copy should be on a separate physical volume which is itself on a separate adapter. To enable this, the range parameter should be set to minimum, and the strict parameter to yes; this will force the LVM to restrict each copy of the logical volume to as few disks as possible, and to maintain the copies separate (no copy may share a disk). Prior planning to ensure space on the physical disks to hold the entire logical volume will ensure each copy can be successfully kept on a single physical volume.
The actual location of the data on each physical disk will have no direct impact on availability. If however, a center policy is selected for example, then although the LVM will try and fulfill the request on all of the mirrored copies, the inter-disk allocation policy will take preference. This means essentially, that if there are not enough center located logical partitions on a disk, then rather than look at spreading the logical volume to another disk, edge or middle located partitions will be used instead.
For maximum availability, a sequential policy should be adopted. This means that writes will be scheduled one after the other to all copies of the logical volume, each write having to complete before the next occurs. This maximizes the chances of at least one copy surviving in the event of a system crash during the process.
This feature should be switched on for maximum availability. Write verification means that after every write to a disk, the data written is read back to ensure its validity. This does have performance implications as every write will involve one extra disk revolution for the read verify.
Using the JFS provides some availability features over writing to raw logical volumes, or using NFS for example. The JFS records all changes to the meta-data of a file system into a log (file systems are explained in File Systems). If there should be a system crash, on reboot, the log is replayed and the file system returned to its last consistent state. This prevents corruption of the file system and thereby assists in maintaining higher availability.
Applications themselves can be designed to be availability aware. In the reverse of the requirements for high performance, applications should avoid asynchronous I/O to ensure that any data written is committed to disk before continuing and risking inconsistencies. In addition, the fsync() system call should be used rather than sync(), so that the application can be sure that all modified pages in memory have been written before continuing.
The section on application oriented performance considerations earlier in this section gives more details on these operating system calls.
The rest of this chapter looks at backup strategies, and the elements involved in planning them.
As soon as the system has been set up and the operating environment configured as required, a backup strategy should be immediately implemented. From this point on, valuable data will be created and stored within the storage subsystem that represents time and effort, and in most cases that supports the business. The organization of the system (operating system data and applications), and the user information created (files and directories) are subject to misadventure, however carefully managed; files can be accidentally erased, and hardware or software faults can destroy some information or even the entire system. For these reasons, it is important to be able to recover the system back to a point at which work can continue. Backing up the system involves making copies of all the information contained in it on a some medium that can be stored separately. The copies can then be used to recreate the system after a failure has been repaired, or information accidentally lost. The information in the system is usually highly dynamic, and therefore frequent copies or updates to the copies (also known as incremental backups) should be taken. The frequency and content of the updates or full backups is unique for each business, and depends upon the rate of change of information and the relative importance of that information. Evaluating this is the process of developing a backup strategy. The following points should be considered:
Consider every potential catastrophe, however unlikely, and determine whether recovery would be possible. If the backup media was lost in some natural disaster, would recovery be possible? Obviously, it is necessary to factor in the likelihood of a particular disaster, but this must be done in conjunction with consideration of the value of the data. It's not much comfort to reflect how unlikely the ball of lightning that destroyed the backup media inside a safe was, when the business is ruined as a result.
There are many different types of backup media, each with varying degrees of reliability and longevity (see Tape Storage for a comparison of backup media). Check the condition of backups on a regular basis to ensure that they are still usable.
Although it is a good policy to develop a regular cycle for reusing backup media, complete copies should be maintained for some time as it can often be a while before it is noticed that a particular file is damaged or unusable, by which time the backup copies may contain copies of the damage. It is therefore a good plan to implement a recycling policy such as the following:
Keep the quarterly backups indefinitely. This will always ensure the ability to access information up to three months old at various levels of currency.
Making a backup of a damaged file system will result in the ability to restore that damaged file system in the event of failure. It is therefore a good idea to check file systems before backing up to ensure integrity.
Files that are in use during a backup will be different to the backed up copy. Backups should therefore be taken while the system, or files being backed up, are not in use.
Major changes introduce the possibility of errors and hence loss of data. It is always sensible to take a backup prior to any such activity.
There are two main types of backup:
In a complete system backup, a copy is made of everything on the system. this can then be used to completely restore the system in the event of a failure. The complete backup can contain operating system and user data, although it is more sensible to maintain these two separately for the following reasons:
In an incremental backup, only the data that has changed since the last backup is backed up.
A complete system backup policy should be used when data does not change too often. The backups should be scheduled at a frequency that allows complete recovery of business critical information. For example, if database update runs are done weekly, then a backup after the run each week is sensible.
An incremental policy should be used when information is extremely dynamic. Full system backups are taken at a fixed interval, within which backups of changed information are taken at shorter intervals. The frequency of the incremental backups depends upon the criticality of the information, as in the complete system backup policy. The frequency of the incremental backups depend upon the volumes of data that have changed. As recovery with incremental backup requires reloading the last full backup followed by application of the incremental backups up until the point of failure, the frequency of incremental backups should be set at a value which is a balance between criticality of information and number of incremental backups that will need applying.
There are several ways of backing up information:
This method is also called file name archive, and should be used to backup individual files or directories, if required. This mechanism is most commonly used by users to make backups of their own files.
This method is also known as backup by i-node, and is used to backup entire file systems. This mechanism should be used by the system administrator to back up large groups of files. It is also used for incremental backups.
This method allows complete volume groups to be backed up. There are separate commands for the root volume group and for user volume groups.
The following commands can be used to implement the backup policy created:
These commands are described in detail in Storage Management Files and Commands Summary. Examples of backing up a system are detailed in Managing Backup and Restore.
So far, backup purposes, policies, and commands have been discussed. This leaves the important topic of the actual media that the backup will be stored on. Tape devices are the most common backup or long term archive medium, though there are still several considerations:
This should be governed by the requirements below, but is more often just a question of cost per megabyte. The most important factors should be the reliability, longevity, performance and capacity.
Volume of information should be considered too, as this might suggest a tape library. Short term archive may suggest an optical device. These decisions are covered in Selecting the Hardware Components.
It is important to consider the length of time that a backup will require, and this is a function of the volume of data and the speed with which the device can write it. If backups need to be taken every evening, then a device capable of completing the process in the time available should be chosen.
This is a question of cost, storage space, and ease of use. The more information that can be packed onto the media, the better generally, as this means that less media will be required for backups. Storage space is therefore less, and if a single cartridge is sufficient, no operator intervention may be required. If multiple cartridges will be required, and operator intervention is not possible, then a tape library should be considered.
As was mentioned earlier in this section, the length of time that the media can be safely stored is important. This governs not only the backup cycles, but for how long the media can be safely reused.
This is a very important issue. Although checks on the success of a backup can and should be performed, unreliable devices that have a high percentage of errors, and produce occasional unreadable backups, are time consuming and dangerous. Some devices encounter read back problems too, even though the media copy may be good.
Device technology improves and changes with time. It is a good idea to ensure that next generation of devices support existing archive and backup media.
These considerations are discussed in relation to tape technology in Tape Storage.
This chapter has looked in detail at the planning and design requirements for storage subsystems. Design of the subsystem involves considering the following points:
How the physical subsystem will be organized from the perspective of:
For those applications that require high performance, designing the storage subsystem for maximum performance. This involves optimizing:
For those applications that require high availability, designing the storage subsystem for maximum availability. This involves optimizing:
Designing a strategy that allows for as full a recovery as necessary in the event of failure, for the business to continue. This entails the following: