AIX Version 4 Storage Management Enhancements

&ver4 provides enhanced functional capabilities in the area of the storage management. This chapter will examine these enhancements, in some detail.

Fragmentation

Fragmentation is a concept introduced in BSD UNIX** which enables system administrators to manage file systems in such a way that they make more efficient use of the disk storage space available to them.

Research conducted in the area of disk space utilization has revealed that up to 45% of disk space is wasted by file systems that use a 4KB block as the allocation unit. In AIX releases prior to AIX Version 4, the disk space allocation unit is in fact 4KB and not tunable, potentially giving rise to much wasted disk space. In &ver4 it is now possible to create journaled file systems with an allocation unit or fragment size specified as one of 512, 1024, 2048 or 4096 bytes.

Although there is a distinct advantage in providing this enhancement for ensuring optimal disk space utilization this can sometimes be at the expense of performance.

In AIX Version 4, as many whole fragments as necessary are used to store a file or directory's data. Consider that we have chosen to use a JFS fragment size of 4KB and we are attempting to store file data which only partially fills a JFS fragment. Potentially, the amount of unused or wasted space in the partially filled fragment can be quite high. For example, if only 500 bytes are stored in this fragment then 3596 bytes will be wasted. However, if a smaller JFS fragment size, say 512 bytes, was used, the amount of wasted disk space would be greatly reduced to only 12 bytes. It is, therefore, better to use small fragment sizes if efficient use of available disk space is required.

Although small fragment sizes can be beneficial in reducing disk space wastage, this can have an adverse effect on disk I/O activity. For a file with a size of 4KB stored in a single fragment of 4KB, only one disk I/O operation would be required to either read or write the file. If the choice of the fragment size was 512 bytes, eight fragments would be allocated to this file and for a read or write to complete, several additional disk I/O operations (disk seeks, data transfers and allocation activity) would be required. Therefore, for file systems which use a fragment size of 4KB, the number of disk I/O operations will be far less than for file systems which employ a smaller fragment size.

For files of greater than 32KB in size, whatever the fragment size, allocation is performed in logical blocks of 4KB. The i-node pointers will therefore point to 4KB logical blocks as before (indirection will also be the same). For those files of up to 32KB in size, fragments come in to play. Consider a file of 17KB in size. The first 16KB of this file will be allocated logical blocks as before, the disk addresses of these blocks pointed to by the first four pointers in the i-node. The last 1KB of the file will be allocated sufficient contiguous fragments to contain the remaining data, if available. Assuming a fragment size of 512 bytes, two fragments in this case. The fifth pointer in the i-node points to the disk address of the first fragment. In AIX Version 3, the first four bits of the disk block addresses were unused and therefore zeros. In AIX Version 4, when fragmentation is used, the last three of these bits are used to indicate the number of fragments from the disk address that are required. To ensure compatibility with previous releases, all zeros in these bits implies a full block of fragments (which means that the disk address references a 4KB logical block), and therefore in this example eight fragments (using logic 8 - 0 = 8 fragments). So, for the final data in the example file, two fragments are required, so the number in the four bits should be 0110 (this is 6 in binary: 8 - 6 = 2 fragments). The JFS will always subtract the first four bits from eight to see if there are fragments at the disk address in the remainder of the pointer, and if so, how many. The next file that is to be written is exactly 4KB in size. This is one complete logical block, and will be written immediately after the two fragments from the previous example, thus wasting no space. The first four bits will be zeros, indicating eight fragments. If there had not been eight contiguous fragments free anywhere in the file system, this write would have failed. See Figure - Fragmentation Example for a diagram of the allocation for this first file (file X). Now a third file is to be written (file Y). File Y is four fragments (or 2048 bytes) in length. There are four fragments free after file X, so file Y is written immediately after the fragments for file X. If there had not been four contiguous fragments available, this write would have failed. File X is now extended by three fragments (or 1536 bytes). There are three contiguous fragments available after file Y, so the extension is written there (as shown in Figure - Fragmentation Example). Note that both the pointer for file Ys fragments, as well as for the second block of file Xs indicate partial blocks of fragments in the first four bits of the address (file X: 0101 = 5; 8 - 5 = 3 fragments. File Y: 0100 = 4; 8 - 4 = 4 fragments).

It is very important to note the following:

  1. Fragments are allocated contiguously or not at all

    If the JFS cannot find sufficient contiguous fragments (up to 4KB worth), allocation and therefore the write will fail. To elaborate, if 11 fragments needed writing (following on from the preceding example), and there were eight free contiguous fragments after file Y, and three free contiguous fragments before file X, then file Z could be written. Having nine free fragments after, and two before would not be sufficient.

  2. Fragments will lead to free space fragmentation

    As files are extended, reduced, and deleted, small groups of free fragments will begin to become available (for example if file X in Figure - Fragmentation Example is reduced in size to six fragments, two free fragments will appear between files X and Y. This will be used if another file should be created of size one or two fragments, or if a file is extended by two fragments beyond a 4KB boundary, otherwise it will remain unused). In order to reclaim this fragmented free space, a tool is provided that will reorganize the file system to coalesce this space as far as possible. The defragfs command is described in Storage Management Files and Commands Summary.




Figure: Fragmentation Example

To expand on fragment allocation one step further, only the last 4KB block of a file can be partially allocated, that is to say allocated as fragments of the logical block size, and the file must be directly referenced by the i-node, not indirect. When blocks (4KB) or partial blocks (a fragment multiple) are allocated, contiguous space must be available for them. Hence the statement in the last example relating to extending a file 2 fragments beyond a 4K boundary. If the file was 9 fragments, and is extended to 18 fragments, then this is the case: 2 full blocks of 8 contiguous fragments, and 2 contiguous fragments must be found for the write to succeed.

Disk Space Allocation

For file systems with a fragment size smaller than 4KB, there is likely to be an increase in allocation activity when the size of existing files or directories are extended.

As an example, assume that a file is extended by 500 bytes, and the file system fragment size is 512 bytes, this will result in one allocation to this file of a 512 byte fragment. If the file is extended by another 500 bytes, another allocation of a 512 byte fragment will be made to this file. So far, two allocation operations have already been performed. However, with a file system fragment size of 4KB, the first file extension operation would have involved one allocation to this file of a 4KB fragment and the second file extension operation would not have resulted in an allocation as there would have been sufficient space from the first allocation. The number of allocations made in the file system using a 512 byte fragment could have been minimized if the two separate file extension operations were performed as one extension of 1024 bytes. Although two 512 byte fragments would still be allocated, this would involve only one file system operation to complete.

Free Space Fragmentation

As has been mentioned, free space fragmentation can occur much more within a file system that uses smaller fragment sizes. To clarify, assume that there is a portion of the disk consisting of 8 contiguous 512 byte fragments and that four files, each 500 bytes in size, have written to these fragments in a non-contiguous manner. The free disk space within this area of the disk (four 512 byte fragments) are unallocated fragments which also reside in a non-contiguous manner. A file extension operation which would require 2048 bytes would not be allocated these free fragments as they would have to be contiguous for a single allocation to succeed.

It is quite possible for a file system using a fragment size smaller than 4KB, particularly 512 bytes, to reach high levels of free space fragmentation.

Fragment Allocation Map

The fragment allocation map, used to hold information about the state of each fragment for each file system, is held on the disk and in virtual memory. The use of smaller fragment sizes in file systems results in an increase in the length of these maps and therefore requires more resources to hold.

Compression

Compression, like fragmentation, is provided for journaled file systems in AIX Version 4 for the better utilization of available disk space. The disk space savings made by using compression on average increase by about a factor of two. Unlike the JFS fragment support, which treats logical blocks of files and directories less than 32K bytes in size differently to those that are larger, the JFS data compression support will use the same data compression technique for all logical blocks of files, irrespective of file size and fragment size. It does however, enforce that the fragments used for the files logical blocks are contiguous.

The obvious advantage of the use of data compression supplemental to fragmentation, is that there is no restriction to the file size, and so compression will be efficient for both small and large files.

AIX Version 4 JFS data compression is supplemental to JFS fragment support, and requires the installation of the data compression software package.

Only regular files and long symbolic links can be compressed in file systems supporting compression. The fragment sizes supported in a compressed file system are 512, 1024 and 2048 bytes only. A compressed file system cannot have a fragment size of 4KB.

The choice of the fragment size for compressed file systems must be made after evaluating the size of files to be stored and the amount of compression that is actually achieved. Where high amounts of compression are possible, higher disk space savings can be achieved by using a small fragment size like 512 bytes. However, at the same time performance will degrade quite substantially.

Implementation of Data Compression

In AIX Version 4, data is compressed at the level of individual files logical blocks. To compress data in large units (all the logical blocks of a file together for example), would result in the loss of more available disk space. By individually compressing a files logical blocks, random seeks and updates are carried out much more rapidly.

After compression of a logical block of a file takes place, the compressed logical block is written to disk using only the number of fragments required for it. After compression, it is likely that the data in the files logical block will occupy less than 4K bytes of disk space. However, if the data does not compress, then it is written to disk in the uncompressed format and allocated the full 4KB of contiguous fragments.

When a file or directory's logical block is first modified, 4KB of disk space are allocated to it to guarantee that the write to disk of that logical block will be successful. If this allocation fails, then an appropriate system error message is returned.

In addition to increased disk I/O activity and free space fragmentation problems, file systems using data compression have the following performance considerations:

1)
Degradation in file system usability arising as a direct result of the data compression/decompression activity. If the time to compress and decompress data is quite lengthy, it may not always be possible to use a compressed file system, particularly in a busy commercial environment where data needs to be available immediately.
2)
All logical blocks in a compressed file system, when modified for the first time, will be allocated 4096 bytes of disk space, and this space will subsequently be reallocated when the logical block is written to disk. Performance costs are therefore associated with this allocation, which does not occur in non-compressed file systems.
3)
In order to perform data compression, approximately 50 CPU cycles per byte are required, and about 10 CPU cycles per byte for decompression. Data compression therefore places a load on the processor by increasing the number of processor cycles.

Compression Algorithm

An IBM version of the Lempel Zev (LZ) algorithm is used to perform data compression. The LZ algorithm compresses data by representing the second and subsequent occurrences of a given string with a pointer, identifying the position of the first occurrence of the string and its length. At the start of the compression process the first byte of data is represented as the raw character using a pointer-byte pair (0, byte). The algorithm then processes a fixed amount of data, say N bytes, for compression. Normally, the value of N is one of 512, 1024 or 2048. Every time a string in the N bytes is replicated it is replaced by a pointer-length pair as described above. After compression of N bytes of data, the algorithm searches for the string starting at the next unprocessed byte in the N bytes previously compressed. If the longest match found has a length of zero or one, it represents the first byte in the unprocessed string as a raw character as mentioned previously. If, on the other hand, the length of the matching string is greater than one, the compression algorithm will represent the string using a pointer-length pair and continue to process a further N bytes of data starting from that string.

Disk Striping

AIX allows the placement of logical volumes on a specific area of one or more physical volumes. For example, the center of the disk may be chosen for the placement of logical volumes when rapid access to data is required. Even though this placement strategy can provide fast access to data, it is still restricted by the fact that a disk I/O operation is performed to retrieve each data block.

In Part 1 of Figure - Striping Example, the numbered disk blocks for the file represent the sequence of the data in the file. To read the entire file sequentially will involve reading each disk block in turn.


Figure: Striping Example

However, if we place the data in a logical volume over all available disks in a specific manner to enable parallel access to that data then this would further improve sequential access to that data (see Figure - Striping Example).

In user environments where sequential access to large data files is very frequent, this technique will prove extremely efficient. In fact, &ver4 provides for this technique with a mechanism known as striping.

In non-striped logical volumes, data is accessed using addresses to data blocks within physical partitions. In a striped logical volume, data is accessed using addresses to stripe units. Consecutive stripe units are created on different physical volumes. A single stripe consists of a stripe unit on each physical volume. The size of a stripe unit must be specified at creation time and can be any power of 2 in the range 4K to 128K bytes. As data in a striped logical volume is no longer accessed using data block addresses, the LVM will track which blocks on which physical drives actually hold the data being accessed. If the data being accessed resides on more than one physical volume, the appropriate number of simultaneous disk I/O operations will be scheduled for all drives concerned.

Usage Implications

Disk striping definitely appears to provide very high-performance access to large sequential files. However, to get optimal performance for sequential I/O, there should be little or no other I/O activity on the physical volumes.

To make the most efficient use of striped logical volumes, some operating system parameters must be tuned and application requirements for memory must also be minimized. Results of a benchmark comparing the relative performance of striped logical volumes against non-striped logical volumes are provided in section Benchmark Results for an I/O Bound Test Using Striping.

The constraints imposed by striping of logical volumes are:

Using Page Space for System Dumps

A collection of one or more logical volumes, used solely for providing a mechanism for storing data temporarily not required to be in real memory, is known as Paging Space. A description of how paging actually works can be found in Page Space. Unlike non-paging logical volumes, which are used to store data permanently when a computer is powered on or rebooted, there is no guarantee that data which previously resided in paging space would still remain there.

In AIX Version 4, paging space is additionally used as a primary dump device for system dumps. During installation of AIX Version 4, /dev/hd6, (the paging logical volume), is automatically configured as the primary dump device. However, for AIX systems being migrated to AIX Version 4, /dev/hd7 is still being maintained as the primary dump device.

After a system dump to the paging space (primary dump device) has taken place, the system boot process has to move the dump data from this area to an appropriate area on the disk. This has to be carried out since the paging space will become re-activated and all data previously residing there is likely to become over-written. By default, the dump is copied to the directory /var/adm/ras. The sysdumpdev command now has an optional flag which can be used to specify a different directory for the dump to be copied to.

The advantages of using paging space for the primary dump device are:

  1. It makes better utilization of storage space by using an existing logical volume as opposed to reserving one specifically for this purpose. A dedicated dump device like /dev/hd7, as used in AIX Version 3, can be quite wasteful, particularly within a stable system.
  2. Since paging space is normally configured to be of the same size or larger than RAM this would guarantee that it is has sufficient space for a dump.
  3. The I/O operations, particularly when writing dump data to disk, are improved if the paging logical volumes are strategically placed for fast access to data. For example, at the center of the disk and also over as many physical volumes within the volume group as possible.

Variable I-nodes

In all UNIX implementations, when a file system is created, several data structures known as i-nodes are written to the disk. For each file or directory one such data structure is used which describes information pertaining to it. The sort of information which is stored in the i-node includes file type, permissions, size, user and group owner ids. Other critical pieces of information that are held in the i-node are the disk addresses at which the files data is stored.

AIX, like other UNIX implementations, reserves a number of i-nodes for files and directories in each file system that is created. In releases prior to AIX Version 4, an i-node is generated for every 4KB of disk space that is allocated to the file system being created.

In a 4MB file system this would result in 1024 i-nodes being generated. For earlier releases this figure would probably suffice, since a file or directory is allocated at minimum 4KB of disk space anyway. In AIX Version 4, since disk space is allocated in fragments allowing better utilization of disk space 1024 i-nodes in a 4MB file system can quickly become exhausted if a large number of small files are written (assuming a fragment size of 512 bytes). In a file system created using a 512 byte fragment size, 8192 files, at maximum, can be written if the largest file size is 512 bytes.

AIX Version 4 JFS provides a parameter to tune the number of i-nodes generated at file system creation time. This parameter, better known as number-of-bytes-per-i-node (NBPI), can be any power of 2 in the range 512 through 16384 (examples include 512, 1024, 2048, 4096).

When NBPI is used in conjunction with the fragment size it can allow better storage management, particularly when it is known beforehand the number and size of files to be stored in the file system. See File Systems for more information on i-nodes.

File System Maximum Size Increase

In releases of AIX prior to Version 4, the maximum size a journaled file system or logical volume can grow to is 2GB. The limitation is due to the usage of a 32 bit signed integer value giving maximum file system addressability of 2GB: 2 raised to the power of 31, the most significant bit giving the sign. With the growing needs of commercial and scientific environments, this limit can be reached quite quickly. In fact it is now becoming more commonplace for database application environments to need to access larger volumes of data.

In AIX Version 4, the maximum size a journaled file system or logical volume can grow to is 256GB. This is now possible since a 64 bit variable is used as for the pointer.

However, note that the maximum file size is still limited to 2GB. This is because no change has been implemented to the i-node structures used to reference the data blocks. See File Systems for more information on file systems.

JFS Log Considerations

For journaled file systems, a transaction log is maintained which provides file system recovery in case the system abnormally terminates. One JFS log, with a default size of one logical partition, maintains log data for all the file systems within a volume group. For file systems that are no larger than 2GB, the default log size is sufficient. However, for file systems that are larger than 2GB, it may be necessary for the log size to be increased proportionately.

Summary

This chapter covers the latest enhancements to storage management made available in AIX Version 4. These enhancements include: