June 9, 2011, 3:51 a.m.
posted by vdv
Overview of Windows Server 2003 File Systems
A file system is a little like a commercial real estate agent. It acts as a broker between a lessor who has space available and a lessee who wants that space. In the case of a file system, the storage system determines what space is available, and applications are the lessees that want a piece of that space.
As administrators, we need to know enough about the elements of a file system transaction so we can spec out our storage needs and anticipate where problems might occur. This means we need to know details about certain disk structures that support file system operations:
To see how a file system turns raw storage into a data repository, we need to know a little about the structures that hold critical information. They are as follows:
From an operational perspective, we need to know what each file system can do, what it can't do, and what to use as criteria when choosing between them. Let's start with storage details. (See the following sidebar, "More Information About File Systems.")
Sectors and Clusters
A hard drive stores data in concentric tracks that are divided into addressable units called sectors (see Figure). In most Western drives, a sector contains 512 bytes.
When a file system asks for data from a storage driver, it must specify the location of that data in relation to the start of the volume. The storage driver then works with the device controller to move the drive heads to the designated location, pick up the required information (plus a little extra for the cache), buffer the information, and deliver it to the file system driver.
A sector is the smallest addressable unit on a drive. Ideally, a file system would assign an address to every sector in a partition. This yields the best utilization, because any space left over between the end of a file and the end of the last sector holding the file is wasted.
At some point, though, a volume may contain so many sectors that the cost of maintaining addresses for all of them starts to become a burden and performance goes down. To improve performance, the file system clumps individual sectors into allocation units, or clusters.
A cluster contains an even multiple of sectors. This is called the cluster size. Clusters come in increasing powers of 2, yielding cluster sizes of 512 bytes, 1K, 2K, 4K, 16K, 32K, and 64K. The maximum cluster size supported by any file system that ships with Windows Server 2003 is 64K.
If the end of a file does not completely fill its assigned cluster, the excess space is wasted. Windows does not provide sub-allocation of sectors within a cluster. This means cluster size has a direct impact on disk utilization. For example, I've seen instances where nearly 25 percent of the available space on a volume was reclaimed by converting a large, heavily loaded FAT volume formatted with 32K clusters into an NTFS volume with 512-byte clusters.
It is beneficial in some instances to match cluster size to average file size. A volume that holds hundreds of thousands of small files should have a small cluster size. That seems obvious. But a volume that holds a few, very large files (database files, for example) can benefit from the improved efficiencies of large cluster sizes. For the most part, though, letting Windows decide on a cluster size when formatting a volume usually yields optimal performance.
Changing cluster sizes requires reformatting. If you decide to increase the cluster size on a big array for a database server, you'll need to back up your data, reformat the array with a different cluster size, and then restore the data from tape.
Each of the three file systems in Windows Server 2003 uses 512-byte clusters up to a certain volume size. Beyond that, behavior differs. For FAT, cluster size doubles each time volume size doubles. FAT32 and NTFS keep cluster sizes at 4K for as long as possible. Figure lists the default cluster sizes for each file system based on volume size.
The 4K plateau on NTFS cluster sizes is there because the compression API does not work with cluster sizes above 4K. FAT32 is seen as an intermediate stage prior to converting to NTFS, so FAT32 cluster sizes are also constrained to 4K as long as possible.
Historically, the defragmentation API was another reason for limiting the maximum NTFS cluster size to 4K. As we'll see in section, "Defragmentation," Windows Server 2003 and XP now permit defragmenting volumes with cluster sizes above 4K.
FAT File System Structure
Figure shows the layout of the first few sectors on a disk that is formatted with FAT. The partition boot sector has an entry that identifies the format type and the location of the FAT and the mirrored FAT. Ordinarily, the FAT is located near the front of the disk to benefit from the fast read times there. (Tracks at the outside of a disk have a higher terminal velocity.)
The Fastfat.sys driver in Windows Server 2003 supports three cluster numbering schemes:
There is also a special version of FAT32 called FAT32x used by Windows 9x and ME when formatting drives larger than 8GB. FAT32x overcomes a limitation of traditional Cylinder/Head/Sector translation by forcing the operating system to use Logical Block Addressing (LBA), which assigns a number to each available sector reported by the drive. LBA ups the ante for partition sizes to whatever the file system can handle. For FAT32, that limit is 2TB (terabytes).
FAT32x volumes store their FAT tables at the end of the volume rather than the beginning. This signals the operating system to use LBA. Windows Server 2003 can read a FAT32x volume but does not use FAT32x formatting because the Fastfat driver always uses LBA unless specifically told not to. If you upgrade a Windows 9x/ME system to XP, there is no indication in any of the command-line utilities or in the Disk Management console that a volume is formatted as FAT32x rather than FAT2. For more information, visit www.win98private.net/fat32x.htm.
Location and Use of FAT Disk Structures
FAT12 and FAT16 file systems require the FAT to start at the first available sector following the boot sector. MBR partitions generally have a few hidden sectors between the partition boot sector and the start of the file system. These hidden sectors are often used for inscrutable purposes. GPT disks used on IA64 systems have no hidden sectors.
The mirrored copy of the FAT must follow immediately after the primary FAT. This fixed location of the FAT tables in FAT12 and FAT16 is a weakness. A failed sector can make the file system inaccessible. This weakness is overcome by FAT32, which can locate the FAT anywhere, although it is generally still located at the front of the disk right after the boot sector.
The first entry in the FAT represents the root directory of the partition. The root directory is special because it is exactly 32 sectors long, enough room for 512 entries. For this reason, you can only put about 500 files and directories at the root of a FAT partition. As you are no doubt aware, FAT and FAT32 support long filenames by robbing directory entries. This can quickly absorb many directory entries if you use long filenames in the root directory, greatly limiting the total number of files and folders you can store at root.
The size of the FAT table itself is determined by the number of clusters in the partition. The larger the partition, the more entries are needed in the FAT. For example, the FAT on a 2GB partition with the default 32K cluster size would take up 128K of disk space. The FAT mirror would also use 128K. This is a pretty efficient use of disk space when compared with FAT32 and NTFS, but the payoff comes in reduced reliability, slower performance for some drive operations, performance degradation in the face of even minimal fragmentation, and the lack of security and journaling features.
The FAT is actually just a big cluster map where each cluster in the partition is represented by a 16-bit (2-byte) entry. If there are 65,535 clusters in the volume, there would be 65,535 2-byte entries in the FAT. (FAT12 packs the bits for storage economy.) Figure shows the layout of a few cluster entries in a FAT16 table.
The first two FAT entries are reserved and represent the root of the partition. The next entry represents the first file or folder created in the volume. In the diagram, this cluster is empty because the original file or folder has been deleted.
An empty cluster has a value of 0000. When you create a file or folder, the file system selects an empty cluster and writes data to the disk in that location. If the file or folder spills over into another cluster, the value of the FAT entry for the first cluster contains the number of the next cluster. This is called a cluster chain. The final cluster assigned to a file is identified with an end-of-file marker, FFFF.
Ideally, each cluster used by a file comes directly after the preceding cluster on the disk. Such a file is said to be contiguous. When you delete a file or folder, the FAT entry is set to 0000. This indicates that the cluster is available. As you add and delete files, the file system reuses empty clusters. This results in fragmentation.
Figure shows a portion of the FAT with a fragmented file. The cluster map shows a file that starts at cluster location 08. The file is too big to fit in one cluster and the next contiguous cluster already has a file in it. The file system driver selected the next available empty cluster and continued the file from there. Remember that the number in a FAT entry points at the next cluster in the chain, not the current cluster.
When the Fastfat driver delivers a file to the I/O Manager in the Windows Server 2003 Executive, it must "walk the chain" of FAT entries to locate all the cluster numbers associated with the file. The drive head must then travel out across the disk and buffer up the data, put it in order, then spool off the results to the guy upstairs. If the files and folders in the partition are heavily fragmented, it takes much more effort from the disk subsystem to collect the clusters. This impacts performance.
As we'll see, Windows Server 2003 and Windows 2000 have a built-in defragmentation utility that can put the FAT and the associated disk clusters back in apple-pie order. The same defragger is supplied in Windows Server 2003 and XP. You can schedule defragmentation in Windows Server 2003, something that required a third-party utility in Windows 2000.
The file system cannot locate a file by its name simply by looking at the cluster map in the FAT. Finding a particular file by its name requires an index that shows the filename and the number of the first cluster in the file as listed in the FAT. That index is called a directory or a folder.
In addition to filenames, a FAT directory entry contains a single byte that defines the file's attributes (Read-Only, Hidden, System, and Archive) and a timestamp to show when the file was created. Directory entries are placed into a disk cluster just as if they were files. Figure shows a diagram of disk clusters that contains a set of directory entries.
A directory entry can become fragmented like the example in the figure when the number of name entries exceeds the size of the cluster. This is another reason large FAT volumes need large cluster sizes.
If you add several files to a directory and the directory entry cannot grow into a contiguous cluster, the directory becomes fragmented. This significantly degrades performance. You can imagine the kind of work it takes for the file system to assemble fragmented directories and their linked files to display the results in Explorer.
FAT Partition and File Sizes
Due to real-mode memory limits, the cluster size for a FAT partition under DOS and Windows 9x is limited to a power of 2 less than the address limit. Therefore, the maximum cluster size is 215 bytes, or 32KB. The maximum size of a FAT partition under DOS and Windows 95, then, is 65535 * 32KB, or about 2GB.
FAT under Windows Server 2003 (which boots into protected mode before the Fastfat file system driver loads) has a cluster size limit of 216 bytes, or 64KB per cluster.
FAT supports 216 clusters on a volume but it reserves 12 clusters for special use, leaving 65,534 clusters for storing files and folders. In practice, it is nearly impossible to select a partition size that would yield exactly the maximum size for a given cylinder alignment. Gaining a few clusters of theoretical capacity is not worth the effort.
The maximum partition size of a FAT partition on Windows Server 2003 (and any member of the NT family) is about 4GB (65,534 clusters at 64K per cluster). DOS and Windows 9x cannot access a partition with a cluster size larger than 32K, so avoid 64K clusters when running a dual-boot machine.
FAT file sizes are specified by a value in the directory entry. This value uses a 32-bit word, so file sizes are limited to 232 bytes, or 4GB. The actual size is one byte shy of a full 4GB, or 4294967295 bytes, because a 32-bit word filled with 1s would be 0XFFFFFFFF.
You can verify this experimentally using Fsutil from the Support Tools. This utility permits you to create files of any length down to the nearest byte. In the experiment, create a FAT32 partition comfortably larger than 4GB. Issue the following command at the root of the partition:
fsutil file createnew 4294967296
You will get an error saying that insufficient disk space exists. Subtract one from the size and the file will be created with no errors.
FAT32 was introduced to overcome some of the more glaring deficiencies in FAT. The most significant difference is the FAT32 cluster map, which uses 32-bit words to identify clusters rather than 16-bit words. This significantly increases the number of clusters that can be addressed. The first 4 bits of each 32-bit cluster address are reserved, so the maximum number of clusters is 228. Coupled with the maximum FAT32 cluster size under Windows Server 2003 of 64K, this yields a theoretical volume size of 4EB (exabytes).
Now for practicalities. The size of any MBR-based disk partition is defined by a Volume Size value in the partition table. This value specifies the number of sectors assigned to the partition without regard to the file system that formats the partition. This Volume Size value is a 32-bit word, so the maximum size of an MBR-based partition is 232 sectors, or 2TB (terabytes).
If you use Dynamic disks in an IA32 system or GPT disks in an IA64 system, you avoid this partition table limit and a volume can grow to its theoretical limit. However, Windows Server 2003 will refuse to format a FAT32 volume larger than 32GB.
You can format a FAT32 volume under Windows 98 or ME to a size larger than 32GB and put it in a Windows Server 2003 machine, and the Fastfat driver can read and write to it. The maximum practical FAT32 volume size under ME is limited to 2TB because of the Volume Size entry in the boot sector.
FAT and FAT32 Weaknesses
FAT and its cousin, FAT32, are kind of like a vaudeville act. They're notable for their longevity more than any remaining entertainment value they might have in them. Here are their primary weaknesses, most of which are corrected by NTFS:
NTFS Cluster Addressing and Sizes
NTFS sets aside a 64-bit word for cluster numbering, but all implementations of NTFS limit the address to the first 32 bits. At the maximum cluster size of 64K, this yields a maximum partition size of 256EB (exabytes). The partition size limit for MBR disks remains at 2TB, even under NTFS, thanks to the maximum size specified in the partition table. This limit can be overcome with dynamic disks or by using GPT disks on IA64 machines.
The maximum default cluster size for NTFS is 4K, but if you have no need for compression, you can select a larger cluster size when formatting a partition.
The maximum NTFS file size is artificially constrained from the theoretical maximum of 264 bytes to an actual maximum of 244 bytes, or about 16TB. This was considered an outrageously large file when NTFS was first developed, but if you extrapolate out the growth of current storage solutions, it won't be long before some high-end Intel-based servers start to nibble at that 16TB limit. Microsoft has not stated what its strategy is for these types of files, but rumor has it that a future successor to Windows Server 2003 might sport a new file system capable of handling humongous files.
The MFT consists of a set of fixed-length records, 1KB apiece. Each record holds a set of attributes that, taken together, uniquely identify the location and contents of a corresponding file, folder, or file system component. (There are a few minor exceptions to this "one record, one file" rule, but they are encountered only when a file gets very large and very fragmented and are not typically a concern.)
Figure shows the hierarchy of the metadata records, if you think of them as representing files and folders. The record names start with a dollar sign, $.
Metadata records are not exposed to the UI or to the command line. You can see the space set aside for them by running CHKDSK. Following is a sample listing. The bold entries show the space taking up by the metadata files:
635008 kilobytes total disk space. 535691 kilobytes in 8206 user files. 1894 kilobytes in 592 indexes. 14176 kilobytes in use by the system. 7647 kilobytes in use by the log file. 83247 kilobytes available on disk.
As you can see, NTFS takes a significant chunk out of a small volume. On volumes in excess of 10GB or so, though, the percentage of total space consumed by NTFS is smaller than FAT32 when cluster sizes are equal.
The order and structure of the metadata records are rigorously controlled because of the impact their location has on performance and functionality. For example, the MFT contains a record representing the partition boot sector. The secondary bootstrap loader, Ntldr, loads the first few metadata records in the MFT and uses that information to locate and read the boot sector so it can mount the rest of the file system.
The MFT is a file, just like any other file. The first record in the MFT, then, is a file record representing the MFT itself. This is the $MFT metadata record. If the sector holding the $MFT record gets damaged or is otherwise unreadable, the operating system would fail to start. To prevent this from happening, the first four MFT records ($MFT, $MFTMirr, $LogFile, and $Volume) are copied to the middle of the volume. If Ntldr cannot open these records in their normal location, it loads the mirrored copies and uses information in the $MFTMirr record to learn the contents of the damaged sector so it can mount the file system.
NTFS 1.2, the version used in NT4, used the first 16 records of the MFT to hold metadata. Only the first 11 records actually contained any information, with the rest reserved for future use. NTFS 3.0 and later set aside the first 26 records for metadata and uses 15 of them.
You may encounter the names of these metadata records in a variety of scenarios. They appear most often in error messages, especially when the file system gets gravely ill. For the most part, though, knowing their names and functions is like knowing the names of the bones in your body. It helps you pinpoint problems where otherwise you would only be able to give vague references.
Here is a quick list of the metadata records and their functions, with more detail offered later in the chapter:
The "Dirty" Flag
NTFS keeps a lot of data in cache waiting for convenient times to commit changes to disk. If you interrupt power (or if the system locks up), there is a possibility that some critical data might not have been saved. This missing data could conceivably compromise the file system.
When there are uncommitted pages in memory, NTFS toggles a flag in the $Volume record. This is commonly called the "dirty" flag. If the flag is set, the disk and the cache are not in a consistent state. When all uncommitted pages have been flushed to disk, NTFS toggles the dirty flag off.
If you start a system following a catastrophic event and the dirty flag is set, the system pauses before initializing the operating system and runs a special boot-time instance of CHKDSK called AUTOCHK. You can watch the results of this file system check in the console window. If the system finds errors, you'll be warned in the console and in the Event log, assuming that the system gets to the point where you can log on.
Bad Cluster Mapping
The cluster containing the bad sector is marked as bad by entering its Logical Cluster Number in the $BadClus metadata record. Each entry in $BadClus takes the form of a named data stream with a pointer to the bad cluster. In essence, the system blocks access to the cluster by assigning a file to it.
Bad cluster mapping is not required for SCSI drives, which are intrinsically self-repairing. If a sector goes bad (cannot be written) on a SCSI drive, the drive controller maps the sector address to one of a set of spare sectors set aside for this purpose. This sector sparing feature helps to keep the drive functioning normally as it ages. If NTFS is unable to write to a cluster on a SCSI drive, it first waits for the SCSI drive to handle the situation via sector sparing. If the drive runs out of spare sectors, NTFS uses bad cluster mapping. For IDE, USB, and FireWire drives, NTFS uses bad cluster mapping exclusively.
NTFS and Removable Drives
You can format a removable drive such as a ZIP, Jaz, or Orb drive using NTFS, but only if the drive itself has been configured to require safe removal instead of permitting it to be jerked from the system. This option is a property of the device driver, and you can set it using the Properties page for the device. Select the Hardware tab, then highlight the removable storage device and click Properties to open the Properties window for the device. Select the Policies tab. Figure shows an example.
Figure. Properties window for a removable storage device showing the Policies tab where the write caching options are selected.
The Write Caching and Safe Removal field shows the caching setting for the device. The default setting is Optimize for Quick Removal. This disables write caching and permits you to snatch the drive from the system at any time. The Optimize for Performance option requires that you use the Safe Removal icon in the Notification area of the status bar to stop the device before removing it.
You're purely on the honor system, of course. If you jerk the USB or FireWire cable from the interface, how can the machine stop you? Still, this is a means of telling you to live by the rules. Write caching will be disabled by default to prevent unnecessary data loss.
One of the advantages to merging the consumer and corporate code bases into Windows XP was that it forced Microsoft to confront performance issues head on, especially when it comes to disk I/O. Traditionally, NTFS has taken a back seat to FAT32 in raw performance. After all, with all the security and reliability overhead in NTFS, it's tough to compete against what is essentially a big table lookup engine.
To meet the performance expectations of FAT32 users while retaining the reliability of NTFS, Microsoft had to work on lots of little details. At a micro level, Microsoft reworked the MFT record headers to make them fit into even byte boundaries and reordered the information a little. This reduced the work necessary to read a record header. A small thing, to be sure, but the file system reads header information a bazillion times a day, so if you can shave a few clock ticks here and there, it adds up.
Also, the MFT was moved from its traditional spot near the start of the volume to a location about one-third of the way in from the start of the volume. This gets it out of the way of application files, which NTFS 3.1 jockeys to the start of the volume to take advantage of the higher throughput. The MFT mirror remains at the middle of the drive.
If you format a partition as NTFS during Setup, or convert a FAT/FAT32 partition formatted by Windows Server 2003 or XP, these two critical metadata records are placed in their proper location. If you upgrade from a previous version of Windows and then convert, this performance enhancement is not applied.
Application file placement plays a role in perceived and actual performance. Windows Server 2003 monitors the applications run by the system and by the users. Every three days, the system defrags critical files and places them at the prime real estate near the start of the volume. This is done during idle times, so don't be surprised if you see a lot of commotion on a hard drive in the evening.
NTFS File Journaling
NTFS caches a considerable amount of data in memory while waiting for the opportunity to commit it to disk. If you've ever had the unfortunate experience of powering down a machine unexpectedly, you know what it's like to face a long and tedious wait at restart while the system runs AUTOCHK.
There is a possibility that AUTOCHK might be unable to reconstruct the contents of critical metadata records. If this were to happen, the file system would be unmountable and you would be spending a very long day restoring the volume from tape.
To prevent this from happening, the metadata files are journaled. That is, any changes made to MFT records are first written to a log file at the center of the disk. Then, at some later time, the entries are transferred from the log file to the main MFT.
This transfer from the log file to the MFT is handled by the Log File Service, or LFS. Each transfer is done as an atomic transaction so that if it is interrupted during the transfer, records are not left in an inconsistent state.
Journaling the file system makes it possible to recover the MFT quickly following an unexpected loss of power or a system lockup. All that AUTOCHK needs to do is verify the integrity of the MFT then replay the uncommitted journal entries.