http://www.enterprisestorageforum.com/sans/features/article.php/3745996

> Can Linux file systems, which I will define as ext-4, XFS and xxx, match the
> performance of file systems on other UNIX-based large SMP servers such as IBM
> and Sun? Some might also inquire about SGI, but SGI has something called
> ProPack, which has a number of optimizations to Linux for high-speed I/O, and
> SGI also has their open proprietary Linux file system called CxFS, which is not
> part of standard Linux distributions. Because SGI ProPack and CxFS are not part
> of standard Linux distributions, we won't consider them here. We'll stick to
> standard Linux because that is what most people use.

Compare to [5].  ProPack is just a set of additional rpms for SLES.  No changes in
the I/O stack, the kernel is the stock SLES kernel.

CXFS is a clustered extension to XFS, hooking between the lower-levels of XFS
and the VFS interface, and providing clustering but nothing else.  The CXFS Metadata
server runs a full copy of XFS, and the clients run a slightly modified XFS I/O
path while delegating metadata operations to the Metadata server.  All I/O performance
and scalability archived by CXFS must be over-archived by XFS first to offset the
cluster overhead.

> There are a number of areas that limit performance in Linux, such as page size compared
> with other operating systems,

The Linux page size is limited by the underlying hardware.  The Larges Page size supported
is 64Kb on 64bit powerpc and IA64, which is equal to other operating systems on
these platforms.

> the restrictions Linux places on direct I/O and page alignment,

Direct I/O is sector aligned (typically 512 bytes for commodity storage)

> and the fact that Linux does not allow direct I/O automatically by request size

I'm not sure what this is supposed to mean, but requesting direct I/O explicitly
seems like a sane decision and could be fixed with a trivial LD_PRELOAD library.

> — I have seen Linux kernels break large (greater than 512 KB) I/O requests into 128 KB
> requests. Since the Linux I/O performance and file system were designed for a desktop
> replacement for Windows, none of this comes as much of a surprise.

I have too, on very old kernels or with consumer ATA drives that require this.  On
enterprise-grade storage Linux does multi-megabyte I/O requests depending on the
hardware capabilities and page size.

> Linux has other issues, as I see it; for starters, the lack of someone to take charge or
> responsibility. With Linux, if you find a problem, groups of people are going to have to
> agree to fix it, and the people writing Linux might not necessarily be responsive to the
> problems you're facing.

And that is different from a large software company how?  I've seen more problems
with this in large companies with split groups than in the opensource world.  And of
course you don't need any agreement to deploy in on your own thanks to the OpenSource
licenses.

> The goals for Linux file systems and the Linux kernel design seem to be trying to address
> a completely different set of problems than AIX or Solaris, and IBM and Sun are far more
> directly responsible than the Linux community if you have a problem. If you run AIX or
> Solaris and complain to IBM or Sun, they can't say we have no control.

But that doesn't mean they'll fix your problem, quite contrary.  While with Linux
you or a consulting company can fix it.  (Note:  the latter is what I partially do for
living)

> The Linux file systems that are commonly used today (ext-3 today and likely soon ext-4
> and xfs) have not had huge structural changes in a long time. Ext-4 improves upon ext-3
> and ext-2 for some improved allocation, but simple things like alignment of the
> superblock to the RAID stripe and the first metadata allocation are not considered.
> Additionally, things like alignment of additional file system metadata regions to RAID
> stripe value are not considered,

The XFS superblock is in filesystem block zero.  Where that is relative to the disk
depends on the disk label, that's why large Linux storage installations do not use
DOS disk labels but the volume manager directly on the raw device or the EFI disk labels.
(Or SGI labels in the case of SGI, but that has historical reasons.)

XFS does strip-align metadata allocations by default, and even detects this automatically
for the various software raid and volume manager implementations.

> nor are simple things like indirect allocations (see File Systems and Volume Managers:
> History and Usage), which are fixed values so with the small allocations supported (4 KB
> maximum), large numbers of allocations are required. Take a 200 TB file system, which
> will require 53.7 billion allocations to represent the 200 TB using the largest
> allocation size of 4 KB supported by ext-3. Using 8 MB, which is feasible on enterprise
> file systems, it becomes a manageable 26.2 million allocations. The bitmap or allocation
> map could even fit in memory for this number of allocations! The XFS file system has
> very similar characteristics to ext-3. Yes, allocations can be larger, up to 64 KB if
> the Linux page size is 64 KB, but the alignment issues for the superblock, metadata
> regions and other issues still exist.

XFS does not use indirect blocks but is extent based, and typical extent sizes
are in the two­ to three digit Gigabyte range for large streaming I/O.  Compare the
paper and presentation at [1] and [2].  Allocations aren't related to the filesystem
block size at all in an extent-based filesystem.  XFS also supports per-inode extent-size
hints to control the default size of the allocations.


http://www.enterprisestorageforum.com/sans/features/article.php/3749926:

> There is a big difference in my world between the computation environments and the large
> storage environments. In the HPC computational environments I work with, I often see
> large clusters (yes, Linux clusters). Of the many hundreds of thousands of nodes that I
> am aware of, however, no one is using a large — by large, I mean 100 TB or greater —
> single instantiation of a Linux file system. I have not even seen a 50 TB Linux file
> system. That does not mean that they don't exist, but I have not seen them, nor have I
> heard of any.

There are lots of XFS installations in the 100s of Terrabyte range.

> * The file system is not aligned to the RAID stripe unless you pad out the first
> stripe to align the superblock. Almost all high-performance file systems that work
> at large scale do this automatically, and metadata regions are always aligned to
> RAID stripes, as metadata is often separated from the data on different devices so
> the RAID stripes can be aligned with the allocation of the data and the metadata. 

We've been there before, this simply is not true.

> Fscking the log is not good enough when you have a hardware issue ranging from a RAID
> hiccup to a hard crash of multiple things caused by something like a power incident. If
> this happens, you must fsck the whole file system, not just the log (a number of
> responders pointed this out). Since the metadata is spread through the file system and
> all the metadata must be checked, given a variety of issues from power to RAID hiccups
> to undetected or mis-corrected errors, there is going to be a great deal of head-seeking
> involved and read-modify write on the RAID devices, since the metadata areas are not
> aligned with RAID stripe applications

fscking a log probably refers to log replay, and indeed is not enough in the case
of a disaster.  That's why fsck performance does matter.  After the work described in
[3] which has been commited to the XFS repository long time ago and has been improved
upon since XFS is probably the industry leader in full fsck performance.

> One other thing I tried to make clear was that small SMP systems with two or even four
> sockets are not being used for the type of environment I've been talking about. If you
> have a 500 TB file system, you often need more bandwidth to the file system than can be
> provided in a four-socket system with, say, two PCIe 2.0 buses (10 GB/sec of theoretical
> bandwidth). Many times these types of systems have eight or even 16 PCIe buses and 10
> GB/sec to 20 GB/sec (or more) of bandwidth. These types of environments are not using
> blades, nor can they, given that breaking up the large file systems is expensive in
> terms of management overhead and scaling performance.

See [1] and [2] for systems about mid-range in the XFS deployment two years ago.
Systems have massively grown since.


In addition to these bits it might help reading up a little bit on XFS, for example
at [4].


[1] http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
[2] http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-presentation.pdf
[3] http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf
[4] http://oss.sgi.com/projects/xfs/training/index.html
[5] http://www.sgi.com/products/software/linux/propack.html