Appears in the proceedings of the 13th USENIX Workshop in Hot Topics in Operating Systems (HotOS XIII)
Vasily Tarasov, Saumitra Bhanage, and Erez Zadok Stony Brook University Abstract
The quality of ﬁle system benchmarking has not improved in over a decade of intense research spanning hundreds of publications. Researchersrepeatedly use a wide range of poorly designed benchmarks, and in most cases, develop their own ad-hoc benchmarks. Our community lacks a deﬁnition of what we want to benchmark in a ﬁle system. We propose several dimensions of ﬁle system benchmarking and review the wide range of tools and techniques in widespread use. We experimentally show that even the simplest of benchmarks can be fragile, producingperformance results spanning orders of magnitude. It is our hope that this paper will spur serious debate in our community, leading to action that can improve how we evaluate our ﬁle and storage systems.
Margo Seltzer Harvard University
Each year, the research community publishes dozens of papers proposing new or improved ﬁle and storage system solutions. Practically every suchpaper includes an evaluation demonstrating how good the proposed approach is on some set of benchmarks. In many cases, the benchmarks are fairly well-known and widely accepted; researchers present means, standard deviations, and other metrics to suggest some element of statistical rigor. It would seem then that the world of ﬁle system benchmarking is in good order, and we should all pat ourselveson the back and continue along with our current methodology. We think not. We claim that ﬁle system benchmarking is actually a disaster area—full of incomplete and misleading results that make it virtually impossible to understand what system or approach to use in any particular scenario. In Section 3, we demonstrate the fragility that results when using a common ﬁle system benchmark (Filebench) to answer a simple question, “How good is the random read performance of Linux ﬁle systems?”. This seemingly trivial example highlights how hard it is to answer even simple questions and also how, as a community, we have come to rely on a set of common benchmarks, without really asking ourselves what we need to evaluate. The fundamental problems are twofold. First, accuracy of publishedresults is questionable in other scientiﬁc areas , but may be even worse in ours [11, 12]. Second, we are asking an ill-deﬁned question when we ask, “Which ﬁle system is better.” We limit our discussion here to the second point. 1
What does it mean for one ﬁle system to be better than another? Many might immediately focus on performance, “I want the ﬁle system that is faster!” But faster underwhat conditions? One system might be faster for accessing many small ﬁles, while another is faster for accessing a single large ﬁle. One system might perform better than another when the data starts on disk (e.g., its on-disk layout is superior). One system might perform better on meta-data operations, while another handles data better. Given the multi-dimensional aspect of the question, we arguethat the answer can never be a single number or the result of a single benchmark. Of course, we all know that—and that’s why every paper worth the time to read presents multiple benchmark results—but how many of those give the reader any help in interpreting the results to apply them to any question other than the narrow question being asked in that paper? The benchmarks we choose should measure theaspect of the system on which the research in a paper focuses. That means that we need to understand precisely what information any given benchmark reveals. For example, many ﬁle system papers use a Linux kernel build as an evaluation metric . However, on practically all modern systems, a kernel build is a CPU bound process, so what does it mean to use it as a ﬁle system benchmark? The...