diff options
Diffstat (limited to 'docs/listdir.md')
-rw-r--r-- | docs/listdir.md | 330 |
1 files changed, 330 insertions, 0 deletions
diff --git a/docs/listdir.md b/docs/listdir.md new file mode 100644 index 00000000..0939cc18 --- /dev/null +++ b/docs/listdir.md @@ -0,0 +1,330 @@ +# Fast directory listing + +In order to find untracked files in a git repository, [gitstatusd](../README.md) needs to list the +contents of every directory. gitstatusd does it 27% faster than a reasonable implementation that a +seasoned C/C++ practitioner might write. This document explains the optimizations that went into it. +As directory listing is a common operation, many other projects can benefit from applying these +optimizations. + +## v1 + +Given a path to a directory, `ListDir()` must produce the list of files in that directory. Moreover, +the list must be sorted lexicographically to enable fast comparison with Git index. + +The following C++ implementation gets the job done. For simplicity, it returns an empty list on +error. + +```c++ +vector<string> ListDir(const char* dirname) { + vector<string> entries; + if (DIR* dir = opendir(dirname)) { + while (struct dirent* ent = (errno = 0, readdir(dir))) { + if (!Dots(ent->d_name)) entries.push_back(ent->d_name); + } + if (errno) entries.clear(); + sort(entries.begin(), entries.end()); + closedir(dir); + } + return entries; +} +``` + +Every directory has entries `"."` and `".."`, which we aren't interested in. We filter them out with +a helper function `Dots()`. + +```c++ +bool Dots(const char* s) { return s[0] == '.' && (!s[1] || (s[1] == '.' && !s[2])); } +``` + +To check how fast `ListDir()` performs, we can run it many times on a typical directory. One million +runs on a directory with 32 files with 16-character names takes 12.7 seconds. + +## v2 + +Experienced C++ practitioners will scoff at our implementation of `ListDir()`. If it's meant to be +efficient, returning `vector<string>` is an unaffordable convenience. To avoid heap allocations we +can use a simple arena that will allow us to reuse memory between different `ListDir()` calls. + +(Changed and added lines are marked with comments.) + +```c++ +void ListDir(const char* dirname, string& arena, vector<char*>& entries) { // + + entries.clear(); // + + if (DIR* dir = opendir(dirname)) { + arena.clear(); // + + while (struct dirent* ent = (errno = 0, readdir(dir))) { + if (!Dots(ent->d_name)) { + entries.push_back(reinterpret_cast<char*>(arena.size())); // + + arena.append(ent->d_name, strlen(ent->d_name) + 1); // + + } + } + if (errno) entries.clear(); + for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)]; // + + sort(entries.begin(), entries.end(), // + + [](const char* a, const char* b) { return strcmp(a, b) < 0; }); // + + closedir(dir); + } +} +``` + +To make performance comparison easier, we can normalize them relative to the baseline. v1 will get +performance score of 100. A twice-as-fast alternative will be 200. + +| version | optimization | score | +|---------|----------------------------|----------:| +| v1 | baseline | 100.0 | +| **v2** | **avoid heap allocations** | **112.7** | + +Avoiding heap allocations makes `ListDir()` 12.7% faster. Not bad. As an added bonus, those casts +will fend off the occasional frontend developer who accidentally wanders into the codebase. + +## v3 + +`opendir()` is an expensive call whose performance is linear in the number of subdirectories in the +path because it needs to perform a lookup for every one of them. We can replace it with `openat()`, +which takes a file descriptor to the parent directory and a name of the subdirectory. Just a single +lookup, less CPU time. This optimization assumes that callers already have a descriptor to the +parent directory, which is indeed the case for gitstatusd, and is often the case in other +applications that traverse filesystem. + +```c++ +void ListDir(int parent_fd, const char* dirname, string& arena, vector<char*>& entries) { // + + entries.clear(); + int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC); // + + if (dir_fd < 0) return; // + + if (DIR* dir = fdopendir(dir_fd)) { + arena.clear(); + while (struct dirent* ent = (errno = 0, readdir(dir))) { + if (!Dots(ent->d_name)) { + entries.push_back(reinterpret_cast<char*>(arena.size())); + arena.append(ent->d_name, strlen(ent->d_name) + 1); + } + } + if (errno) entries.clear(); + for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)]; + sort(entries.begin(), entries.end(), + [](const char* a, const char* b) { return strcmp(a, b) < 0; }); + closedir(dir); + } else { // + + close(dir_fd); // + + } // + +} +``` + +This is worth about 3.5% in speed. + +| version | optimization | score | +|---------|--------------------------------------|----------:| +| v1 | baseline | 100.0 | +| v2 | avoid heap allocations | 112.7 | +| **v3** | **open directories with `openat()`** | **116.2** | + +## v4 + +Copying file names to the arena isn't free but it doesn't seem like we can avoid it. Poking around +we can see that the POSIX API we are using is implemented on Linux on top of `getdents64` system +call. Its documentation isn't very encouraging: + +```text +These are not the interfaces you are interested in. Look at +readdir(3) for the POSIX-conforming C library interface. This page +documents the bare kernel system call interfaces. + +Note: There are no glibc wrappers for these system calls. +``` + +Hmm... The API looks like something we can take advantage of, so let's try it anyway. + +First, we'll need a simple `Arena` class that can allocate 8KB blocks of memory. + +```c++ +class Arena { + public: + enum { kBlockSize = 8 << 10 }; + + char* Alloc() { + if (cur_ == blocks_.size()) blocks_.emplace_back(kBlockSize, 0); + return blocks_[cur_++].data(); + } + + void Clear() { cur_ = 0; } + + private: + size_t cur_ = 0; + vector<string> blocks_; +}; +``` + +Next, we need to define `struct dirent64_t` ourselves because there is no wrapper for the system +call we are about to use. + +```c++ +struct dirent64_t { + ino64_t d_ino; + off64_t d_off; + unsigned short d_reclen; + unsigned char d_type; + char d_name[]; +}; +``` + +Finally we can get to the implementation of `ListDir()`. + +```c++ +void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) { // + + entries.clear(); + int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC); + if (dir_fd < 0) return; + arena.Clear(); // + + while (true) { // + + char* buf = arena.Alloc(); // + + int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize); // + + if (n <= 0) { // + + if (n) entries.clear(); // + + break; // + + } // + + for (int pos = 0; pos < n;) { // + + auto* ent = reinterpret_cast<dirent64_t*>(buf + pos); // + + if (!Dots(ent->d_name)) entries.push_back(ent->d_name); // + + pos += ent->d_reclen; // + + } // + + } // + + sort(entries.begin(), entries.end(), + [](const char* a, const char* b) { return strcmp(a, b) < 0; }); + close(dir_fd); +} +``` + +How are we doing with this one? + +| version | optimization | score | +|---------|----------------------------------|----------:| +| v1 | baseline | 100.0 | +| v2 | avoid heap allocations | 112.7 | +| v3 | open directories with `openat()` | 116.2 | +| **v4** | **call `getdents64()` directly** | **137.8** | + +Solid 20% speedup. Worth the trouble. Unfortunately, we now have just one `reinterpret_cast` instead +of two, and it's not nearly as scary-looking. Hopefully with the next iteration we can get back some +of that evil vibe of low-level code. + +As a bonus, every element in `entries` has `d_type` at offset -1. This can be useful to the callers +that need to distinguish between regular files and directories (gitstatusd, in fact, needs this). +Note how `ListDir()` implements this feature at zero cost, as a lucky accident of `dirent64_t` +memory layout. + +## v5 + +The CPU profile of `ListDir()` reveals that almost all userspace CPU time is spent in `strcmp()`. +Digging into the source code of `std::sort()` we can see that it uses Insertion Sort for short +collections. Our 32-element vector falls under the threshold. Insertion Sort makes `O(N^2)` +comparisons, hence a lot of CPU time in `strcmp()`. Switching to `qsort()` or +[Timsort](https://en.wikipedia.org/wiki/Timsort) is of no use as all good sorting algorithms fall +back to Insertion Sort. + +If we cannot make fewer comparisons, perhaps we can make each of them faster? `strcmp()` compares +characters one at a time. It cannot read ahead as it can be illegal to touch memory past the first +null byte. But _we_ know that it's safe to read a few extra bytes past the end of `d_name` for every +entry except the last in the buffer. And since we own the buffer, we can overallocate it so that +reading past the end of the last entry is also safe. + +Combining these ideas with the fact that file names on Linux are at most 255 bytes long, we can +invoke `getdents64()` like this: + +```c++ +int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256); +``` + +And then compare entries like this: + +```c++ +[](const char* a, const char* b) { return memcmp(a, b, 255) < 0; } +``` + +This version doesn't give any speedup compared to the previous but it opens an avenue for another +optimization. The pointers we pass to `memcmp()` aren't aligned. To be more specific, their +numerical values are `N * 8 + 3` for some `N`. When given such a pointer, `memcmp()` will check the +first 5 bytes one by one, and only then switch to comparing 8 bytes at a time. If we can handle the +first 5 bytes ourselves, we can pass aligned memory to `memcmp()` and take full advantage of its +vectorized loop. + +Here's the implementation: + +```c++ +uint64_t Read64(const void* p) { // + + uint64_t x; // + + memcpy(&x, p, sizeof(x)); // + + return x; // + +} // + + +void ByteSwap64(void* p) { // + + uint64_t x = __builtin_bswap64(Read64(p)); // + + memcpy(p, &x, sizeof(x)); // + +} // + + +void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) { + entries.clear(); + int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC); + if (dir_fd < 0) return; + arena.Clear(); + while (true) { + char* buf = arena.Alloc(); + int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256); // + + if (n <= 0) { + if (n) entries.clear(); + break; + } + for (int pos = 0; pos < n;) { + auto* ent = reinterpret_cast<dirent64_t*>(buf + pos); + if (!Dots(ent->d_name)) { + ByteSwap64(ent->d_name); // + + entries.push_back(ent->d_name); + } + pos += ent->d_reclen; + } + } + sort(entries.begin(), entries.end(), [](const char* a, const char* b) { + uint64_t x = Read64(a); // + + uint64_t y = Read64(b); // + + return x < y || (x == y && a != b && memcmp(a + 5, b + 5, 256) < 0); // + + }); + for (char* p : entries) ByteSwap64(p); // + + close(dir_fd); +} +``` + +This is for Little Endian architecture. Big Endian doesn't need `ByteSwap64()`, so it'll be a bit +faster. + +| version | optimization | score | +|---------|----------------------------------|----------:| +| v1 | baseline | 100.0 | +| v2 | avoid heap allocations | 112.7 | +| v3 | open directories with `openat()` | 116.2 | +| v4 | call `getdents64()` directly | 137.8 | +| **v5** | **hand-optimize `strcmp()`** | **143.3** | + +Fast and respectably arcane. + +## Conclusion + +Through a series of incremental improvements we've sped up directory listing by 43.3% compared to a +naive implementation (v1) and 27.2% compared to a reasonable implementation that a seasoned C/C++ +practitioner might write (v2). + +However, these numbers are based on an artificial benchmark while the real judge is always the real +code. Our goal was to speed up gitstatusd. Benchmark was just a tool. Thankfully, the different +versions of `ListDir()` have the same comparative performance within gitstatusd as in the benchmark. +In truth, the directory chosen for the benchmark wasn't arbitrary. It was picked by sampling +gitstatusd when it runs on [chromium](https://github.com/chromium/chromium) git repository. + +The final version of `ListDir()` spends 97% of its CPU time in the kernel. If we assume that it +makes the minimum possible number of system calls and these calls are optimal (true to the best +of my knowledge), it puts the upper bound on possible future performance improvements at just 3%. +There is almost nothing left in `ListDir()` to optimize. + +![ListDir() CPU profile]( + https://raw.githubusercontent.com/romkatv/gitstatus/1ac366952366d89980b3f3484f270b4fa5ae4293/cpu-profile-listdir.png) + +(The CPU profile was created with [gperftools](https://github.com/gperftools/gperftools) and +rendered with [pprof](https://github.com/google/pprof)). |