aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/listdir.md330
1 files changed, 0 insertions, 330 deletions
diff --git a/docs/listdir.md b/docs/listdir.md
deleted file mode 100644
index 0939cc18..00000000
--- a/docs/listdir.md
+++ /dev/null
@@ -1,330 +0,0 @@
-# Fast directory listing
-
-In order to find untracked files in a git repository, [gitstatusd](../README.md) needs to list the
-contents of every directory. gitstatusd does it 27% faster than a reasonable implementation that a
-seasoned C/C++ practitioner might write. This document explains the optimizations that went into it.
-As directory listing is a common operation, many other projects can benefit from applying these
-optimizations.
-
-## v1
-
-Given a path to a directory, `ListDir()` must produce the list of files in that directory. Moreover,
-the list must be sorted lexicographically to enable fast comparison with Git index.
-
-The following C++ implementation gets the job done. For simplicity, it returns an empty list on
-error.
-
-```c++
-vector<string> ListDir(const char* dirname) {
- vector<string> entries;
- if (DIR* dir = opendir(dirname)) {
- while (struct dirent* ent = (errno = 0, readdir(dir))) {
- if (!Dots(ent->d_name)) entries.push_back(ent->d_name);
- }
- if (errno) entries.clear();
- sort(entries.begin(), entries.end());
- closedir(dir);
- }
- return entries;
-}
-```
-
-Every directory has entries `"."` and `".."`, which we aren't interested in. We filter them out with
-a helper function `Dots()`.
-
-```c++
-bool Dots(const char* s) { return s[0] == '.' && (!s[1] || (s[1] == '.' && !s[2])); }
-```
-
-To check how fast `ListDir()` performs, we can run it many times on a typical directory. One million
-runs on a directory with 32 files with 16-character names takes 12.7 seconds.
-
-## v2
-
-Experienced C++ practitioners will scoff at our implementation of `ListDir()`. If it's meant to be
-efficient, returning `vector<string>` is an unaffordable convenience. To avoid heap allocations we
-can use a simple arena that will allow us to reuse memory between different `ListDir()` calls.
-
-(Changed and added lines are marked with comments.)
-
-```c++
-void ListDir(const char* dirname, string& arena, vector<char*>& entries) { // +
- entries.clear(); // +
- if (DIR* dir = opendir(dirname)) {
- arena.clear(); // +
- while (struct dirent* ent = (errno = 0, readdir(dir))) {
- if (!Dots(ent->d_name)) {
- entries.push_back(reinterpret_cast<char*>(arena.size())); // +
- arena.append(ent->d_name, strlen(ent->d_name) + 1); // +
- }
- }
- if (errno) entries.clear();
- for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)]; // +
- sort(entries.begin(), entries.end(), // +
- [](const char* a, const char* b) { return strcmp(a, b) < 0; }); // +
- closedir(dir);
- }
-}
-```
-
-To make performance comparison easier, we can normalize them relative to the baseline. v1 will get
-performance score of 100. A twice-as-fast alternative will be 200.
-
-| version | optimization | score |
-|---------|----------------------------|----------:|
-| v1 | baseline | 100.0 |
-| **v2** | **avoid heap allocations** | **112.7** |
-
-Avoiding heap allocations makes `ListDir()` 12.7% faster. Not bad. As an added bonus, those casts
-will fend off the occasional frontend developer who accidentally wanders into the codebase.
-
-## v3
-
-`opendir()` is an expensive call whose performance is linear in the number of subdirectories in the
-path because it needs to perform a lookup for every one of them. We can replace it with `openat()`,
-which takes a file descriptor to the parent directory and a name of the subdirectory. Just a single
-lookup, less CPU time. This optimization assumes that callers already have a descriptor to the
-parent directory, which is indeed the case for gitstatusd, and is often the case in other
-applications that traverse filesystem.
-
-```c++
-void ListDir(int parent_fd, const char* dirname, string& arena, vector<char*>& entries) { // +
- entries.clear();
- int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC); // +
- if (dir_fd < 0) return; // +
- if (DIR* dir = fdopendir(dir_fd)) {
- arena.clear();
- while (struct dirent* ent = (errno = 0, readdir(dir))) {
- if (!Dots(ent->d_name)) {
- entries.push_back(reinterpret_cast<char*>(arena.size()));
- arena.append(ent->d_name, strlen(ent->d_name) + 1);
- }
- }
- if (errno) entries.clear();
- for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)];
- sort(entries.begin(), entries.end(),
- [](const char* a, const char* b) { return strcmp(a, b) < 0; });
- closedir(dir);
- } else { // +
- close(dir_fd); // +
- } // +
-}
-```
-
-This is worth about 3.5% in speed.
-
-| version | optimization | score |
-|---------|--------------------------------------|----------:|
-| v1 | baseline | 100.0 |
-| v2 | avoid heap allocations | 112.7 |
-| **v3** | **open directories with `openat()`** | **116.2** |
-
-## v4
-
-Copying file names to the arena isn't free but it doesn't seem like we can avoid it. Poking around
-we can see that the POSIX API we are using is implemented on Linux on top of `getdents64` system
-call. Its documentation isn't very encouraging:
-
-```text
-These are not the interfaces you are interested in. Look at
-readdir(3) for the POSIX-conforming C library interface. This page
-documents the bare kernel system call interfaces.
-
-Note: There are no glibc wrappers for these system calls.
-```
-
-Hmm... The API looks like something we can take advantage of, so let's try it anyway.
-
-First, we'll need a simple `Arena` class that can allocate 8KB blocks of memory.
-
-```c++
-class Arena {
- public:
- enum { kBlockSize = 8 << 10 };
-
- char* Alloc() {
- if (cur_ == blocks_.size()) blocks_.emplace_back(kBlockSize, 0);
- return blocks_[cur_++].data();
- }
-
- void Clear() { cur_ = 0; }
-
- private:
- size_t cur_ = 0;
- vector<string> blocks_;
-};
-```
-
-Next, we need to define `struct dirent64_t` ourselves because there is no wrapper for the system
-call we are about to use.
-
-```c++
-struct dirent64_t {
- ino64_t d_ino;
- off64_t d_off;
- unsigned short d_reclen;
- unsigned char d_type;
- char d_name[];
-};
-```
-
-Finally we can get to the implementation of `ListDir()`.
-
-```c++
-void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) { // +
- entries.clear();
- int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);
- if (dir_fd < 0) return;
- arena.Clear(); // +
- while (true) { // +
- char* buf = arena.Alloc(); // +
- int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize); // +
- if (n <= 0) { // +
- if (n) entries.clear(); // +
- break; // +
- } // +
- for (int pos = 0; pos < n;) { // +
- auto* ent = reinterpret_cast<dirent64_t*>(buf + pos); // +
- if (!Dots(ent->d_name)) entries.push_back(ent->d_name); // +
- pos += ent->d_reclen; // +
- } // +
- } // +
- sort(entries.begin(), entries.end(),
- [](const char* a, const char* b) { return strcmp(a, b) < 0; });
- close(dir_fd);
-}
-```
-
-How are we doing with this one?
-
-| version | optimization | score |
-|---------|----------------------------------|----------:|
-| v1 | baseline | 100.0 |
-| v2 | avoid heap allocations | 112.7 |
-| v3 | open directories with `openat()` | 116.2 |
-| **v4** | **call `getdents64()` directly** | **137.8** |
-
-Solid 20% speedup. Worth the trouble. Unfortunately, we now have just one `reinterpret_cast` instead
-of two, and it's not nearly as scary-looking. Hopefully with the next iteration we can get back some
-of that evil vibe of low-level code.
-
-As a bonus, every element in `entries` has `d_type` at offset -1. This can be useful to the callers
-that need to distinguish between regular files and directories (gitstatusd, in fact, needs this).
-Note how `ListDir()` implements this feature at zero cost, as a lucky accident of `dirent64_t`
-memory layout.
-
-## v5
-
-The CPU profile of `ListDir()` reveals that almost all userspace CPU time is spent in `strcmp()`.
-Digging into the source code of `std::sort()` we can see that it uses Insertion Sort for short
-collections. Our 32-element vector falls under the threshold. Insertion Sort makes `O(N^2)`
-comparisons, hence a lot of CPU time in `strcmp()`. Switching to `qsort()` or
-[Timsort](https://en.wikipedia.org/wiki/Timsort) is of no use as all good sorting algorithms fall
-back to Insertion Sort.
-
-If we cannot make fewer comparisons, perhaps we can make each of them faster? `strcmp()` compares
-characters one at a time. It cannot read ahead as it can be illegal to touch memory past the first
-null byte. But _we_ know that it's safe to read a few extra bytes past the end of `d_name` for every
-entry except the last in the buffer. And since we own the buffer, we can overallocate it so that
-reading past the end of the last entry is also safe.
-
-Combining these ideas with the fact that file names on Linux are at most 255 bytes long, we can
-invoke `getdents64()` like this:
-
-```c++
-int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256);
-```
-
-And then compare entries like this:
-
-```c++
-[](const char* a, const char* b) { return memcmp(a, b, 255) < 0; }
-```
-
-This version doesn't give any speedup compared to the previous but it opens an avenue for another
-optimization. The pointers we pass to `memcmp()` aren't aligned. To be more specific, their
-numerical values are `N * 8 + 3` for some `N`. When given such a pointer, `memcmp()` will check the
-first 5 bytes one by one, and only then switch to comparing 8 bytes at a time. If we can handle the
-first 5 bytes ourselves, we can pass aligned memory to `memcmp()` and take full advantage of its
-vectorized loop.
-
-Here's the implementation:
-
-```c++
-uint64_t Read64(const void* p) { // +
- uint64_t x; // +
- memcpy(&x, p, sizeof(x)); // +
- return x; // +
-} // +
-
-void ByteSwap64(void* p) { // +
- uint64_t x = __builtin_bswap64(Read64(p)); // +
- memcpy(p, &x, sizeof(x)); // +
-} // +
-
-void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) {
- entries.clear();
- int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);
- if (dir_fd < 0) return;
- arena.Clear();
- while (true) {
- char* buf = arena.Alloc();
- int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256); // +
- if (n <= 0) {
- if (n) entries.clear();
- break;
- }
- for (int pos = 0; pos < n;) {
- auto* ent = reinterpret_cast<dirent64_t*>(buf + pos);
- if (!Dots(ent->d_name)) {
- ByteSwap64(ent->d_name); // +
- entries.push_back(ent->d_name);
- }
- pos += ent->d_reclen;
- }
- }
- sort(entries.begin(), entries.end(), [](const char* a, const char* b) {
- uint64_t x = Read64(a); // +
- uint64_t y = Read64(b); // +
- return x < y || (x == y && a != b && memcmp(a + 5, b + 5, 256) < 0); // +
- });
- for (char* p : entries) ByteSwap64(p); // +
- close(dir_fd);
-}
-```
-
-This is for Little Endian architecture. Big Endian doesn't need `ByteSwap64()`, so it'll be a bit
-faster.
-
-| version | optimization | score |
-|---------|----------------------------------|----------:|
-| v1 | baseline | 100.0 |
-| v2 | avoid heap allocations | 112.7 |
-| v3 | open directories with `openat()` | 116.2 |
-| v4 | call `getdents64()` directly | 137.8 |
-| **v5** | **hand-optimize `strcmp()`** | **143.3** |
-
-Fast and respectably arcane.
-
-## Conclusion
-
-Through a series of incremental improvements we've sped up directory listing by 43.3% compared to a
-naive implementation (v1) and 27.2% compared to a reasonable implementation that a seasoned C/C++
-practitioner might write (v2).
-
-However, these numbers are based on an artificial benchmark while the real judge is always the real
-code. Our goal was to speed up gitstatusd. Benchmark was just a tool. Thankfully, the different
-versions of `ListDir()` have the same comparative performance within gitstatusd as in the benchmark.
-In truth, the directory chosen for the benchmark wasn't arbitrary. It was picked by sampling
-gitstatusd when it runs on [chromium](https://github.com/chromium/chromium) git repository.
-
-The final version of `ListDir()` spends 97% of its CPU time in the kernel. If we assume that it
-makes the minimum possible number of system calls and these calls are optimal (true to the best
-of my knowledge), it puts the upper bound on possible future performance improvements at just 3%.
-There is almost nothing left in `ListDir()` to optimize.
-
-![ListDir() CPU profile](
- https://raw.githubusercontent.com/romkatv/gitstatus/1ac366952366d89980b3f3484f270b4fa5ae4293/cpu-profile-listdir.png)
-
-(The CPU profile was created with [gperftools](https://github.com/gperftools/gperftools) and
-rendered with [pprof](https://github.com/google/pprof)).