/** * Copyright (c) Meta Platforms, Inc. and affiliates. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */

Folly API Documentation

folly/AtomicHashmap.h

folly/AtomicHashmap.h introduces a synchronized UnorderedAssociativeContainer implementation designed for extreme performance in heavily multithreaded environments (about 2-5x faster than tbb::concurrent_hash_map) and good memory usage properties. Find and iteration are wait-free, insert has key-level lock granularity, there is minimal memory overhead, and permanent 32-bit ids can be used to reference each element.

Limitations


Although it can provide extreme performance, AtomicHashmap has some unique limitations as well.

For a complete list of limitations and departures from the UnorderedAssociativeContainer concept, see folly/AtomicHashMap.h

Unique Features


Usage


Usage is similar to most maps, although note the conspicuous lack of operator[] which encourages non thread-safe access patterns.

Below is a synchronized key counter implementation that allows the counter values to be incremented in parallel with serializing all the values to a string.

   class Counters {
    private:
     AtomicHashMap<int64_t,int64_t> ahm;

    public:
     explicit Counters(size_t numCounters) : ahm(numCounters) {}

     void increment(int64_t obj_id) {
       auto ret = ahm.insert(make_pair(obj_id, 1));
       if (!ret.second) {
         // obj_id already exists, increment
         NoBarrier_AtomicIncrement(&ret.first->second, 1);
       }
     }

     int64_t getValue(int64_t obj_id) {
       auto ret = ahm.find(obj_id);
       return ret != ahm.end() ? ret->second : 0;
     }

     // Serialize the counters without blocking increments
     string toString() {
       string ret = "{\n";
       ret.reserve(ahm.size() * 32);
       for (const auto& e : ahm) {
         ret += folly::to<string>(
           "  [", e.first, ":", NoBarrier_Load(&e.second), "]\n");
       }
       ret += "}\n";
       return ret;
     }
   };

Implementation


AtomicHashMap is a composition of AtomicHashArray submaps, which implement the meat of the functionality. Only one AHA is created on initialization, and additional submaps are appended if the first one gets full. If the AHM grows, there will be multiple submaps that must be probed in series to find a given key. The more growth, the more submaps will be chained, and the slower it will get. If the initial size estimate is good, only one submap will ever be created and performance will be optimal.

AtomicHashArray is a fixed-size probing hash map (also referred to as an open addressed hash map) where hash collisions are resolved by checking subsequent elements. This means that they can be allocated in slabs as arrays of value_type elements, have excellent cache performance, and have no memory overhead from storing pointers.

The algorithm is simple - when inserting, the key is hash-mod’ed to an offset, and that element-key is atomically compare-and-swap’ed with the locked key value. If successful, the value is written and the element-key is unlocked by setting it to the input key value. If the compare fails, the next element is tried until success or the map is full.

Finds are even simpler. The key is hash-mod’ed to an offset, and the element-key is examined. If it is the same as the input key, the reference is returned, if it’s the empty key, failure is returned, otherwise the next key is tried. This can be done wait-free without any atomic instructions because the elements are always in a valid state.

Erase is done by finding the key, then compare-and-swap’ing the element-key with the reserved erased key value. If the swap succeeds, return success, otherwise return failure (the element was erased by a competing thread). If the key does not exist, return failure.

folly/Benchmark.h

folly/Benchmark.h provides a simple framework for writing and executing benchmarks. Currently the framework targets only single-threaded testing (though you can internally use fork-join parallelism and measure total run time).

Overview


Using folly/Benchmark.h is very simple. Here’s an example:

    #include <folly/Benchmark.h>
    #include <vector>
    using namespace std;
    using namespace folly;
    BENCHMARK(insertFrontVector) {
      // Let's insert 100 elements at the front of a vector
      vector<int> v;
      for (unsigned int i = 0; i < 100; ++i) {
        v.insert(v.begin(), i);
      }
    }
    BENCHMARK(insertBackVector) {
      // Let's insert 100 elements at the back of a vector
      vector<int> v;
      for (unsigned int i = 0; i < 100; ++i) {
        v.insert(v.end(), i);
      }
    }
    int main() {
      runBenchmarks();
    }

Compiling and running this code produces to the standard output:

    ===============================================================================
    test.cpp                                              relative ns/iter  iters/s
    ===============================================================================
    insertFrontVector                                                3.84K  260.38K
    insertBackVector                                                 1.61K  622.75K
    ===============================================================================

Let’s worry about the empty column “relative” later. The table contains, for each benchmark, the time spent per call and the converse number of calls per second. Numbers are represented in metric notation (K for thousands, M for millions etc). As expected, in this example the second function is much faster (fewer ns/iter and more iters/s).

The macro BENCHMARK introduces a function and also adds it to an internal array containing all benchmarks in the system. The defined function takes no arguments and returns void.

The framework calls the function many times to collect statistics about it. Sometimes the function itself would want to do that iteration—for example how about inserting n elements instead of 100 elements? To do the iteration internally, use BENCHMARK with two parameters. The second parameter is the number of iterations and is passed by the framework down to the function. The type of the count is implicitly unsigned. Consider a slightly reworked example:

    #include <folly/Benchmark.h>
    #include <folly/container/Foreach.h>
    #include <vector>
    using namespace std;
    using namespace folly;
    BENCHMARK(insertFrontVector, n) {
      vector<int> v;
      for (unsigned int i = 0; i < n; ++i) {
        v.insert(v.begin(), i);
      }
    }
    BENCHMARK(insertBackVector, n) {
      vector<int> v;
      for (unsigned int i = 0; i < n; ++i) {
        v.insert(v.end(), i);
      }
    }
    int main() {
      runBenchmarks();
    }

The produced numbers are substantially different:

    ===============================================================================
    Benchmark                                             relative ns/iter  iters/s
    ===============================================================================
    insertFrontVector                                               39.92    25.05M
    insertBackVector                                                 3.46   288.89M
    ===============================================================================

Now the numbers indicate the speed of one single insertion because the framework assumed the user-defined function used internal iteration (which it does). So inserting at the back of a vector is more than 10 times faster than inserting at the front! Speaking of comparisons…

Baselines


Choosing one or more good baselines is a crucial activity in any measurement. Without a baseline there is little information to derive from the sheer numbers. If, for example, you do experimentation with algorithms, a good baseline is often an established approach (e.g. the built-in std::sort for sorting). Essentially all experimental numbers should be compared against some baseline.

To support baseline-driven measurements, folly/Benchmark.h defines BENCHMARK_RELATIVE, which works much like BENCHMARK, except it considers the most recent lexically-occurring BENCHMARK a baseline, and fills the “relative” column. Say, for example, we want to use front insertion for a vector as a baseline and see how back insertion compares with it:

    #include <folly/Benchmark.h>
    #include <folly/container/Foreach.h>
    #include <vector>
    using namespace std;
    using namespace folly;
    BENCHMARK(insertFrontVector, n) {
      vector<int> v;
      for (unsigned int i = 0; i < n; ++i) {
        v.insert(v.begin(), i);
      }
    }
    BENCHMARK_RELATIVE(insertBackVector, n) {
      vector<int> v;
      for (unsigned int i = 0; i < n; ++i) {
        v.insert(v.end(), i);
      }
    }
    int main() {
      runBenchmarks();
    }

This program prints something like:

    ===============================================================================
    Benchmark                                             relative ns/iter  iters/s
    ===============================================================================
    insertFrontVector                                               42.65    23.45M
    insertBackVector                                     1208.24%    3.53   283.30M
    ===============================================================================

showing the 1208.24% relative speed advantage of inserting at the back compared to front. The scale is chosen in such a way that 100% means identical speed, numbers smaller than 100% indicate the benchmark is slower than the baseline, and numbers greater than 100% indicate the benchmark is faster. For example, if you see 42% that means the speed of the benchmark is 0.42 of the baseline speed. If you see 123%, it means the benchmark is 23% or 1.23 times faster.

To close the current benchmark group and start another, simply use BENCHMARK again.

Ars Gratia Artis


If you want to draw a horizontal line of dashes (e.g. at the end of a group or for whatever reason), use BENCHMARK_DRAW_LINE(). The line fulfills a purely aesthetic role; it doesn’t interact with measurements in any way.

    BENCHMARK(foo) {
      Foo foo;
      foo.doSomething();
    }

    BENCHMARK_DRAW_LINE();

    BENCHMARK(bar) {
      Bar bar;
      bar.doSomething();
    }

Suspending a benchmark


Sometimes benchmarking code must do some preparation work that is physically inside the benchmark function, but should not take part to its time budget. To temporarily suspend the benchmark, use the pseudo-statement BENCHMARK_SUSPEND as follows:

    BENCHMARK(insertBackVector, n) {
      vector<int> v;
      BENCHMARK_SUSPEND {
        v.reserve(n);
      }
      for (unsigned int i = 0; i < n; ++i) {
        v.insert(v.end(), i);
      }
    }

The preallocation effected with v.reserve(n) will not count toward the total run time of the benchmark.

Only the main thread should call BENCHMARK_SUSPEND (and of course it should not call it while other threads are doing actual work). This is because the timer is application-global.

If the scope introduced by BENCHMARK_SUSPEND is not desired, you may want to “manually” use the BenchmarkSuspender type. Constructing such an object suspends time measurement, and destroying it resumes the measurement. If you want to resume time measurement before the destructor, call dismiss against the BenchmarkSuspender object. The previous example could have been written like this:

    BENCHMARK(insertBackVector, n) {
      BenchmarkSuspender braces;
      vector<int> v;
      v.reserve(n);
      braces.dismiss();
      for (unsigned int i = 0; i < n; ++i) {
        v.insert(v.end(), i);
      }
    }

doNotOptimizeAway


Finally, the small utility function doNotOptimizeAway prevents compiler optimizations that may interfere with benchmarking . Call doNotOptimizeAway(var) against variables that you use for benchmarking but otherwise are useless. The compiler tends to do a good job at eliminating unused variables, and this function fools it into thinking a variable is in fact needed. Example:

    BENCHMARK(fpOps, n) {
      double d = 1;
      FOR_EACH_RANGE (i, 1, n) {
        d += i;
        d -= i;
        d *= i;
        d /= i;
      }
      doNotOptimizeAway(d);
    }

A look under the hood


folly/Benchmark.h has a simple, systematic approach to collecting timings.

First, it organizes measurements in several large epochs, and takes the minimum over all epochs. Taking the minimum gives the closest result to the real runtime. Benchmark timings are not a regular random variable that fluctuates around an average. Instead, the real time we’re looking for is one to which there’s a variety of additive noise (i.e. there is no noise that could actually shorten the benchmark time below its real value). In theory, taking an infinite amount of samples and keeping the minimum is the actual time that needs measuring. That’s why the accuracy of benchmarking increases with the number of epochs.

Clearly, in real functioning there will also be noise and a variety of effects caused by the running context. But the noise during the benchmark (straight setup, simple looping) is a poor model for the noise in the real application. So taking the minimum across several epochs is the most informative result.

Inside each epoch, the function measured is iterated an increasing number of times until the total runtime is large enough to make noise negligible. At that point the time is collected, and the time per iteration is computed. As mentioned, the minimum time per iteration over all epochs is the final result.

The timer function used is clock_gettime with the CLOCK_REALTIME clock id. Note that you must use a recent Linux kernel (2.6.38 or newer), otherwise the resolution of CLOCK_REALTIME is inadequate.

folly/Conv.h

folly/Conv.h is a one-stop-shop for converting values across types. Its main features are simplicity of the API (only the names to and toAppend must be memorized), speed (folly is significantly faster, sometimes by an order of magnitude, than comparable APIs), and correctness.

Synopsis


All examples below are assume to have included folly/Conv.h and issued using namespace folly; You will need:

    // To format as text and append to a string, use toAppend.
    fbstring str;
    toAppend(2.5, &str);
    CHECK_EQ(str, "2.5");

    // Multiple arguments are okay, too. Just put the pointer to string at the end.
    toAppend(" is ", 2, " point ", 5, &str);
    CHECK_EQ(str, "2.5 is 2 point 5");

    // You don't need to use fbstring (although it's much faster for conversions and in general).
    std::string stdStr;
    toAppend("Pi is about ", 22.0 / 7, &stdStr);
    // In general, just use to<TargetType>(sourceValue). It returns its result by value.
    stdStr = to<std::string>("Variadic ", "arguments also accepted.");

    // to<fbstring> is 2.5x faster than to<std::string> for typical workloads.
    str = to<fbstring>("Variadic ", "arguments also accepted.");

Integral-to-integral conversion


Using to<Target>(value) to convert one integral type to another will behave as follows:

        short x;
        unsigned short y;
        ...
        auto a = to<int>(x); // zero overhead conversion
        auto b = to<int>(y); // zero overhead conversion
    short x;
    unsigned short y;
    long z;
    ...
    x = 123;
    auto a = to<unsigned short>(x); // fine
    x = -1;
    a = to<unsigned short>(x); // THROWS
    z = 2000000000;
    auto b = to<int>(z); // fine
    z += 1000000000;
    b = to<int>(z); // THROWS
    auto b = to<unsigned int>(z); // fine

Anything-to-string conversion


As mentioned, there are two primitives for converting anything to string: to and toAppend. They support the same set of source types, literally by definition (to is implemented in terms of toAppend for all types). The call toAppend(value, &str) formats and appends value to str whereas to<StringType>(value) formats value as a StringType and returns the result by value. Currently, the supported StringTypes are std::string and fbstring

Both toAppend and to with a string type as a target support variadic arguments. Each argument is converted in turn. For toAppend the last argument in a variadic list must be the address of a supported string type (no need to specify the string type as a template argument).

Integral-to-string conversion

Nothing special here - integrals are converted to strings in decimal format, with a ‘-’ prefix for negative values. Example:

    auto a = to<fbstring>(123);
    assert(a == "123");
    a = to<fbstring>(-456);
    assert(a == "-456");

The conversion implementation is aggressively optimized. It converts two digits at a time assisted by fixed-size tables. Converting a long to an fbstring is 3.6x faster than using boost::lexical_cast and 2.5x faster than using sprintf even though the latter is used in conjunction with a stack-allocated constant-size buffer.

Note that converting integral types to fbstring has a particular advantage compared to converting to std::string No integral type (<= 64 bits) has more than 20 decimal digits including sign. Since fbstring employs the small string optimization for up to 23 characters, converting an integral to fbstring is guaranteed to not allocate memory, resulting in significant speed and memory locality gains. Benchmarks reveal a 2x gain on a typical workload.

char to string conversion

Although char is technically an integral type, most of the time you want the string representation of 'a' to be "a", not 96 That’s why folly/Conv.h handles char as a special case that does the expected thing. Note that signed char and unsigned char are still considered integral types.

Floating point to string conversion

folly/Conv.h uses V8’s double conversion routines. They are accurate and fast; on typical workloads, to<fbstring>(doubleValue) is 1.9x faster than sprintf and 5.5x faster than boost::lexical_cast (It is also 1.3x faster than to<std::string>(doubleValue)

const char* to string conversion

For completeness, folly/Conv.h supports const char* including i.e. string literals. The “conversion” consists, of course, of the string itself. Example:

    auto s = to<fbstring>("Hello, world");
    assert(s == "Hello, world");

Anything from string conversion (i.e. parsing)


folly/Conv.h includes three kinds of parsing routines:

    fbstring s = " 1234 angels on a pin";
    StringPiece pc(s);
    auto x = to<int>(&pc);
    assert(x == 1234);
    assert(pc == " angels on a pin";

Note how the routine ate the leading space but not the trailing one.

Parsing integral types

Parsing integral types is unremarkable - decimal format is expected, optional '+' or '-' sign for signed types, but no optional '+' is allowed for unsigned types. The one remarkable element is speed - parsing typical long values is 6x faster than sscanf. folly/Conv.h uses aggressive loop unrolling and table-assisted SIMD-style code arrangement that avoids integral division (slow) and data dependencies across operations (ILP-unfriendly). Example:

    fbstring str = "  12345  ";
    assert(to<int>(str) == 12345);
    str = "  12345six seven eight";
    StringPiece pc(str);
    assert(to<int>(&pc) == 12345);
    assert(str == "six seven eight");

Parsing floating-point types

folly/Conv.h uses, again, V8’s double-conversion routines as back-end. The speed is 3x faster than sscanf and 1.7x faster than in-home routines such as parse<double> But the more important detail is accuracy - even if you do code a routine that works faster than to<double> chances are it is incorrect and will fail in a variety of corner cases. Using to<double> is strongly recommended.

Note that if the string “NaN” (with any capitalization) is passed to to<double> then NaN is returned, which can be tested for as follows:

    fbstring str = "nan"; // "NaN", "NAN", etc.
    double d = to<double>(str);
    if (std::isnan(d)) {
      // string was a valid representation of the double value NaN
    }

Note that passing “-NaN” (with any capitalization) to to<double> also returns NaN.

Note that if the strings “inf” or “infinity” (with any capitalization) are passed to to<double> then infinity is returned, which can be tested for as follows:

    fbstring str = "inf"; // "Inf", "INF", "infinity", "Infinity", etc.
    double d = to<double>(str);
    if (std::isinf(d)) {
      // string was a valid representation of one of the double values +Infinity
      // or -Infinity
    }

Note that passing “-inf” or “-infinity” (with any capitalization) to to<double> returns -infinity rather than +infinity. The sign of the infinity can be tested for as follows:

    fbstring str = "-inf"; // or "inf", "-Infinity", "+Infinity", etc.
    double d = to<double>(str);
    if (d == std::numeric_limits<double>::infinity()) {
      // string was a valid representation of the double value +Infinity
    } else if (d == -std::numeric_limits<double>::infinity()) {
      // string was a valid representation of the double value -Infinity
    }

Note that if an unparseable string is passed to to<double> then an exception is thrown, rather than NaN being returned. This can be tested for as follows:

    fbstring str = "not-a-double"; // Or "1.1.1", "", "$500.00", etc.
    double d;
    try {
      d = to<double>(str);
    } catch (const std::range_error &) {
      // string could not be parsed
    }

Note that the empty string ("") is an unparseable value, and will cause to<double> to throw an exception.

Non-throwing interfaces

tryTo<T> is the non-throwing variant of to<T>. It returns an Expected<T, ConversionCode>. You can think of Expected as like an Optional<T>, but if the conversion failed, Expected stores an error code instead of a T.

tryTo<T> has similar performance as to<T> when the conversion is successful. On the error path, you can expect tryTo<T> to be roughly three orders of magnitude faster than the throwing to<T> and to completely avoid any lock contention arising from stack unwinding.

Here is how to use non-throwing conversions:

    auto t1 = tryTo<int>(str);
    if (t1.hasValue()) {
      use(t1.value());
    }

Expected has a composability feature to make the above pattern simpler.

    tryTo<int>(str).then([](int i) { use(i); });

folly/dynamic.h

folly/dynamic.h provides a runtime dynamically typed value for C++, similar to the way languages with runtime type systems work (e.g. Python). It can hold types from a predetermined set of types (ints, bools, arrays of other dynamics, etc), similar to something like std::variant, but the syntax is intended to be a little more like using the native type directly.

Overview


Here are some code samples to get started (assumes a using folly::dynamic; was used):

    dynamic twelve = 12; // creates a dynamic that holds an integer
    dynamic str = "string"; // yep, this one is an fbstring

    // A few other types.
    dynamic nul = nullptr;
    dynamic boolean = false;

    // Arrays can be initialized with dynamic::array.
    dynamic array = dynamic::array("array ", "of ", 4, " elements");
    assert(array.size() == 4);
    dynamic emptyArray = dynamic::array;
    assert(emptyArray.empty());

    // Maps from dynamics to dynamics are called objects.  The
    // dynamic::object constant is how you make an empty map from dynamics
    // to dynamics.
    dynamic map = dynamic::object;
    map["something"] = 12;
    map["another_something"] = map["something"] * 2;

    // Dynamic objects may be initialized this way
    dynamic map2 = dynamic::object("something", 12)("another_something", 24);

Runtime Type Checking and Conversions


Any operation on a dynamic requires checking at runtime that the type is compatible with the operation. If it isn’t, you’ll get a folly::TypeError. Other exceptions can also be thrown if you try to do something impossible (e.g. if you put a very large 64-bit integer in and try to read it out as a double).

More examples should hopefully clarify this:

    dynamic dint = 42;

    dynamic str = "foo";
    dynamic anotherStr = str + "something"; // fine
    dynamic thisThrows = str + dint; // TypeError is raised

Explicit type conversions can be requested for some of the basic types:

    dynamic dint = 12345678;
    dynamic doub = dint.asDouble(); // doub will hold 12345678.0
    dynamic str = dint.asString(); // str == "12345678"

    dynamic hugeInt = std::numeric_limits<int64_t>::max();
    dynamic hugeDoub = hugeInt.asDouble();  // throws a folly/Conv.h error,
                                            // since it can't fit in a double

For more complicated conversions, see DynamicConverter.

Comparison Operators and Hashing

Equality Operators

Equality operators (==, !=) are supported for all types.

For dynamics of the same type, the underlying equality operator shall apply.

For comparisons between different numeric types (double and int64), numeric equality will apply - thus 2.0 == 2.

Values of any other different types will be deemed to not be equal.

Ordering operators

Ordering operators (<, <=, >, >=) are supported for all types, except dynamic::object which will throw if it is involved in an ordering operator.

For dynamics of the same type, the underlying ordering operator shall apply.

For comparisons between different numeric types (double and int64), numeric ordering will apply - thus 1.5 < 2.

Ordering of values between other different types will maintain total ordering properties and be consistent within a given binary run, and thus safe for use in e.g. std::set. The actual ordering is undefined and could change across versions, thus a dependency should not be taken outside of the total ordering property within a given binary.

Hashing

Hashing is supported by all types, and the hashes of two values will match if they are equal per dynamic::operator==.

As a result, numerical types have the same numerical hashing regardless of int64 vs double - so e.g. std::hash<dynamic>()(2) will give the same value as std::hash<dynamic>()(2.0).

Iteration and Lookup


You can iterate over dynamic arrays as you would over any C++ sequence container.

    dynamic array = dynamic::array(2, 3, "foo");

    for (auto& val : array) {
      doSomethingWith(val);
    }

You can iterate over dynamic maps by calling items(), keys(), values(), which behave similarly to the homonymous methods of Python dictionaries.

    dynamic obj = dynamic::object(2, 3)("hello", "world")("x", 4);

    for (auto& pair : obj.items()) {
      // Key is pair.first, value is pair.second
      processKey(pair.first);
      processValue(pair.second);
    }

    for (auto& key : obj.keys()) {
      processKey(key);
    }

    for (auto& value : obj.values()) {
      processValue(value);
    }

You can find an element by key in a dynamic map using the find() method, which returns an iterator compatible with items():

    dynamic obj = dynamic::object(2, 3)("hello", "world")("x", 4);

    auto pos = obj.find("hello");
    // pos->first is "hello"
    // pos->second is "world"

    auto pos = obj.find("no_such_key");
    // pos == obj.items().end()

Use for JSON


The original motivation for implementing this type was to try to make dealing with json documents in C++ almost as easy as it is in languages with dynamic type systems (php or javascript, etc). The reader can judge whether we’re anywhere near that goal, but here’s what it looks like:

    // Parsing JSON strings and using them.
    std::string jsonDocument = R"({"key":12,"key2":[false, null, true, "yay"]})";
    dynamic parsed = folly::parseJson(jsonDocument);
    assert(parsed["key"] == 12);
    assert(parsed["key2"][0] == false);
    assert(parsed["key2"][1] == nullptr);

    // Building the same document programmatically.
    dynamic sonOfAJ = dynamic::object
      ("key", 12)
      ("key2", dynamic::array(false, nullptr, true, "yay"));

    // Printing.  (See also folly::toPrettyJson)
    auto str = folly::toJson(sonOfAJ);
    assert(jsonDocument.compare(str) == 0);

Performance


Dynamic typing is more expensive than static typing, even when you do it in C++. ;)

However, some effort has been made to keep folly::dynamic and the json (de)serialization at least reasonably performant for common cases. The heap is only used for arrays and objects, and move construction is fully supported. String formatting internally also uses the highly performant folly::to<> (see folly/Conv.h).

A trade off to keep in mind though, is that sizeof(folly::dynamic) is 64 bytes. You probably don’t want to use it if you need to allocate large numbers of them (prefer static types, etc).

Some Design Rationale


Q. Why doesn’t a dynamic string support begin(), end(), and operator[]?

The value_type of a dynamic iterator is dynamic, and operator[] (or the at() function) has to return a reference to a dynamic. If we wanted this to work for strings, this would mean we’d have to support dynamics with a character type, and moreover that the internal representation of strings would be such that we can hand out references to dynamic as accessors on individual characters. There are a lot of potential efficiency drawbacks with this, and it seems like a feature that is not needed too often in practice.

Q. Isn’t this just a poor imitation of the C# language feature?

Pretty much.

folly/DynamicConverter.h

When dynamic objects contain data of a known type, it is sometimes useful to have its well-typed representation. A broad set of type-conversions are contained in DynamicConverter.h, and facilitate the transformation of dynamic objects into their well-typed format.

Usage


Simply pass a dynamic into a templated convertTo:

    dynamic d = { { 1, 2, 3 }, { 4, 5 } }; // a vector of vector of int
    auto vvi = convertTo<fbvector<fbvector<int>>>(d);

Supported Types


convertTo naturally supports conversions to

  1. arithmetic types (such as int64_t, unsigned short, bool, and double)
  2. fbstring, std::string
  3. containers and map-containers

NOTE:

convertTo will assume that Type is a container if * it has a Type::value_type, and * it has a Type::iterator, and * it has a constructor that accepts two InputIterators

Additionally, convertTo will assume that Type is a map if * it has a Type::key_type, and * it has a Type::mapped_type, and * value_type is a pair of const key_type and mapped_type

If Type meets the container criteria, then it will be constructed by calling its InputIterator constructor.

Customization


If you want to use convertTo to convert dynamics into your own custom class, then all you have to do is provide a template specialization of DynamicConverter with the static method convert. Make sure you put it in namespace folly.

Example:

    struct Token {
      int kind_;
      fbstring lexeme_;
      
      explicit Token(int kind, const fbstring& lexeme)
        : kind_(kind), lexeme_(lexeme) {}
    };
    namespace folly {
    template <> struct DynamicConverter<Token> {
      static Token convert(const dynamic& d) {
        int k = convertTo<int>(d["KIND"]);
        fbstring lex = convertTo<fbstring>(d["LEXEME"]);
        return Token(k, lex);
      }
    };
    }

Thread pools & Executors

Run your concurrent code in a performant way

All about thread pools #

How do I use the thread pools? #

Wangle provides two concrete thread pools (IOThreadPoolExecutor, CPUThreadPoolExecutor) as well as building them in as part of a complete async framework. Generally you might want to grab the global executor, and use it with a future, like this:

auto f = someFutureFunction().via(getCPUExecutor()).then(...)

Or maybe you need to construct a thrift/memcache client, and need an event base:

auto f = getClient(getIOExecutor()->getEventBase())->callSomeFunction(args...)
         .via(getCPUExecutor())
         .then([](Result r){ .... do something with result});

vs. C++11's std::launch #

The current C++11 std::launch only has two modes: async or deferred. In a production system, neither is what you want: async will launch a new thread for every launch without limit, while deferred will defer the work until it is needed lazily, but then do the work in the current thread synchronously when it is needed.

Wangle's thread pools always launch work as soon as possible, have limits to the maximum number of tasks / threads allowed, so we will never use more threads than absolutely needed. See implementation details below about each type of executor.

Why do we need yet another set of thread pools? #

Unfortunately none of the existing thread pools had every feature needed - things based on pipes are too slow. Several older ones didn't support std::function.

Why do we need several different types of thread pools? #

If you want epoll support, you need an fd - event_fd is the latest notification hotness. Unfortunately, an active fd triggers all the epoll loops it is in, leading to thundering herd - so if you want a fair queue (one queue total vs. one queue per worker thread), you need to use some kind of semaphore. Unfortunately semaphores can't be put in epoll loops, so they are incompatible with IO. Fortunately, you usually want to separate the IO and CPU bound work anyway to give stronger tail latency guarantees on IO.

IOThreadPoolExecutor #

CPUThreadPoolExecutor #

ThreadPoolExecutor #

Base class that contains the thread startup/shutdown/stats logic, since this is pretty disjoint from how tasks are actually run

Observers #

An observer interface is provided to listen for thread start/stop events. This is useful to create objects that should be one-per-thread, but also have them work correctly if threads are added/removed from the thread pool.

Stats #

PoolStats are provided to get task count, running time, waiting time, etc.

folly/FBString.h

fbstring is a drop-in replacement for std::string. The main benefit of fbstring is significantly increased performance on virtually all important primitives. This is achieved by using a three-tiered storage strategy and by cooperating with the memory allocator. In particular, fbstring is designed to detect use of jemalloc and cooperate with it to achieve significant improvements in speed and memory usage.

fbstring supports 32- and 64-bit and little- and big-endian architectures.

Storage strategies


Implementation highlights


folly/FBVector.h

Simply replacing std::vector with folly::fbvector (after having included the folly/FBVector.h header file) will improve the performance of your C++ code using vectors with common coding patterns. The improvements are always non-negative, almost always measurable, frequently significant, sometimes dramatic, and occasionally spectacular.

Sample


    folly::fbvector<int> numbers({0, 1, 2, 3});
    numbers.reserve(10);
    for (int i = 4; i < 10; i++) {
      numbers.push_back(i * 2);
    }
    assert(numbers[6] == 12);

Motivation


std::vector is the stalwart abstraction many use for dynamically-allocated arrays in C++. It is also the best known and most used of all containers. It may therefore seem a surprise that std::vector leaves important - and sometimes vital - efficiency opportunities on the table. This document explains how our own drop-in abstraction fbvector improves key performance aspects of std::vector. Refer to folly/test/FBVectorTest.cpp for a few benchmarks.

Memory Handling


It is well known that std::vector grows exponentially (at a constant factor) in order to avoid quadratic growth performance. The trick is choosing a good factor. Any factor greater than 1 ensures O(1) amortized append complexity towards infinity. But a factor that’s too small (say, 1.1) causes frequent vector reallocation, and one that’s too large (say, 3 or 4) forces the vector to consume much more memory than needed.

The initial HP implementation by Stepanov used a growth factor of 2; i.e., whenever you’d push_back into a vector without there being room, it would double the current capacity. This was not a good choice: it can be mathematically proven that a growth factor of 2 is rigorously the worst possible because it never allows the vector to reuse any of its previously-allocated memory. Despite other compilers reducing the growth factor to 1.5, gcc has staunchly maintained its factor of 2. This makes std::vector cache- unfriendly and memory manager unfriendly.

To see why that’s the case, consider a large vector of capacity C. When there’s a request to grow the vector, the vector (assuming no in-place resizing, see the appropriate section in this document) will allocate a chunk of memory next to its current chunk, copy its existing data to the new chunk, and then deallocate the old chunk. So now we have a chunk of size C followed by a chunk of size k * C. Continuing this process we’ll then have a chunk of size k * k * C to the right and so on. That leads to a series of the form (using ^^ for power):

C, C*k,  C*k^^2, C*k^^3, ...

If we choose k = 2 we know that every element in the series will be strictly larger than the sum of all previous ones because of the remarkable equality:

1 + 2^^1 + 2^^2 + 2^^3... + 2^^n = 2^^(n+1) - 1

This means that any new chunk requested will be larger than all previously used chunks combined, so the vector must crawl forward in memory; it can’t move back to its deallocated chunks. But any number smaller than 2 guarantees that you’ll at some point be able to reuse the previous chunks. For instance, choosing 1.5 as the factor allows memory reuse after 4 reallocations; 1.45 allows memory reuse after 3 reallocations; and 1.3 allows reuse after only 2 reallocations.

Of course, the above makes a number of simplifying assumptions about how the memory allocator works, but definitely you don’t want to choose the theoretically absolute worst growth factor. fbvector uses a growth factor of 1.5. That does not impede good performance at small sizes because of the way fbvector cooperates with jemalloc (below).

The jemalloc Connection


Virtually all modern allocators allocate memory in fixed-size quanta that are chosen to minimize management overhead while at the same time offering good coverage at low slack. For example, an allocator may choose blocks of doubling size (32, 64, 128, , …) up to 4096, and then blocks of size multiples of a page up until 1MB, and then 512KB increments and so on.

As discussed above, std::vector also needs to (re)allocate in quanta. The next quantum is usually defined in terms of the current size times the infamous growth constant. Because of this setup, std::vector has some slack memory at the end much like an allocated block has some slack memory at the end.

It doesn’t take a rocket surgeon to figure out that an allocator- aware std::vector would be a marriage made in heaven: the vector could directly request blocks of “perfect” size from the allocator so there would be virtually no slack in the allocator. Also, the entire growth strategy could be adjusted to work perfectly with allocator’s own block growth strategy. That’s exactly what fbvector does - it automatically detects the use of jemalloc and adjusts its reallocation strategy accordingly.

But wait, there’s more. Many memory allocators do not support in- place reallocation, although most of them could. This comes from the now notorious design of realloc() to opaquely perform either in-place reallocation or an allocate-memcpy-deallocate cycle. Such lack of control subsequently forced all clib-based allocator designs to avoid in-place reallocation, and that includes C++’s new and std::allocator. This is a major loss of efficiency because an in-place reallocation, being very cheap, may mean a much less aggressive growth strategy. In turn that means less slack memory and faster reallocations.

Object Relocation


One particularly sensitive topic about handling C++ values is that they are all conservatively considered non- relocatable. In contrast, a relocatable value would preserve its invariant even if its bits were moved arbitrarily in memory. For example, an int32_t is relocatable because moving its 4 bytes would preserve its actual value, so the address of that value does not “matter” to its integrity.

C++’s assumption of non-relocatable values hurts everybody for the benefit of a few questionable designs. The issue is that moving a C++ object “by the book” entails (a) creating a new copy from the existing value; (b) destroying the old value. This is quite vexing and violates common sense; consider this hypothetical conversation between Captain Picard and an incredulous alien:

Incredulous Alien: “So, this teleporter, how does it work?”
Picard: “It beams people and arbitrary matter from one place to another.”
Incredulous Alien: “Hmmm… is it safe?”
Picard: “Yes, but earlier models were a hassle. They’d clone the person to another location. Then the teleporting chief would have to shoot the original. Ask O’Brien, he was an intern during those times. A bloody mess, that’s what it was.”

Only a tiny minority of objects are genuinely non-relocatable:

    class Ew {
      char buffer[1024];
      char * pointerInsideBuffer;
    public:
      Ew() : pointerInsideBuffer(buffer) {}
      ...
    }

The first class of designs can always be redone at small or no cost in efficiency. The second class of objects should not be values in the first place - they should be allocated with new and manipulated using (smart) pointers. It is highly unusual for a value to have observers that alias pointers to it.

Relocatable objects are of high interest to std::vector because such knowledge makes insertion into the vector and vector reallocation considerably faster: instead of going to Picard’s copy-destroy cycle, relocatable objects can be moved around simply by using memcpy or memmove. This optimization can yield arbitrarily high wins in efficiency; for example, it transforms vector< vector<double> > or vector< hash_map<int, string> > from risky liabilities into highly workable compositions.

In order to allow fast relocation without risk, fbvector uses a trait folly::IsRelocatable defined in "folly/Traits.h". By default, folly::IsRelocatable::value conservatively yields false. If you know that your type Widget is in fact relocatable, go right after Widget’s definition and write this:

    // at global namespace level
    namespace folly {
      struct IsRelocatable<Widget> : boost::true_type {};
    }

If you don’t do this, fbvector<Widget> will fail to compile with a static_assert.

Miscellaneous


fbvector uses a careful implementation all around to make sure it doesn’t lose efficiency through the cracks. Some future directions may be in improving raw memory copying (memcpy is not an intrinsic in gcc and does not work terribly well for large chunks) and in furthering the collaboration with jemalloc. Have fun!

folly/Format.h

folly/Format.h provides a fast, powerful, type-safe, flexible facility for formatting text, using a specification language similar to Python’s str.format. By default, it can format strings, numbers (integral and floating point), and dynamically-typed folly::dynamic objects, and can extract values from random-access containers and string-keyed maps. In many cases, format is faster than sprintf as well as being fully type-safe.

Overview


Here are some code samples to get started:

using folly::format;
using folly::sformat;

// Objects produced by format() can be streamed without creating
// an intermediary string; {} yields the next argument using default
// formatting.
std::cout << format("The answers are {} and {}", 23, 42);
// => "The answers are 23 and 42"

// If you just want the string, though, you're covered.
std::string result = sformat("The answers are {} and {}", 23, 42);
// => "The answers are 23 and 42"

// To insert a literal '{' or '}', just double it.
std::cout << format("{} {{}} {{{}}}", 23, 42);
// => "23 {} {42}"

// Arguments can be referenced out of order, even multiple times
std::cout << format("The answers are {1}, {0}, and {1} again", 23, 42);
// => "The answers are 42, 23, and 42 again"

// It's perfectly fine to not reference all arguments
std::cout << format("The only answer is {1}", 23, 42);
// => "The only answer is 42"

// Values can be extracted from indexable containers
// (random-access sequences and integral-keyed maps), and also from
// string-keyed maps
std::vector<int> v {23, 42};
std::map<std::string, std::string> m { {"what", "answer"} };
std::cout << format("The only {1[what]} is {0[1]}", v, m);
// => "The only answer is 42"

// format works with pairs and tuples
std::tuple<int, std::string, int> t {42, "hello", 23};
std::cout << format("{0} {2} {1}", t);
// => "42 23 hello"

// Format supports width, alignment, arbitrary fill, and various
// format specifiers, with meanings similar to printf
// "X<10": fill with 'X', left-align ('<'), width 10
std::cout << format("{:X<10} {}", "hello", "world");
// => "helloXXXXX world"

// Field width may be a runtime value rather than part of the format string
int x = 6;
std::cout << format("{:-^*}", x, "hi");
// => "--hi--"

// Explicit arguments work with dynamic field width, as long as indexes are
// given for both the value and the field width.
std::cout << format("{2:+^*0}",
9, "unused", 456); // => "+++456+++"

// Format supports printf-style format specifiers
std::cout << format("{0:05d} decimal = {0:04x} hex", 42);
// => "00042 decimal = 002a hex"

// Formatter objects may be written to a string using folly::to or
// folly::toAppend (see folly/Conv.h), or by calling their appendTo(),
// and str() methods
std::string s = format("The only answer is {}", 42).str();
std::cout << s;
// => "The only answer is 42"

// Decimal precision usage
std::cout << format("Only 2 decimals is {:.2f}", 23.34134534535);
// => "Only 2 decimals is 23.34"

Format string syntax


Format string (format): "{" [arg_index] ["[" key "]"] [":" format_spec] "}"

Format specification: [[fill] align] [sign] ["#"] ["0"] [width] [","] ["." precision] ["."] [type]

Presentation formats:

Extension


You can extend format for your own class by providing a specialization for folly::FormatValue. See folly/Format.h and folly/FormatArg.h for details, and the existing specialization for folly::dynamic in folly/dynamic-inl.h for an implementation example.

folly/Function.h

folly::Function is a polymorphic function wrapper that is not copyable and does not require the wrapped function to be copy constructible. It is similar to std::function, but different with respect to some interesting features.

There are some limitations in std::function that folly::Function tries to avoid. std::function is copy-constructible and requires that the callable that it wraps is copy-constructible as well, which is a constraint that is often inconvenient. In most cases when using a std::function you don’t make use of its copy-constructibility, so you might sometimes feel like you get back very little in return for a noticeable restriction.

This restriction becomes apparent when trying to use a lambda capturing a unique_ptr (or any non-copyable type) as a callback for a folly::Future.

std::unique_ptr<Foo> foo_ptr = new Foo;

some_future.then(
    [foo_ptr = std::move(foo_ptr)] mutable
    (int x)
    { foo_ptr->setX(x); }
);

This piece of code did not compile before folly::Future started using folly::Function instead of std::function to store the callback. Because the lambda captures something non-copyable (the unique_ptr), it is not copyable itself. And std::function can only store copyable callables.

The implementation of folly::Future did not make use of the copy-constructibility of std::function at any point. There was no benefit from the fact that the std::function is copy-constructible, but the fact that it can only wrap copy-constructible callables posed a restriction.

A workaround was available: folly::MoveWrapper, which wraps an object that may be non-copyable and implements copy operations by moving the embedded object. Using a folly::MoveWrapper, you can capture non-copyable objects in a lambda, and the lambda itself is still copyable and may be wrapped in a std::function. It is a pragmatic solution for the above problem, but you have to be a little careful. The problem is that you can’t use a MoveWrapper anywhere where copy operations are assumed to behave like actual copy operations. Also, a folly::MoveWrapper<std::unique_ptr<T>> essentially behaves like auto_ptr<T>. Ask yourself whether you’d want to use lots of auto_ptrs in your codebase. And the original question still persists: we very often don’t benefit from copy-constructibility of std::function, so why do we have to live with this restriction? I.e. why do we have to use MoveWrapper?

folly::Function is an actual solution to this problem, as it can wrap non-copyable callables, at the cost of not being copy-constructible, which more often than not is not a relevant restriction. folly::Future now uses folly::Function to store callbacks, so the good news is: the code example from the top of this note is becoming a perfectly valid way to use future callbacks. The code compiles and behaves as you would expect.

Here are more details about folly::Function: much like std::function, it wraps an arbitrary object that can be invoked like a given function type. E.g. a folly::Function<int(std::string, double)> can wrap any callable object that returns an int (or something that is convertible to an int) when invoked with a std::string and a double argument. The function type is a template parameter of folly::Function, but the specific type of the callable is not.

Other than copyability, there is one more significant difference between std::function and folly::Function, and it concerns const-correctness. std::function does not enforce const-correctness: it allows you to store mutable callables (i.e. callables that may change their inner state when executed, such as a mutable lambda) and call them in a const context (i.e. when you only have access to a const reference to the std::function object). For example:

class FooBar {
 public:
  void call_func() const {
    func_();
  }
 private:
  std::function<void()> func_;
};

The call_func member function is declared const. However, when it calls func_(), it may change the inner state of func_, and thereby the inner state of the FooBar object. Inside the FooBar class, func_ is like a non-const method that is callable from const methods.

Some people consider std::function in the standard broken with respect to this. (Paper N4348 explains this problem in more detail.) It also lists possible ways to fix the problem. folly::Function, however, goes a different way: you have to declare whether you want to store a const function, in which case you can invoke any reference (const or non-const) of the folly::Function, or a non-const (mutable) function, in which case you need a non-const reference to the folly::Function to be able to invoke it. In the above example, let’s say that func_ stores a const function, which makes it okay that it gets invoked from call_func (a const method). Instead of std::function, you could use folly::Function<void() const> for the func_ member.

Const-ness is part of a function type. To illustrate:

class Foo {
 public:
  int operator()() { return 1; }
  int operator()(char const*) { return 2; }
  int operator()(int) { return 3; }
  int operator()(int) const { return 4; }
  int operator()(int, int) const { return 5; }
};

You can overload methods multiple times using different argument signatures. Const-ness is part of that signature, so even for the same set of argument types you can overload a const and a non-const version. It’s not even particularly unusual to do that. Take for instance the begin() method of STL container types: begin() returns an iterator, begin() const returns a const_iterator. folly::Function allows you to select a specific overload easily:

folly::Function<int()> uf1 = Foo();
// uf1() returns 1
folly::Function<int(char const*)> uf2 = Foo();
// uf2() returns 2
folly::Function<int(int)> uf3 = Foo();
// uf3() returns 3
folly::Function<int(int) const> uf4 = Foo();
// uf4() returns 4
folly::Function<int(int, int) const> uf5 = Foo();
// uf5() returns 5

If cfoo is a const-reference to a Foo object, cfoo(int) returns 4. If foo is a non-const reference to a Foo object, foo(int) returns 3. Normal const-to-non-const conversion behavior applies: if you call foo(int, int) it will return 5: a non-const reference will invoke the const method if no non-const method is defined. Which leads to the following behavior:

folly::Function<int(int, int)> uf5nc = Foo();
// uf5nc() returns 5

If you are wondering if the introduction of const function types means that you have to change lots of normal function types to const function types if you want to use folly::Function: not really, or at least not as much as you might think. There are only two reasons to use a folly::Function with a const function type: * a callable object defines both const and non-const operator() and you explicitly want to select the const one * you need to invoke a folly::Function from a const context (i.e. you only have a const reference to the folly::Function)

In practice, you will not need the const variant very often. Adding const to a function type adds a restriction for the callable: it must not change its inner state when invoked. If you don’t care whether it does or not, don’t worry about const!

A folly::Function<R(Args...) const> can be converted into a folly::Function<R(Args...)>: either way the stored callable will not change its inner state when invoked. The former type expresses and guarantees that, the latter does not. When you get rid of the const, the selected function stays the same:

folly::Function<int(int)> uf4nc = std::move(uf4);
// uf4nc() returns 4, not 3!

If you want to go the other way, you are talking about a (potentially dangerous) const cast: a callable that may or may not change its inner state is declared as one that guarantees not to do that. Proceed at your own risk! This conversion does not happen implicitly, though:

folly::Function<int() const> uf1c = folly::constCastFunction(std::move(uf1));
// uf1c() returns 1

Admittedly, seeing const function types as template parameters is unfamiliar. As far as I am aware it happens nowhere in the standard library. But it is the most consistent way to deal with the issue of const-correctness here. Const qualifiers are part of a function type, as a matter of fact. If you require a const-qualified function to be wrapped in a folly::Function, just declare it as that! More often than not you will find that you do not need the const qualifier. While writing the folly::Function implementation, a good set of unit tests had existed before the const function types got introduced. Not a single one of those unit tests had to be changed: they all compiled and passed after the introduction of const function types. Obviously new ones were added to test the const-correctness. But in your day-to-day use of folly::Function you won’t have to worry about const very often.

Memory usage

Like most implementations of std::function, folly::Function stores small callable objects in-line and larger callable objects will be stored on the heap. folly::Function returns the size of the allocation it has made from its heapAllocatedMemory() member function; naturally it will return 0 when the callable is stored in-line.

Futures

Futures is a framework for expressing asynchronous code in C++ using the Promise/Future pattern.

Overview

Folly Futures is an async C++ framework inspired by Twitter’s Futures implementation in Scala (see also Future.scala, Promise.scala, and friends), and loosely builds upon the existing but anemic Futures code found in the C++11 standard (std::future) and boost::future (especially >= 1.53.0). Although inspired by the C++11 std::future interface, it is not a drop-in replacement because some ideas don’t translate well enough to maintain API compatibility.

The primary difference from std::future is that you can attach callbacks to Futures (with thenValue or thenTry), under the control of an executor to manage where work runs, which enables sequential and parallel composition of Futures for cleaner asynchronous code.

Brief Synopsis

#include <folly/futures/Future.h>
#include <folly/executors/ThreadedExecutor.h>
using namespace folly;
using namespace std;

void foo(int x) {
  // do something with x
  cout << "foo(" << x << ")" << endl;
}

// ...
  folly::ThreadedExecutor executor;
  cout << "making Promise" << endl;
  Promise<int> p;
  Future<int> f = p.getSemiFuture().via(&executor);
  auto f2 = move(f).thenValue(foo);
  cout << "Future chain made" << endl;

// ... now perhaps in another event callback

  cout << "fulfilling Promise" << endl;
  p.setValue(42);
  move(f2).get();
  cout << "Promise fulfilled" << endl;

This would print:

making Promise
Future chain made
fulfilling Promise
foo(42)
Promise fulfilled

Blog Post

In addition to this document, there is a blog post on code.facebook.com (June 2015).

Brief guide

This brief guide covers the basics. For a more in-depth coverage skip to the appropriate section.

Let’s begin with an example using an imaginary simplified Memcache client interface:

using std::string;
class MemcacheClient {
 public:
  struct GetReply {
    enum class Result {
      FOUND,
      NOT_FOUND,
      SERVER_ERROR,
    };

    Result result;
    // The value when result is FOUND,
    // The error message when result is SERVER_ERROR or CLIENT_ERROR
    // undefined otherwise
    string value;
  };

  GetReply get(string key);
};

This API is synchronous, i.e. when you call get() you have to wait for the result. This is very simple, but unfortunately it is also very easy to write very slow code using synchronous APIs.

Now, consider this traditional asynchronous signature for the same operation:

int async_get(string key, std::function<void(GetReply)> callback);

When you call async_get(), your asynchronous operation begins and when it finishes your callback will be called with the result. Very performant code can be written with an API like this, but for nontrivial applications the code devolves into a special kind of spaghetti code affectionately referred to as “callback hell”.

The Future-based API looks like this:

SemiFuture<GetReply> future_get(string key);

A SemiFuture<GetReply> or Future<GetReply> is a placeholder for the GetReply that we will eventually get. For most of the descriptive text below, Future can refer to either folly::SemiFuture or folly::Future as the former is a safe subset of the latter. A Future usually starts life out “unfulfilled”, or incomplete, i.e.:

fut.isReady() == false
fut.value()  // will throw an exception because the Future is not ready

At some point in the future, the Future will have been fulfilled, and we can access its value.

fut.isReady() == true
GetReply& reply = fut.value();

Futures support exceptions. If the asynchronous producer fails with an exception, your Future may represent an exception instead of a value. In that case:

fut.isReady() == true
fut.value() // will rethrow the exception

Just what is exceptional depends on the API. In our example we have chosen not to raise exceptions for SERVER_ERROR, but represent this explicitly in the GetReply object. On the other hand, an astute Memcache veteran would notice that we left CLIENT_ERROR out of GetReply::Result, and perhaps a CLIENT_ERROR would have been raised as an exception, because CLIENT_ERROR means there’s a bug in the library and this would be truly exceptional. These decisions are judgement calls by the API designer. The important thing is that exceptional conditions (including and especially spurious exceptions that nobody expects) get captured and can be handled higher up the “stack”.

So far we have described a way to initiate an asynchronous operation via an API that returns a Future, and then sometime later after it is fulfilled, we get its value. This is slightly more useful than a synchronous API, but it’s not yet ideal. There are two more very important pieces to the puzzle.

First, we can aggregate Futures, to define a new Future that completes after some or all of the aggregated Futures complete. Consider two examples: fetching a batch of requests and waiting for all of them, and fetching a group of requests and waiting for only one of them.

MemcacheClient mc;

vector<SemiFuture<GetReply>> futs;
for (auto& key : keys) {
  futs.push_back(mc.future_get(key));
}
auto all = collectAll(futs.begin(), futs.end());

vector<SemiFuture<GetReply>> futs;
for (auto& key : keys) {
  futs.push_back(mc.future_get(key));
}
auto any = collectAny(futs.begin(), futs.end());

vector<SemiFuture<GetReply>> futs;
for (auto& key : keys) {
  futs.push_back(mc.future_get(key));
}
auto anyv = collectAnyWithoutException(futs.begin(), futs.end());

all and any are Futures (for the exact type and usage see the header files). They will be complete when all/one of futs are complete, respectively. (There is also collectN() for when you need some, and collectAnyWithoutException when you need one value if some value would be available.)

Second, we can associate a Future with an executor. An executor specifies where work will run, and we detail this more later. In summary, given an executor we can convert a SemiFuture to a Future with an executor, or a Future on one executor to a Future on another executor.

For example:

folly::ThreadedExecutor executor;
SemiFuture<GetReply> semiFut = mc.future_get("foo");
Future<GetReply> fut1 = std::move(semiFut).via(&executor);

Once an executor is attached, a Future allows continuations to be attached and chained together monadically. An example will clarify:

SemiFuture<GetReply> semiFut = mc.future_get("foo");
Future<GetReply> fut1 = std::move(semiFut).via(&executor);

Future<string> fut2 = std::move(fut1).thenValue(
  [](GetReply reply) {
    if (reply.result == MemcacheClient::GetReply::Result::FOUND)
      return reply.value;
    throw SomeException("No value");
  });

Future<Unit> fut3 = std::move(fut2)
  .thenValue([](string str) {
    cout << str << endl;
  })
  .thenTry([](folly::Try<string> strTry) {
    cout << strTry.value() << endl;
  })
  .thenError(folly::tag_t<std::exception>{}, [](std::exception const& e) {
    cerr << e.what() << endl;
  });

That example is a little contrived but the idea is that you can transform a result from one type to another, potentially in a chain, and unhandled errors propagate. Of course, the intermediate variables are optional.

Using .thenValue or .thenTry to add callbacks is idiomatic. It brings all the code into one place, which avoids callback hell. .thenValue appends a continuation that takes T&& for some Future<T> and an error bypasses the callback and is passed to the next, thenTry takes a callback taking folly::Try<T> which encapsulates both value and exception. thenError(tag_t<ExceptionType>{},... will bypass a value and only run if there is an exception, the ExceptionType template parameter to the tag type allows filtering by exception type; tag_t<ExceptionType>{} is optional and if no tag is passed passed the function will be parameterized with a folly::exception_wrapper. For C++17 there is a global inline variable and folly::tag<ExceptionType> may be passed directly without explicit construction.

Up to this point we have skirted around the matter of waiting for Futures. You may never need to wait for a Future, because your code is event-driven and all follow-up action happens in a then-block. But if want to have a batch workflow, where you initiate a batch of asynchronous operations and then wait for them all to finish at a synchronization point, then you will want to wait for a Future. Futures have a blocking method called wait() that does exactly that and optionally takes a timeout.

Futures are partially threadsafe. A Promise or Future can migrate between threads as long as there’s a full memory barrier of some sort. Future::thenValue and Promise::setValue (and all variants that boil down to those two calls) can be called from different threads. But, be warned that you might be surprised about which thread your callback executes on. Let’s consider an example, where we take a future straight from a promise, without going via the safer SemiFuture, and where we therefore have a Future that does not carry an executor. This is in general something to avoid.

// Thread A
Promise<Unit> p;
auto f = p.getFuture();

// Thread B
std::move(f).thenValue(x).thenValue(y).thenTry(z);

// Thread A
p.setValue();

This is legal and technically threadsafe. However, it is important to realize that you do not know in which thread x, y, and/or z will execute. Maybe they will execute in Thread A when p.setValue() is called. Or, maybe they will execute in Thread B when f.thenValue is called. Or, maybe x will execute in Thread A, but y and/or z will execute in Thread B. There’s a race between setValue and then—whichever runs last will execute the callback. The only guarantee is that one of them will run the callback.

For safety, .via should be preferred. We can chain .via operations to give very strong control over where callbacks run:

std::move(aFuture)
  .thenValue(x)
  .via(e1).thenValue(y1).thenValue(y2)
  .via(e2).thenValue(z);

x will execute in the context of the executor associated with aFuture. y1 and y2 will execute in the context of e1, and z will execute in the context of e2. If after z you want to get back to the original context, you need to get there with a call to via passing the original executor.

You make me Promises, Promises

If you are wrapping an asynchronous operation, or providing an asynchronous API to users, then you will want to make Promises. Every Future has a corresponding Promise (except Futures that spring into existence already completed, with makeFuture()). Promises are simple: you make one, you extract the Future, and you fulfill it with a value or an exception. Example:

Promise<int> p;
SemiFuture<int> f = p.getSemiFuture();

f.isReady() == false

p.setValue(42);

f.isReady() == true
f.value() == 42

and an exception example:

Promise<int> p;
SemiFuture<int> f = p.getSemiFuture();

f.isReady() == false

p.setException(std::runtime_error("Fail"));

f.isReady() == true
f.value() // throws the exception

It’s good practice to use setWith which takes a function and automatically captures exceptions, e.g.

Promise<int> p;
p.setWith([]{
  try {
    // do stuff that may throw
    return 42;
  } catch (MySpecialException const& e) {
    // handle it
    return 7;
  }
  // Any exceptions that we didn't catch, will be caught for us
});

folly/GroupVarint.h

folly/GroupVarint.h is an implementation of variable-length encoding for 32- and 64-bit integers using the Group Varint encoding scheme as described in Jeff Dean’s WSDM 2009 talk and in Information Retrieval: Implementing and Evaluating Search Engines.

Briefly, a group of four 32-bit integers is encoded as a sequence of variable length, between 5 and 17 bytes; the first byte encodes the length (in bytes) of each integer in the group. A group of five 64-bit integers is encoded as a sequence of variable length, between 7 and 42 bytes; the first two bytes encode the length (in bytes) of each integer in the group.

GroupVarint.h defines a few classes:

The 32-bit implementation is significantly faster than the 64-bit implementation; on platforms supporting the SSSE3 instruction set, we use the PSHUFB instruction to speed up lookup, as described in SIMD-Based Decoding of Posting Lists (CIKM 2011).

For more details, see the header file folly/GroupVarint.h and the associated test file folly/test/GroupVarintTest.cpp.

folly/stats/Histogram.h

Classes


Histogram

Histogram.h defines a simple histogram class, templated on the type of data you want to store. This class is useful for tracking a large stream of data points, where you want to remember the overall distribution of the data, but do not need to remember each data point individually.

Each histogram bucket stores the number of data points that fell in the bucket, as well as the overall sum of the data points in the bucket. Note that no overflow checking is performed, so if you have a bucket with a large number of very large values, it may overflow and cause inaccurate data for this bucket. As such, the histogram class is not well suited to storing data points with very large values. However, it works very well for smaller data points such as request latencies, request or response sizes, etc.

In addition to providing access to the raw bucket data, the Histogram class also provides methods for estimating percentile values. This allows you to estimate the median value (the 50th percentile) and other values such as the 95th or 99th percentiles.

All of the buckets have the same width. The number of buckets and bucket width is fixed for the lifetime of the histogram. As such, you do need to know your expected data range ahead of time in order to have accurate statistics. The histogram does keep one bucket to store all data points that fall below the histogram minimum, and one bucket for the data points above the maximum. However, because these buckets don’t have a good lower/upper bound, percentile estimates in these buckets may be inaccurate.

HistogramBuckets

The Histogram class is built on top of HistogramBuckets. HistogramBuckets provides an API very similar to Histogram, but allows a user-defined bucket class. This allows users to implement more complex histogram types that store more than just the count and sum in each bucket.

When computing percentile estimates HistogramBuckets allows user-defined functions for computing the average value and data count in each bucket. This allows you to define more complex buckets which may have multiple different ways of computing the average value and the count.

For example, one use case could be tracking timeseries data in each bucket. Each set of timeseries data can have independent data in the bucket, which can show how the data distribution is changing over time.

Example Usage


Say we have code that sends many requests to remote services, and want to generate a histogram showing how long the requests take. The following code will initialize histogram with 50 buckets, tracking values between 0 and 5000. (There are 50 buckets since the bucket width is specified as 100. If the bucket width is not an even multiple of the histogram range, the last bucket will simply be shorter than the others.)

    folly::Histogram<int64_t> latencies(100, 0, 5000);

The addValue() method is used to add values to the histogram. Each time a request finishes we can add its latency to the histogram:

    latencies.addValue(now - startTime);

You can access each of the histogram buckets to display the overall distribution. Note that bucket 0 tracks all data points that were below the specified histogram minimum, and the last bucket tracks the data points that were above the maximum.

    auto numBuckets = latencies.getNumBuckets();
    cout << "Below min: " << latencies.getBucketByIndex(0).count << "\n";
    for (unsigned int n = 1; n < numBuckets - 1; ++n) {
      cout << latencies.getBucketMin(n) << "-" << latencies.getBucketMax(n)
           << ": " << latencies.getBucketByIndex(n).count << "\n";
    }
    cout << "Above max: "
         << latencies.getBucketByIndex(numBuckets - 1).count << "\n";

You can also use the getPercentileEstimate() method to estimate the value at the Nth percentile in the distribution. For example, to estimate the median, as well as the 95th and 99th percentile values:

    int64_t median = latencies.getPercentileEstimate(0.5);
    int64_t p95 = latencies.getPercentileEstimate(0.95);
    int64_t p99 = latencies.getPercentileEstimate(0.99);

Thread Safety


Note that Histogram and HistogramBuckets objects are not thread-safe. If you wish to access a single Histogram from multiple threads, you must perform your own locking to ensure that multiple threads do not access it at the same time.

folly/

Components

Below is a list of (some) Folly components in alphabetical order, along with a brief description of each.

Arena.h, ThreadCachedArena.h

Simple arena for memory allocation: multiple allocations get freed all at once. With threaded version.

AtomicHashMap.h, AtomicHashArray.h, AtomicHashArray.h, AtomicLinkedList.h, …

High-performance atomic data-structures. Many of these are built with very specific tradeoffs and constraints in mind that make them faster than their more general counterparts. Each header should contain information about what these tradeoffs are.

Baton.h

A Baton allows a thread to block once and be awoken: it captures a single handoff. It is essentially a (very small, very fast) semaphore that supports only a single call to sem_call and sem_wait.

Benchmark.h

A small framework for benchmarking code. Client code registers benchmarks, optionally with an argument that dictates the scale of the benchmark (iterations, working set size etc). The framework runs benchmarks (subject to a command-line flag) and produces formatted output with timing information.

Bits.h

Various bit manipulation utilities optimized for speed; includes functions that wrap the ffsl(l) primitives in a uniform interface.

ConcurrentSkipList.h

An implementation of the structure described in A Provably Correct Scalable Concurrent Skip List by Herlihy et al.

Conv.h

A variety of data conversion routines (notably to and from string), optimized for speed and safety.

Demangle.h

Pretty-printing C++ types.

DiscriminatedPtr.h

Similar to std::variant, but restricted to pointers only. Uses the highest-order unused 16 bits in a pointer as discriminator. So sizeof(DiscriminatedPtr<int, string, Widget>) == sizeof(void*).

dynamic.h

Dynamically-typed object, created with JSON objects in mind. DynamicConverter.h is a utility for efficiently converting from a dynamic to a more concrete structure when the scheme is known (e.g. json -> map<int,int>).

EvictingCacheMap.h

A simple LRU hash map.

FBString.h

A drop-in implementation of std::string with a variety of optimizations.

FBVector.h

A mostly drop-in implementation of std::vector with a variety of optimizations.

File.h

A C++ abstraction around files.

Fingerprint.h

Rabin fingerprinting.

Function.h

A polymorphic wrapper for callables similar to std::function but not copyable and therefore able to wrap non-copyable callables, such as lambdas that capture move-only types like std::unique_ptr or folly::Promise.

futures/

Futures is a framework for expressing asynchronous code in C++ using the Promise/Future pattern.

Format.h

Python-style formatting utilities.

gen/

This library makes it possible to write declarative comprehensions for processing sequences of values efficiently in C++ akin to C#’s LINQ.

GroupVarint.h

Group Varint encoding for 32-bit values.

IPAddress.h

A collection of utilities to deal with IPAddresses, including ipv4 and ipv6.

io/

A collection of useful of abstractions for high-performance io. This is heavily relied upon in Facebook’s internally networking code.

Hash.h

Various popular hash function implementations.

Histogram.h

A simple class for collecting histogram data.

IntrusiveList.h

Convenience type definitions for using boost::intrusive_list.

json.h

JSON serializer and deserializer. Uses dynamic.h.

Likely.h

Wrappers around __builtin_expect.

Malloc.h, Memory.h

Memory allocation helpers, particularly when using jemalloc.

MicroSpinLock.h

A really, really small spinlock for fine-grained locking of lots of teeny-tiny data.

MPMCQueue.h

MPMCQueue is a high-performance bounded concurrent queue that supports multiple producers, multiple consumers, and optional blocking. The queue has a fixed capacity, for which all memory will be allocated up front.

The additional utility MPMCPipeline.h is an extension that lets you chain several queues together with processing steps in between.

PackedSyncPtr.h

A highly specialized data structure consisting of a pointer, a 1-bit spin lock, and a 15-bit integral, all inside one 64-bit word.

Poly.h

A class template that makes it relatively easy to define a type-erasing polymorphic object wrapper.

Preprocessor.h

Necessarily evil stuff.

ProducerConsumerQueue.h

Lock free single-reader, single-writer queue.

Random.h

Defines only one function—randomNumberSeed().

Range.h

Boost-style range facility and the StringPiece specialization.

RWSpinLock.h

Fast and compact reader-writer spin lock.

ScopeGuard.h

C++11 incarnation of the old ScopeGuard idiom.

Singleton.h

A singleton to rule the singletons. This is an attempt to insert a layer between C++ statics and the fiasco that ensues, so that things can be created, and destroyed, correctly upon program creation, program end and sometimes dlopen and fork.

Singletons are bad for you, but this may help.

SmallLocks.h

Very small spin locks (1 byte and 1 bit).

small_vector.h

Vector with the small buffer optimization and an optional embedded PicoSpinLock.

sorted_vector_types.h

Collections similar to std::map but implemented as sorted vectors.

stats/

A collection of efficient utilities for collecting statistics: * time series counters, gauges, histograms, and quantiles; * single-pass mean and variance.

StlAllocator.h

STL allocator wrapping a simple allocate/deallocate interface.

String.h

String utilities that connect folly::fbstring with std::string.

Subprocess.h

Subprocess library, modeled after Python’s subprocess module.

Synchronized.h

High-level synchronization library.

System.h

Demangling and errno utilities.

ThreadCachedInt.h

High-performance atomic increment using thread caching.

ThreadLocal.h

Improved thread local storage for non-trivial types.

TimeoutQueue.h

Queue with per-item timeout.

Traits.h

Type traits that complement those defined in the standard C++11 header <traits>.

Unicode.h

Defines the codePointToUtf8 function.

Uri.h

A collection of utilities to deal with URIs.

folly/PackedSyncPtr.h

A highly specialized data structure consisting of a pointer, a 1-bit spin lock, and a 15-bit integral packed into sizeof(void*).

Typical application is for microsharding of many elements within containers. Because there is no memory overhead, an arbitrarily large number of locks can be used to minimize lock contention with no memory penalty. Additionally, excellent cache performance is obtained by storing the lock inline with the pointer (no additional cache miss or false sharing). Finally, because it uses a simple spinlock mechanism, the cost of acquiring an uncontended lock is minimal.

Usage


This is not a “smart” pointer: nothing automagical is going on here. Locking is up to the user. Resource deallocation is up to the user. Locks are never acquired or released outside explicit calls to lock() and unlock().

Change the value of the raw pointer with set(), but you must hold the lock when calling this function if multiple threads could be using it.

Here is an example of using a PackedSyncPtr to build a synchronized vector with no memory overhead - the spinlock and size are stored in the 16 unused bits of pointer, the rest of which points to the actual data. See folly/small_vector.h for a complete implementation of this concept.

    template<typename T>
    class SyncVec {
      PackedSyncPtr<T> base;

     public:
      SyncVec() { base.init(); }

      void push_back(const T& t) {
        base.set(
          static_cast<T*>(realloc(base.get(), (base.extra() + 1) * sizeof(T))));
        base[base.extra()] = t;
        base.setExtra(base.extra() + 1);
      }

      size_t size() const {
        return base.extra();
      }

      void lock() {
        base.lock();
      }

      void unlock() {
        base.unlock();
      }

      T* begin() const {
        return base.get();
      }

      T* end() const {
        return base.get() + base.extra();
      }
    };

Implementation


This is using an x64-specific detail about the effective virtual address space. Long story short: the upper two bytes of all our pointers will be zero in reality—and if you have a couple billion such pointers in core, it makes pretty good sense to try to make use of that memory. The exact details can be perused here:

http://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses

folly/Poly.h

Poly is a class template that makes it relatively easy to define a type-erasing polymorphic object wrapper.

Type-erasure


std::function is one example of a type-erasing polymorphic object wrapper; folly::exception_wrapper is another. Type-erasure is often used as an alternative to dynamic polymorphism via inheritance-based virtual dispatch. The distinguishing characteristic of type-erasing wrappers are:

Examples: Defining a type-erasing function wrapper with folly::Poly


Defining a polymorphic wrapper with Poly is a matter of defining two things:

Below is a simple example program that defines a drawable wrapper for any type that provides a draw member function. (The details will be explained later.)

    // This example is an adaptation of one found in Louis Dionne's dyno library.
    #include <folly/Poly.h>
    #include <iostream>

    struct IDrawable {
      // Define the interface of something that can be drawn:
      template <class Base> struct Interface : Base {
        void draw(std::ostream& out) const { folly::poly_call<0>(*this, out);}
      };
      // Define how concrete types can fulfill that interface (in C++17):
      template <class T> using Members = folly::PolyMembers<&T::draw>;
    };

    // Define an object that can hold anything that can be drawn:
    using drawable = folly::Poly<IDrawable>;

    struct Square {
      void draw(std::ostream& out) const { out << "Square\n"; }
    };

    struct Circle {
      void draw(std::ostream& out) const { out << "Circle\n"; }
    };

    void f(drawable const& d) {
      d.draw(std::cout);
    }

    int main() {
      f(Square{}); // prints Square
      f(Circle{}); // prints Circle
    }

The above program prints:

    Square
    Circle

Here is another (heavily commented) example of a simple implementation of a std::function-like polymorphic wrapper. Its interface has only a single member function: operator()

    // An interface for a callable object of a particular signature, Fun
    // (most interfaces don't need to be templates, FWIW).
    template <class Fun>
    struct IFunction;

    template <class R, class... As>
    struct IFunction<R(As...)> {
      // An interface is defined as a nested class template called
      // Interface that takes a single template parameter, Base, from
      // which it inherits.
      template <class Base>
      struct Interface : Base {
        // The Interface has public member functions. These become the
        // public interface of the resulting Poly instantiation.
        // (Implementation note: Poly<IFunction<Sig>> will publicly
        // inherit from this struct, which is what gives it the right
        // member functions.)
        R operator()(As... as) const {
          // The definition of each member function in your interface will
          // always consist of a single line dispatching to folly::poly_call<N>.
          // The "N" corresponds to the N-th member function in the
          // list of member function bindings, Members, defined below.
          // The first argument will always be *this, and the rest of the
          // arguments should simply forward (if necessary) the member
          // function's arguments.
          return static_cast<R>(
              folly::poly_call<0>(*this, std::forward<As>(as)...));
        }
      };
      // The "Members" alias template is a comma-separated list of bound
      // member functions for a given concrete type "T". The
      // "FOLLY_POLY_MEMBERS" macro accepts a comma-separated list, and the
      // (optional) "FOLLY_POLY_MEMBER" macro lets you disambiguate overloads
      // by explicitly specifying the function signature the target member
      // function should have. In this case, we require "T" to have a
      // function call operator with the signature `R(As...) const`.
      //
      // If you are using a C++17-compatible compiler, you can do away with
      // the macros and write this as:
      //
      //   template <class T>
      //   using Members =
      //       folly::PolyMembers<folly::sig<R(As...) const>(&T::operator())>;
      //
      // And since `folly::sig` is only needed for disambiguation in case of
      // overloads, if you are not concerned about objects with overloaded
      // function call operators, it could be further simplified to:
      //
      //   template <class T>
      //   using Members = folly::PolyMembers<&T::operator()>;
      //
      template <class T>
      using Members = FOLLY_POLY_MEMBERS(
          FOLLY_POLY_MEMBER(R(As...) const, &T::operator()));
    };

    // Now that we have defined the interface, we can pass it to Poly to
    // create our type-erasing wrapper:
    template <class Fun>
    using Function = Poly<IFunction<Fun>>;

Given the above definition of Function, users can now initialize instances of (say) Function<int(int, int)> with function objects like std::plus<int> and std::multiplies<int>, as below:

    Function<int(int, int)> fun = std::plus<int>{};
    assert(5 == fun(2, 3));
    fun = std::multiplies<int>{};
    assert(6 = fun(2, 3));

Defining an interface with C++17


With C++17, defining an interface to be used with Poly is fairly straightforward. As in the Function example above, there is a struct with a nested Interface class template and a nested Members alias template. No macros are needed with C++17.

Imagine we were defining something like a Java-style iterator. If we are using a C++17 compiler, our interface would look something like this:

    template <class Value>
    struct IJavaIterator {
      template <class Base>
      struct Interface : Base {
        bool Done() const { return folly::poly_call<0>(*this); }
        Value Current() const { return folly::poly_call<1>(*this); }
        void Next() { folly::poly_call<2>(*this); }
      };
      // NOTE: This works in C++17 only:
      template <class T>
      using Members = folly::PolyMembers<&T::Done, &T::Current, &T::Next>;
    };

    template <class Value>
    using JavaIterator = Poly<IJavaIterator<Value>>;

Given the above definition, JavaIterator<int> can be used to hold instances of any type that has Done, Current, and Next member functions with the correct (or compatible) signatures.

The presence of overloaded member functions complicates this picture. Often, property members are faked in C++ with const and non-const member function overloads, like in the interface specified below:

    struct IIntProperty {
      template <class Base>
      struct Interface : Base {
        int Value() const { return folly::poly_call<0>(*this); }
        void Value(int i) { folly::poly_call<1>(*this, i); }
      };
      // NOTE: This works in C++17 only:
      template <class T>
      using Members = folly::PolyMembers<
        folly::sig<int() const>(&T::Value),
        folly::sig<void(int)>(&T::Value)>;
    };

    using IntProperty = Poly<IIntProperty>;

Now, any object that has Value members of compatible signatures can be assigned to instances of IntProperty object. Note how folly::sig is used to disambiguate the overloads of &T::Value.

Defining an interface with C++14


In C++14, the nice syntax above doesn’t work, so we have to resort to macros. The two examples above would look like this:

    template <class Value>
    struct IJavaIterator {
      template <class Base>
      struct Interface : Base {
        bool Done() const { return folly::poly_call<0>(*this); }
        Value Current() const { return folly::poly_call<1>(*this); }
        void Next() { folly::poly_call<2>(*this); }
      };
      // NOTE: This works in C++14 and C++17:
      template <class T>
      using Members = FOLLY_POLY_MEMBERS(&T::Done, &T::Current, &T::Next);
    };

    template <class Value>
    using JavaIterator = Poly<IJavaIterator<Value>>;

and

    struct IIntProperty {
      template <class Base>
      struct Interface : Base {
        int Value() const { return folly::poly_call<0>(*this); }
        void Value(int i) { return folly::poly_call<1>(*this, i); }
      };
      // NOTE: This works in C++14 and C++17:
      template <class T>
      using Members = FOLLY_POLY_MEMBERS(
        FOLLY_POLY_MEMBER(int() const, &T::Value),
        FOLLY_POLY_MEMBER(void(int), &T::Value));
    };

    using IntProperty = Poly<IIntProperty>;

Extending interfaces


One typical advantage of inheritance-based solutions to runtime polymorphism is that one polymorphic interface could extend another through inheritance. The same can be accomplished with type-erasing polymorphic wrappers. In the Poly library, you can use folly::PolyExtends to say that one interface extends another.

    struct IFoo {
      template <class Base>
      struct Interface : Base {
        void Foo() const { return folly::poly_call<0>(*this); }
      };
      template <class T>
      using Members = FOLLY_POLY_MEMBERS(&T::Foo);
    };

    // The IFooBar interface extends the IFoo interface
    struct IFooBar : PolyExtends<IFoo> {
      template <class Base>
      struct Interface : Base {
        void Bar() const { return folly::poly_call<0>(*this); }
      };
      template <class T>
      using Members = FOLLY_POLY_MEMBERS(&T::Bar);
    };

    using FooBar = Poly<IFooBar>;

Given the above definition, instances of type FooBar have both Foo() and Bar() member functions.

The sensible conversions exist between a wrapped derived type and a wrapped base type. For instance, assuming IDerived extends IBase with PolyExtends:

    Poly<IDerived> derived = ...;
    Poly<IBase> base = derived; // This conversion is OK.

As you would expect, there is no conversion in the other direction, and at present there is no Poly equivalent to dynamic_cast.

Type-erasing polymorphic reference wrappers


Sometimes you don’t need to own a copy of an object; a reference will do. For that you can use Poly to capture a reference to an object satisfying an interface rather than the whole object itself. The syntax is intuitive.

    int i = 42;

    // Capture a mutable reference to an object of any IRegular type:
    Poly<IRegular &> intRef = i;

    assert(42 == folly::poly_cast<int>(intRef));
    // Assert that we captured the address of "i":
    assert(&i == &folly::poly_cast<int>(intRef));

A reference-like Poly has a different interface than a value-like Poly. Rather than calling member functions with the obj.fun() syntax, you would use the obj->fun() syntax. This is for the sake of const-correctness. For example, consider the code below:

    struct IFoo {
      template <class Base>
      struct Interface {
        void Foo() { folly::poly_call<0>(*this); }
      };
      template <class T>
      using Members = folly::PolyMembers<&T::Foo>;
    };

    struct SomeFoo {
      void Foo() { std::printf("SomeFoo::Foo\n"); }
    };

    SomeFoo foo;
    Poly<IFoo &> const anyFoo = foo;
    anyFoo->Foo(); // prints "SomeFoo::Foo"

Notice in the above code that the Foo member function is non-const. Notice also that the anyFoo object is const. However, since it has captured a non-const reference to the foo object, it should still be possible to dispatch to the non-const Foo member function. When instantiated with a reference type, Poly has an overloaded operator-> member that returns a pointer to the IFoo interface with the correct const-ness, which makes this work.

The same mechanism also prevents users from calling non-const member functions on Poly objects that have captured const references, which would violate const-correctness.

Sensible conversions exist between non-reference and reference Polys. For instance:

    Poly<IRegular> value = 42;
    Poly<IRegular &> mutable_ref = value;
    Poly<IRegular const &> const_ref = mutable_ref;

    assert(&poly_cast<int>(value) == &poly_cast<int>(mutable_ref));
    assert(&poly_cast<int>(value) == &poly_cast<int>(const_ref));

Non-member functions (C++17)


If you wanted to write the interface ILogicallyNegatable, which captures all types that can be negated with unary operator!, you could do it as we’ve shown above, by binding &T::operator! in the nested Members alias template, but that has the problem that it won’t work for types that have defined unary operator! as a free function. To handle this case, the Poly library lets you use a free function instead of a member function when creating a binding.

With C++17 you may use a lambda to create a binding, as shown in the example below:

    struct ILogicallyNegatable {
      template <class Base>
      struct Interface : Base {
        bool operator!() const { return folly::poly_call<0>(*this); }
      };
      template <class T>
      using Members = folly::PolyMembers<
        +[](T const& t) -> decltype(bool(!t)) { return bool(!t); }>;
    };

This requires some explanation. The unary operator+ in front of the lambda is necessary! It causes the lambda to decay to a C-style function pointer, which is one of the types that folly::PolyMembers accepts. The decltype in the lambda return type is also necessary. Through the magic of SFINAE, it will cause Poly<ILogicallyNegatable> to reject any types that don’t support unary operator!.

If you are using a free function to create a binding, the first parameter is implicitly the this parameter. It will receive the type-erased object.

Non-member functions (C++14)


If you are using a C++14 compiler, the definition of ILogicallyNegatable above will fail because lambdas are not constexpr. We can get the same effect by writing the lambda as a named free function, as show below:

    struct ILogicallyNegatable {
      template <class Base>
      struct Interface : Base {
        bool operator!() const { return folly::poly_call<0>(*this); }
      };
      template <class T>
      static auto negate(T const& t)
        -> decltype(bool(!t)) { return bool(!t); }
      template <class T>
      using Members = FOLLY_POLY_MEMBERS(&negate<T>);
    };

As with the example that uses the lambda in the preceding section, the first parameter is implicitly the this parameter. It will receive the type-erased object.

Multi-dispatch


What if you want to create an IAddable interface for things that can be added? Adding requires two objects, both of which are type-erased. This interface requires dispatching on both objects, doing the addition only if the types are the same. For this we make use of the PolySelf template alias to define an interface that takes more than one object of the erased type.

    struct IAddable {
      template <class Base>
      struct Interface : Base {
        friend PolySelf<Base>
        operator+(PolySelf<Base> const& a, PolySelf<Base> const& b) const {
          return folly::poly_call<0>(a, b);
        }
      };
      template <class T>
      using Members = folly::PolyMembers<
        +[](T const& a, T const& b) -> decltype(a + b) { return a + b; }>;
    };

Given the above definition of IAddable we would be able to do the following:

    Poly<IAddable> a = 2, b = 3;
    Poly<IAddable> c = a + b;
    assert(poly_cast<int>(c) == 5);

If a and b stored objects of different types, a BadPolyCast exception would be thrown.

Move-only types


If you want to store move-only types, then your interface should extend the poly::IMoveOnly interface.

Implementation notes


Poly will store “small” objects in an internal buffer, avoiding the cost of of dynamic allocations. At present, this size is not configurable; it is pegged at the size of two doubles.

Poly objects are always nothrow movable. If you store an object in one that has a potentially throwing move constructor, the object will be stored on the heap, even if it could fit in the internal storage of the Poly object. (So be sure to give your objects nothrow move constructors!)

Poly implements type-erasure in a manner very similar to how the compiler accomplishes virtual dispatch. Every Poly object contains a pointer to a table of function pointers. Member function calls involve a double- indirection: once through the v-pointer, and other indirect function call through the function pointer.

folly/ProducerConsumerQueue.h

The folly::ProducerConsumerQueue class is a one-producer one-consumer queue with very low synchronization overhead.

The queue must be created with a fixed maximum size (and allocates that many cells of sizeof(T)), and it provides just a few simple operations:

All of these operations are wait-free. The read operations (including frontPtr and popFront) and write operations must only be called by the reader and writer thread, respectively. isFull, isEmpty, and sizeGuess may be called by either thread, but the return values from read, write, or frontPtr are sufficient for most cases.

write may fail if the queue is full, and read may fail if the queue is empty, so in many situations it is important to choose the queue size such that the queue filling or staying empty for long is unlikely.

Example


A toy example that doesn’t really do anything useful:

    folly::ProducerConsumerQueue<folly::fbstring> queue{size};

    std::thread reader([&queue] {
      for (;;) {
        folly::fbstring str;
        while (!queue.read(str)) {
          //spin until we get a value
          continue;
        }

        sink(str);
      }
    });

    // producer thread:
    for (;;) {
      folly::fbstring str = source();
      while (!queue.write(str)) {
        //spin until the queue has room
        continue;
      }
    }

Alternatively, the consumer may be written as follows to use the ‘front’ value in place, thus avoiding moves or copies:

    std::thread reader([&queue] {
      for (;;) {
        folly::fbstring* pval;
        do {
          pval = queue.frontPtr();
        } while (!pval); // spin until we get a value;

        sink(*pval);
        queue.popFront();
      }
    });

folly/synchronization/Rcu.h

C++ read-copy-update (RCU) functionality in folly.

Overview


Read-copy-update (RCU) is a low-overhead synchronization mechanism that provides guaranteed ordering between operations on shared data. In the simplest usage pattern, readers enter a critical section, view some state, and leave the critical section, while writers modify shared state and then defer some cleanup operations.

Proper use of the APIs provided by folly RCU will guarantee that a cleanup operation that is deferred during a reader critical section will not be executed until after that critical section is over.

Usage


folly::rcu_domain

The main synchronization primitive in folly RCU is a folly::rcu_domain. A folly::rcu_domain is a “universe” of deferred execution. Each domain has an executor on which deferred functions may execute. folly::rcu_domain provides the requirements to be BasicLockable, and readers may enter a read region in a folly::rcu_domain by treating the domain as a mutex type that can be wrapped by C++ locking primitives.

For example, to enter and exit a read region using non-movable RAII semantics, you could use an std::scoped_lock:

{
  std::scoped_lock<folly::rcu_domain> lock(folly::rcu_default_domain());
  protectedData.read();
}

Alternatively, if you need to be able to move the lock, you could use std::unique_lock:

class ProtectedData {
  private:
    std::unique_lock<folly::rcu_domain> lock_;
    void* data_;
}

Default vs. custom domains

There is a global, default domain that can be accessed using folly::rcu_default_domain() as in the example above. If required, you can also create your own domain:

{
  Executor* my_executor = getExecutor();
  folly::rcu_domain domain(my_executor /* or nullptr to use default */);
}

In general, using the default domain is strongly encouraged as you will likely get better cache locality by sharing a domain that it used by other callers in process. If, however, you can’t avoid blocking during reader critical sections, your own custom domain should be used to avoid delaying reclamation from other updaters. A custom domain can also be used if you want update callbacks to be invoked on a specific executor.

Updates and retires

A typical reader / updater synchronization will look something like this:

void doSomethingWith(IPAddress host);

static std::atomic<ConfigData*> globalConfigData;

void reader() {
  while (true) {
    IPAddress curManagementServer;
    {
      // We're about to do some reads we want to protect; if we read a
      // pointer, we need to make sure that if the writer comes along and
      // updates it, the writer's cleanup operation won't happen until we're
      // done accessing the pointed-to data. We get a Guard on that
      // domain; as long as it exists, no function subsequently passed to
      // invokeEventually will execute.
      std::scoped_lock<folly::rcu_domain> guard(folly::rcu_default_domain());
      ConfigData* configData = globalConfigData.load(std::memory_order_consume);
      // We created a guard before we read globalConfigData; we know that the
      // pointer will remain valid until the guard is destroyed.
      curManagementServer = configData->managementServerIP;
      // RCU domain via the scoped mutex is released here; retired objects
      // may be freed.
    }
    doSomethingWith(curManagementServer);
  }
}

void writer() {
  while (true) {
    std::this_thread::sleep_for(std::chrono::seconds(60));
    ConfigData* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
    ConfigData* newConfigData = loadConfigDataFromRemoteServer();
    globalConfigData.store(newConfigData, std::memory_order_release);
    folly::rcu_retire(oldConfigData);
    // Alternatively, in a blocking manner:
    //   folly::rcu_synchronize();
    //   delete oldConfigData;
  }
}

In the example above, a single writer updates ConfigData* in a loop, and then defers freeing it until all RCU readers have exited their read regions. The writer may use either of the following two APIs to safely defer freeing the old ConfigData*:

If you expect there to be very long read regions, it may be required to use rcu_synchronize() or a periodic rcu_barrier() (described below) to avoid running out of memory due to delayed reclamation.

folly::rcu_barrier()

Another synchronization primitive provided by the folly RCU library is rcu_barrier(). Unlike rcu_synchronize(), which blocks until all outstanding readers have exited their read regions, rcu_barrier() blocks until all outstanding deleters (specified in a call to rcu_retire()) are completed.

As mentioned above, one example of where this may be useful is avoiding out-of-memory errors due to scheduling too many objects whose reclamation is delayed. Taking our example from above, we could avoid OOMs using a periodic invocation to rcu_barrier() as follows:

static std::atomic<ConfigData*> globalConfigData;

void writer() {
  uint32_t retires = 0;
  while (true) {
    std::this_thread::sleep_for(std::chrono::seconds(60));
    ConfigData* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
    ConfigData* newConfigData = loadConfigDataFromRemoteServer();
    globalConfigData.store(newConfigData, std::memory_order_release);
    if (retires++ % 1000 == 0) {
      folly::rcu_barrier();
    }
    folly::rcu_retire(oldConfigData);
  }
}

Custom deleter in folly::rcu_retire

When invoking folly::rcu_retire(), you may optionally also pass a custom deleter function that is invoked instead of std::default_delete:

#include <folly/logging/xlog.h>

static std::atomic<ConfigData*> globalConfigData;

void writer() {
  while (true) {
    std::this_thread::sleep_for(std::chrono::seconds(60));
    ConfigData* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
    ConfigData* newConfigData = loadConfigDataFromRemoteServer();
    globalConfigData.store(newConfigData, std::memory_order_release);
    folly::rcu_retire(oldConfigData, [](ConfigData* obj) {
      XLOG(INFO) << "Deleting retired config " << oldConfigData->version);
      delete obj;
    });
  }
}

API limitations of updaters

Exceptions may not be thrown at any point in a retire callback. This includes both the deleter, as well as the object’s destructor. Other than this, any operation is safe from within a retired object’s destructor, including retiring other objects, or even retiring the same object as long as the custom deleter did not free it.

When using the default domain or the default executor, it is not legal to hold a lock across an rcu_retire() that is acquired by the deleter. This is normally not a problem when using the default deleter delete, which does not acquire any user locks. However, even when using the default deleter, an object having a user-defined destructor that acquires locks held across the corresponding call to rcu_retire() can still deadlock.

Note as well that there is no guarantee of the order in which retire callbacks are invoked. A retire callback is guaranteed to be invoked only after all readers that were present when the callback was scheduled have exited. Otherwise, any ordering of callback invocation may occur.

Note on fork()

fork() may not be invoked in a multithreaded program where any thread other than the calling thread is in an RCU read region. Doing so will result in undefined behavior, and will likely lead to deadlock. If the forking thread is inside of an RCU read region, it must invoke exec() before exiting the read region.

Performance


std::scoped_lock<folly::rcu_domain> creation/destruction is on the order of ~5ns. By comparison, folly::SharedMutex::lock_shared() followed by folly::SharedMutex::unlock_shared() is ~26ns.

folly/SmallLocks.h

This module is currently x64 only.

This header defines two very small mutex types. These are useful in highly memory-constrained environments where contention is unlikely. The purpose of these is to allow fine-grained locking in massive data structures where memory is at a premium. Often, each record may have a spare bit or byte lying around, so sometimes these can be tacked on with no additional memory cost.

There are two types exported from this header. MicroSpinLock is a single byte lock, and PicoSpinLock can be wrapped around an integer to use a single bit as a lock. Why do we have both? Because you can’t use x64 bts on a single byte, so sizeof(MicroSpinLock) is smaller than sizeof(PicoSpinLock) can be, giving it some use cases.

Both the locks in this header model the C++11 Lockable concept. So you can use std::lock_guard or std::unique_lock to lock them in an RAII way if you want.

Additional information is in the header.

folly/Synchronized.h

folly/Synchronized.h introduces a simple abstraction for mutex- based concurrency. It replaces convoluted, unwieldy, and just plain wrong code with simple constructs that are easy to get right and difficult to get wrong.

Motivation

Many of our multithreaded C++ programs use shared data structures associated with locks. This follows the time-honored adage of mutex-based concurrency control “associate mutexes with data, not code”. Consider the following example:


    class RequestHandler {
      ...
      RequestQueue requestQueue_;
      SharedMutex requestQueueMutex_;

      std::map<std::string, Endpoint> requestEndpoints_;
      SharedMutex requestEndpointsMutex_;

      HandlerState workState_;
      SharedMutex workStateMutex_;
      ...
    };

Whenever the code needs to read or write some of the protected data, it acquires the mutex for reading or for reading and writing. For example:

    void RequestHandler::processRequest(const Request& request) {
      stop_watch<> watch;
      checkRequestValidity(request);
      std::unique_lock lock(requestQueueMutex_);
      requestQueue_.push_back(request);
      stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
      LOG(INFO) << "enqueued request ID " << request.getID();
    }

However, the correctness of the technique is entirely predicated on convention. Developers manipulating these data members must take care to explicitly acquire the correct lock for the data they wish to access. There is no ostensible error for code that:

Introduction to folly/Synchronized.h

The same code sample could be rewritten with Synchronized as follows:

    class RequestHandler {
      ...
      Synchronized<RequestQueue> requestQueue_;
      Synchronized<std::map<std::string, Endpoint>> requestEndpoints_;
      Synchronized<HandlerState> workState_;
      ...
    };

    void RequestHandler::processRequest(const Request& request) {
      stop_watch<> watch;
      checkRequestValidity(request);
      requestQueue_.wlock()->push_back(request);
      stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
      LOG(INFO) << "enqueued request ID " << request.getID();
    }

The rewrite does at maximum efficiency what needs to be done: acquires the lock associated with the RequestQueue object, writes to the queue, and releases the lock immediately thereafter.

On the face of it, that’s not much to write home about, and not an obvious improvement over the previous state of affairs. But the features at work invisible in the code above are as important as those that are visible:

If you need to perform several operations while holding the lock, Synchronized provides several options for doing this.

The wlock() method (or lock() if you have a non-shared mutex type) returns a LockedPtr object that can be stored in a variable. The lock will be held for as long as this object exists, similar to a std::unique_lock. This object can be used as if it were a pointer to the underlying locked object:

    {
      auto lockedQueue = requestQueue_.wlock();
      lockedQueue->push_back(request1);
      lockedQueue->push_back(request2);
    }

The rlock() function is similar to wlock(), but acquires a shared lock rather than an exclusive lock.

We recommend explicitly opening a new nested scope whenever you store a LockedPtr object, to help visibly delineate the critical section, and to ensure that the LockedPtr is destroyed as soon as it is no longer needed.

Alternatively, Synchronized also provides mechanisms to run a function while holding the lock. This makes it possible to use lambdas to define brief critical sections:

    void RequestHandler::processRequest(const Request& request) {
      stop_watch<> watch;
      checkRequestValidity(request);
      requestQueue_.withWLock([&](auto& queue) {
        // withWLock() automatically holds the lock for the
        // duration of this lambda function
        queue.push_back(request);
      });
      stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
      LOG(INFO) << "enqueued request ID " << request.getID();
    }

One advantage of the withWLock() approach is that it forces a new scope to be used for the critical section, making the critical section more obvious in the code, and helping to encourage code that releases the lock as soon as possible.

Template class Synchronized<T>

Template Parameters

Synchronized is a template with two parameters, the data type and a mutex type: Synchronized<T, Mutex>.

If not specified, the mutex type defaults to folly::SharedMutex. However, any mutex type with an interface compatible with the standard mutex types is supported. Additionally, any mutex type compatible with the extended protocol implemented in folly/synchronization/Lock.h is supported.

Synchronized provides slightly different APIs when instantiated with a shared mutex type or an upgrade mutex type then with a plain exclusive mutex. If instantiated with either of the two mutex types above through having a member called lock_shared(), the Synchronized object has corresponding wlock, rlock or ulock methods to acquire different lock types. When using a shared or upgrade mutex type, these APIs ensure that callers make an explicit choice to acquire a shared, exclusive or upgrade lock and that callers do not unintentionally lock the mutex in the incorrect mode. The rlock() APIs only provide const access to the underlying data type, ensuring that it cannot be modified when only holding a shared lock.

Constructors

The default constructor default-initializes the data and its associated mutex.

The copy constructor locks the source for reading and copies its data into the target. (The target is not locked as an object under construction is only accessed by one thread.)

Finally, Synchronized<T> defines an explicit constructor that takes an object of type T and copies it. For example:

    // Default constructed
    Synchronized<map<string, int>> syncMap1;

    // Copy constructed
    Synchronized<map<string, int>> syncMap2(syncMap1);

    // Initializing from an existing map
    map<string, int> init;
    init["world"] = 42;
    Synchronized<map<string, int>> syncMap3(init);
    EXPECT_EQ(syncMap3->size(), 1);

Assignment, swap, and copying

The copy assignment operator copies the underlying source data into a temporary with the source mutex locked, and then move the temporary into the destination data with the destination mutex locked. This technique avoids the need to lock both mutexes at the same time. Mutexes are not copied or moved.

The move assignment operator assumes the source object is a true rvalue and does lock the source mutex. It moves the source data into the destination data with the destination mutex locked.

swap acquires locks on both mutexes in increasing order of object address, and then swaps the underlying data. This avoids potential deadlock, which may otherwise happen should one thread do a = b while another thread does b = a.

The data copy assignment operator copies the parameter into the destination data while the destination mutex is locked.

The data move assignment operator moves the parameter into the destination data while the destination mutex is locked.

To get a copy of the guarded data, there are two methods available: void copy(T*) and T copy(). The first copies data to a provided target and the second returns a copy by value. Both operations are done under a read lock. Example:

    Synchronized<vector<string>> syncVec1, syncVec2;
    vector<string> vec;

    // Assign
    syncVec1 = syncVec2;
    // Assign straight from vector
    syncVec1 = vec;

    // Swap
    syncVec1.swap(syncVec2);
    // Swap with vector
    syncVec1.swap(vec);

    // Copy to given target
    syncVec1.copy(&vec);
    // Get a copy by value
    auto copy = syncVec1.copy();

lock()

If the mutex type used with Synchronized is a simple exclusive mutex type (as opposed to a shared mutex), Synchronized<T> provides a lock() method that returns a LockedPtr<T> to access the data while holding the lock.

The LockedPtr object returned by lock() holds the lock for as long as it exists. Whenever possible, prefer declaring a separate inner scope for storing this variable, to make sure the LockedPtr is destroyed as soon as the lock is no longer needed:

    void fun(Synchronized<vector<string>, std::mutex>& vec) {
      {
        auto locked = vec.lock();
        locked->push_back("hello");
        locked->push_back("world");
      }
      LOG(INFO) << "successfully added greeting";
    }

wlock() and rlock()

If the mutex type used with Synchronized is a shared mutex type, Synchronized<T> provides a wlock() method that acquires an exclusive lock, and an rlock() method that acquires a shared lock.

The LockedPtr returned by rlock() only provides const access to the internal data, to ensure that it cannot be modified while only holding a shared lock.

    int computeSum(const Synchronized<vector<int>>& vec) {
      int sum = 0;
      auto locked = vec.rlock();
      for (int n : *locked) {
        sum += n;
      }
      return sum;
    }

    void doubleValues(Synchronized<vector<int>>& vec) {
      auto locked = vec.wlock();
      for (int& n : *locked) {
        n *= 2;
      }
    }

This example brings us to a cautionary discussion. The LockedPtr object returned by lock(), wlock(), or rlock() only holds the lock as long as it exists. This object makes it difficult to access the data without holding the lock, but not impossible. In particular you should never store a raw pointer or reference to the internal data for longer than the lifetime of the LockedPtr object.

For instance, if we had written the following code in the examples above, this would have continued accessing the vector after the lock had been released:

    // No. NO. NO!
    for (int& n : *vec.wlock()) {
      n *= 2;
    }

The vec.wlock() return value is destroyed in this case as soon as the internal range iterators are created. The range iterators point into the vector’s data, but lock is released immediately, before executing the loop body.

Needless to say, this is a crime punishable by long debugging nights.

Range-based for loops are slightly subtle about the lifetime of objects used in the initializer statement. Most other problematic use cases are a bit easier to spot than this, since the lifetime of the LockedPtr is more explicitly visible.

withLock()

As an alternative to the lock() API, Synchronized also provides a withLock() method that executes a function or lambda expression while holding the lock. The function receives a reference to the data as its only argument.

This has a few benefits compared to lock():

For example, withLock() makes the range-based for loop mistake from above much harder to accidentally run into:

    vec.withLock([](auto& locked) {
      for (int& n : locked) {
        n *= 2;
      }
    });

This code does not have the same problem as the counter-example with wlock() above, since the lock is held for the duration of the loop.

When using Synchronized with a shared mutex type, it provides separate withWLock() and withRLock() methods instead of withLock().

ulock() and withULockPtr()

Synchronized also supports upgrading and downgrading mutex lock levels as long as the mutex type used to instantiate the Synchronized type has the same interface as the mutex types in the C++ standard library. See below for an intro to upgrade mutexes.

An upgrade lock can be acquired as usual either with the ulock() method or the withULockPtr() method as so

    {
      // only const access allowed to the underlying object when an upgrade lock
      // is acquired
      auto ulock = vec.ulock();
      auto newSize = ulock->size();
    }

    auto newSize = vec.withULockPtr([](auto ulock) {
      // only const access allowed to the underlying object when an upgrade lock
      // is acquired
      return ulock->size();
    });

An upgrade lock acquired via ulock() or withULockPtr() can be upgraded or downgraded by calling any of the following methods on the LockedPtr proxy

Calling these leaves the LockedPtr object on which the method was called in an invalid null state and returns another LockedPtr proxy holding the specified lock. The upgrade or downgrade is done atomically - the Synchronized object is never in an unlocked state during the lock state transition. For example

    auto ulock = obj.ulock();
    if (ulock->needsUpdate()) {
      auto wlock = ulock.moveFromUpgradeToWrite();

      // ulock is now null

      wlock->updateObj();
    }

This “move” can also occur in the context of a withULockPtr() (withWLockPtr() or withRLockPtr() work as well!) function as so

    auto newSize = obj.withULockPtr([](auto ulock) {
      if (ulock->needsUpdate()) {

        // release upgrade lock get write lock atomically
        auto wlock = ulock.moveFromUpgradeToWrite();
        // ulock is now null
        wlock->updateObj();

        // release write lock and acquire read lock atomically
        auto rlock = wlock.moveFromWriteToRead();
        // wlock is now null
        return rlock->newSize();

      } else {

        // release upgrade lock and acquire read lock atomically
        auto rlock = ulock.moveFromUpgradeToRead();
        // ulock is now null
        return rlock->newSize();
      }
    });

Intro to upgrade mutexes:

An upgrade mutex is a shared mutex with an extra state called upgrade and an atomic state transition from upgrade to unique. The upgrade state is more powerful than the shared state but less powerful than the unique state.

An upgrade lock permits only const access to shared state for doing reads. It does not permit mutable access to shared state for doing writes. Only a unique lock permits mutable access for doing writes.

An upgrade lock may be held concurrently with any number of shared locks on the same mutex. An upgrade lock is exclusive with other upgrade locks and unique locks on the same mutex - only one upgrade lock or unique lock may be held at a time.

The upgrade mutex solves the problem of doing a read of shared state and then optionally doing a write to shared state efficiently under contention. Consider this scenario with a shared mutex:

    struct MyObect {
      bool isUpdateRequired() const;
      void doUpdate();
    };

    struct MyContainingObject {
      folly::Synchronized<MyObject> sync;

      void mightHappenConcurrently() {
        // first check
        if (!sync.rlock()->isUpdateRequired()) {
          return;
        }
        sync.withWLock([&](auto& state) {
          // second check
          if (!state.isUpdateRequired()) {
            return;
          }
          state.doUpdate();
        });
      }
    };

Here, the second isUpdateRequired check happens under a unique lock. This means that the second check cannot be done concurrently with other threads doing first isUpdateRequired checks under the shared lock, even though the second check, like the first check, is read-only and requires only const access to the shared state.

This may even introduce unnecessary blocking under contention. Since the default mutex type, folly::SharedMutex, has write priority, the unique lock protecting the second check may introduce unnecessary blocking to all the other threads that are attempting to acquire a shared lock to protect the first check. This problem is called reader starvation.

One solution is to use a shared mutex type with read priority, such as folly::SharedMutexReadPriority. That can introduce less blocking under contention to the other threads attempting to acquire a shared lock to do the first check. However, that may backfire and cause threads which are attempting to acquire a unique lock (for the second check) to stall, waiting for a moment in time when there are no shared locks held on the mutex, a moment in time that may never even happen. This problem is called writer starvation.

Starvation is a tricky problem to solve in general. But we can partially side- step it in our case.

An alternative solution is to use an upgrade lock for the second check. Threads attempting to acquire an upgrade lock for the second check do not introduce unnecessary blocking to all other threads that are attempting to acquire a shared lock for the first check. Only after the second check passes, and the upgrade lock transitions atomically from an upgrade lock to a unique lock, does the unique lock introduce necessary blocking to the other threads attempting to acquire a shared lock. With this solution, unlike the solution without the upgrade lock, the second check may be done concurrently with all other first checks rather than blocking or being blocked by them.

The example would then look like:

    struct MyObject {
      bool isUpdateRequired() const;
      void doUpdate();
    };

    struct MyContainingObject {
      folly::Synchronized<MyObject> sync;

      void mightHappenConcurrently() {
        // first check
        if (!sync.rlock()->isUpdateRequired()) {
          return;
        }
        sync.withULockPtr([&](auto ulock) {
          // second check
          if (!ulock->isUpdateRequired()) {
            return;
          }
          auto wlock = ulock.moveFromUpgradeToWrite();
          wlock->doUpdate();
        });
      }
    };

Note: Some shared mutex implementations offer an atomic state transition from shared to unique and some upgrade mutex implementations offer an atomic state transition from shared to upgrade. These atomic state transitions are dangerous, however, and can deadlock when done concurrently on the same mutex. For example, if threads A and B both hold shared locks on a mutex and are both attempting to transition atomically from shared to upgrade locks, the threads are deadlocked. Likewise if they are both attempting to transition atomically from shared to unique locks, or one is attempting to transition atomically from shared to upgrade while the other is attempting to transition atomically from shared to unique. Therefore, Synchronized’s LockedPtr proxies do not expose either of these dangerous atomic state transitions even when the underlying mutex type supports them.

Timed Locking

When Synchronized is used with a mutex type that supports timed lock acquisition, lock(), wlock(), and rlock() can all take an optional std::chrono::duration argument. This argument specifies a timeout to use for acquiring the lock. If the lock is not acquired before the timeout expires, a null LockedPtr object will be returned. Callers must explicitly check the return value before using it:

    void fun(Synchronized<vector<string>>& vec) {
      {
        auto locked = vec.lock(10ms);
        if (!locked) {
          throw std::runtime_error("failed to acquire lock");
        }
        locked->push_back("hello");
        locked->push_back("world");
      }
      LOG(INFO) << "successfully added greeting";
    }

unlock() and scopedUnlock()

Synchronized is a good mechanism for enforcing scoped synchronization, but it has the inherent limitation that it requires the critical section to be, well, scoped. Sometimes the code structure requires a fleeting “escape” from the iron fist of synchronization, while still inside the critical section scope.

One common pattern is releasing the lock early on error code paths, prior to logging an error message. The LockedPtr class provides an unlock() method that makes this possible:

    Synchronized<map<int, string>> dic;
    ...
    {
      auto locked = dic.rlock();
      auto iter = locked->find(0);
      if (iter == locked.end()) {
        locked.unlock();  // don't hold the lock while logging
        LOG(ERROR) << "key 0 not found";
        return false;
      }
      processValue(*iter);
    }
    LOG(INFO) << "succeeded";

For more complex nested control flow scenarios, scopedUnlock() returns an object that will release the lock for as long as it exists, and will reacquire the lock when it goes out of scope.


    Synchronized<map<int, string>> dic;
    ...
    {
      auto locked = dic.wlock();
      auto iter = locked->find(0);
      if (iter == locked->end()) {
        {
          auto unlocker = locked.scopedUnlock();
          LOG(INFO) << "Key 0 not found, inserting it."
        }
        locked->emplace(0, "zero");
      } else {
        *iter = "zero";
      }
    }

Clearly scopedUnlock() comes with specific caveats and liabilities. You must assume that during the scopedUnlock() section, other threads might have changed the protected structure in arbitrary ways. In the example above, you cannot use the iterator iter and you cannot assume that the key 0 is not in the map; another thread might have inserted it while you were bragging on LOG(INFO).

Whenever a LockedPtr object has been unlocked, whether with unlock() or scopedUnlock(), it will behave as if it is null. isNull() will return true. Dereferencing an unlocked LockedPtr is not allowed and will result in undefined behavior.

Synchronized and std::condition_variable

When used with a std::mutex, Synchronized supports using a std::condition_variable with its internal mutex. This allows a condition_variable to be used to wait for a particular change to occur in the internal data.

The LockedPtr returned by Synchronized<T, std::mutex>::lock() has a as_lock() method that returns a reference to a std::unique_lock<std::mutex>, which can be given to the std::condition_variable:

    Synchronized<vector<string>, std::mutex> vec;
    std::condition_variable emptySignal;

    // Assuming some other thread will put data on vec and signal
    // emptySignal, we can then wait on it as follows:
    auto locked = vec.lock();
    emptySignal.wait(locked.as_lock(),
                     [&] { return !locked->empty(); });

acquireLocked()

Sometimes locking just one object won’t be able to cut the mustard. Consider a function that needs to lock two Synchronized objects at the same time - for example, to copy some data from one to the other. At first sight, it looks like sequential wlock() calls will work just fine:

    void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
      auto lockedA = a.wlock();
      auto lockedB = b.wlock();
      ... use lockedA and lockedB ...
    }

This code compiles and may even run most of the time, but embeds a deadly peril: if one threads call fun(x, y) and another thread calls fun(y, x), then the two threads are liable to deadlocking as each thread will be waiting for a lock the other is holding. This issue is a classic that applies regardless of the fact the objects involved have the same type.

This classic problem has a classic solution: all threads must acquire locks in the same order. The actual order is not important, just the fact that the order is the same in all threads. Many libraries simply acquire mutexes in increasing order of their address, which is what we’ll do, too. The acquireLocked() function takes care of all details of proper locking of two objects and offering their innards. It returns a std::tuple of LockedPtrs:

    void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
      auto ret = folly::acquireLocked(a, b);
      auto& lockedA = std::get<0>(ret);
      auto& lockedB = std::get<1>(ret);
      ... use lockedA and lockedB ...
    }

Note that C++ 17 introduces structured binding syntax which will make the returned tuple more convenient to use:

    void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
      auto [lockedA, lockedB] = folly::acquireLocked(a, b);
      ... use lockedA and lockedB ...
    }

An acquireLockedPair() function is also available, which returns a std::pair instead of a std::tuple. This is more convenient to use in many situations, until compiler support for structured bindings is more widely available.

Synchronizing several data items with one mutex

The library is geared at protecting one object of a given type with a mutex. However, sometimes we’d like to protect two or more members with the same mutex. Consider for example a bidirectional map, i.e. a map that holds an int to string mapping and also the converse string to int mapping. The two maps would need to be manipulated simultaneously. There are at least two designs that come to mind.

Using a nested struct

You can easily pack the needed data items in a little struct. For example:

    class Server {
      struct BiMap {
        map<int, string> direct;
        map<string, int> inverse;
      };
      Synchronized<BiMap> bimap_;
      ...
    };
    ...
    bimap_.withLock([](auto& locked) {
      locked.direct[0] = "zero";
      locked.inverse["zero"] = 0;
    });

With this code in tow you get to use bimap_ just like any other Synchronized object, without much effort.

Using std::tuple

If you won’t stop short of using a spaceship-era approach, std::tuple is there for you. The example above could be rewritten for the same functionality like this:

    class Server {
      Synchronized<tuple<map<int, string>, map<string, int>>> bimap_;
      ...
    };
    ...
    bimap_.withLock([](auto& locked) {
      get<0>(locked)[0] = "zero";
      get<1>(locked)["zero"] = 0;
    });

The code uses std::get with compile-time integers to access the fields in the tuple. The relative advantages and disadvantages of using a local struct vs. std::tuple are quite obvious - in the first case you need to invest in the definition, in the second case you need to put up with slightly more verbose and less clear access syntax.

Summary

Synchronized and its supporting tools offer you a simple, robust paradigm for mutual exclusion-based concurrency. Instead of manually pairing data with the mutexes that protect it and relying on convention to use them appropriately, you can benefit of encapsulation and typechecking to offload a large part of that task and to provide good guarantees.

folly/ThreadCachedInt.h

High-performance atomic increment using thread caching.

folly/ThreadCachedInt.h introduces a integer class designed for high performance increments from multiple threads simultaneously without loss of precision. It has two read modes, readFast gives a potentially stale value with one load, and readFull gives the exact value, but is much slower, as discussed below.

Performance


Increment performance is up to 10x greater than std::atomic_fetch_add in high contention environments. See folly/test/ThreadCachedIntTest.h for more comprehensive benchmarks.

readFast is as fast as a single load.

readFull, on the other hand, requires acquiring a mutex and iterating through a list to accumulate the values of all the thread local counters, so is significantly slower than readFast.

Usage


Create an instance and increment it with increment or the operator overloads. Read the value with readFast for quick, potentially stale data, or readFull for a more expensive but precise result. There are additional convenience functions as well, such as set.

    ThreadCachedInt<int64_t> val;
    EXPECT_EQ(0, val.readFast());
    ++val;                        // increment in thread local counter only
    EXPECT_EQ(0, val.readFast()); // increment has not been flushed
    EXPECT_EQ(1, val.readFull()); // accumulates all thread local counters
    val.set(2);
    EXPECT_EQ(2, val.readFast());
    EXPECT_EQ(2, val.readFull());

Implementation


folly::ThreadCachedInt uses folly::ThreadLocal to store thread specific objects that each have a local counter. When incrementing, the thread local instance is incremented. If the local counter passes the cache size, the value is flushed to the global counter with an atomic increment. It is this global counter that is read with readFast via a simple load, but will not count any of the updates that haven’t been flushed.

In order to read the exact value, ThreadCachedInt uses the extended readAllThreads() API of folly::ThreadLocal to iterate through all the references to all the associated thread local object instances. This currently requires acquiring a global mutex and iterating through the references, accumulating the counters along with the global counter. This also means that the first use of the object from a new thread will acquire the mutex in order to insert the thread local reference into the list. By default, there is one global mutex per integer type used in ThreadCachedInt. If you plan on using a lot of ThreadCachedInts in your application, considering breaking up the global mutex by introducing additional Tag template parameters.

set simply sets the global counter value, and marks all the thread local instances as needing to be reset. When iterating with readFull, thread local counters that have been marked as reset are skipped. When incrementing, thread local counters marked for reset are set to zero and unmarked for reset.

Upon destruction, thread local counters are flushed to the parent so that counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself wasn’t destroyed in another thread already.

Alternate Implementations


There are of course many ways to skin a cat, and you may notice there is a partial alternate implementation in folly/test/ThreadCachedIntTest.cpp that provides similar performance. ShardedAtomicInt simply uses an array of std::atomic<int64_t>’s and hashes threads across them to do low-contention atomic increments, and readFull just sums up all the ints.

This sounds great, but in order to get the contention low enough to get similar performance as ThreadCachedInt with 24 threads, ShardedAtomicInt needs about 2000 ints to hash across. This uses about 20x more memory, and the lock-free readFull has to sum up all 2048 ints, which ends up being a about 50x slower than ThreadCachedInt in low contention situations, which is hopefully the common case since it’s designed for high-write, low read access patterns. Performance of readFull is about the same speed as ThreadCachedInt in high contention environments.

Depending on the operating conditions, it may make more sense to use one implementation over the other. For example, a lower contention environment will probably be able to use a ShardedAtomicInt with a much smaller array without hurting performance, while improving memory consumption and perf of readFull.

folly/ThreadLocal.h

Improved thread local storage for non-trivial types.

Usage


The API of ThreadLocalPtr is very close to boost::thread_specific_ptr with the notable addition of the accessAllThreads method. There is also a ThreadLocal class which is a thin wrapper around ThreadLocalPtr that manages allocation automatically (creates a new object the first time it is dereferenced from each thread).

ThreadLocalPtr simply gives you a place to put and access a pointer local to each thread such that it will be destroyed appropriately.

{
  folly::ThreadLocalPtr<Widget> w;
  w.reset(new Widget(0), Widget::customDeleterA);
  std::thread([&w]() {
    w.reset(new Widget(1), Widget::customDeleterB);
    w.get()->mangleWidget();
  } // Widget(1) is destroyed with customDeleterB
} // Widget(0) is destroyed with customDeleterA

Note that customDeleterB will get called with TLPDestructionMode::THIS_THREAD and customerDeleterA will get called with TLPDestructionMode::ALL_THREADS. This is to distinguish between thread exit vs. the entire ThreadLocalPtr getting destroyed, in which case there is cleanup work that may be avoided.

The accessAllThreads interface is provided to walk all the thread local child objects of a parent. accessAllThreads initializes an accessor which holds a global lock that blocks all creation and destruction of ThreadLocal objects with the same Tag and can be used as an iterable container. Typical use is for frequent write, infrequent read data access patterns such as counters. Note that you must specify a unique Tag type so you don’t block other ThreadLocal object usage, and you should try to minimize the lifetime of the accessor so the lock is held for as short as possible).

The following example is a simplification of folly/ThreadCachedInt.h. It keeps track of a counter value and allows multiple threads to add to the count without synchronization. In order to get the total count, read() iterates through all the thread local values via accessAllThreads() and sums them up. class NewTag is used to break the global mutex so that this class won’t block other ThreadLocal usage when read() is called.

Note that read() holds the global mutex which blocks construction, destruction, and read() for other SimpleThreadCachedInt’s, but does not block add(). Also, since it uses the unique NewTag, SimpleThreadCachedInt does not affect other ThreadLocal usage.

class SimpleThreadCachedInt {

  class NewTag;  // Segments the global mutex
  ThreadLocal<int,NewTag> val_;

 public:
  void add(int val) {
    *val_ += val;  // operator*() gives a reference to the thread local instance
  }

  int read() {
    int ret = 0;
    // accessAllThreads acquires the global lock
    for (const auto& i : val_.accessAllThreads()) {
      ret += i;
    }  // Global lock is released on scope exit
    return ret;
  }
};

Implementation


We keep a __thread array of pointers to objects (ThreadEntry::elements) where each array has an index for each unique instance of the ThreadLocalPtr object. Each ThreadLocalPtr object has a unique id that is an index into these arrays so we can fetch the correct object from thread local storage very efficiently.

In order to prevent unbounded growth of the id space and thus huge ThreadEntry::elements arrays, for example due to continuous creation and destruction of ThreadLocalPtr objects, we keep track of all active instances by linking them together into a list. When an instance is destroyed we remove it from the chain and insert the id into freeIds_ for reuse. These operations require a global mutex, but only happen at construction and destruction time. accessAllThreads also acquires this global mutex.

We use a single global pthread_key_t per Tag to manage object destruction and memory cleanup upon thread exit because there is a finite number of pthread_key_t’s available per machine.

‘folly/Traits.h’

Implements traits complementary to those provided in

Motivation


<type_traits> is the Standard type-traits library defining a variety of traits such as is_integral or is_floating_point. This helps to gain more information about a given type.

folly/Traits.h implements traits complementing those present in the Standard.

IsRelocatable


In C++, the default way to move an object is by calling the copy constructor and destroying the old copy instead of directly copying the memory contents by using memcpy(). The conservative approach of moving an object assumes that the copied object is not relocatable. The two following code sequences should be semantically equivalent for a relocatable type:

{
  void conservativeMove(T * from, T * to) {
    new(to) T(from);
    (*from).~T();
  }
}

{
  void optimizedMove(T * from, T * to) {
    memcpy(to, from, sizeof(T));
  }
}

Very few C++ types are non-relocatable. The type defined below maintains a pointer inside an embedded buffer and hence would be non-relocatable. Moving the object by simply copying its memory contents would leave the internal pointer pointing to the old buffer.

class NonRelocatableType {
private:
  char buffer[1024];
  char * pointerToBuffer;
  ...
public:
  NonRelocatableType() : pointerToBuffer(buffer) {}
  ...
};

We can optimize the task of moving a relocatable type T using memcpy. IsRelocatable::value describes the ability of moving around memory a value of type T by using memcpy.

Usage


fbvector only works with relocatable objects. If assumptions are not stated explicitly, fbvector<MySimpleType> or fbvector<MyParameterizedType> will fail to compile due to assertion below:

static_assert(IsRelocatable<My*Type>::value, "");

FOLLY_ASSUME_FBVECTOR_COMPATIBLE*(type) macros can be used to state that type is relocatable and has nothrow constructor.

Similarly,

Few common types, namely std::basic_string, std::vector, std::list, std::map, std::deque, std::set, std::unique_ptr, std::shared_ptr, std::function, which are compatible with fbvector are already instantiated and declared compatible with fbvector. fbvector can be directly used with any of these C++ types.

std::pair can be safely assumed to be compatible with fbvector if both of its components are.

IsOneOf


std::is_same<T1, T2>::value can be used to test if types of T1 and T2 are same. folly::IsOneOf<T, T1, Ts...>::value can be used to test if type of T matches the type of one of the other template parameter, T1, T2, …Tn. Recursion is used to implement this type trait.

folly/small_vector.h

folly::small_vector<T,Int=1,...> is a sequence container that implements small buffer optimization. It behaves similarly to std::vector, except until a certain number of elements are reserved it does not use the heap.

Like standard vector, it is guaranteed to use contiguous memory. (So, after it spills to the heap all the elements live in the heap buffer.)

Simple usage example:

    small_vector<int,2> vec;
    vec.push_back(0); // Stored in-place on stack
    vec.push_back(1); // Still on the stack
    vec.push_back(2); // Switches to heap buffer.

Details


This class is useful in either of following cases:

The last two cases were the main motivation for implementing it.

There are also a couple of flags you can pass into this class template to customize its behavior. You can provide them in any order after the in-place count. They are all in the namespace small_vector_policy.

A couple more examples:

    // With space for 32 in situ unique pointers, and only using a
    // 4-byte size_type.
    small_vector<std::unique_ptr<int>, 32, uint32_t> v;

    // A inline vector of up to 256 ints which will not use the heap.
    small_vector<int, 256, NoHeap> v;

    // Same as the above, but making the size_type smaller too.
    small_vector<int, 256, NoHeap, uint16_t> v;