/** * Copyright (c) Meta Platforms, Inc. and affiliates. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */
folly/AtomicHashmap.h
folly/AtomicHashmap.h
introduces a synchronized
UnorderedAssociativeContainer implementation designed for extreme
performance in heavily multithreaded environments (about 2-5x faster
than tbb::concurrent_hash_map) and good memory usage properties. Find
and iteration are wait-free, insert has key-level lock granularity,
there is minimal memory overhead, and permanent 32-bit ids can be used
to reference each element.
Although it can provide extreme performance, AtomicHashmap has some unique limitations as well.
The space for erased elements cannot be reclaimed (they are tombstoned forever) so it’s generally not a good idea to use this if you’re erasing things a lot.
Only supports 32 or 64 bit keys - this is because they must be atomically compare-and-swap’ed.
Growth beyond initialization reduces performance - if you don’t know the approximate number of elements you’ll be inserting into the map, you probably shouldn’t use this class.
Must manage synchronization externally in order to modify values
in the map after insertion. Lock pools are a common way to do this, or
you may consider using folly::PackedSyncPtr<T>
as
your ValueT
.
Must define special reserved key values for empty, erased, and locked elements.
For a complete list of limitations and departures from the
UnorderedAssociativeContainer concept, see
folly/AtomicHashMap.h
value_type
references remain valid as long as the
map itself. Note this is not true for most other probing hash maps which
will move elements when rehashing, which is necessary for them to grow.
AtomicHashMap grows by chaining additional slabs, so elements never need
to be moved.
Unique 32-bit ids can be used to reference elements in the map
via iterator::getIndex()
. This can be helpful to save
memory in the rest of the application by replacing 64-bit pointers or
keys.
Iterators are never invalidated. This means you can iterate through the map while simultaneously inserting and erasing. This is particularly useful for non-blocking map serialization.
Usage is similar to most maps, although note the conspicuous lack of operator[] which encourages non thread-safe access patterns.
Below is a synchronized key counter implementation that allows the counter values to be incremented in parallel with serializing all the values to a string.
class Counters {
private:
<int64_t,int64_t> ahm;
AtomicHashMap
public:
explicit Counters(size_t numCounters) : ahm(numCounters) {}
void increment(int64_t obj_id) {
auto ret = ahm.insert(make_pair(obj_id, 1));
if (!ret.second) {
// obj_id already exists, increment
(&ret.first->second, 1);
NoBarrier_AtomicIncrement}
}
int64_t getValue(int64_t obj_id) {
auto ret = ahm.find(obj_id);
return ret != ahm.end() ? ret->second : 0;
}
// Serialize the counters without blocking increments
() {
string toString= "{\n";
string ret .reserve(ahm.size() * 32);
retfor (const auto& e : ahm) {
+= folly::to<string>(
ret " [", e.first, ":", NoBarrier_Load(&e.second), "]\n");
}
+= "}\n";
ret return ret;
}
};
AtomicHashMap is a composition of AtomicHashArray submaps, which implement the meat of the functionality. Only one AHA is created on initialization, and additional submaps are appended if the first one gets full. If the AHM grows, there will be multiple submaps that must be probed in series to find a given key. The more growth, the more submaps will be chained, and the slower it will get. If the initial size estimate is good, only one submap will ever be created and performance will be optimal.
AtomicHashArray is a fixed-size probing hash map (also referred to as an open addressed hash map) where hash collisions are resolved by checking subsequent elements. This means that they can be allocated in slabs as arrays of value_type elements, have excellent cache performance, and have no memory overhead from storing pointers.
The algorithm is simple - when inserting, the key is hash-mod’ed to an offset, and that element-key is atomically compare-and-swap’ed with the locked key value. If successful, the value is written and the element-key is unlocked by setting it to the input key value. If the compare fails, the next element is tried until success or the map is full.
Finds are even simpler. The key is hash-mod’ed to an offset, and the element-key is examined. If it is the same as the input key, the reference is returned, if it’s the empty key, failure is returned, otherwise the next key is tried. This can be done wait-free without any atomic instructions because the elements are always in a valid state.
Erase is done by finding the key, then compare-and-swap’ing the element-key with the reserved erased key value. If the swap succeeds, return success, otherwise return failure (the element was erased by a competing thread). If the key does not exist, return failure.
folly/Benchmark.h
folly/Benchmark.h
provides a simple framework for
writing and executing benchmarks. Currently the framework targets only
single-threaded testing (though you can internally use fork-join
parallelism and measure total run time).
Using folly/Benchmark.h
is very simple. Here’s an
example:
#include <folly/Benchmark.h>
#include <vector>
using namespace std;
using namespace folly;
(insertFrontVector) {
BENCHMARK// Let's insert 100 elements at the front of a vector
<int> v;
vectorfor (unsigned int i = 0; i < 100; ++i) {
.insert(v.begin(), i);
v}
}
(insertBackVector) {
BENCHMARK// Let's insert 100 elements at the back of a vector
<int> v;
vectorfor (unsigned int i = 0; i < 100; ++i) {
.insert(v.end(), i);
v}
}
int main() {
();
runBenchmarks}
Compiling and running this code produces to the standard output:
===============================================================================
test.cpp relative ns/iter iters/s
===============================================================================
insertFrontVector 3.84K 260.38K
insertBackVector 1.61K 622.75K
===============================================================================
Let’s worry about the empty column “relative” later. The table contains, for each benchmark, the time spent per call and the converse number of calls per second. Numbers are represented in metric notation (K for thousands, M for millions etc). As expected, in this example the second function is much faster (fewer ns/iter and more iters/s).
The macro BENCHMARK
introduces a function and also adds
it to an internal array containing all benchmarks in the system. The
defined function takes no arguments and returns void
.
The framework calls the function many times to collect statistics
about it. Sometimes the function itself would want to do that
iteration—for example how about inserting n
elements
instead of 100 elements? To do the iteration internally, use
BENCHMARK
with two parameters. The second parameter is the
number of iterations and is passed by the framework down to the
function. The type of the count is implicitly unsigned
.
Consider a slightly reworked example:
#include <folly/Benchmark.h>
#include <folly/container/Foreach.h>
#include <vector>
using namespace std;
using namespace folly;
(insertFrontVector, n) {
BENCHMARK<int> v;
vectorfor (unsigned int i = 0; i < n; ++i) {
.insert(v.begin(), i);
v}
}
(insertBackVector, n) {
BENCHMARK<int> v;
vectorfor (unsigned int i = 0; i < n; ++i) {
.insert(v.end(), i);
v}
}
int main() {
();
runBenchmarks}
The produced numbers are substantially different:
===============================================================================
Benchmark relative ns/iter iters/s
===============================================================================
insertFrontVector 39.92 25.05M
insertBackVector 3.46 288.89M
===============================================================================
Now the numbers indicate the speed of one single insertion because the framework assumed the user-defined function used internal iteration (which it does). So inserting at the back of a vector is more than 10 times faster than inserting at the front! Speaking of comparisons…
Choosing one or more good baselines is a crucial activity in any
measurement. Without a baseline there is little information to derive
from the sheer numbers. If, for example, you do experimentation with
algorithms, a good baseline is often an established approach (e.g. the
built-in std::sort
for sorting). Essentially all
experimental numbers should be compared against some baseline.
To support baseline-driven measurements,
folly/Benchmark.h
defines BENCHMARK_RELATIVE
,
which works much like BENCHMARK
, except it considers the
most recent lexically-occurring BENCHMARK
a baseline, and
fills the “relative” column. Say, for example, we want to use front
insertion for a vector as a baseline and see how back insertion compares
with it:
#include <folly/Benchmark.h>
#include <folly/container/Foreach.h>
#include <vector>
using namespace std;
using namespace folly;
(insertFrontVector, n) {
BENCHMARK<int> v;
vectorfor (unsigned int i = 0; i < n; ++i) {
.insert(v.begin(), i);
v}
}
(insertBackVector, n) {
BENCHMARK_RELATIVE<int> v;
vectorfor (unsigned int i = 0; i < n; ++i) {
.insert(v.end(), i);
v}
}
int main() {
();
runBenchmarks}
This program prints something like:
===============================================================================
Benchmark relative ns/iter iters/s
===============================================================================
insertFrontVector 42.65 23.45M
insertBackVector 1208.24% 3.53 283.30M
===============================================================================
showing the 1208.24% relative speed advantage of inserting at the back compared to front. The scale is chosen in such a way that 100% means identical speed, numbers smaller than 100% indicate the benchmark is slower than the baseline, and numbers greater than 100% indicate the benchmark is faster. For example, if you see 42% that means the speed of the benchmark is 0.42 of the baseline speed. If you see 123%, it means the benchmark is 23% or 1.23 times faster.
To close the current benchmark group and start another, simply use
BENCHMARK
again.
If you want to draw a horizontal line of dashes (e.g. at the end of a
group or for whatever reason), use BENCHMARK_DRAW_LINE()
.
The line fulfills a purely aesthetic role; it doesn’t interact with
measurements in any way.
(foo) {
BENCHMARK;
Foo foo.doSomething();
foo}
();
BENCHMARK_DRAW_LINE
(bar) {
BENCHMARK;
Bar bar.doSomething();
bar}
Sometimes benchmarking code must do some preparation work that is
physically inside the benchmark function, but should not take part to
its time budget. To temporarily suspend the benchmark, use the
pseudo-statement BENCHMARK_SUSPEND
as follows:
(insertBackVector, n) {
BENCHMARK<int> v;
vector{
BENCHMARK_SUSPEND .reserve(n);
v}
for (unsigned int i = 0; i < n; ++i) {
.insert(v.end(), i);
v}
}
The preallocation effected with v.reserve(n)
will not
count toward the total run time of the benchmark.
Only the main thread should call BENCHMARK_SUSPEND
(and
of course it should not call it while other threads are doing actual
work). This is because the timer is application-global.
If the scope introduced by BENCHMARK_SUSPEND
is not
desired, you may want to “manually” use the
BenchmarkSuspender
type. Constructing such an object
suspends time measurement, and destroying it resumes the measurement. If
you want to resume time measurement before the destructor, call
dismiss
against the BenchmarkSuspender
object.
The previous example could have been written like this:
(insertBackVector, n) {
BENCHMARK;
BenchmarkSuspender braces<int> v;
vector.reserve(n);
v.dismiss();
bracesfor (unsigned int i = 0; i < n; ++i) {
.insert(v.end(), i);
v}
}
doNotOptimizeAway
Finally, the small utility function doNotOptimizeAway
prevents compiler optimizations that may interfere with benchmarking .
Call doNotOptimizeAway(var) against variables that you use for
benchmarking but otherwise are useless. The compiler tends to do a good
job at eliminating unused variables, and this function fools it into
thinking a variable is in fact needed. Example:
(fpOps, n) {
BENCHMARKdouble d = 1;
(i, 1, n) {
FOR_EACH_RANGE += i;
d -= i;
d *= i;
d /= i;
d }
(d);
doNotOptimizeAway}
folly/Benchmark.h
has a simple, systematic approach to
collecting timings.
First, it organizes measurements in several large epochs, and takes the minimum over all epochs. Taking the minimum gives the closest result to the real runtime. Benchmark timings are not a regular random variable that fluctuates around an average. Instead, the real time we’re looking for is one to which there’s a variety of additive noise (i.e. there is no noise that could actually shorten the benchmark time below its real value). In theory, taking an infinite amount of samples and keeping the minimum is the actual time that needs measuring. That’s why the accuracy of benchmarking increases with the number of epochs.
Clearly, in real functioning there will also be noise and a variety of effects caused by the running context. But the noise during the benchmark (straight setup, simple looping) is a poor model for the noise in the real application. So taking the minimum across several epochs is the most informative result.
Inside each epoch, the function measured is iterated an increasing number of times until the total runtime is large enough to make noise negligible. At that point the time is collected, and the time per iteration is computed. As mentioned, the minimum time per iteration over all epochs is the final result.
The timer function used is clock_gettime
with the
CLOCK_REALTIME
clock id. Note that you must use a recent
Linux kernel (2.6.38 or newer), otherwise the resolution of
CLOCK_REALTIME
is inadequate.
folly/Conv.h
folly/Conv.h
is a one-stop-shop for converting values
across types. Its main features are simplicity of the API (only the
names to
and toAppend
must be memorized),
speed (folly is significantly faster, sometimes by an order of
magnitude, than comparable APIs), and correctness.
All examples below are assume to have included
folly/Conv.h
and issued using namespace folly;
You will need:
// To format as text and append to a string, use toAppend.
;
fbstring str(2.5, &str);
toAppend(str, "2.5");
CHECK_EQ
// Multiple arguments are okay, too. Just put the pointer to string at the end.
(" is ", 2, " point ", 5, &str);
toAppend(str, "2.5 is 2 point 5");
CHECK_EQ
// You don't need to use fbstring (although it's much faster for conversions and in general).
std::string stdStr;
("Pi is about ", 22.0 / 7, &stdStr);
toAppend// In general, just use to<TargetType>(sourceValue). It returns its result by value.
= to<std::string>("Variadic ", "arguments also accepted.");
stdStr
// to<fbstring> is 2.5x faster than to<std::string> for typical workloads.
= to<fbstring>("Variadic ", "arguments also accepted."); str
Using to<Target>(value)
to convert one integral
type to another will behave as follows:
short x;
unsigned short y;
...
auto a = to<int>(x); // zero overhead conversion
auto b = to<int>(y); // zero overhead conversion
to
inserts bounds checks and throws
std::range_error
if the target type cannot accommodate the
source value. Example:short x;
unsigned short y;
long z;
...
= 123;
x auto a = to<unsigned short>(x); // fine
= -1;
x = to<unsigned short>(x); // THROWS
a = 2000000000;
z auto b = to<int>(z); // fine
+= 1000000000;
z = to<int>(z); // THROWS
b auto b = to<unsigned int>(z); // fine
As mentioned, there are two primitives for converting anything to
string: to
and toAppend
. They support the same
set of source types, literally by definition (to
is
implemented in terms of toAppend
for all types). The call
toAppend(value, &str)
formats and appends
value
to str
whereas
to<StringType>(value)
formats value
as a
StringType
and returns the result by value. Currently, the
supported StringType
s are std::string
and
fbstring
Both toAppend
and to
with a string type as
a target support variadic arguments. Each argument is converted in turn.
For toAppend
the last argument in a variadic list must be
the address of a supported string type (no need to specify the string
type as a template argument).
Nothing special here - integrals are converted to strings in decimal format, with a ‘-’ prefix for negative values. Example:
auto a = to<fbstring>(123);
assert(a == "123");
= to<fbstring>(-456);
a assert(a == "-456");
The conversion implementation is aggressively optimized. It converts
two digits at a time assisted by fixed-size tables. Converting a
long
to an fbstring
is 3.6x faster than using
boost::lexical_cast
and 2.5x faster than using
sprintf
even though the latter is used in conjunction with
a stack-allocated constant-size buffer.
Note that converting integral types to fbstring
has a
particular advantage compared to converting to std::string
No integral type (<= 64 bits) has more than 20 decimal digits
including sign. Since fbstring
employs the small string
optimization for up to 23 characters, converting an integral to
fbstring
is guaranteed to not allocate memory, resulting in
significant speed and memory locality gains. Benchmarks reveal a 2x gain
on a typical workload.
char
to string
conversionAlthough char
is technically an integral type, most of
the time you want the string representation of 'a'
to be
"a"
, not 96
That’s why
folly/Conv.h
handles char
as a special case
that does the expected thing. Note that signed char
and
unsigned char
are still considered integral types.
folly/Conv.h
uses V8’s double
conversion routines. They are accurate and fast; on typical
workloads, to<fbstring>(doubleValue)
is 1.9x faster
than sprintf
and 5.5x faster than
boost::lexical_cast
(It is also 1.3x faster than
to<std::string>(doubleValue)
const char*
to
string conversionFor completeness, folly/Conv.h
supports
const char*
including i.e. string literals. The
“conversion” consists, of course, of the string itself. Example:
auto s = to<fbstring>("Hello, world");
assert(s == "Hello, world");
folly/Conv.h
includes three kinds of parsing
routines:
to<Type>(const char* begin, const char* end)
rigidly converts the range [begin, end) to Type
These
routines have drastic restrictions (e.g. allow no leading or trailing
whitespace) and are intended as an efficient back-end for more tolerant
routines.to<Type>(stringy)
converts stringy
to Type
Value stringy
may be of type
const char*
, StringPiece
,
std::string
, or fbstring
(Technically, the
requirement is that stringy
implicitly converts to a
StringPiece
to<Type>(&stringPiece)
parses with progress
information: given stringPiece
of type
StringPiece
it parses as much as possible from it as type
Type
and alters stringPiece
to remove the
munched characters. This is easiest clarified by an example:= " 1234 angels on a pin";
fbstring s (s);
StringPiece pcauto x = to<int>(&pc);
assert(x == 1234);
assert(pc == " angels on a pin";
Note how the routine ate the leading space but not the trailing one.
Parsing integral types is unremarkable - decimal format is expected,
optional '+'
or '-'
sign for signed types, but
no optional '+'
is allowed for unsigned types. The one
remarkable element is speed - parsing typical long
values
is 6x faster than sscanf
. folly/Conv.h
uses
aggressive loop unrolling and table-assisted SIMD-style code arrangement
that avoids integral division (slow) and data dependencies across
operations (ILP-unfriendly). Example:
= " 12345 ";
fbstring str assert(to<int>(str) == 12345);
= " 12345six seven eight";
str (str);
StringPiece pcassert(to<int>(&pc) == 12345);
assert(str == "six seven eight");
folly/Conv.h
uses, again, V8’s
double-conversion routines as back-end. The speed is 3x faster than
sscanf
and 1.7x faster than in-home routines such as
parse<double>
But the more important detail is
accuracy - even if you do code a routine that works faster than
to<double>
chances are it is incorrect and will fail
in a variety of corner cases. Using to<double>
is
strongly recommended.
Note that if the string “NaN” (with any capitalization) is passed to
to<double>
then NaN
is returned, which
can be tested for as follows:
= "nan"; // "NaN", "NAN", etc.
fbstring str double d = to<double>(str);
if (std::isnan(d)) {
// string was a valid representation of the double value NaN
}
Note that passing “-NaN” (with any capitalization) to
to<double>
also returns NaN
.
Note that if the strings “inf” or “infinity” (with any
capitalization) are passed to to<double>
then
infinity
is returned, which can be tested for as
follows:
= "inf"; // "Inf", "INF", "infinity", "Infinity", etc.
fbstring str double d = to<double>(str);
if (std::isinf(d)) {
// string was a valid representation of one of the double values +Infinity
// or -Infinity
}
Note that passing “-inf” or “-infinity” (with any capitalization) to
to<double>
returns -infinity
rather than
+infinity
. The sign of the infinity
can be
tested for as follows:
= "-inf"; // or "inf", "-Infinity", "+Infinity", etc.
fbstring str double d = to<double>(str);
if (d == std::numeric_limits<double>::infinity()) {
// string was a valid representation of the double value +Infinity
} else if (d == -std::numeric_limits<double>::infinity()) {
// string was a valid representation of the double value -Infinity
}
Note that if an unparseable string is passed to
to<double>
then an exception is thrown, rather than
NaN
being returned. This can be tested for as follows:
= "not-a-double"; // Or "1.1.1", "", "$500.00", etc.
fbstring str double d;
try {
= to<double>(str);
d } catch (const std::range_error &) {
// string could not be parsed
}
Note that the empty string (""
) is an unparseable value,
and will cause to<double>
to throw an exception.
tryTo<T>
is the non-throwing variant of
to<T>
. It returns an
Expected<T, ConversionCode>
. You can think of
Expected
as like an Optional<T>
, but if
the conversion failed, Expected
stores an error code
instead of a T
.
tryTo<T>
has similar performance as
to<T>
when the conversion is successful. On the error
path, you can expect tryTo<T>
to be roughly three
orders of magnitude faster than the throwing to<T>
and to completely avoid any lock contention arising from stack
unwinding.
Here is how to use non-throwing conversions:
auto t1 = tryTo<int>(str);
if (t1.hasValue()) {
(t1.value());
use}
Expected
has a composability feature to make the above
pattern simpler.
<int>(str).then([](int i) { use(i); }); tryTo
folly/dynamic.h
folly/dynamic.h
provides a runtime dynamically typed
value for C++, similar to the way languages with runtime type systems
work (e.g. Python). It can hold types from a predetermined set of types
(ints, bools, arrays of other dynamics, etc), similar to something like
std::variant
, but the syntax is intended to be a little
more like using the native type directly.
Here are some code samples to get started (assumes a
using folly::dynamic;
was used):
= 12; // creates a dynamic that holds an integer
dynamic twelve = "string"; // yep, this one is an fbstring
dynamic str
// A few other types.
= nullptr;
dynamic nul = false;
dynamic boolean
// Arrays can be initialized with dynamic::array.
= dynamic::array("array ", "of ", 4, " elements");
dynamic array assert(array.size() == 4);
= dynamic::array;
dynamic emptyArray assert(emptyArray.empty());
// Maps from dynamics to dynamics are called objects. The
// dynamic::object constant is how you make an empty map from dynamics
// to dynamics.
= dynamic::object;
dynamic map ["something"] = 12;
map["another_something"] = map["something"] * 2;
map
// Dynamic objects may be initialized this way
= dynamic::object("something", 12)("another_something", 24); dynamic map2
Any operation on a dynamic requires checking at runtime that the type
is compatible with the operation. If it isn’t, you’ll get a
folly::TypeError
. Other exceptions can also be thrown if
you try to do something impossible (e.g. if you put a very large 64-bit
integer in and try to read it out as a double).
More examples should hopefully clarify this:
= 42;
dynamic dint
= "foo";
dynamic str = str + "something"; // fine
dynamic anotherStr = str + dint; // TypeError is raised dynamic thisThrows
Explicit type conversions can be requested for some of the basic types:
= 12345678;
dynamic dint = dint.asDouble(); // doub will hold 12345678.0
dynamic doub = dint.asString(); // str == "12345678"
dynamic str
= std::numeric_limits<int64_t>::max();
dynamic hugeInt = hugeInt.asDouble(); // throws a folly/Conv.h error,
dynamic hugeDoub // since it can't fit in a double
For more complicated conversions, see DynamicConverter.
Equality operators (==
, !=
) are supported
for all types.
For dynamics of the same type, the underlying equality operator shall apply.
For comparisons between different numeric types (double and int64),
numeric equality will apply - thus 2.0 == 2
.
Values of any other different types will be deemed to not be equal.
Ordering operators (<
, <=
,
>
, >=
) are supported for all types,
except dynamic::object
which will throw if it is involved
in an ordering operator.
For dynamics of the same type, the underlying ordering operator shall apply.
For comparisons between different numeric types (double and int64),
numeric ordering will apply - thus 1.5 < 2
.
Ordering of values between other different types will maintain total
ordering properties and be consistent within a given binary run, and
thus safe for use in e.g. std::set
. The actual ordering is
undefined and could change across versions, thus a dependency should not
be taken outside of the total ordering property within a given
binary.
Hashing is supported by all types, and the hashes of two values will
match if they are equal per dynamic::operator==
.
As a result, numerical types have the same numerical hashing
regardless of int64 vs double - so
e.g. std::hash<dynamic>()(2)
will give the same value
as std::hash<dynamic>()(2.0)
.
You can iterate over dynamic arrays as you would over any C++ sequence container.
= dynamic::array(2, 3, "foo");
dynamic array
for (auto& val : array) {
(val);
doSomethingWith}
You can iterate over dynamic maps by calling items()
,
keys()
, values()
, which behave similarly to
the homonymous methods of Python dictionaries.
= dynamic::object(2, 3)("hello", "world")("x", 4);
dynamic obj
for (auto& pair : obj.items()) {
// Key is pair.first, value is pair.second
(pair.first);
processKey(pair.second);
processValue}
for (auto& key : obj.keys()) {
(key);
processKey}
for (auto& value : obj.values()) {
(value);
processValue}
You can find an element by key in a dynamic map using the
find()
method, which returns an iterator compatible with
items()
:
= dynamic::object(2, 3)("hello", "world")("x", 4);
dynamic obj
auto pos = obj.find("hello");
// pos->first is "hello"
// pos->second is "world"
auto pos = obj.find("no_such_key");
// pos == obj.items().end()
The original motivation for implementing this type was to try to make dealing with json documents in C++ almost as easy as it is in languages with dynamic type systems (php or javascript, etc). The reader can judge whether we’re anywhere near that goal, but here’s what it looks like:
// Parsing JSON strings and using them.
std::string jsonDocument = R"({"key":12,"key2":[false, null, true, "yay"]})";
= folly::parseJson(jsonDocument);
dynamic parsed assert(parsed["key"] == 12);
assert(parsed["key2"][0] == false);
assert(parsed["key2"][1] == nullptr);
// Building the same document programmatically.
= dynamic::object
dynamic sonOfAJ ("key", 12)
("key2", dynamic::array(false, nullptr, true, "yay"));
// Printing. (See also folly::toPrettyJson)
auto str = folly::toJson(sonOfAJ);
assert(jsonDocument.compare(str) == 0);
Dynamic typing is more expensive than static typing, even when you do it in C++. ;)
However, some effort has been made to keep
folly::dynamic
and the json (de)serialization at least
reasonably performant for common cases. The heap is only used for arrays
and objects, and move construction is fully supported. String formatting
internally also uses the highly performant
folly::to<>
(see folly/Conv.h
).
A trade off to keep in mind though, is that
sizeof(folly::dynamic)
is 64 bytes. You probably don’t want
to use it if you need to allocate large numbers of them (prefer static
types, etc).
Q. Why doesn’t a dynamic string support begin(), end(), and operator[]?
The value_type of a dynamic iterator is dynamic
, and
operator[]
(or the at()
function) has to
return a reference to a dynamic. If we wanted this to work for strings,
this would mean we’d have to support dynamics with a character type, and
moreover that the internal representation of strings would be such that
we can hand out references to dynamic as accessors on individual
characters. There are a lot of potential efficiency drawbacks with this,
and it seems like a feature that is not needed too often in
practice.
Q. Isn’t this just a poor imitation of the C# language feature?
Pretty much.
folly/DynamicConverter.h
When dynamic objects contain data of a known type, it is sometimes
useful to have its well-typed representation. A broad set of
type-conversions are contained in DynamicConverter.h
, and
facilitate the transformation of dynamic objects into their well-typed
format.
Simply pass a dynamic into a templated convertTo:
= { { 1, 2, 3 }, { 4, 5 } }; // a vector of vector of int
dynamic d auto vvi = convertTo<fbvector<fbvector<int>>>(d);
convertTo naturally supports conversions to
NOTE:
convertTo
Additionally, convertTo
If Type meets the container criteria, then it will be constructed by calling its InputIterator constructor.
If you want to use convertTo to convert dynamics into your own custom class, then all you have to do is provide a template specialization of DynamicConverter with the static method convert. Make sure you put it in namespace folly.
Example:
struct Token {
int kind_;
lexeme_;
fbstring
explicit Token(int kind, const fbstring& lexeme)
: kind_(kind), lexeme_(lexeme) {}
};
namespace folly {
template <> struct DynamicConverter<Token> {
static Token convert(const dynamic& d) {
int k = convertTo<int>(d["KIND"]);
= convertTo<fbstring>(d["LEXEME"]);
fbstring lex return Token(k, lex);
}
};
}
Run your concurrent code in a performant way
Wangle provides two concrete thread pools (IOThreadPoolExecutor, CPUThreadPoolExecutor) as well as building them in as part of a complete async framework. Generally you might want to grab the global executor, and use it with a future, like this:
auto f = someFutureFunction().via(getCPUExecutor()).then(...)
Or maybe you need to construct a thrift/memcache client, and need an event base:
auto f = getClient(getIOExecutor()->getEventBase())->callSomeFunction(args...) .via(getCPUExecutor()) .then([](Result r){ .... do something with result});
The current C++11 std::launch only has two modes: async or deferred. In a production system, neither is what you want: async will launch a new thread for every launch without limit, while deferred will defer the work until it is needed lazily, but then do the work in the current thread synchronously when it is needed.
Wangle's thread pools always launch work as soon as possible, have limits to the maximum number of tasks / threads allowed, so we will never use more threads than absolutely needed. See implementation details below about each type of executor.
Unfortunately none of the existing thread pools had every feature needed - things based on pipes are too slow. Several older ones didn't support std::function.
If you want epoll support, you need an fd - event_fd is the latest notification hotness. Unfortunately, an active fd triggers all the epoll loops it is in, leading to thundering herd - so if you want a fair queue (one queue total vs. one queue per worker thread), you need to use some kind of semaphore. Unfortunately semaphores can't be put in epoll loops, so they are incompatible with IO. Fortunately, you usually want to separate the IO and CPU bound work anyway to give stronger tail latency guarantees on IO.
Base class that contains the thread startup/shutdown/stats logic, since this is pretty disjoint from how tasks are actually run
An observer interface is provided to listen for thread start/stop events. This is useful to create objects that should be one-per-thread, but also have them work correctly if threads are added/removed from the thread pool.
PoolStats are provided to get task count, running time, waiting time, etc.
folly/FBString.h
fbstring
is a drop-in replacement for
std::string
. The main benefit of fbstring
is
significantly increased performance on virtually all important
primitives. This is achieved by using a three-tiered storage strategy
and by cooperating with the memory allocator. In particular,
fbstring
is designed to detect use of jemalloc and
cooperate with it to achieve significant improvements in speed and
memory usage.
fbstring
supports 32- and 64-bit and little- and
big-endian architectures.
Small strings (<= 23 chars) are stored in-situ without memory allocation.
Medium strings (24 - 255 chars) are stored in malloc-allocated memory and copied eagerly.
Large strings (> 255 chars) are stored in malloc-allocated memory and copied lazily.
100% compatible with std::string
.
Thread-safe reference counted copy-on-write for strings “large” strings (> 255 chars).
Uses malloc
instead of allocators.
Jemalloc-friendly. fbstring
automatically detects if
application uses jemalloc and if so, significantly improves allocation
strategy by using non-standard jemalloc extensions.
find()
is implemented using simplified Boyer-Moore
algorithm. Casual tests indicate a 30x speed improvement over
string::find()
for successful searches and a 1.5x speed
improvement for failed searches.
Offers conversions to and from std::string
.
folly/FBVector.h
Simply replacing std::vector
with
folly::fbvector
(after having included the
folly/FBVector.h
header file) will improve the performance
of your C++ code using vectors with common coding patterns. The
improvements are always non-negative, almost always measurable,
frequently significant, sometimes dramatic, and occasionally
spectacular.
::fbvector<int> numbers({0, 1, 2, 3});
folly.reserve(10);
numbersfor (int i = 4; i < 10; i++) {
.push_back(i * 2);
numbers}
assert(numbers[6] == 12);
std::vector
is the stalwart abstraction many use for
dynamically-allocated arrays in C++. It is also the best known and most
used of all containers. It may therefore seem a surprise that
std::vector
leaves important - and sometimes vital -
efficiency opportunities on the table. This document explains how our
own drop-in abstraction fbvector
improves key performance
aspects of std::vector
. Refer to
folly/test/FBVectorTest.cpp for a few benchmarks.
It is well known that std::vector
grows exponentially
(at a constant factor) in order to avoid quadratic growth performance.
The trick is choosing a good factor. Any factor greater than 1 ensures
O(1) amortized append complexity towards infinity. But a factor that’s
too small (say, 1.1) causes frequent vector reallocation, and one that’s
too large (say, 3 or 4) forces the vector to consume much more memory
than needed.
The initial HP implementation by Stepanov used a growth factor of 2;
i.e., whenever you’d push_back
into a vector without there
being room, it would double the current capacity. This was not a good
choice: it can be mathematically proven that a growth factor of 2 is
rigorously the worst possible because it never allows the vector
to reuse any of its previously-allocated memory. Despite other compilers
reducing the growth factor to 1.5, gcc has staunchly maintained its
factor of 2. This makes std::vector
cache- unfriendly and
memory manager unfriendly.
To see why that’s the case, consider a large vector of capacity C. When there’s a request to grow the vector, the vector (assuming no in-place resizing, see the appropriate section in this document) will allocate a chunk of memory next to its current chunk, copy its existing data to the new chunk, and then deallocate the old chunk. So now we have a chunk of size C followed by a chunk of size k * C. Continuing this process we’ll then have a chunk of size k * k * C to the right and so on. That leads to a series of the form (using ^^ for power):
C, C*k, C*k^^2, C*k^^3, ...
If we choose k = 2 we know that every element in the series will be strictly larger than the sum of all previous ones because of the remarkable equality:
1 + 2^^1 + 2^^2 + 2^^3... + 2^^n = 2^^(n+1) - 1
This means that any new chunk requested will be larger than all previously used chunks combined, so the vector must crawl forward in memory; it can’t move back to its deallocated chunks. But any number smaller than 2 guarantees that you’ll at some point be able to reuse the previous chunks. For instance, choosing 1.5 as the factor allows memory reuse after 4 reallocations; 1.45 allows memory reuse after 3 reallocations; and 1.3 allows reuse after only 2 reallocations.
Of course, the above makes a number of simplifying assumptions about
how the memory allocator works, but definitely you don’t want to choose
the theoretically absolute worst growth factor. fbvector
uses a growth factor of 1.5. That does not impede good performance at
small sizes because of the way fbvector
cooperates with
jemalloc (below).
Virtually all modern allocators allocate memory in fixed-size quanta
that are chosen to minimize management overhead while at the same time
offering good coverage at low slack. For example, an allocator may
choose blocks of doubling size (32, 64, 128,
As discussed above, std::vector
also needs to
(re)allocate in quanta. The next quantum is usually defined in terms of
the current size times the infamous growth constant. Because of this
setup, std::vector
has some slack memory at the end much
like an allocated block has some slack memory at the end.
It doesn’t take a rocket surgeon to figure out that an allocator-
aware std::vector
would be a marriage made in heaven: the
vector could directly request blocks of “perfect” size from the
allocator so there would be virtually no slack in the allocator. Also,
the entire growth strategy could be adjusted to work perfectly with
allocator’s own block growth strategy. That’s exactly what
fbvector
does - it automatically detects the use of
jemalloc and adjusts its reallocation strategy accordingly.
But wait, there’s more. Many memory allocators do not support in-
place reallocation, although most of them could. This comes from the now
notorious design of realloc()
to opaquely perform either
in-place reallocation or an allocate-memcpy-deallocate cycle. Such lack
of control subsequently forced all clib-based allocator designs to avoid
in-place reallocation, and that includes C++’s new
and
std::allocator
. This is a major loss of efficiency because
an in-place reallocation, being very cheap, may mean a much less
aggressive growth strategy. In turn that means less slack memory and
faster reallocations.
One particularly sensitive topic about handling C++ values is that
they are all conservatively considered non- relocatable. In
contrast, a relocatable value would preserve its invariant even if its
bits were moved arbitrarily in memory. For example, an
int32_t
is relocatable because moving its 4 bytes would
preserve its actual value, so the address of that value does not
“matter” to its integrity.
C++’s assumption of non-relocatable values hurts everybody for the benefit of a few questionable designs. The issue is that moving a C++ object “by the book” entails (a) creating a new copy from the existing value; (b) destroying the old value. This is quite vexing and violates common sense; consider this hypothetical conversation between Captain Picard and an incredulous alien:
Incredulous Alien: “So, this teleporter, how does it work?”
Picard: “It beams people and arbitrary matter from one place to
another.”
Incredulous Alien: “Hmmm… is it safe?”
Picard: “Yes,
but earlier models were a hassle. They’d clone the person to another
location. Then the teleporting chief would have to shoot the original.
Ask O’Brien, he was an intern during those times. A bloody mess, that’s
what it was.”
Only a tiny minority of objects are genuinely non-relocatable:
class Ew {
char buffer[1024];
char * pointerInsideBuffer;
public:
() : pointerInsideBuffer(buffer) {}
Ew...
}
The first class of designs can always be redone at small or no cost
in efficiency. The second class of objects should not be values in the
first place - they should be allocated with new
and
manipulated using (smart) pointers. It is highly unusual for a value to
have observers that alias pointers to it.
Relocatable objects are of high interest to std::vector
because such knowledge makes insertion into the vector and vector
reallocation considerably faster: instead of going to Picard’s
copy-destroy cycle, relocatable objects can be moved around simply by
using memcpy
or memmove
. This optimization can
yield arbitrarily high wins in efficiency; for example, it transforms
vector< vector<double> >
or
vector< hash_map<int, string> >
from risky
liabilities into highly workable compositions.
In order to allow fast relocation without risk, fbvector
uses a trait folly::IsRelocatable
defined in
"folly/Traits.h"
. By default,
folly::IsRelocatable::value
conservatively yields false. If
you know that your type Widget
is in fact relocatable, go
right after Widget
’s definition and write this:
// at global namespace level
namespace folly {
struct IsRelocatable<Widget> : boost::true_type {};
}
If you don’t do this, fbvector<Widget>
will fail
to compile with a static_assert
.
fbvector
uses a careful implementation all around to
make sure it doesn’t lose efficiency through the cracks. Some future
directions may be in improving raw memory copying (memcpy
is not an intrinsic in gcc and does not work terribly well for large
chunks) and in furthering the collaboration with jemalloc. Have fun!
folly/Format.h
folly/Format.h
provides a fast, powerful, type-safe,
flexible facility for formatting text, using a specification language
similar to Python’s str.format.
By default, it can format strings, numbers (integral and floating
point), and dynamically-typed folly::dynamic
objects, and
can extract values from random-access containers and string-keyed maps.
In many cases, format
is faster than sprintf
as well as being fully type-safe.
Here are some code samples to get started:
using folly::format;
using folly::sformat;
// Objects produced by format() can be streamed without creating
// an intermediary string; {} yields the next argument using default
// formatting.
std::cout << format("The answers are {} and {}", 23, 42);
// => "The answers are 23 and 42"
// If you just want the string, though, you're covered.
std::string result = sformat("The answers are {} and {}", 23, 42);
// => "The answers are 23 and 42"
// To insert a literal '{' or '}', just double it.
std::cout << format("{} {{}} {{{}}}", 23, 42);
// => "23 {} {42}"
// Arguments can be referenced out of order, even multiple times
std::cout << format("The answers are {1}, {0}, and {1} again", 23, 42);
// => "The answers are 42, 23, and 42 again"
// It's perfectly fine to not reference all arguments
std::cout << format("The only answer is {1}", 23, 42);
// => "The only answer is 42"
// Values can be extracted from indexable containers
// (random-access sequences and integral-keyed maps), and also from
// string-keyed maps
std::vector<int> v {23, 42};
std::map<std::string, std::string> m { {"what", "answer"} };
std::cout << format("The only {1[what]} is {0[1]}", v, m);
// => "The only answer is 42"
// format works with pairs and tuples
std::tuple<int, std::string, int> t {42, "hello", 23};
std::cout << format("{0} {2} {1}", t);
// => "42 23 hello"
// Format supports width, alignment, arbitrary fill, and various
// format specifiers, with meanings similar to printf
// "X<10": fill with 'X', left-align ('<'), width 10
std::cout << format("{:X<10} {}", "hello", "world");
// => "helloXXXXX world"
// Field width may be a runtime value rather than part of the format string
int x = 6;
std::cout << format("{:-^*}", x, "hi");
// => "--hi--"
// Explicit arguments work with dynamic field width, as long as indexes are
// given for both the value and the field width.
std::cout << format("{2:+^*0}",
9, "unused", 456); // => "+++456+++"
// Format supports printf-style format specifiers
std::cout << format("{0:05d} decimal = {0:04x} hex", 42);
// => "00042 decimal = 002a hex"
// Formatter objects may be written to a string using folly::to or
// folly::toAppend (see folly/Conv.h), or by calling their appendTo(),
// and str() methods
std::string s = format("The only answer is {}", 42).str();
std::cout << s;
// => "The only answer is 42"
// Decimal precision usage
std::cout << format("Only 2 decimals is {:.2f}", 23.34134534535);
// => "Only 2 decimals is 23.34"
Format string (format
):
"{" [arg_index] ["[" key "]"] [":" format_spec] "}"
arg_index
: index of argument to format; default = next
argument. Note that a format string may have either default argument
indexes or non-default argument indexes, but not both (to avoid
confusion).key
: if the argument is a container (C-style array or
pointer, std::array
, vector, deque, map), you may use this
to select the element to format; works with random-access sequences and
integer- and string-keyed maps. Multiple level keys work as well, with
components separated with “.”; for example, given
map<string, map<string, string>> m
,
{[foo.bar]}
selects m["foo"]["bar"]
.format_spec
: format specification, see belowFormat specification:
[[fill] align] [sign] ["#"] ["0"] [width] [","] ["." precision] ["."] [type]
fill
(may only be specified if align
is
also specified): pad with this character (‘
’ (space) or
‘0
’ (zero) might be useful; space is default)align
: one of ‘<
’, ‘>
’,
‘=
’, ‘^
’:
<
’: left-align (default for most objects)>
’: right-align (default for numbers)=
’: pad after sign, but before significant digits;
used to print -0000120
; only valid for numbers^
’: centersign
: one of ‘+
’, ‘-
’, ’ ’
(space) (only valid for numbers)
+
’: output ‘+
’ if positive or zero,
‘-
’ if negative-
’: output ‘-
’ if negative, nothing
otherwise (default)
’ (space): output ‘
’ (space) if positive
or zero, ‘-
’ if negative#
’: output base prefix (0
for octal,
0b
or 0B
for binary, 0x
or
0X
for hexadecimal; only valid for integers)0
’: 0-pad after sign, same as specifying
“0=
” as the fill
and align
parameters (only valid for numbers)width
: minimum field width. May be ‘*
’ to
indicate that the field width is given by an argument. Defaults to the
next argument (preceding the value to be formatted) but an explicit
argument index may be given following the ‘*
’.,
’ (comma): output comma as thousands’ separator (only
valid for integers, and only for decimal output)precision
(not allowed for integers):
f
’ or ‘F
’ presentation) or number of
significant digits (‘g
’ or ‘G
’).
’ (when used after precision or in lieu of precison):
Forces a trailing decimal point to make it clear this is a floating
point value.type
: presentation format, see belowPresentation formats:
folly::StringPiece
, std::string
,
folly::fbstring
, const char*
):
s
’ (default)b
’: output in binary (base 2) (“0b
”
prefix if ‘#
’ specified)B
’: output in binary (base 2) (“0B
”
prefix if ‘#
’ specified)c
’: output as a character (cast to
char
)d
’: output in decimal (base 10) (default)o
’: output in octal (base 8)O
’: output in octal (base 8) (same as
‘o
’)x
’: output in hexadecimal (base 16) (lower-case digits
above 9)X
’: output in hexadecimal (base 16) (upper-case digits
above 9)n
’: locale-aware output (currently same as
‘d
’)bool
:
true
” or “false
” as
stringschar
:
c
’ instead of
‘d
’float
, double
;
long double
is not implemented):
e
’: scientific notation using ‘e
’ as
exponent characterE
’: scientific notation using ‘E
’ as
exponent characterf
’: fixed pointF
’: fixed point (same as ‘f
’)g
’: general; use either ‘f
’ or
‘e
’ depending on magnitude (default)G
’: general; use either ‘f
’ or
‘E
’ depending on magnituden
’: locale-aware version of ‘g
’
(currently same as ‘g
’)%
’: percentage: multiply by 100 then display as
‘f
’You can extend format
for your own class by providing a
specialization for folly::FormatValue
. See
folly/Format.h
and folly/FormatArg.h
for
details, and the existing specialization for folly::dynamic
in folly/dynamic-inl.h
for an implementation example.
folly/Function.h
folly::Function
is a polymorphic function wrapper that
is not copyable and does not require the wrapped function to be copy
constructible. It is similar to std::function
, but
different with respect to some interesting features.
There are some limitations in std::function
that
folly::Function
tries to avoid. std::function
is copy-constructible and requires that the callable that it wraps is
copy-constructible as well, which is a constraint that is often
inconvenient. In most cases when using a std::function
you
don’t make use of its copy-constructibility, so you might sometimes feel
like you get back very little in return for a noticeable
restriction.
This restriction becomes apparent when trying to use a lambda
capturing a unique_ptr
(or any non-copyable type) as a
callback for a folly::Future.
std::unique_ptr<Foo> foo_ptr = new Foo;
.then(
some_future[foo_ptr = std::move(foo_ptr)] mutable
(int x)
{ foo_ptr->setX(x); }
);
This piece of code did not compile before folly::Future
started using folly::Function
instead of
std::function
to store the callback. Because the lambda
captures something non-copyable (the unique_ptr
), it is not
copyable itself. And std::function
can only store copyable
callables.
The implementation of folly::Future did not make use of the
copy-constructibility of std::function
at any point. There
was no benefit from the fact that the std::function
is
copy-constructible, but the fact that it can only wrap
copy-constructible callables posed a restriction.
A workaround was available: folly::MoveWrapper
, which
wraps an object that may be non-copyable and implements copy operations
by moving the embedded object. Using a folly::MoveWrapper
,
you can capture non-copyable objects in a lambda, and the lambda itself
is still copyable and may be wrapped in a std::function
. It
is a pragmatic solution for the above problem, but you have to be a
little careful. The problem is that you can’t use a
MoveWrapper
anywhere where copy operations are assumed to
behave like actual copy operations. Also, a
folly::MoveWrapper<std::unique_ptr<T>>
essentially behaves like auto_ptr<T>
. Ask yourself
whether you’d want to use lots of auto_ptr
s in your
codebase. And the original question still persists: we very often don’t
benefit from copy-constructibility of std::function
, so why
do we have to live with this restriction? I.e. why do we have to use
MoveWrapper
?
folly::Function
is an actual solution to this problem,
as it can wrap non-copyable callables, at the cost of not being
copy-constructible, which more often than not is not a relevant
restriction. folly::Future
now uses
folly::Function
to store callbacks, so the good news is:
the code example from the top of this note is becoming a perfectly valid
way to use future callbacks. The code compiles and behaves as you would
expect.
Here are more details about folly::Function
: much like
std::function
, it wraps an arbitrary object that can be
invoked like a given function type. E.g. a
folly::Function<int(std::string, double)>
can wrap
any callable object that returns an int
(or something that
is convertible to an int
) when invoked with a
std::string
and a double
argument. The
function type is a template parameter of folly::Function
,
but the specific type of the callable is not.
Other than copyability, there is one more significant difference
between std::function
and folly::Function
, and
it concerns const-correctness. std::function
does not
enforce const-correctness: it allows you to store mutable callables
(i.e. callables that may change their inner state when executed, such as
a mutable lambda) and call them in a const context (i.e. when you only
have access to a const reference to the std::function
object). For example:
class FooBar {
public:
void call_func() const {
func_();
}
private:
std::function<void()> func_;
};
The call_func
member function is declared const.
However, when it calls func_()
, it may change the inner
state of func_
, and thereby the inner state of the
FooBar
object. Inside the FooBar
class,
func_
is like a non-const method that is callable from
const methods.
Some people consider std::function
in the standard
broken with respect to this. (Paper N4348 explains this problem in more
detail.) It also lists possible ways to fix the problem.
folly::Function
, however, goes a different way: you have to
declare whether you want to store a const function, in which case you
can invoke any reference (const or non-const) of the
folly::Function
, or a non-const (mutable) function, in
which case you need a non-const reference to the
folly::Function
to be able to invoke it. In the above
example, let’s say that func_
stores a const function,
which makes it okay that it gets invoked from call_func
(a
const method). Instead of std::function
, you could use
folly::Function<void() const>
for the
func_
member.
Const-ness is part of a function type. To illustrate:
class Foo {
public:
int operator()() { return 1; }
int operator()(char const*) { return 2; }
int operator()(int) { return 3; }
int operator()(int) const { return 4; }
int operator()(int, int) const { return 5; }
};
You can overload methods multiple times using different argument
signatures. Const-ness is part of that signature, so even for the same
set of argument types you can overload a const and a non-const version.
It’s not even particularly unusual to do that. Take for instance the
begin()
method of STL container types: begin()
returns an iterator
, begin() const
returns a
const_iterator
. folly::Function
allows you to
select a specific overload easily:
::Function<int()> uf1 = Foo();
folly// uf1() returns 1
::Function<int(char const*)> uf2 = Foo();
folly// uf2() returns 2
::Function<int(int)> uf3 = Foo();
folly// uf3() returns 3
::Function<int(int) const> uf4 = Foo();
folly// uf4() returns 4
::Function<int(int, int) const> uf5 = Foo();
folly// uf5() returns 5
If cfoo
is a const-reference to a Foo
object, cfoo(int)
returns 4. If foo
is a
non-const reference to a Foo
object, foo(int)
returns 3. Normal const-to-non-const conversion behavior applies: if you
call foo(int, int)
it will return 5: a non-const reference
will invoke the const method if no non-const method is defined. Which
leads to the following behavior:
::Function<int(int, int)> uf5nc = Foo();
folly// uf5nc() returns 5
If you are wondering if the introduction of const function types
means that you have to change lots of normal function types to const
function types if you want to use folly::Function
: not
really, or at least not as much as you might think. There are only two
reasons to use a folly::Function
with a const function
type: * a callable object defines both const and non-const
operator()
and you explicitly want to select the const one
* you need to invoke a folly::Function
from a const context
(i.e. you only have a const reference to the
folly::Function
)
In practice, you will not need the const variant very often. Adding const to a function type adds a restriction for the callable: it must not change its inner state when invoked. If you don’t care whether it does or not, don’t worry about const!
A folly::Function<R(Args...) const>
can be
converted into a folly::Function<R(Args...)>
: either
way the stored callable will not change its inner state when invoked.
The former type expresses and guarantees that, the latter does not. When
you get rid of the const, the selected function stays the same:
::Function<int(int)> uf4nc = std::move(uf4);
folly// uf4nc() returns 4, not 3!
If you want to go the other way, you are talking about a (potentially dangerous) const cast: a callable that may or may not change its inner state is declared as one that guarantees not to do that. Proceed at your own risk! This conversion does not happen implicitly, though:
::Function<int() const> uf1c = folly::constCastFunction(std::move(uf1));
folly// uf1c() returns 1
Admittedly, seeing const function types as template parameters is
unfamiliar. As far as I am aware it happens nowhere in the standard
library. But it is the most consistent way to deal with the issue of
const-correctness here. Const qualifiers are part of a function type, as
a matter of fact. If you require a const-qualified function to be
wrapped in a folly::Function
, just declare it as that! More
often than not you will find that you do not need the const qualifier.
While writing the folly::Function
implementation, a good
set of unit tests had existed before the const function types got
introduced. Not a single one of those unit tests had to be changed: they
all compiled and passed after the introduction of const function types.
Obviously new ones were added to test the const-correctness. But in your
day-to-day use of folly::Function
you won’t have to worry
about const very often.
Like most implementations of std::function
,
folly::Function
stores small callable objects in-line and
larger callable objects will be stored on the heap.
folly::Function
returns the size of the allocation it has
made from its heapAllocatedMemory()
member function;
naturally it will return 0
when the callable is stored
in-line.
Folly Futures is an async C++ framework inspired by Twitter’s Futures implementation in Scala (see also Future.scala, Promise.scala, and friends), and loosely builds upon the existing but anemic Futures code found in the C++11 standard (std::future) and boost::future (especially >= 1.53.0). Although inspired by the C++11 std::future interface, it is not a drop-in replacement because some ideas don’t translate well enough to maintain API compatibility.
The primary difference from std::future is that you can attach
callbacks to Futures (with thenValue
or
thenTry
), under the control of an executor to manage where
work runs, which enables sequential and parallel composition of Futures
for cleaner asynchronous code.
#include <folly/futures/Future.h>
#include <folly/executors/ThreadedExecutor.h>
using namespace folly;
using namespace std;
void foo(int x) {
// do something with x
<< "foo(" << x << ")" << endl;
cout }
// ...
::ThreadedExecutor executor;
folly<< "making Promise" << endl;
cout <int> p;
Promise<int> f = p.getSemiFuture().via(&executor);
Futureauto f2 = move(f).thenValue(foo);
<< "Future chain made" << endl;
cout
// ... now perhaps in another event callback
<< "fulfilling Promise" << endl;
cout .setValue(42);
p(f2).get();
move<< "Promise fulfilled" << endl; cout
This would print:
making Promise
Future chain made
fulfilling Promise
foo(42)
Promise fulfilled
In addition to this document, there is a blog post on code.facebook.com (June 2015).
This brief guide covers the basics. For a more in-depth coverage skip to the appropriate section.
Let’s begin with an example using an imaginary simplified Memcache client interface:
using std::string;
class MemcacheClient {
public:
struct GetReply {
enum class Result {
,
FOUND,
NOT_FOUND,
SERVER_ERROR};
;
Result result// The value when result is FOUND,
// The error message when result is SERVER_ERROR or CLIENT_ERROR
// undefined otherwise
;
string value};
(string key);
GetReply get};
This API is synchronous, i.e. when you call get()
you
have to wait for the result. This is very simple, but unfortunately it
is also very easy to write very slow code using synchronous APIs.
Now, consider this traditional asynchronous signature for the same operation:
int async_get(string key, std::function<void(GetReply)> callback);
When you call async_get()
, your asynchronous operation
begins and when it finishes your callback will be called with the
result. Very performant code can be written with an API like this, but
for nontrivial applications the code devolves into a special kind of
spaghetti code affectionately referred to as “callback hell”.
The Future-based API looks like this:
<GetReply> future_get(string key); SemiFuture
A SemiFuture<GetReply>
or
Future<GetReply>
is a placeholder for the
GetReply
that we will eventually get. For most of the
descriptive text below, Future can refer to either
folly::SemiFuture
or folly::Future
as the
former is a safe subset of the latter. A Future usually starts life out
“unfulfilled”, or incomplete, i.e.:
.isReady() == false
fut.value() // will throw an exception because the Future is not ready fut
At some point in the future, the Future
will have been
fulfilled, and we can access its value.
.isReady() == true
fut& reply = fut.value(); GetReply
Futures support exceptions. If the asynchronous producer fails with an exception, your Future may represent an exception instead of a value. In that case:
.isReady() == true
fut.value() // will rethrow the exception fut
Just what is exceptional depends on the API. In our example we have
chosen not to raise exceptions for SERVER_ERROR
, but
represent this explicitly in the GetReply
object. On the
other hand, an astute Memcache veteran would notice that we left
CLIENT_ERROR
out of GetReply::Result
, and
perhaps a CLIENT_ERROR
would have been raised as an
exception, because CLIENT_ERROR
means there’s a bug in the
library and this would be truly exceptional. These decisions are
judgement calls by the API designer. The important thing is that
exceptional conditions (including and especially spurious exceptions
that nobody expects) get captured and can be handled higher up the
“stack”.
So far we have described a way to initiate an asynchronous operation via an API that returns a Future, and then sometime later after it is fulfilled, we get its value. This is slightly more useful than a synchronous API, but it’s not yet ideal. There are two more very important pieces to the puzzle.
First, we can aggregate Futures, to define a new Future that completes after some or all of the aggregated Futures complete. Consider two examples: fetching a batch of requests and waiting for all of them, and fetching a group of requests and waiting for only one of them.
;
MemcacheClient mc
<SemiFuture<GetReply>> futs;
vectorfor (auto& key : keys) {
.push_back(mc.future_get(key));
futs}
auto all = collectAll(futs.begin(), futs.end());
<SemiFuture<GetReply>> futs;
vectorfor (auto& key : keys) {
.push_back(mc.future_get(key));
futs}
auto any = collectAny(futs.begin(), futs.end());
<SemiFuture<GetReply>> futs;
vectorfor (auto& key : keys) {
.push_back(mc.future_get(key));
futs}
auto anyv = collectAnyWithoutException(futs.begin(), futs.end());
all
and any
are Futures (for the exact type
and usage see the header files). They will be complete when all/one of
futs are complete, respectively. (There is also collectN()
for when you need some, and collectAnyWithoutException
when
you need one value if some value would be available.)
Second, we can associate a Future with an executor. An executor
specifies where work will run, and we detail this more later. In
summary, given an executor we can convert a SemiFuture
to a
Future
with an executor, or a Future
on one
executor to a Future
on another executor.
For example:
::ThreadedExecutor executor;
folly<GetReply> semiFut = mc.future_get("foo");
SemiFuture<GetReply> fut1 = std::move(semiFut).via(&executor); Future
Once an executor is attached, a Future
allows
continuations to be attached and chained together monadically. An
example will clarify:
<GetReply> semiFut = mc.future_get("foo");
SemiFuture<GetReply> fut1 = std::move(semiFut).via(&executor);
Future
<string> fut2 = std::move(fut1).thenValue(
Future[](GetReply reply) {
if (reply.result == MemcacheClient::GetReply::Result::FOUND)
return reply.value;
throw SomeException("No value");
});
<Unit> fut3 = std::move(fut2)
Future.thenValue([](string str) {
<< str << endl;
cout })
.thenTry([](folly::Try<string> strTry) {
<< strTry.value() << endl;
cout })
.thenError(folly::tag_t<std::exception>{}, [](std::exception const& e) {
<< e.what() << endl;
cerr });
That example is a little contrived but the idea is that you can transform a result from one type to another, potentially in a chain, and unhandled errors propagate. Of course, the intermediate variables are optional.
Using .thenValue
or .thenTry
to add
callbacks is idiomatic. It brings all the code into one place, which
avoids callback hell. .thenValue
appends a continuation
that takes T&&
for some
Future<T>
and an error bypasses the callback and is
passed to the next, thenTry
takes a callback taking
folly::Try<T>
which encapsulates both value and
exception. thenError(tag_t<ExceptionType>{},...
will
bypass a value and only run if there is an exception, the
ExceptionType
template parameter to the tag type allows
filtering by exception type; tag_t<ExceptionType>{}
is optional and if no tag is passed passed the function will be
parameterized with a folly::exception_wrapper
. For C++17
there is a global inline variable and
folly::tag<ExceptionType>
may be passed directly
without explicit construction.
Up to this point we have skirted around the matter of waiting for
Futures. You may never need to wait for a Future, because your code is
event-driven and all follow-up action happens in a then-block. But if
want to have a batch workflow, where you initiate a batch of
asynchronous operations and then wait for them all to finish at a
synchronization point, then you will want to wait for a Future. Futures
have a blocking method called wait()
that does exactly that
and optionally takes a timeout.
Futures are partially threadsafe. A Promise or Future can migrate
between threads as long as there’s a full memory barrier of some sort.
Future::thenValue
and Promise::setValue
(and
all variants that boil down to those two calls) can be called from
different threads. But, be warned that you might be
surprised about which thread your callback executes on. Let’s consider
an example, where we take a future straight from a promise, without
going via the safer SemiFuture, and where we therefore have a
Future
that does not carry an executor. This is in general
something to avoid.
// Thread A
<Unit> p;
Promiseauto f = p.getFuture();
// Thread B
std::move(f).thenValue(x).thenValue(y).thenTry(z);
// Thread A
.setValue(); p
This is legal and technically threadsafe. However, it is important to
realize that you do not know in which thread x
,
y
, and/or z
will execute. Maybe they will
execute in Thread A when p.setValue()
is called. Or, maybe
they will execute in Thread B when f.thenValue
is called.
Or, maybe x
will execute in Thread A, but y
and/or z
will execute in Thread B. There’s a race between
setValue
and then
—whichever runs last will
execute the callback. The only guarantee is that one of them will run
the callback.
For safety, .via
should be preferred. We can chain
.via
operations to give very strong control over where
callbacks run:
std::move(aFuture)
.thenValue(x)
.via(e1).thenValue(y1).thenValue(y2)
.via(e2).thenValue(z);
x
will execute in the context of the executor associated
with aFuture
. y1
and y2
will
execute in the context of e1
, and z
will
execute in the context of e2
. If after z
you
want to get back to the original context, you need to get there with a
call to via
passing the original executor.
If you are wrapping an asynchronous operation, or providing an
asynchronous API to users, then you will want to make
Promise
s. Every Future has a corresponding Promise (except
Futures that spring into existence already completed, with
makeFuture()
). Promises are simple: you make one, you
extract the Future, and you fulfill it with a value or an exception.
Example:
<int> p;
Promise<int> f = p.getSemiFuture();
SemiFuture
.isReady() == false
f
.setValue(42);
p
.isReady() == true
f.value() == 42 f
and an exception example:
<int> p;
Promise<int> f = p.getSemiFuture();
SemiFuture
.isReady() == false
f
.setException(std::runtime_error("Fail"));
p
.isReady() == true
f.value() // throws the exception f
It’s good practice to use setWith which takes a function and automatically captures exceptions, e.g.
<int> p;
Promise.setWith([]{
ptry {
// do stuff that may throw
return 42;
} catch (MySpecialException const& e) {
// handle it
return 7;
}
// Any exceptions that we didn't catch, will be caught for us
});
folly/GroupVarint.h
folly/GroupVarint.h
is an implementation of
variable-length encoding for 32- and 64-bit integers using the Group
Varint encoding scheme as described in Jeff Dean’s WSDM
2009 talk and in Information
Retrieval: Implementing and Evaluating Search Engines.
Briefly, a group of four 32-bit integers is encoded as a sequence of variable length, between 5 and 17 bytes; the first byte encodes the length (in bytes) of each integer in the group. A group of five 64-bit integers is encoded as a sequence of variable length, between 7 and 42 bytes; the first two bytes encode the length (in bytes) of each integer in the group.
GroupVarint.h
defines a few classes:
GroupVarint<T>
, where T
is
uint32_t
or uint64_t
:
Basic encoding / decoding interface, mainly aimed at encoding / decoding one group at a time.
GroupVarintEncoder<T, Output>
, where
T
is uint32_t
or uint64_t
, and
Output
is a functor that accepts StringPiece
objects as arguments:
Streaming encoder: add values one at a time, and they will be flushed to the output one group at a time. Handles the case where the last group is incomplete (the number of integers to encode isn’t a multiple of the group size)
GroupVarintDecoder<T>
, where T
is
uint32_t
or uint64_t
:
Streaming decoder: extract values one at a time. Handles the case where the last group is incomplete.
The 32-bit implementation is significantly faster than the 64-bit implementation; on platforms supporting the SSSE3 instruction set, we use the PSHUFB instruction to speed up lookup, as described in SIMD-Based Decoding of Posting Lists (CIKM 2011).
For more details, see the header file
folly/GroupVarint.h
and the associated test file
folly/test/GroupVarintTest.cpp
.
folly/stats/Histogram.h
Histogram
Histogram.h
defines a simple histogram class, templated
on the type of data you want to store. This class is useful for tracking
a large stream of data points, where you want to remember the overall
distribution of the data, but do not need to remember each data point
individually.
Each histogram bucket stores the number of data points that fell in the bucket, as well as the overall sum of the data points in the bucket. Note that no overflow checking is performed, so if you have a bucket with a large number of very large values, it may overflow and cause inaccurate data for this bucket. As such, the histogram class is not well suited to storing data points with very large values. However, it works very well for smaller data points such as request latencies, request or response sizes, etc.
In addition to providing access to the raw bucket data, the
Histogram
class also provides methods for estimating
percentile values. This allows you to estimate the median value (the
50th percentile) and other values such as the 95th or 99th
percentiles.
All of the buckets have the same width. The number of buckets and bucket width is fixed for the lifetime of the histogram. As such, you do need to know your expected data range ahead of time in order to have accurate statistics. The histogram does keep one bucket to store all data points that fall below the histogram minimum, and one bucket for the data points above the maximum. However, because these buckets don’t have a good lower/upper bound, percentile estimates in these buckets may be inaccurate.
HistogramBuckets
The Histogram
class is built on top of
HistogramBuckets
. HistogramBuckets
provides an
API very similar to Histogram
, but allows a user-defined
bucket class. This allows users to implement more complex histogram
types that store more than just the count and sum in each bucket.
When computing percentile estimates HistogramBuckets
allows user-defined functions for computing the average value and data
count in each bucket. This allows you to define more complex buckets
which may have multiple different ways of computing the average value
and the count.
For example, one use case could be tracking timeseries data in each bucket. Each set of timeseries data can have independent data in the bucket, which can show how the data distribution is changing over time.
Say we have code that sends many requests to remote services, and want to generate a histogram showing how long the requests take. The following code will initialize histogram with 50 buckets, tracking values between 0 and 5000. (There are 50 buckets since the bucket width is specified as 100. If the bucket width is not an even multiple of the histogram range, the last bucket will simply be shorter than the others.)
::Histogram<int64_t> latencies(100, 0, 5000); folly
The addValue() method is used to add values to the histogram. Each time a request finishes we can add its latency to the histogram:
.addValue(now - startTime); latencies
You can access each of the histogram buckets to display the overall distribution. Note that bucket 0 tracks all data points that were below the specified histogram minimum, and the last bucket tracks the data points that were above the maximum.
auto numBuckets = latencies.getNumBuckets();
<< "Below min: " << latencies.getBucketByIndex(0).count << "\n";
cout for (unsigned int n = 1; n < numBuckets - 1; ++n) {
<< latencies.getBucketMin(n) << "-" << latencies.getBucketMax(n)
cout << ": " << latencies.getBucketByIndex(n).count << "\n";
}
<< "Above max: "
cout << latencies.getBucketByIndex(numBuckets - 1).count << "\n";
You can also use the getPercentileEstimate()
method to
estimate the value at the Nth percentile in the distribution. For
example, to estimate the median, as well as the 95th and 99th percentile
values:
int64_t median = latencies.getPercentileEstimate(0.5);
int64_t p95 = latencies.getPercentileEstimate(0.95);
int64_t p99 = latencies.getPercentileEstimate(0.99);
Note that Histogram
and HistogramBuckets
objects are not thread-safe. If you wish to access a single
Histogram
from multiple threads, you must perform your own
locking to ensure that multiple threads do not access it at the same
time.
folly/
Below is a list of (some) Folly components in alphabetical order, along with a brief description of each.
Arena.h
,
ThreadCachedArena.h
Simple arena for memory allocation: multiple allocations get freed all at once. With threaded version.
AtomicHashMap.h
,
AtomicHashArray.h
, AtomicHashArray.h
,
AtomicLinkedList.h
, …High-performance atomic data-structures. Many of these are built with very specific tradeoffs and constraints in mind that make them faster than their more general counterparts. Each header should contain information about what these tradeoffs are.
Baton.h
A Baton allows a thread to block once and be awoken: it captures a
single handoff. It is essentially a (very small, very fast) semaphore
that supports only a single call to sem_call
and
sem_wait
.
Benchmark.h
A small framework for benchmarking code. Client code registers benchmarks, optionally with an argument that dictates the scale of the benchmark (iterations, working set size etc). The framework runs benchmarks (subject to a command-line flag) and produces formatted output with timing information.
Bits.h
Various bit manipulation utilities optimized for speed; includes functions that wrap the ffsl(l) primitives in a uniform interface.
ConcurrentSkipList.h
An implementation of the structure described in A Provably Correct Scalable Concurrent Skip List by Herlihy et al.
Conv.h
A variety of data conversion routines (notably to and from string), optimized for speed and safety.
Demangle.h
Pretty-printing C++ types.
DiscriminatedPtr.h
Similar to std::variant
, but restricted to pointers
only. Uses the highest-order unused 16 bits in a pointer as
discriminator. So
sizeof(DiscriminatedPtr<int, string, Widget>) == sizeof(void*)
.
dynamic.h
Dynamically-typed object, created with JSON objects in mind.
DynamicConverter.h
is a utility for efficiently converting
from a dynamic
to a more concrete structure when the scheme
is known (e.g. json -> map<int,int>
).
EvictingCacheMap.h
A simple LRU hash map.
FBString.h
A drop-in implementation of std::string
with a variety
of optimizations.
FBVector.h
A mostly drop-in implementation of std::vector
with a
variety of optimizations.
File.h
A C++ abstraction around files.
Fingerprint.h
Rabin fingerprinting.
Function.h
A polymorphic wrapper for callables similar to
std::function
but not copyable and therefore able to wrap
non-copyable callables, such as lambdas that capture move-only types
like std::unique_ptr
or folly::Promise
.
futures/
Futures is a framework for expressing asynchronous code in C++ using the Promise/Future pattern.
Format.h
Python-style formatting utilities.
gen/
This library makes it possible to write declarative comprehensions for processing sequences of values efficiently in C++ akin to C#’s LINQ.
GroupVarint.h
Group Varint encoding for 32-bit values.
IPAddress.h
A collection of utilities to deal with IPAddresses, including ipv4 and ipv6.
io/
A collection of useful of abstractions for high-performance io. This is heavily relied upon in Facebook’s internally networking code.
Hash.h
Various popular hash function implementations.
Histogram.h
A simple class for collecting histogram data.
IntrusiveList.h
Convenience type definitions for using
boost::intrusive_list
.
json.h
JSON serializer and deserializer. Uses dynamic.h
.
Likely.h
Wrappers around __builtin_expect
.
Malloc.h
,
Memory.h
Memory allocation helpers, particularly when using jemalloc.
MicroSpinLock.h
A really, really small spinlock for fine-grained locking of lots of teeny-tiny data.
MPMCQueue.h
MPMCQueue
The additional utility MPMCPipeline.h
is an extension
that lets you chain several queues together with processing steps in
between.
PackedSyncPtr.h
A highly specialized data structure consisting of a pointer, a 1-bit spin lock, and a 15-bit integral, all inside one 64-bit word.
Poly.h
A class template that makes it relatively easy to define a type-erasing polymorphic object wrapper.
Preprocessor.h
Necessarily evil stuff.
ProducerConsumerQueue.h
Lock free single-reader, single-writer queue.
Random.h
Defines only one function—randomNumberSeed()
.
Range.h
Boost-style range facility and the StringPiece
specialization.
RWSpinLock.h
Fast and compact reader-writer spin lock.
ScopeGuard.h
C++11 incarnation of the old ScopeGuard idiom.
Singleton.h
A singleton to rule the singletons. This is an attempt to insert a
layer between C++ statics and the fiasco that ensues, so that things can
be created, and destroyed, correctly upon program creation, program end
and sometimes dlopen
and fork
.
Singletons are bad for you, but this may help.
SmallLocks.h
Very small spin locks (1 byte and 1 bit).
small_vector.h
Vector with the small buffer optimization and an optional embedded
PicoSpinLock
.
sorted_vector_types.h
Collections similar to std::map
but implemented as
sorted vectors.
stats/
A collection of efficient utilities for collecting statistics: * time series counters, gauges, histograms, and quantiles; * single-pass mean and variance.
StlAllocator.h
STL allocator wrapping a simple allocate/deallocate interface.
String.h
String utilities that connect folly::fbstring
with
std::string
.
Subprocess.h
Subprocess library, modeled after Python’s subprocess module.
Synchronized.h
High-level synchronization library.
System.h
Demangling and errno utilities.
ThreadCachedInt.h
High-performance atomic increment using thread caching.
ThreadLocal.h
Improved thread local storage for non-trivial types.
TimeoutQueue.h
Queue with per-item timeout.
Traits.h
Type traits that complement those defined in the standard C++11
header <traits>
.
Unicode.h
Defines the codePointToUtf8
function.
Uri.h
A collection of utilities to deal with URIs.
folly/PackedSyncPtr.h
A highly specialized data structure consisting of a pointer, a 1-bit
spin lock, and a 15-bit integral packed into
sizeof(void*)
.
Typical application is for microsharding of many elements within containers. Because there is no memory overhead, an arbitrarily large number of locks can be used to minimize lock contention with no memory penalty. Additionally, excellent cache performance is obtained by storing the lock inline with the pointer (no additional cache miss or false sharing). Finally, because it uses a simple spinlock mechanism, the cost of acquiring an uncontended lock is minimal.
This is not a “smart” pointer: nothing automagical is going on here. Locking is up to the user. Resource deallocation is up to the user. Locks are never acquired or released outside explicit calls to lock() and unlock().
Change the value of the raw pointer with set(), but you must hold the lock when calling this function if multiple threads could be using it.
Here is an example of using a PackedSyncPtr to build a synchronized
vector with no memory overhead - the spinlock and size are stored in the
16 unused bits of pointer, the rest of which points to the actual data.
See folly/small_vector.h
for a complete implementation of
this concept.
template<typename T>
class SyncVec {
<T> base;
PackedSyncPtr
public:
() { base.init(); }
SyncVec
void push_back(const T& t) {
.set(
basestatic_cast<T*>(realloc(base.get(), (base.extra() + 1) * sizeof(T))));
[base.extra()] = t;
base.setExtra(base.extra() + 1);
base}
size_t size() const {
return base.extra();
}
void lock() {
.lock();
base}
void unlock() {
.unlock();
base}
* begin() const {
Treturn base.get();
}
* end() const {
Treturn base.get() + base.extra();
}
};
This is using an x64-specific detail about the effective virtual address space. Long story short: the upper two bytes of all our pointers will be zero in reality—and if you have a couple billion such pointers in core, it makes pretty good sense to try to make use of that memory. The exact details can be perused here:
http://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses
folly/Poly.h
Poly
is a class template that makes it relatively easy
to define a type-erasing polymorphic object wrapper.
std::function
is one example of a type-erasing
polymorphic object wrapper; folly::exception_wrapper
is
another. Type-erasure is often used as an alternative to dynamic
polymorphism via inheritance-based virtual dispatch. The distinguishing
characteristic of type-erasing wrappers are:
shared_ptr
s and unique_ptr
s in APIs,
complicating their point-of-use. APIs that take type-erasing wrappers,
on the other hand, can often store small objects in-situ, with no
dynamic allocation. The memory management, if any, is handled for you,
and leads to cleaner APIs: consumers of your API don’t need to pass
shared_ptr<AbstractBase>
; they can simply pass any
object that satisfies the interface you require.
(std::function
is a particularly compelling example of this
benefit. Far worse would be an inheritance-based callable solution like
shared_ptr<ICallable<void(int)>>
. )folly::Poly
Defining a polymorphic wrapper with Poly
is a matter of
defining two things:
Below is a simple example program that defines a
drawable
wrapper for any type that provides a
draw
member function. (The details will be explained
later.)
// This example is an adaptation of one found in Louis Dionne's dyno library.
#include <folly/Poly.h>
#include <iostream>
struct IDrawable {
// Define the interface of something that can be drawn:
template <class Base> struct Interface : Base {
void draw(std::ostream& out) const { folly::poly_call<0>(*this, out);}
};
// Define how concrete types can fulfill that interface (in C++17):
template <class T> using Members = folly::PolyMembers<&T::draw>;
};
// Define an object that can hold anything that can be drawn:
using drawable = folly::Poly<IDrawable>;
struct Square {
void draw(std::ostream& out) const { out << "Square\n"; }
};
struct Circle {
void draw(std::ostream& out) const { out << "Circle\n"; }
};
void f(drawable const& d) {
.draw(std::cout);
d}
int main() {
(Square{}); // prints Square
f(Circle{}); // prints Circle
f}
The above program prints:
Square
Circle
Here is another (heavily commented) example of a simple
implementation of a std::function
-like polymorphic wrapper.
Its interface has only a single member function:
operator()
// An interface for a callable object of a particular signature, Fun
// (most interfaces don't need to be templates, FWIW).
template <class Fun>
struct IFunction;
template <class R, class... As>
struct IFunction<R(As...)> {
// An interface is defined as a nested class template called
// Interface that takes a single template parameter, Base, from
// which it inherits.
template <class Base>
struct Interface : Base {
// The Interface has public member functions. These become the
// public interface of the resulting Poly instantiation.
// (Implementation note: Poly<IFunction<Sig>> will publicly
// inherit from this struct, which is what gives it the right
// member functions.)
operator()(As... as) const {
R // The definition of each member function in your interface will
// always consist of a single line dispatching to folly::poly_call<N>.
// The "N" corresponds to the N-th member function in the
// list of member function bindings, Members, defined below.
// The first argument will always be *this, and the rest of the
// arguments should simply forward (if necessary) the member
// function's arguments.
return static_cast<R>(
::poly_call<0>(*this, std::forward<As>(as)...));
folly}
};
// The "Members" alias template is a comma-separated list of bound
// member functions for a given concrete type "T". The
// "FOLLY_POLY_MEMBERS" macro accepts a comma-separated list, and the
// (optional) "FOLLY_POLY_MEMBER" macro lets you disambiguate overloads
// by explicitly specifying the function signature the target member
// function should have. In this case, we require "T" to have a
// function call operator with the signature `R(As...) const`.
//
// If you are using a C++17-compatible compiler, you can do away with
// the macros and write this as:
//
// template <class T>
// using Members =
// folly::PolyMembers<folly::sig<R(As...) const>(&T::operator())>;
//
// And since `folly::sig` is only needed for disambiguation in case of
// overloads, if you are not concerned about objects with overloaded
// function call operators, it could be further simplified to:
//
// template <class T>
// using Members = folly::PolyMembers<&T::operator()>;
//
template <class T>
using Members = FOLLY_POLY_MEMBERS(
(R(As...) const, &T::operator()));
FOLLY_POLY_MEMBER};
// Now that we have defined the interface, we can pass it to Poly to
// create our type-erasing wrapper:
template <class Fun>
using Function = Poly<IFunction<Fun>>;
Given the above definition of Function
, users can now
initialize instances of (say) Function<int(int, int)>
with function objects like std::plus<int>
and
std::multiplies<int>
, as below:
<int(int, int)> fun = std::plus<int>{};
Functionassert(5 == fun(2, 3));
= std::multiplies<int>{};
fun assert(6 = fun(2, 3));
With C++17, defining an interface to be used with Poly
is fairly straightforward. As in the Function
example
above, there is a struct with a nested Interface
class
template and a nested Members
alias template. No macros are
needed with C++17.
Imagine we were defining something like a Java-style iterator. If we are using a C++17 compiler, our interface would look something like this:
template <class Value>
struct IJavaIterator {
template <class Base>
struct Interface : Base {
bool Done() const { return folly::poly_call<0>(*this); }
() const { return folly::poly_call<1>(*this); }
Value Currentvoid Next() { folly::poly_call<2>(*this); }
};
// NOTE: This works in C++17 only:
template <class T>
using Members = folly::PolyMembers<&T::Done, &T::Current, &T::Next>;
};
template <class Value>
using JavaIterator = Poly<IJavaIterator<Value>>;
Given the above definition, JavaIterator<int>
can
be used to hold instances of any type that has Done
,
Current
, and Next
member functions with the
correct (or compatible) signatures.
The presence of overloaded member functions complicates this picture.
Often, property members are faked in C++ with const
and
non-const
member function overloads, like in the interface
specified below:
struct IIntProperty {
template <class Base>
struct Interface : Base {
int Value() const { return folly::poly_call<0>(*this); }
void Value(int i) { folly::poly_call<1>(*this, i); }
};
// NOTE: This works in C++17 only:
template <class T>
using Members = folly::PolyMembers<
::sig<int() const>(&T::Value),
folly::sig<void(int)>(&T::Value)>;
folly};
using IntProperty = Poly<IIntProperty>;
Now, any object that has Value
members of compatible
signatures can be assigned to instances of IntProperty
object. Note how folly::sig
is used to disambiguate the
overloads of &T::Value
.
In C++14, the nice syntax above doesn’t work, so we have to resort to macros. The two examples above would look like this:
template <class Value>
struct IJavaIterator {
template <class Base>
struct Interface : Base {
bool Done() const { return folly::poly_call<0>(*this); }
() const { return folly::poly_call<1>(*this); }
Value Currentvoid Next() { folly::poly_call<2>(*this); }
};
// NOTE: This works in C++14 and C++17:
template <class T>
using Members = FOLLY_POLY_MEMBERS(&T::Done, &T::Current, &T::Next);
};
template <class Value>
using JavaIterator = Poly<IJavaIterator<Value>>;
and
struct IIntProperty {
template <class Base>
struct Interface : Base {
int Value() const { return folly::poly_call<0>(*this); }
void Value(int i) { return folly::poly_call<1>(*this, i); }
};
// NOTE: This works in C++14 and C++17:
template <class T>
using Members = FOLLY_POLY_MEMBERS(
(int() const, &T::Value),
FOLLY_POLY_MEMBER(void(int), &T::Value));
FOLLY_POLY_MEMBER};
using IntProperty = Poly<IIntProperty>;
One typical advantage of inheritance-based solutions to runtime
polymorphism is that one polymorphic interface could extend another
through inheritance. The same can be accomplished with type-erasing
polymorphic wrappers. In the Poly
library, you can use
folly::PolyExtends
to say that one interface extends
another.
struct IFoo {
template <class Base>
struct Interface : Base {
void Foo() const { return folly::poly_call<0>(*this); }
};
template <class T>
using Members = FOLLY_POLY_MEMBERS(&T::Foo);
};
// The IFooBar interface extends the IFoo interface
struct IFooBar : PolyExtends<IFoo> {
template <class Base>
struct Interface : Base {
void Bar() const { return folly::poly_call<0>(*this); }
};
template <class T>
using Members = FOLLY_POLY_MEMBERS(&T::Bar);
};
using FooBar = Poly<IFooBar>;
Given the above definition, instances of type FooBar
have both Foo()
and Bar()
member
functions.
The sensible conversions exist between a wrapped derived type and a
wrapped base type. For instance, assuming IDerived
extends
IBase
with PolyExtends
:
<IDerived> derived = ...;
Poly<IBase> base = derived; // This conversion is OK. Poly
As you would expect, there is no conversion in the other direction,
and at present there is no Poly
equivalent to
dynamic_cast
.
Sometimes you don’t need to own a copy of an object; a reference will
do. For that you can use Poly
to capture a
reference to an object satisfying an interface rather than the
whole object itself. The syntax is intuitive.
int i = 42;
// Capture a mutable reference to an object of any IRegular type:
<IRegular &> intRef = i;
Poly
assert(42 == folly::poly_cast<int>(intRef));
// Assert that we captured the address of "i":
assert(&i == &folly::poly_cast<int>(intRef));
A reference-like Poly
has a different interface than a
value-like Poly
. Rather than calling member functions with
the obj.fun()
syntax, you would use the
obj->fun()
syntax. This is for the sake of
const
-correctness. For example, consider the code
below:
struct IFoo {
template <class Base>
struct Interface {
void Foo() { folly::poly_call<0>(*this); }
};
template <class T>
using Members = folly::PolyMembers<&T::Foo>;
};
struct SomeFoo {
void Foo() { std::printf("SomeFoo::Foo\n"); }
};
;
SomeFoo foo<IFoo &> const anyFoo = foo;
Poly->Foo(); // prints "SomeFoo::Foo" anyFoo
Notice in the above code that the Foo
member function is
non-const
. Notice also that the anyFoo
object
is const
. However, since it has captured a
non-const
reference to the foo
object, it
should still be possible to dispatch to the non-const
Foo
member function. When instantiated with a reference
type, Poly
has an overloaded operator->
member that returns a pointer to the IFoo
interface with
the correct const
-ness, which makes this work.
The same mechanism also prevents users from calling
non-const
member functions on Poly
objects
that have captured const
references, which would violate
const
-correctness.
Sensible conversions exist between non-reference and reference
Poly
s. For instance:
<IRegular> value = 42;
Poly<IRegular &> mutable_ref = value;
Poly<IRegular const &> const_ref = mutable_ref;
Poly
assert(&poly_cast<int>(value) == &poly_cast<int>(mutable_ref));
assert(&poly_cast<int>(value) == &poly_cast<int>(const_ref));
If you wanted to write the interface
ILogicallyNegatable
, which captures all types that can be
negated with unary operator!
, you could do it as we’ve
shown above, by binding &T::operator!
in the nested
Members
alias template, but that has the problem that it
won’t work for types that have defined unary operator!
as a
free function. To handle this case, the Poly
library lets
you use a free function instead of a member function when creating a
binding.
With C++17 you may use a lambda to create a binding, as shown in the example below:
struct ILogicallyNegatable {
template <class Base>
struct Interface : Base {
bool operator!() const { return folly::poly_call<0>(*this); }
};
template <class T>
using Members = folly::PolyMembers<
+[](T const& t) -> decltype(bool(!t)) { return bool(!t); }>;
};
This requires some explanation. The unary operator+
in
front of the lambda is necessary! It causes the lambda to decay to a
C-style function pointer, which is one of the types that
folly::PolyMembers
accepts. The decltype
in
the lambda return type is also necessary. Through the magic of SFINAE,
it will cause Poly<ILogicallyNegatable>
to reject any
types that don’t support unary operator!
.
If you are using a free function to create a binding, the first
parameter is implicitly the this
parameter. It will receive
the type-erased object.
If you are using a C++14 compiler, the definition of
ILogicallyNegatable
above will fail because lambdas are not
constexpr
. We can get the same effect by writing the lambda
as a named free function, as show below:
struct ILogicallyNegatable {
template <class Base>
struct Interface : Base {
bool operator!() const { return folly::poly_call<0>(*this); }
};
template <class T>
static auto negate(T const& t)
-> decltype(bool(!t)) { return bool(!t); }
template <class T>
using Members = FOLLY_POLY_MEMBERS(&negate<T>);
};
As with the example that uses the lambda in the preceding section,
the first parameter is implicitly the this
parameter. It
will receive the type-erased object.
What if you want to create an IAddable
interface for
things that can be added? Adding requires two objects, both of
which are type-erased. This interface requires dispatching on both
objects, doing the addition only if the types are the same. For this we
make use of the PolySelf
template alias to define an
interface that takes more than one object of the erased type.
struct IAddable {
template <class Base>
struct Interface : Base {
friend PolySelf<Base>
operator+(PolySelf<Base> const& a, PolySelf<Base> const& b) const {
return folly::poly_call<0>(a, b);
}
};
template <class T>
using Members = folly::PolyMembers<
+[](T const& a, T const& b) -> decltype(a + b) { return a + b; }>;
};
Given the above definition of IAddable
we would be able
to do the following:
<IAddable> a = 2, b = 3;
Poly<IAddable> c = a + b;
Polyassert(poly_cast<int>(c) == 5);
If a
and b
stored objects of different
types, a BadPolyCast
exception would be thrown.
If you want to store move-only types, then your interface should
extend the poly::IMoveOnly
interface.
Poly
will store “small” objects in an internal buffer,
avoiding the cost of of dynamic allocations. At present, this size is
not configurable; it is pegged at the size of two
double
s.
Poly
objects are always nothrow movable. If you store an
object in one that has a potentially throwing move constructor, the
object will be stored on the heap, even if it could fit in the internal
storage of the Poly
object. (So be sure to give your
objects nothrow move constructors!)
Poly
implements type-erasure in a manner very similar to
how the compiler accomplishes virtual dispatch. Every Poly
object contains a pointer to a table of function pointers. Member
function calls involve a double- indirection: once through the
v-pointer, and other indirect function call through the function
pointer.
folly/ProducerConsumerQueue.h
The folly::ProducerConsumerQueue
class is a one-producer
one-consumer queue with very low synchronization overhead.
The queue must be created with a fixed maximum size (and allocates that many cells of sizeof(T)), and it provides just a few simple operations:
read
: Attempt to read the value at the front to the
queue into a variable, returns false
iff queue was
empty.write
: Emplace a value at the end of the queue, returns
false
iff the queue was full.frontPtr
: Retrieve a pointer to the item at the front
of the queue, or nullptr
if it is empty.popFront
: Remove the item from the front of the queue
(queue must not be empty).isEmpty
: Check if the queue is empty.isFull
: Check if the queue is full.sizeGuess
: Returns the number of entries in the queue.
Because of the way we coordinate threads, this guess could be slightly
wrong when called by the producer/consumer thread, and it could be
wildly inaccurate if called from any other threads. Hence, only call
from producer/consumer threads!All of these operations are wait-free. The read operations (including
frontPtr
and popFront
) and write operations
must only be called by the reader and writer thread, respectively.
isFull
, isEmpty
, and sizeGuess
may be called by either thread, but the return values from
read
, write
, or frontPtr
are
sufficient for most cases.
write
may fail if the queue is full, and
read
may fail if the queue is empty, so in many situations
it is important to choose the queue size such that the queue filling or
staying empty for long is unlikely.
A toy example that doesn’t really do anything useful:
::ProducerConsumerQueue<folly::fbstring> queue{size};
folly
std::thread reader([&queue] {
for (;;) {
::fbstring str;
follywhile (!queue.read(str)) {
//spin until we get a value
continue;
}
(str);
sink}
});
// producer thread:
for (;;) {
::fbstring str = source();
follywhile (!queue.write(str)) {
//spin until the queue has room
continue;
}
}
Alternatively, the consumer may be written as follows to use the ‘front’ value in place, thus avoiding moves or copies:
std::thread reader([&queue] {
for (;;) {
::fbstring* pval;
follydo {
= queue.frontPtr();
pval } while (!pval); // spin until we get a value;
(*pval);
sink.popFront();
queue}
});
folly/synchronization/Rcu.h
C++ read-copy-update (RCU) functionality in folly.
Read-copy-update (RCU) is a low-overhead synchronization mechanism that provides guaranteed ordering between operations on shared data. In the simplest usage pattern, readers enter a critical section, view some state, and leave the critical section, while writers modify shared state and then defer some cleanup operations.
Proper use of the APIs provided by folly RCU will guarantee that a cleanup operation that is deferred during a reader critical section will not be executed until after that critical section is over.
The main synchronization primitive in folly RCU is a
folly::rcu_domain
. A folly::rcu_domain
is a
“universe” of deferred execution. Each domain has an executor on which
deferred functions may execute. folly::rcu_domain
provides
the requirements to be BasicLockable,
and readers may enter a read region in a folly::rcu_domain
by treating the domain as a mutex type that can be wrapped by C++
locking primitives.
For example, to enter and exit a read region using non-movable RAII
semantics, you could use an std::scoped_lock
:
{
std::scoped_lock<folly::rcu_domain> lock(folly::rcu_default_domain());
.read();
protectedData}
Alternatively, if you need to be able to move the lock, you could use
std::unique_lock
:
class ProtectedData {
private:
std::unique_lock<folly::rcu_domain> lock_;
void* data_;
}
There is a global, default domain that can be accessed using
folly::rcu_default_domain()
as in the example above. If
required, you can also create your own domain:
{
* my_executor = getExecutor();
Executor::rcu_domain domain(my_executor /* or nullptr to use default */);
folly}
In general, using the default domain is strongly encouraged as you will likely get better cache locality by sharing a domain that it used by other callers in process. If, however, you can’t avoid blocking during reader critical sections, your own custom domain should be used to avoid delaying reclamation from other updaters. A custom domain can also be used if you want update callbacks to be invoked on a specific executor.
A typical reader / updater synchronization will look something like this:
void doSomethingWith(IPAddress host);
static std::atomic<ConfigData*> globalConfigData;
void reader() {
while (true) {
;
IPAddress curManagementServer{
// We're about to do some reads we want to protect; if we read a
// pointer, we need to make sure that if the writer comes along and
// updates it, the writer's cleanup operation won't happen until we're
// done accessing the pointed-to data. We get a Guard on that
// domain; as long as it exists, no function subsequently passed to
// invokeEventually will execute.
std::scoped_lock<folly::rcu_domain> guard(folly::rcu_default_domain());
* configData = globalConfigData.load(std::memory_order_consume);
ConfigData// We created a guard before we read globalConfigData; we know that the
// pointer will remain valid until the guard is destroyed.
= configData->managementServerIP;
curManagementServer // RCU domain via the scoped mutex is released here; retired objects
// may be freed.
}
(curManagementServer);
doSomethingWith}
}
void writer() {
while (true) {
std::this_thread::sleep_for(std::chrono::seconds(60));
* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
ConfigData* newConfigData = loadConfigDataFromRemoteServer();
ConfigData.store(newConfigData, std::memory_order_release);
globalConfigData::rcu_retire(oldConfigData);
folly// Alternatively, in a blocking manner:
// folly::rcu_synchronize();
// delete oldConfigData;
}
}
In the example above, a single writer updates
ConfigData*
in a loop, and then defers freeing it until all
RCU readers have exited their read regions. The writer may use either of
the following two APIs to safely defer freeing the old
ConfigData*
:
rcu_retire()
: To schedule the asynchronous deletion of
the oldConfigData
pointer when all readers have exited
their read regions.rcu_synchronize()
: To block the calling thread until
all readers have exited their read regions, at which point the pointer
is safe to be deleted.If you expect there to be very long read regions, it may be required
to use rcu_synchronize()
or a periodic
rcu_barrier()
(described below) to avoid running out of
memory due to delayed reclamation.
Another synchronization primitive provided by the folly RCU library
is rcu_barrier()
. Unlike rcu_synchronize()
,
which blocks until all outstanding readers have exited their read
regions, rcu_barrier()
blocks until all outstanding
deleters (specified in a call to rcu_retire()
) are
completed.
As mentioned above, one example of where this may be useful is
avoiding out-of-memory errors due to scheduling too many objects whose
reclamation is delayed. Taking our example from above, we could avoid
OOMs using a periodic invocation to rcu_barrier()
as
follows:
static std::atomic<ConfigData*> globalConfigData;
void writer() {
uint32_t retires = 0;
while (true) {
std::this_thread::sleep_for(std::chrono::seconds(60));
* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
ConfigData* newConfigData = loadConfigDataFromRemoteServer();
ConfigData.store(newConfigData, std::memory_order_release);
globalConfigDataif (retires++ % 1000 == 0) {
::rcu_barrier();
folly}
::rcu_retire(oldConfigData);
folly}
}
When invoking folly::rcu_retire()
, you may optionally
also pass a custom deleter function that is invoked instead of std::default_delete
:
#include <folly/logging/xlog.h>
static std::atomic<ConfigData*> globalConfigData;
void writer() {
while (true) {
std::this_thread::sleep_for(std::chrono::seconds(60));
* oldConfigData = globalConfigData.load(std::memory_order_relaxed);
ConfigData* newConfigData = loadConfigDataFromRemoteServer();
ConfigData.store(newConfigData, std::memory_order_release);
globalConfigData::rcu_retire(oldConfigData, [](ConfigData* obj) {
folly(INFO) << "Deleting retired config " << oldConfigData->version);
XLOGdelete obj;
});
}
}
Exceptions may not be thrown at any point in a retire callback. This includes both the deleter, as well as the object’s destructor. Other than this, any operation is safe from within a retired object’s destructor, including retiring other objects, or even retiring the same object as long as the custom deleter did not free it.
When using the default domain or the default executor, it is not
legal to hold a lock across an rcu_retire()
that is
acquired by the deleter. This is normally not a problem when using the
default deleter delete
, which does not acquire any user
locks. However, even when using the default deleter, an object having a
user-defined destructor that acquires locks held across the
corresponding call to rcu_retire()
can still deadlock.
Note as well that there is no guarantee of the order in which retire callbacks are invoked. A retire callback is guaranteed to be invoked only after all readers that were present when the callback was scheduled have exited. Otherwise, any ordering of callback invocation may occur.
fork()
may not be invoked in a multithreaded program
where any thread other than the calling thread is in an RCU read region.
Doing so will result in undefined behavior, and will likely lead to
deadlock. If the forking thread is inside of an RCU read region, it must
invoke exec()
before exiting the read region.
std::scoped_lock<folly::rcu_domain>
creation/destruction is on the order of ~5ns. By comparison,
folly::SharedMutex::lock_shared()
followed by
folly::SharedMutex::unlock_shared()
is ~26ns.
folly/SmallLocks.h
This module is currently x64 only.
This header defines two very small mutex types. These are useful in highly memory-constrained environments where contention is unlikely. The purpose of these is to allow fine-grained locking in massive data structures where memory is at a premium. Often, each record may have a spare bit or byte lying around, so sometimes these can be tacked on with no additional memory cost.
There are two types exported from this header.
MicroSpinLock
is a single byte lock, and
PicoSpinLock
can be wrapped around an integer to use a
single bit as a lock. Why do we have both? Because you can’t use x64
bts
on a single byte, so sizeof(MicroSpinLock)
is smaller than sizeof(PicoSpinLock)
can be, giving it some
use cases.
Both the locks in this header model the C++11 Lockable concept. So
you can use std::lock_guard
or
std::unique_lock
to lock them in an RAII way if you
want.
Additional information is in the header.
folly/Synchronized.h
folly/Synchronized.h
introduces a simple abstraction for
mutex- based concurrency. It replaces convoluted, unwieldy, and just
plain wrong code with simple constructs that are easy to get right and
difficult to get wrong.
Many of our multithreaded C++ programs use shared data structures associated with locks. This follows the time-honored adage of mutex-based concurrency control “associate mutexes with data, not code”. Consider the following example:
class RequestHandler {
...
requestQueue_;
RequestQueue requestQueueMutex_;
SharedMutex
std::map<std::string, Endpoint> requestEndpoints_;
requestEndpointsMutex_;
SharedMutex
workState_;
HandlerState workStateMutex_;
SharedMutex ...
};
Whenever the code needs to read or write some of the protected data, it acquires the mutex for reading or for reading and writing. For example:
void RequestHandler::processRequest(const Request& request) {
<> watch;
stop_watch(request);
checkRequestValiditystd::unique_lock lock(requestQueueMutex_);
requestQueue_.push_back(request);
stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
(INFO) << "enqueued request ID " << request.getID();
LOG}
However, the correctness of the technique is entirely predicated on convention. Developers manipulating these data members must take care to explicitly acquire the correct lock for the data they wish to access. There is no ostensible error for code that:
const
access to the guarded datafolly/Synchronized.h
The same code sample could be rewritten with
Synchronized
as follows:
class RequestHandler {
...
<RequestQueue> requestQueue_;
Synchronized<std::map<std::string, Endpoint>> requestEndpoints_;
Synchronized<HandlerState> workState_;
Synchronized...
};
void RequestHandler::processRequest(const Request& request) {
<> watch;
stop_watch(request);
checkRequestValidityrequestQueue_.wlock()->push_back(request);
stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
(INFO) << "enqueued request ID " << request.getID();
LOG}
The rewrite does at maximum efficiency what needs to be done:
acquires the lock associated with the RequestQueue
object,
writes to the queue, and releases the lock immediately thereafter.
On the face of it, that’s not much to write home about, and not an obvious improvement over the previous state of affairs. But the features at work invisible in the code above are as important as those that are visible:
requestQueue_
without acquiring the
lock you wouldn’t be able to; it is virtually impossible to access the
queue without acquiring the correct lock.If you need to perform several operations while holding the lock,
Synchronized
provides several options for doing this.
The wlock()
method (or lock()
if you have a
non-shared mutex type) returns a LockedPtr
object that can
be stored in a variable. The lock will be held for as long as this
object exists, similar to a std::unique_lock
. This object
can be used as if it were a pointer to the underlying locked object:
{
auto lockedQueue = requestQueue_.wlock();
->push_back(request1);
lockedQueue->push_back(request2);
lockedQueue}
The rlock()
function is similar to wlock()
,
but acquires a shared lock rather than an exclusive lock.
We recommend explicitly opening a new nested scope whenever you store
a LockedPtr
object, to help visibly delineate the critical
section, and to ensure that the LockedPtr
is destroyed as
soon as it is no longer needed.
Alternatively, Synchronized
also provides mechanisms to
run a function while holding the lock. This makes it possible to use
lambdas to define brief critical sections:
void RequestHandler::processRequest(const Request& request) {
<> watch;
stop_watch(request);
checkRequestValidityrequestQueue_.withWLock([&](auto& queue) {
// withWLock() automatically holds the lock for the
// duration of this lambda function
.push_back(request);
queue});
stats_->addStatValue("requestEnqueueLatency", watch.elapsed());
(INFO) << "enqueued request ID " << request.getID();
LOG}
One advantage of the withWLock()
approach is that it
forces a new scope to be used for the critical section, making the
critical section more obvious in the code, and helping to encourage code
that releases the lock as soon as possible.
Synchronized<T>
Synchronized
is a template with two parameters, the data
type and a mutex type: Synchronized<T, Mutex>
.
If not specified, the mutex type defaults to
folly::SharedMutex
. However, any mutex type with an
interface compatible with the standard mutex types is supported.
Additionally, any mutex type compatible with the extended protocol
implemented in folly/synchronization/Lock.h
is
supported.
Synchronized
provides slightly different APIs when
instantiated with a shared mutex type or an upgrade mutex type then with
a plain exclusive mutex. If instantiated with either of the two mutex
types above through having a member called lock_shared(), the
Synchronized
object has corresponding wlock
,
rlock
or ulock
methods to acquire different
lock types. When using a shared or upgrade mutex type, these APIs ensure
that callers make an explicit choice to acquire a shared, exclusive or
upgrade lock and that callers do not unintentionally lock the mutex in
the incorrect mode. The rlock()
APIs only provide
const
access to the underlying data type, ensuring that it
cannot be modified when only holding a shared lock.
The default constructor default-initializes the data and its associated mutex.
The copy constructor locks the source for reading and copies its data into the target. (The target is not locked as an object under construction is only accessed by one thread.)
Finally, Synchronized<T>
defines an explicit
constructor that takes an object of type T
and copies it.
For example:
// Default constructed
<map<string, int>> syncMap1;
Synchronized
// Copy constructed
<map<string, int>> syncMap2(syncMap1);
Synchronized
// Initializing from an existing map
<string, int> init;
map["world"] = 42;
init<map<string, int>> syncMap3(init);
Synchronized(syncMap3->size(), 1); EXPECT_EQ
The copy assignment operator copies the underlying source data into a temporary with the source mutex locked, and then move the temporary into the destination data with the destination mutex locked. This technique avoids the need to lock both mutexes at the same time. Mutexes are not copied or moved.
The move assignment operator assumes the source object is a true rvalue and does lock the source mutex. It moves the source data into the destination data with the destination mutex locked.
swap
acquires locks on both mutexes in increasing order
of object address, and then swaps the underlying data. This avoids
potential deadlock, which may otherwise happen should one thread do
a = b
while another thread does b = a
.
The data copy assignment operator copies the parameter into the destination data while the destination mutex is locked.
The data move assignment operator moves the parameter into the destination data while the destination mutex is locked.
To get a copy of the guarded data, there are two methods available:
void copy(T*)
and T copy()
. The first copies
data to a provided target and the second returns a copy by value. Both
operations are done under a read lock. Example:
<vector<string>> syncVec1, syncVec2;
Synchronized<string> vec;
vector
// Assign
= syncVec2;
syncVec1 // Assign straight from vector
= vec;
syncVec1
// Swap
.swap(syncVec2);
syncVec1// Swap with vector
.swap(vec);
syncVec1
// Copy to given target
.copy(&vec);
syncVec1// Get a copy by value
auto copy = syncVec1.copy();
lock()
If the mutex type used with Synchronized
is a simple
exclusive mutex type (as opposed to a shared mutex),
Synchronized<T>
provides a lock()
method
that returns a LockedPtr<T>
to access the data while
holding the lock.
The LockedPtr
object returned by lock()
holds the lock for as long as it exists. Whenever possible, prefer
declaring a separate inner scope for storing this variable, to make sure
the LockedPtr
is destroyed as soon as the lock is no longer
needed:
void fun(Synchronized<vector<string>, std::mutex>& vec) {
{
auto locked = vec.lock();
->push_back("hello");
locked->push_back("world");
locked}
(INFO) << "successfully added greeting";
LOG}
wlock()
and
rlock()
If the mutex type used with Synchronized
is a shared
mutex type, Synchronized<T>
provides a
wlock()
method that acquires an exclusive lock, and an
rlock()
method that acquires a shared lock.
The LockedPtr
returned by rlock()
only
provides const access to the internal data, to ensure that it cannot be
modified while only holding a shared lock.
int computeSum(const Synchronized<vector<int>>& vec) {
int sum = 0;
auto locked = vec.rlock();
for (int n : *locked) {
+= n;
sum }
return sum;
}
void doubleValues(Synchronized<vector<int>>& vec) {
auto locked = vec.wlock();
for (int& n : *locked) {
*= 2;
n }
}
This example brings us to a cautionary discussion. The
LockedPtr
object returned by lock()
,
wlock()
, or rlock()
only holds the lock as
long as it exists. This object makes it difficult to access the data
without holding the lock, but not impossible. In particular you should
never store a raw pointer or reference to the internal data for longer
than the lifetime of the LockedPtr
object.
For instance, if we had written the following code in the examples above, this would have continued accessing the vector after the lock had been released:
// No. NO. NO!
for (int& n : *vec.wlock()) {
*= 2;
n }
The vec.wlock()
return value is destroyed in this case
as soon as the internal range iterators are created. The range iterators
point into the vector’s data, but lock is released immediately, before
executing the loop body.
Needless to say, this is a crime punishable by long debugging nights.
Range-based for loops are slightly subtle about the lifetime of
objects used in the initializer statement. Most other problematic use
cases are a bit easier to spot than this, since the lifetime of the
LockedPtr
is more explicitly visible.
withLock()
As an alternative to the lock()
API,
Synchronized
also provides a withLock()
method
that executes a function or lambda expression while holding the lock.
The function receives a reference to the data as its only argument.
This has a few benefits compared to lock()
:
lock()
if they choose to, but this is
not required. withLock()
ensures that a new scope must
always be defined.withLock()
also helps
encourage users to release the lock as soon as possible. Because the
critical section scope is easily visible in the code, it is harder to
accidentally put extraneous code inside the critical section without
realizing it.For example, withLock()
makes the range-based for loop
mistake from above much harder to accidentally run into:
.withLock([](auto& locked) {
vecfor (int& n : locked) {
*= 2;
n }
});
This code does not have the same problem as the counter-example with
wlock()
above, since the lock is held for the duration of
the loop.
When using Synchronized
with a shared mutex type, it
provides separate withWLock()
and withRLock()
methods instead of withLock()
.
ulock()
and
withULockPtr()
Synchronized
also supports upgrading and downgrading
mutex lock levels as long as the mutex type used to instantiate the
Synchronized
type has the same interface as the mutex types
in the C++ standard library. See below for an intro to upgrade
mutexes.
An upgrade lock can be acquired as usual either with the
ulock()
method or the withULockPtr()
method as
so
{
// only const access allowed to the underlying object when an upgrade lock
// is acquired
auto ulock = vec.ulock();
auto newSize = ulock->size();
}
auto newSize = vec.withULockPtr([](auto ulock) {
// only const access allowed to the underlying object when an upgrade lock
// is acquired
return ulock->size();
});
An upgrade lock acquired via ulock()
or
withULockPtr()
can be upgraded or downgraded by calling any
of the following methods on the LockedPtr
proxy
moveFromUpgradeToWrite()
moveFromWriteToUpgrade()
moveFromWriteToRead()
moveFromUpgradeToRead()
Calling these leaves the LockedPtr
object on which the
method was called in an invalid null
state and returns
another LockedPtr proxy holding the specified lock. The upgrade or
downgrade is done atomically - the Synchronized
object is
never in an unlocked state during the lock state transition. For
example
auto ulock = obj.ulock();
if (ulock->needsUpdate()) {
auto wlock = ulock.moveFromUpgradeToWrite();
// ulock is now null
->updateObj();
wlock}
This “move” can also occur in the context of a
withULockPtr()
(withWLockPtr()
or
withRLockPtr()
work as well!) function as so
auto newSize = obj.withULockPtr([](auto ulock) {
if (ulock->needsUpdate()) {
// release upgrade lock get write lock atomically
auto wlock = ulock.moveFromUpgradeToWrite();
// ulock is now null
->updateObj();
wlock
// release write lock and acquire read lock atomically
auto rlock = wlock.moveFromWriteToRead();
// wlock is now null
return rlock->newSize();
} else {
// release upgrade lock and acquire read lock atomically
auto rlock = ulock.moveFromUpgradeToRead();
// ulock is now null
return rlock->newSize();
}
});
An upgrade mutex is a shared mutex with an extra state called
upgrade
and an atomic state transition from
upgrade
to unique
. The upgrade
state is more powerful than the shared
state but less
powerful than the unique
state.
An upgrade lock permits only const access to shared state for doing reads. It does not permit mutable access to shared state for doing writes. Only a unique lock permits mutable access for doing writes.
An upgrade lock may be held concurrently with any number of shared locks on the same mutex. An upgrade lock is exclusive with other upgrade locks and unique locks on the same mutex - only one upgrade lock or unique lock may be held at a time.
The upgrade mutex solves the problem of doing a read of shared state and then optionally doing a write to shared state efficiently under contention. Consider this scenario with a shared mutex:
struct MyObect {
bool isUpdateRequired() const;
void doUpdate();
};
struct MyContainingObject {
::Synchronized<MyObject> sync;
folly
void mightHappenConcurrently() {
// first check
if (!sync.rlock()->isUpdateRequired()) {
return;
}
.withWLock([&](auto& state) {
sync// second check
if (!state.isUpdateRequired()) {
return;
}
.doUpdate();
state});
}
};
Here, the second isUpdateRequired
check happens under a
unique lock. This means that the second check cannot be done
concurrently with other threads doing first
isUpdateRequired
checks under the shared lock, even though
the second check, like the first check, is read-only and requires only
const access to the shared state.
This may even introduce unnecessary blocking under contention. Since
the default mutex type, folly::SharedMutex
, has write
priority, the unique lock protecting the second check may introduce
unnecessary blocking to all the other threads that are attempting to
acquire a shared lock to protect the first check. This problem is called
reader starvation.
One solution is to use a shared mutex type with read priority, such
as folly::SharedMutexReadPriority
. That can introduce less
blocking under contention to the other threads attempting to acquire a
shared lock to do the first check. However, that may backfire and cause
threads which are attempting to acquire a unique lock (for the second
check) to stall, waiting for a moment in time when there are no shared
locks held on the mutex, a moment in time that may never even happen.
This problem is called writer starvation.
Starvation is a tricky problem to solve in general. But we can partially side- step it in our case.
An alternative solution is to use an upgrade lock for the second check. Threads attempting to acquire an upgrade lock for the second check do not introduce unnecessary blocking to all other threads that are attempting to acquire a shared lock for the first check. Only after the second check passes, and the upgrade lock transitions atomically from an upgrade lock to a unique lock, does the unique lock introduce necessary blocking to the other threads attempting to acquire a shared lock. With this solution, unlike the solution without the upgrade lock, the second check may be done concurrently with all other first checks rather than blocking or being blocked by them.
The example would then look like:
struct MyObject {
bool isUpdateRequired() const;
void doUpdate();
};
struct MyContainingObject {
::Synchronized<MyObject> sync;
folly
void mightHappenConcurrently() {
// first check
if (!sync.rlock()->isUpdateRequired()) {
return;
}
.withULockPtr([&](auto ulock) {
sync// second check
if (!ulock->isUpdateRequired()) {
return;
}
auto wlock = ulock.moveFromUpgradeToWrite();
->doUpdate();
wlock});
}
};
Note: Some shared mutex implementations offer an atomic state
transition from shared
to unique
and some
upgrade mutex implementations offer an atomic state transition from
shared
to upgrade
. These atomic state
transitions are dangerous, however, and can deadlock when done
concurrently on the same mutex. For example, if threads A and B both
hold shared locks on a mutex and are both attempting to transition
atomically from shared to upgrade locks, the threads are deadlocked.
Likewise if they are both attempting to transition atomically from
shared to unique locks, or one is attempting to transition atomically
from shared to upgrade while the other is attempting to transition
atomically from shared to unique. Therefore, Synchronized
’s
LockedPtr
proxies do not expose either of these dangerous
atomic state transitions even when the underlying mutex type supports
them.
When Synchronized
is used with a mutex type that
supports timed lock acquisition, lock()
,
wlock()
, and rlock()
can all take an optional
std::chrono::duration
argument. This argument specifies a
timeout to use for acquiring the lock. If the lock is not acquired
before the timeout expires, a null LockedPtr
object will be
returned. Callers must explicitly check the return value before using
it:
void fun(Synchronized<vector<string>>& vec) {
{
auto locked = vec.lock(10ms);
if (!locked) {
throw std::runtime_error("failed to acquire lock");
}
->push_back("hello");
locked->push_back("world");
locked}
(INFO) << "successfully added greeting";
LOG}
unlock()
and
scopedUnlock()
Synchronized
is a good mechanism for enforcing scoped
synchronization, but it has the inherent limitation that it requires the
critical section to be, well, scoped. Sometimes the code structure
requires a fleeting “escape” from the iron fist of synchronization,
while still inside the critical section scope.
One common pattern is releasing the lock early on error code paths,
prior to logging an error message. The LockedPtr
class
provides an unlock()
method that makes this possible:
<map<int, string>> dic;
Synchronized...
{
auto locked = dic.rlock();
auto iter = locked->find(0);
if (iter == locked.end()) {
.unlock(); // don't hold the lock while logging
locked(ERROR) << "key 0 not found";
LOGreturn false;
}
(*iter);
processValue}
(INFO) << "succeeded"; LOG
For more complex nested control flow scenarios,
scopedUnlock()
returns an object that will release the lock
for as long as it exists, and will reacquire the lock when it goes out
of scope.
<map<int, string>> dic;
Synchronized...
{
auto locked = dic.wlock();
auto iter = locked->find(0);
if (iter == locked->end()) {
{
auto unlocker = locked.scopedUnlock();
(INFO) << "Key 0 not found, inserting it."
LOG}
->emplace(0, "zero");
locked} else {
*iter = "zero";
}
}
Clearly scopedUnlock()
comes with specific caveats and
liabilities. You must assume that during the scopedUnlock()
section, other threads might have changed the protected structure in
arbitrary ways. In the example above, you cannot use the iterator
iter
and you cannot assume that the key 0
is
not in the map; another thread might have inserted it while you were
bragging on LOG(INFO)
.
Whenever a LockedPtr
object has been unlocked, whether
with unlock()
or scopedUnlock()
, it will
behave as if it is null. isNull()
will return true.
Dereferencing an unlocked LockedPtr
is not allowed and will
result in undefined behavior.
Synchronized
and std::condition_variable
When used with a std::mutex
, Synchronized
supports using a std::condition_variable
with its internal
mutex. This allows a condition_variable
to be used to wait
for a particular change to occur in the internal data.
The LockedPtr
returned by
Synchronized<T, std::mutex>::lock()
has a
as_lock()
method that returns a reference to a
std::unique_lock<std::mutex>
, which can be given to
the std::condition_variable
:
<vector<string>, std::mutex> vec;
Synchronizedstd::condition_variable emptySignal;
// Assuming some other thread will put data on vec and signal
// emptySignal, we can then wait on it as follows:
auto locked = vec.lock();
.wait(locked.as_lock(),
emptySignal[&] { return !locked->empty(); });
acquireLocked()
Sometimes locking just one object won’t be able to cut the mustard.
Consider a function that needs to lock two Synchronized
objects at the same time - for example, to copy some data from one to
the other. At first sight, it looks like sequential wlock()
calls will work just fine:
void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
auto lockedA = a.wlock();
auto lockedB = b.wlock();
... use lockedA and lockedB ...
}
This code compiles and may even run most of the time, but embeds a
deadly peril: if one threads call fun(x, y)
and another
thread calls fun(y, x)
, then the two threads are liable to
deadlocking as each thread will be waiting for a lock the other is
holding. This issue is a classic that applies regardless of the fact the
objects involved have the same type.
This classic problem has a classic solution: all threads must acquire
locks in the same order. The actual order is not important, just the
fact that the order is the same in all threads. Many libraries simply
acquire mutexes in increasing order of their address, which is what
we’ll do, too. The acquireLocked()
function takes care of
all details of proper locking of two objects and offering their innards.
It returns a std::tuple
of LockedPtr
s:
void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
auto ret = folly::acquireLocked(a, b);
auto& lockedA = std::get<0>(ret);
auto& lockedB = std::get<1>(ret);
... use lockedA and lockedB ...
}
Note that C++ 17 introduces structured binding syntax which will make the returned tuple more convenient to use:
void fun(Synchronized<vector<int>>& a, Synchronized<vector<int>>& b) {
auto [lockedA, lockedB] = folly::acquireLocked(a, b);
... use lockedA and lockedB ...
}
An acquireLockedPair()
function is also available, which
returns a std::pair
instead of a std::tuple
.
This is more convenient to use in many situations, until compiler
support for structured bindings is more widely available.
The library is geared at protecting one object of a given type with a
mutex. However, sometimes we’d like to protect two or more members with
the same mutex. Consider for example a bidirectional map, i.e. a map
that holds an int
to string
mapping and also
the converse string
to int
mapping. The two
maps would need to be manipulated simultaneously. There are at least two
designs that come to mind.
struct
You can easily pack the needed data items in a little struct. For example:
class Server {
struct BiMap {
<int, string> direct;
map<string, int> inverse;
map};
<BiMap> bimap_;
Synchronized...
};
...
bimap_.withLock([](auto& locked) {
.direct[0] = "zero";
locked.inverse["zero"] = 0;
locked});
With this code in tow you get to use bimap_
just like
any other Synchronized
object, without much effort.
std::tuple
If you won’t stop short of using a spaceship-era approach,
std::tuple
is there for you. The example above could be
rewritten for the same functionality like this:
class Server {
<tuple<map<int, string>, map<string, int>>> bimap_;
Synchronized...
};
...
bimap_.withLock([](auto& locked) {
<0>(locked)[0] = "zero";
get<1>(locked)["zero"] = 0;
get});
The code uses std::get
with compile-time integers to
access the fields in the tuple. The relative advantages and
disadvantages of using a local struct vs. std::tuple
are
quite obvious - in the first case you need to invest in the definition,
in the second case you need to put up with slightly more verbose and
less clear access syntax.
Synchronized
and its supporting tools offer you a
simple, robust paradigm for mutual exclusion-based concurrency. Instead
of manually pairing data with the mutexes that protect it and relying on
convention to use them appropriately, you can benefit of encapsulation
and typechecking to offload a large part of that task and to provide
good guarantees.
folly/ThreadCachedInt.h
High-performance atomic increment using thread caching.
folly/ThreadCachedInt.h
introduces a integer class
designed for high performance increments from multiple threads
simultaneously without loss of precision. It has two read modes,
readFast
gives a potentially stale value with one load, and
readFull
gives the exact value, but is much slower, as
discussed below.
Increment performance is up to 10x greater than
std::atomic_fetch_add
in high contention environments. See
folly/test/ThreadCachedIntTest.h
for more comprehensive
benchmarks.
readFast
is as fast as a single load.
readFull
, on the other hand, requires acquiring a mutex
and iterating through a list to accumulate the values of all the thread
local counters, so is significantly slower than
readFast
.
Create an instance and increment it with increment
or
the operator overloads. Read the value with readFast
for
quick, potentially stale data, or readFull
for a more
expensive but precise result. There are additional convenience functions
as well, such as set
.
<int64_t> val;
ThreadCachedInt(0, val.readFast());
EXPECT_EQ++val; // increment in thread local counter only
(0, val.readFast()); // increment has not been flushed
EXPECT_EQ(1, val.readFull()); // accumulates all thread local counters
EXPECT_EQ.set(2);
val(2, val.readFast());
EXPECT_EQ(2, val.readFull()); EXPECT_EQ
folly::ThreadCachedInt
uses
folly::ThreadLocal
to store thread specific objects that
each have a local counter. When incrementing, the thread local instance
is incremented. If the local counter passes the cache size, the value is
flushed to the global counter with an atomic increment. It is this
global counter that is read with readFast
via a simple
load, but will not count any of the updates that haven’t been
flushed.
In order to read the exact value, ThreadCachedInt
uses
the extended readAllThreads()
API of
folly::ThreadLocal
to iterate through all the references to
all the associated thread local object instances. This currently
requires acquiring a global mutex and iterating through the references,
accumulating the counters along with the global counter. This also means
that the first use of the object from a new thread will acquire the
mutex in order to insert the thread local reference into the list. By
default, there is one global mutex per integer type used in
ThreadCachedInt
. If you plan on using a lot of
ThreadCachedInt
s in your application, considering breaking
up the global mutex by introducing additional Tag
template
parameters.
set
simply sets the global counter value, and marks all
the thread local instances as needing to be reset. When iterating with
readFull
, thread local counters that have been marked as
reset are skipped. When incrementing, thread local counters marked for
reset are set to zero and unmarked for reset.
Upon destruction, thread local counters are flushed to the parent so that counts are not lost after increments in temporary threads. This requires grabbing the global mutex to make sure the parent itself wasn’t destroyed in another thread already.
There are of course many ways to skin a cat, and you may notice there
is a partial alternate implementation in
folly/test/ThreadCachedIntTest.cpp
that provides similar
performance. ShardedAtomicInt
simply uses an array of
std::atomic<int64_t>
’s and hashes threads across them
to do low-contention atomic increments, and readFull
just
sums up all the ints.
This sounds great, but in order to get the contention low enough to
get similar performance as ThreadCachedInt with 24 threads,
ShardedAtomicInt
needs about 2000 ints to hash across. This
uses about 20x more memory, and the lock-free readFull
has
to sum up all 2048 ints, which ends up being a about 50x slower than
ThreadCachedInt
in low contention situations, which is
hopefully the common case since it’s designed for high-write, low read
access patterns. Performance of readFull
is about the same
speed as ThreadCachedInt
in high contention
environments.
Depending on the operating conditions, it may make more sense to use
one implementation over the other. For example, a lower contention
environment will probably be able to use a ShardedAtomicInt
with a much smaller array without hurting performance, while improving
memory consumption and perf of readFull
.
folly/ThreadLocal.h
Improved thread local storage for non-trivial types.
boost::thread_specific_ptr
.pthread_getspecific
directly,
but only consumes a single pthread_key_t
per
Tag
template param.thread_specific_ptr
API with
accessAllThreads
and extended custom deleter support.The API of ThreadLocalPtr
is very close to
boost::thread_specific_ptr
with the notable addition of the
accessAllThreads
method. There is also a
ThreadLocal
class which is a thin wrapper around
ThreadLocalPtr
that manages allocation automatically
(creates a new object the first time it is dereferenced from each
thread).
ThreadLocalPtr
simply gives you a place to put and
access a pointer local to each thread such that it will be destroyed
appropriately.
{
::ThreadLocalPtr<Widget> w;
folly.reset(new Widget(0), Widget::customDeleterA);
wstd::thread([&w]() {
.reset(new Widget(1), Widget::customDeleterB);
w.get()->mangleWidget();
w} // Widget(1) is destroyed with customDeleterB
} // Widget(0) is destroyed with customDeleterA
Note that customDeleterB
will get called with
TLPDestructionMode::THIS_THREAD
and
customerDeleterA
will get called with
TLPDestructionMode::ALL_THREADS
. This is to distinguish
between thread exit vs. the entire ThreadLocalPtr
getting
destroyed, in which case there is cleanup work that may be avoided.
The accessAllThreads
interface is provided to walk all
the thread local child objects of a parent.
accessAllThreads
initializes an accessor which holds a
global lock that blocks all creation and destruction of
ThreadLocal
objects with the same Tag
and can
be used as an iterable container. Typical use is for frequent write,
infrequent read data access patterns such as counters. Note that you
must specify a unique Tag type so you don’t block other ThreadLocal
object usage, and you should try to minimize the lifetime of the
accessor so the lock is held for as short as possible).
The following example is a simplification of
folly/ThreadCachedInt.h
. It keeps track of a counter value
and allows multiple threads to add to the count without synchronization.
In order to get the total count, read()
iterates through
all the thread local values via accessAllThreads()
and sums
them up. class NewTag
is used to break the global mutex so
that this class won’t block other ThreadLocal
usage when
read()
is called.
Note that read()
holds the global mutex which blocks
construction, destruction, and read()
for other
SimpleThreadCachedInt
’s, but does not block
add()
. Also, since it uses the unique NewTag
,
SimpleThreadCachedInt
does not affect other
ThreadLocal
usage.
class SimpleThreadCachedInt {
class NewTag; // Segments the global mutex
<int,NewTag> val_;
ThreadLocal
public:
void add(int val) {
*val_ += val; // operator*() gives a reference to the thread local instance
}
int read() {
int ret = 0;
// accessAllThreads acquires the global lock
for (const auto& i : val_.accessAllThreads()) {
+= i;
ret } // Global lock is released on scope exit
return ret;
}
};
We keep a __thread
array of pointers to objects
(ThreadEntry::elements
) where each array has an index for
each unique instance of the ThreadLocalPtr
object. Each
ThreadLocalPtr
object has a unique id that is an index into
these arrays so we can fetch the correct object from thread local
storage very efficiently.
In order to prevent unbounded growth of the id space and thus huge
ThreadEntry::elements
arrays, for example due to continuous
creation and destruction of ThreadLocalPtr
objects, we keep
track of all active instances by linking them together into a list. When
an instance is destroyed we remove it from the chain and insert the id
into freeIds_
for reuse. These operations require a global
mutex, but only happen at construction and destruction time.
accessAllThreads
also acquires this global mutex.
We use a single global pthread_key_t
per
Tag
to manage object destruction and memory cleanup upon
thread exit because there is a finite number of
pthread_key_t
’s available per machine.
Implements traits complementary to those provided in
IsRelocatable
trait.IsOneOf
trait<type_traits>
is the Standard type-traits library
defining a variety of traits such as is_integral
or
is_floating_point
. This helps to gain more information
about a given type.
folly/Traits.h
implements traits complementing those
present in the Standard.
In C++, the default way to move an object is by calling the copy constructor and destroying the old copy instead of directly copying the memory contents by using memcpy(). The conservative approach of moving an object assumes that the copied object is not relocatable. The two following code sequences should be semantically equivalent for a relocatable type:
{
void conservativeMove(T * from, T * to) {
new(to) T(from);
(*from).~T();
}
}
{
void optimizedMove(T * from, T * to) {
(to, from, sizeof(T));
memcpy}
}
Very few C++ types are non-relocatable. The type defined below maintains a pointer inside an embedded buffer and hence would be non-relocatable. Moving the object by simply copying its memory contents would leave the internal pointer pointing to the old buffer.
class NonRelocatableType {
private:
char buffer[1024];
char * pointerToBuffer;
...
public:
() : pointerToBuffer(buffer) {}
NonRelocatableType...
};
We can optimize the task of moving a relocatable type T using memcpy.
IsRelocatable
Declaring types
template <class T1, class T2>
class MyParameterizedType;
class MySimpleType;
Declaring a type as relocatable
Appending the lines below after definition of My*Type
(MyParameterizedType
or MySimpleType
) will
declare it as relocatable
/* Definition of My*Type goes here */
// global namespace (not inside any namespace)
namespace folly {
// defining specialization of IsRelocatable for MySimpleType
template <>
struct IsRelocatable<MySimpleType> : std::true_type {};
// defining specialization of IsRelocatable for MyParameterizedType
template <class T1, class T2>
struct IsRelocatable<MyParameterizedType<T1, T2>>
: ::std::true_type {};
}
To make it easy to state assumptions for a regular type or a family of parameterized type, various macros can be used as shown below.
Stating that a type is Relocatable using a macro
// global namespace
namespace folly {
// For a Regular Type
(MySimpleType);
FOLLY_ASSUME_RELOCATABLE// For a Parameterized Type
(MyParameterizedType<T1, T2>);
FOLLY_ASSUME_RELOCATABLE}
fbvector
only works with relocatable objects. If
assumptions are not stated explicitly,
fbvector<MySimpleType>
or
fbvector<MyParameterizedType>
will fail to compile
due to assertion below:
static_assert(IsRelocatable<My*Type>::value, "");
FOLLY_ASSUME_FBVECTOR_COMPATIBLE*(type) macros can be used to state that type is relocatable and has nothrow constructor.
Stating that a type is fbvector-compatible
using
macros i.e. relocatable and has nothrow default constructor
// at global level, i.e no namespace
// macro for regular type
(MySimpleType)
FOLLY_ASSUME_FBVECTOR_COMPATIBLE// macro for types having 2 template parameters (MyParameterizedType)
(MyParameterizedType) FOLLY_ASSUME_FBVECTOR_COMPATIBLE_2
Similarly,
FOLLY_ASSUME_FBVECTOR_COMPATIBLE_1(MyTypeHavingOneParameter) macro is for family of parameterized types having 1 parameter
FOLLY_ASSUME_FBVECTOR_COMPATIBLE_3(MyTypeHavingThreeParameters) macro is for family of parameterized types having 3 parameters
FOLLY_ASSUME_FBVECTOR_COMPATIBLE_4(MyTypeHavingFourParameters) macro is for family of parameterized types having 4 parameters
Few common types, namely std::basic_string
,
std::vector
, std::list
, std::map
,
std::deque
, std::set
,
std::unique_ptr
, std::shared_ptr
,
std::function
, which are compatible with
fbvector
are already instantiated and declared compatible
with fbvector
. fbvector
can be directly used
with any of these C++ types.
std::pair
can be safely assumed to be compatible with
fbvector
if both of its components are.
std::is_same<T1, T2>::value
can be used to test if
types of T1 and T2 are same.
folly::IsOneOf<T, T1, Ts...>::value
can be used to
test if type of T matches the type of one of the other template
parameter, T1, T2, …Tn. Recursion is used to implement this type
trait.
folly/small_vector.h
folly::small_vector<T,Int=1,...>
is a sequence
container that implements small buffer optimization. It behaves
similarly to std::vector, except until a certain number of elements are
reserved it does not use the heap.
Like standard vector, it is guaranteed to use contiguous memory. (So, after it spills to the heap all the elements live in the heap buffer.)
Simple usage example:
<int,2> vec;
small_vector.push_back(0); // Stored in-place on stack
vec.push_back(1); // Still on the stack
vec.push_back(2); // Switches to heap buffer. vec
This class is useful in either of following cases:
Short-lived stack vectors with few elements (or maybe with a usually-known number of elements), if you want to avoid malloc.
If the vector(s) are usually under a known size and lookups are very common, you’ll save an extra cache miss in the common case when the data is kept in-place.
You have billions of these vectors and don’t want to waste space
on std::vector
’s capacity tracking. This vector lets
malloc
track our allocation capacity. (Note that this slows
down the insertion/reallocation code paths significantly; if you need
those to be fast you should use fbvector
.)
The last two cases were the main motivation for implementing it.
There are also a couple of flags you can pass into this class
template to customize its behavior. You can provide them in any order
after the in-place count. They are all in the
namespace small_vector_policy
.
NoHeap
- Avoid the heap entirely. (Throws
std::length_error
if you would’ve spilled out of the
in-place allocation.)
<Any integral type>
- customizes the amount of
space we spend on tracking the size of the vector.
A couple more examples:
// With space for 32 in situ unique pointers, and only using a
// 4-byte size_type.
<std::unique_ptr<int>, 32, uint32_t> v;
small_vector
// A inline vector of up to 256 ints which will not use the heap.
<int, 256, NoHeap> v;
small_vector
// Same as the above, but making the size_type smaller too.
<int, 256, NoHeap, uint16_t> v; small_vector