Saturday, 6 January 2018

Designing Async Task Dispatch Library From Scratch (Part-1)

What the series is about

  • Understanding how generators work
  • Understand how future object works
  • How to create a task for asynchronous execution
  • Creating a simple but almost complete tasking and networking library

What the series is _not_ about

  • Understanding basic usage of generators. Reader is expected to have used it before.
  • Creating a library rivalling asyncio or anything else. I am only doing this to understand the whole thing better from a system programmers perspective.

The 'yield' Keyword

Lets take the below simple example of a generator:

def my_gen():
    ret = yield 5
    return ret

Few questions that comes to my mind are:

  • What happens when the 'yield' expression gets executed ?
  • What do we get as the return value of this generator-function ?
  • What more is there to this generator-function other than just yielding 5 ?
To answer above questions and to get enlightened we need to know how a generator-function is different from a normal function.

Difference between a generator-function and a regular function

NOTE: Generators are of iterable types. Iterables as a concept is something we would not be looking at in this post.

Few of the very important differences are listed below. There are more difference than what are listed below but out of scope for the current topic.
  1. Suspension and resume capability
          Unlike regular function, a generator-function can be suspended in between its execution and can be resumed from the same point. That means all the states are saved before getting suspended so that they can be reused when the generator-function is resumed. Its the presence of the yield expression inside a function that provides this capability. The yield is basically the suspension and resumption point.

      2. Returns a generator on calling a generator-function

          Unlike regular function which returns some value (or nothing) on execution, a generator-function returns a generator object.  This generator object is what we will make use or abuse to control how the generator-function should execute.

     3. Suspension and resumption of generator-function can be controlled

         As said above, using the generator object one can control ow to resume and what value to send to it from the client side. We will see a lot more on this in the next section.

Deconstructing a generator-function

Lets take this hello-world example:

def my_gen():
    val = yield "hello, world!"
    return val

There are 2 ways that you can work with this generator-function. Lets see the hard way first:

# Prime the generator
g = my_gen()

#Get the yield value
res = next(g)
print ("my_gen yielded result: {}".format(res))

try:
    next(g)
except StopIteration as e:
    print ("my_gen generator function returned: {}".format(e.value))

This would print:

my_gen yielded result: hello, world!
my_gen generator function returned: None

Hmm...whats going on ? The assignment clearly did not work as the return value is printed as "None"!
Now is a good time to explain about controlling a generator-function using the generator object.

A generator instance has 3 methods defined on it:

  • next
  • send
  • throw
The next  method will proceed to the next 'yield' statement present in the function and executes whatever expression is there to its right side. The return of which is passed to the called of the next method not to the variable on the left hand side of the assignment statement.

The throw method is what will be used by the client/scheduler program to pass exceptions to the generator-function.

The send method is what makes the generator behave as a coroutine. Thanks to PEP-342. 
It is similar to what next does in proceeding to the next yield  statement in the function, but as the name suggests, it can send a value as the result of the execution of the right hand side of the assignment statement.
That means, its completely under my control on what to set the value of the variable val as ? Yup!!

See this in action:

g = my_gen()
next(g) # Yields "hello, world!"
try:
    g.send(42) # Sends in a value 42 which gets assigned to val and does next(g)
except StopIteration as e:
    print ("my_gen returned: {}".format(e.value))

It prints:
my_gen returned: 42

NOTE: In the above example I am using the free function next instead of the member function. The free function operates on an iterable object which the generator is and calls its __next__ method.

Yeah, fetching the return value of a generator-function is pretty weird. Agreed. It is set as the value field of the StopIteration exception.  Remember this.

Isn't this pretty freakishly powerful ? In fact this ability to "send" values to the generator is what makes it a coroutine and is used for creating a tasking system like asyncio.

So, lets reiterate on what is happening with the above simple generator-function:

  1. Prime the generator by calling the function first. That gives us a generator object to work with.
  2. Calling next on the generator object will go to the next yield statement in the function and return the output of its expression to its right side. At this point the function is suspended. It needs to be resumed by calling any of the 3 methods next/send/resume.
  3. At this suspended state, the control is back to client. We want to pass some random value to the generator. If the yield is used as an assignment statement, then the sent value would get assigned to it, otherwise it would just behave like regular next call.

Adding a pictorial view if it helps at all:



As can be seen in the above picture, I have divided the generator function into two blocks.
The client/scheduler code dealing with the generator does below things:
a. Prime the generator function to get the generator object.

b. Do a next on it, which calls the __next__ method of the generator, which in turn goes to the first yield statement inside the generator function and returns the output of the RHS expression, None if no RHS expression is present. In our example, it returns 5.

c. Now we want to set the value of res to something meaningful. This is done by the send method, which places the value 42 on res and also calls the __next__ method of the generator.

d. Since no more yields are present, it will throw a StopIteration exception having the return value of the generator function set in its value member.


I still don't see the point

So, how is all these freaky things going to make me any useful as a programmer you ask.
Lets see a small example of some database driver that you wrote. The task is to fetch some rows from the DB and do something with the rows.

The client of your basic DB driver would probably write something like this:

class DDBResultFetcher(object):
    def __init__(self, rows_cb):
        self.instance = None
        self.is_connected = False
        self._cb = rows_cb
    
    def connect(self):
        if not self.instance:
            instance = MyDBDriver.get_correct_instance()
        
        if self.connected:
            return
        
        self.instance.connect()
        self.is_connected = True
    
    @property
    def connected(self):
        return is_connected
    
    def on_rows_cb(self, rows):
        self._cb(rows)
    
if __name__ == "__main__":
    
    def manage_rows(rows):
        for row in rows:
            #Do something with the rows
            ....
            
    db_f = DDBResultFetcher(manage_rows)
    db_f.connect()
    
    db_f.run()
    
    ##Additional code to check if the results have been completely fetched
    ##or not
    ......
           
Even though I have not added many error or exception checks and still many handling of cases missing, the code to write from a users perspective is quite a lot, error prone and can become unmanageable.

How would a generator/coroutine based DB driver can make my code look like:

def fetch_results_from_db():
    instance = MyDBDriver.get_correct_instance()
    
    if not instance.connected():
        instance.connect()
    
    while True:
        rows = yield instance
        if not rows:
            break
        # Do something with the rows
        ...

My user code is so much simpler now! The onus is now on the library writer to provide mechanisms to interact with this generator-function to use the yield suspension and resumption to send the rows obtained from the DB.

How to do that ? Wait for the next instalment of the series...

NOTE: My intention here is not to propagate the idea that "free functions are better than classes". No, its just an example.


Bonus Section: But still how its works behind the scenes ? 

We will look here at some disassembled output and some C code to gain a little more better understanding of what is going on. It's for those people who cant sleep without seeing a bit of assembly or C/C++ code.




Or (the easier way)

In [7]: dis.dis(gen_obj.gi_code)
  2           0 LOAD_CONST               1 (5)
              3 YIELD_VALUE
              4 STORE_FAST               0 (x)

  3           7 LOAD_FAST                0 (x)
             10 RETURN_VALUE


There are lots of resources on the internet where you can understand what the above output means, but I will just briefly mention about the significance of each column.
As can be seen, there are 5 columns:

  • Column 1 : The line number in the python source file which the bunch of assembly statements represent.
  • Column 2 : The offset in the byte code (gen_obj.gi_code.co_code) where the opcode is present.
  • Column 3 : The opcode name.
  • Column 4 : The index into the attributes of the code object. Its bit more involved, but out of scope for this discussion.
  • Column 5 : The constants and the variable name based upon the opcode. This is only for helping mere humans.
Now, with this much information at hand lets take a look at the disassemble output again. 
  • Load a constant value '5'.
  • Yield. Here is where the current frame is saved along with all its context which is in the gi_code object.
  • On issuing next or send, resume the saved frame and store the sent value ('None' in case of next and send without any argument) into the variable named 'x'.
  • Read the value in variable 'x' and return it.

More Fun with GDB

Now we know that the stack frame for a generator function is persisted on heap so that it can be loaded and resumed again.

NOTE: From whatever I could understand from the cPython code, all stack frames are allocated on heap. But of course nothing should be written on that assumption.

We will see now what happens when I am sending a value to a generator. Things we are expecting to happen are:
  • Load the frame associated with the generator.
  • Send the value ( ? )
  • Run the frame till next yield or throw StopIteration exception
  • In case of yield, go back to the previous frame of execution.
For this, I have a put a breakpoint on _PyGen_Send function which can be found in Objects/genobject.c source of CPython.  It internally calls gen_send_ex which is present in the same source file.

The source code we are debugging:

def my_gen():
    x = yield

g = my_gen()
next(g)
g.send(42)


Lets see what happens in gen_send_ex. We will only see the relevant portions of the code.


static PyObject *
gen_send_ex(PyGenObject *gen, PyObject *arg, int exc)
{
    PyThreadState *tstate = PyThreadState_GET();
    PyFrameObject *f = gen->gi_frame;
    PyObject *result;
     .
     .
     .
}

We are getting the current thread state and the frame associated with the generator object.


    if (f->f_lasti == -1) {
        if (arg && arg != Py_None) {
            char *msg = "can't send non-None value to a "
                        "just-started generator";
            if (PyCoro_CheckExact(gen))
                msg = "can't send non-None value to a "
                      "just-started coroutine";
            PyErr_SetString(PyExc_TypeError, msg);
            return NULL;
        }
    } else {
        /* Push arg onto the frame's value stack */
        result = arg ? arg : Py_None;
        Py_INCREF(result);
        *(f->f_stacktop++) = result;
    }

Here we are checking the value of the last executed instruction on the generator function frame (f->f_lasti). If it's set to -1 it means the generator has not moved to its first yield statement.

Lets check the state of our generator stack frame (which is variable 'f'):

(gdb) p *f
$18 = {
  ob_base = {
    ob_base = {
      _ob_next = 0x7fcf85618d48,
      _ob_prev = 0x7fcf855825a8,
      ob_refcnt = 1,
      ob_type = 0x88cc20 
    },
    ob_size = 20
  },
  f_back = 0x0,
  f_code = 0x7fcf86660100,
  f_builtins = 0x7fcf86784b48,
  f_globals = 0x7fcf8673ac98,
  f_locals = 0x0,
  f_valuestack = 0x2916478,
  f_stacktop = 0x2916478,
  f_trace = 0x0,
  f_exc_type = 0x0,
  f_exc_value = 0x0,
  f_exc_traceback = 0x0,
  f_gen = 0x7fcf855825a8,
  f_lasti = 3,
  f_lineno = 1,
  f_iblock = 0,
  f_executing = 0 '\000',
  f_blockstack = {{
      b_type = -875836469,
      b_handler = -875836469,
      b_level = -875836469
    } },
  f_localsplus = {0x0}
}

Check the highlighted portion. The blue field says that we are at the first line of the frame, which is true, we are at the yield statement which is first line of our generator function my_gen.

Since our generator has already reached the yield (because we already executed next on the generator before sending any value), lets focus on the else portion:

    ...
    } else {
        /* Push arg onto the frame's value stack */
        result = arg ? arg : Py_None;
        Py_INCREF(result);
        *(f->f_stacktop++) = result;
    }

Hmm...its just putting the passed value on the top of the stack! Now that should result in a STORE, right ? Lets check the disassembly:

In [3]: dis.dis(my_gen)
  2           0 LOAD_CONST               0 (None)
              3 YIELD_VALUE
              4 STORE_FAST               0 (x)
              7 LOAD_CONST               0 (None)
             10 RETURN_VALUE

Yup, thats looks about correct and as expected. The sent value is getting stored in variable x.

From here, python should execute the current generator frame and should do some bookkeeping to get back to the previous frame which called the generator next/send once it yields again or finishes.

    /* Generators always return to their most recent caller, not
     * necessarily their creator. */
    Py_XINCREF(tstate->frame);
    assert(f->f_back == NULL);
    f->f_back = tstate->frame;

    gen->gi_running = 1;
    result = PyEval_EvalFrameEx(f, exc);
    gen->gi_running = 0;

The current frame is saved in the generator's frame as highlighted  and PyEval_EvalFrameEx will continue with the execution of the frame.

I have jumped through lots of details here but this should be a good start for anyone interested in digging deeper.

Saturday, 8 October 2016

What you need _not_ know about std::function - Part 3

Performance of std::function

In the previous post we saw the implementation of std::function in libstd++ (g++-6.2) in detail and discussed briefly on the libc++ implementation of the same. And now, its time to look at what the performance stats has to say.

The platforms and the compilers used for running the performance tests:
  1. Mac OS Yosemite  - clang++ 3.7
  2. Ubuntu 16.04 VM  - g++-6.2
  3. Ubuntu 16.04 VM  - clang++-3.8 (with libc++)
For all tests, the code was compiled with O2 level optimization settings.

I will be trying to benchmark std::function against:
  1. Virtual function call
  2. C++11 version of impossibly fast delegate (From StackExchange)
The codes are available at my github repo.

NOTE:
  1. I would be using 'volatile' in the benchmarking code and stronger data dependency inside tight loops. The use of volatile is to prevent the compiler from optimizing away the code altogether.
  2. I am including 'iostream' header. :)
  3. I wasn't really planning to benchmark the 'fast delegate' until I saw THIS reddit post. That left me wandering why std::function was not used.

A note about "Impossibly Fast Delegate"

The implementation simply stores the callable object (pure function pointers and member function pointers) using template hackery and later invoked whenever the delegate call function is invoked, nothing more. This scheme would not work for people who want:
  1. The stored callable to remain valid outside the lifetime of the class object whose member function needs to be invoked.
  2. The original implementation did not work for any other types of callbacks. But the c++11 implementation does it in a naive way, i.e to allocate the handler on heap. There is no SBO done.
So, one must be careful while doing comparison of the fast delegate with std::function, just for the reason that std::function can do more things that the "fast delegate" and certainly it would have additional costs if what you want is point #1.

Also, I found on web doing some completely ridiculous benchmarking of boost::function with the fast delegate.  The author simply went on creating a new instance of boost::function from a lambda for every iteration, but does not do the same for fast delegate ! I am certainly not planning to do such naive mistakes. I will be keeping the tests as simple as possible to avoid any kind of oversight.

First Round

Here, I am just going to call the callbacks '10000000' number of times and find the total time taken for all 3 on the above mentioned platforms.

Code for Virtual function call:
#include <iostream>
#include <chrono>

constexpr static const int ITER_COUNT = 10000000;

class Base {
public:
  virtual volatile int fun_function(volatile int) = 0;
};

class Derived final: public Base {
public:
  volatile int fun_function(volatile int i) {
    return ++i;
  }
};

int main(int argc, char* argv[]) {
  Derived d;
  Base * b = nullptr;
  b = &d;
  volatile int v = 0;

  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < ITER_COUNT; i++) {
    v = b->fun_function(v);
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << '\n';
}


Code for std::function:

#include <iostream>
#include <chrono>
#include <functional>

constexpr static const int ITER_COUNT = 10000000;

volatile int fun_function(volatile int v)
{
  return ++v;
}

int main() {
  std::function<volatile int(volatile int)> f(fun_function);

  volatile int v = 0;
  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < ITER_COUNT; i++) {
    v = f(v);
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::cout <<
    std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << '\n';

  return 0;
}

Code for "Impossibly fast delegate":

#include <iostream>
#include <chrono>
#include "../impl_fast_delegate.hpp"

constexpr static const int ITER_COUNT = 10000000;

volatile int fun_function(volatile int v)
{
  return ++v;
}

int main() {
  delegate<volatile int (volatile int)> del(fun_function);

  volatile int v = 0;
  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < ITER_COUNT; i++) {
    v = del(v);
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::cout <<
    std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << '\n';
  return 0;
}


And the results are...


My analysis:
  1. Hmm..in all cases virtual function wins by a massive margin, except for clang on ubuntu.
  2. The fast delegate is bit faster on both Mac and ubuntu with clang but not on g++-6.2, where std::function wins against the fast delegate. For now, I will consider both std::function and fast delegate equally fast until we do some more micro-benchmarking.
  3. The results for virtual function on Ubuntu/g++ is very surprising. What makes it do the job so much faster ? Lets have a look at its assembly (with my comments):
        movl    $10000000, %edx  // Setting up the counter value in edx
        movq    %rax, %rbx
        .p2align 4,,10
        .p2align 3
.L3:
        movl    4(%rsp), %eax    // Load the volatile 'v' into eax
        addl    $1, %eax         // Add 1 to the value
        subl    $1, %edx         // Subtract the counter
        movl    %eax, 4(%rsp)    // Save the new value to its actual place
        jne     .L3              // Loop

Well, it seems that compiler was successfully able to devirtualize the virtual call and further optimize it to make all the computations without calling any function!! Thats brilliant, but more work for me.

Let's change the main function for the virtual function code to:
int main(int argc, char* argv[]) {
  Derived d;
  Dummy u;
  Base * b = nullptr;

  if (*argv[1] == '0') {
    b = &d;
  } else {
    b = &u;
  }

  volatile int v = 0;
  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < ITER_COUNT; i++) {
    v = b->fun_function(v);
  }

  auto end = std::chrono::high_resolution_clock::now();

}

I simply added one more derived function and made the decision to point to any one of them via a base class pointer a runtime decision based on command argument.

Lets rerun the tests and see if it changes anything:


Yeah, we now do have a reasonable measurement for virtual function on Ubuntu/g++ combo. But the general analysis is:

Virtual Function Call << std::function <~ Fast Delegate


What makes std::function so much slower than Virtual Function ?

Lets have a look at there corresponding assembly code (on Ubuntu/g++-6.2) (I just hope my comments to the assembly instructions are not way off):

// Assembly for the calling part for Virtual Function
        movq    %rax, %r12
        .p2align 4,,10
        .p2align 3
.L5:
        movq    0(%rbp), %rax  // rax probably points to the VT
        movl    12(%rsp), %esi // Move volatile 'v' to esi
        movq    %rbp, %rdi     // Move the 'this' pointer to rdi
        call    *(%rax)
        subl    $1, %ebx
        movl    %eax, 12(%rsp) // The new result stored in eax is stored back.
        jne     .L5


// Assembly for the calling part for std::function
        movq    %rax, %rbp
        .p2align 4,,10
        .p2align 3
.L11:
        cmpq    $0, 32(%rsp)   // Test if callable object is stored
        movl    8(%rsp), %eax  // Read the volatile 'v' and store it in eax
        movl    %eax, 12(%rsp) // Store the volatile in eax at a different offset from stack pointer
        je      .L27           // Jump to exception handling code
        leaq    12(%rsp), %rsi // Store the volatile value in rsi
        leaq    16(%rsp), %rdi // Store the Any_data union address in rdi
.LEHB0:
        call    *40(%rsp)      // Call the function callable target
        subl    $1, %ebx
        movl    %eax, 8(%rsp)
        jne     .L11


As can be seen, std::function requires more instructions to be executed before making a call to the target callable, 6 instructions again 3 for virtual function.

One thing to not is that, for the std::function case, there is a test against zero for each iteration. This is coming from the below function:
    template <typename Res, typename... ArgTypes>
    Res
    function <Res(ArgTypes...)>::
    operator()(ArgTypes... args) const
    {
      if (M_empty())
        throw_bad_function_call();
      return M_invoker(M_functor, std::forward<ArgTypes>(args)...);
    }

Let's see what happens if we avoid doing this check.


Well, not much difference or no difference at all after removing the check! I am sure I have read it somewhere that removing the check gives increase speedup. Not sure using which compiler and on which platform, but surely it has no effect on Ubuntu using g++-6.2 with O2 optimization.

Let's check the new assembly code just for the hell of it:
.L10:
        movl    8(%rsp), %eax
        leaq    12(%rsp), %rsi
        leaq    16(%rsp), %rdi
        movl    %eax, 12(%rsp)
        call    *40(%rsp)
        subl    $1, %ebx
        movl    %eax, 8(%rsp)
        jne     .L10

As can be seen, we no longer have the instructions to make the test against zero and for throwing exceptions. So, I am assuming its the two leaq instructions which is affecting the performance in case of std::function against a virtual function call.

!!!!  I wonder if the compiler could have optimized the 2 leaq instructions outside the loop, as they are not changed at all. That would make the instruction cache more hot.  !!!!


Second Round 

Till now, we saw the benchmarks based upon just using the chrono library and by analyzing the assembly. Let's pimp up the setup. I will be using google benchmark library (on both mac and ubuntu) and perf (only on ubuntu).

The tests are:
  1. BM_std_function_basic: Same test which we did in first round.
  2. BM_std_function_heavy: Test which disables SBO i.e with a functor having size big enough which cannot be stored in the internal buffer.
  3. BM_std_function_poly_cb: Test involving member function callback. Kind of similar to what has been done here.
  4. BM_virtual_function_basic/48: Same test which we did in first round.
  5. BM_imp_fast_delegate_basic: Same as #1 but using fast delegate.
  6. BM_imp_fast_delegate_heavy: Same as #2 but using fast delegate.
  7. BM_imp_fast_delegate_poly_cb: Same as #3 but using fast delegate.

Result on Mac Os using clang++-3.7:



Hmm..std::function seems to beat 'fast delegate' in both #2 and #3 test. Lets look at what is happening on Ubuntu...

Result on Ubuntu using g++-6.2:




Result on Ubuntu using clang++-3.8 using libc++:




Hmm...g++ one certainly did optimize out the #3 and #7 test (or may be not), but I am not interested in that now. What particularly caught my attention is the 11 ns difference in the #2nd test between g++ and clang++.

Other than that, from the above basic test results, I see no reason to go for another delegate implementation rather than using std::function. There may be other interesting test cases to try out, but this is all I have time for as of now :( (This is a way to indirectly ask help from reader to contribute more tests :) ) .


To go even further beyond....

Can't help but to post this link when the heading is something like this...



Now, let's see why we have a 11 ns difference for the 'heavy' test. Between, I used 'perf' as well to get something out of it, but in that too I ended up annotating the callbacks to show the assembly. So, assembly is all I have to figure out this difference.

Assembly for the clang version:

.LBB2_3:                                
        movq    8(%rbx), %rax
        leaq    1(%rax), %rcx
        movq    %rcx, 8(%rbx)
        cmpq    72(%rbx), %rax
        jae     .LBB2_8
# BB#4:                                 
        movl    $40, %edi
        callq   _Znwm         // This is where it calls the operator new
        movq    $_ZTVNSt3__110__function6__funcI10BigFunctorNS_9allocatorIS2_EEFbPKcEEE+16, (%rax)
                              // Store the Virtual Table in rax
        movq    %rax, 32(%rsp)
        xorps   %xmm0, %xmm0
        movups  %xmm0, 20(%rax)
        movl    $0, 36(%rax)
        movq    %r15, 8(%rax)
        movl    $1735289202, 16(%rax)   
        #APP
        #NO_APP
        movq    32(%rsp), %rdi
        cmpq    %r14, %rdi
        je      .LBB2_5
# BB#6:                                 
        testq   %rdi, %rdi
        jne     .LBB2_7
        jmp     .LBB2_1
        .align  16, 0x90
.LBB2_5:                                
        movq    (%rsp), %rax // Prepare for the function call i.e make the arguments ready in register
        movq    %r14, %rdi
        callq   *32(%rax)    // Call the function, remember rax is now pointing to VT ?
        jmp     .LBB2_1
.LBB2_2:                                
        movq    %rbx, %rdi
        callq   _ZN9benchmark5State16StartKeepRunningEv
        jmp     .LBB2_3



Assembly for g++ version:

.L95:
        movq    8(%rbx), %rax
        cmpq    72(%rbx), %rax
        leaq    1(%rax), %rdx
        movq    %rdx, 8(%rbx)
        jnb     .L111
        movl    $32, %edi
        movq    $0, 32(%rsp)
.LEHB3:
        call    _Znwm          // Invoke the operator new
.LEHE3:
        leaq    8(%rsp), %rsi
        leaq    16(%rsp), %rdi

        // !! This is the start where its trying to zero out
        // the buffer inside the functor          

        movb    $0, (%rax)
        movb    $0, 1(%rax)
        movb    $0, 2(%rax)
        movb    $0, 3(%rax)
        movb    $0, 4(%rax)
        movb    $0, 5(%rax)
        movb    $0, 6(%rax)
        movb    $0, 7(%rax)
        movb    $0, 8(%rax)
        movb    $0, 9(%rax)
        movb    $0, 10(%rax)
        movb    $0, 11(%rax)
        movb    $0, 12(%rax)
        movb    $0, 13(%rax)
        movb    $0, 14(%rax)
        movb    $0, 15(%rax)
        movb    $0, 16(%rax)
        movb    $0, 17(%rax)
        movb    $0, 18(%rax)
        movb    $0, 19(%rax)
        movb    $0, 20(%rax)
        movb    $0, 21(%rax)
        movb    $0, 22(%rax)
        movb    $0, 23(%rax)
        movb    $0, 24(%rax)
        movb    $0, 25(%rax)
        movb    $0, 26(%rax)
        movb    $0, 27(%rax)
        movb    $0, 28(%rax)
        movb    $0, 29(%rax)
        movb    $0, 30(%rax)
        movb    $0, 31(%rax)

        movq    %rax, 16(%rsp)
        movq    $_ZNSt17_Function_handlerIFbPKcE10BigFunctorE9_M_invokeERKSt9_Any_dataOS1_, 40(%rsp)
        movq    $_ZNSt14_Function_base13_Base_managerI10BigFunctorE10_M_managerERSt9_Any_dataRKS3_St18_Manager_operation, 32(%rsp)
        movq    $.LC2, 8(%rsp)
        call    _ZNSt17_Function_handlerIFbPKcE10BigFunctorE9_M_invokeERKSt9_Any_dataOS1_ // Call the target callable
        movq    32(%rsp), %rax
        testq   %rax, %rax
        je      .L101
        leaq    16(%rsp), %rsi
        movl    $3, %edx
        movq    %rsi, %rdi
        call    *%rax         // Another call is part of destructor, it calls the _M_manager() function
        cmpb    $0, (%rbx)
        jne     .L95
.L110:
        movq    %rbx, %rdi
.LEHB4:
        call    _ZN9benchmark5State16StartKeepRunningEv
        jmp     .L95


Do you see all those per byte memset to 0 ? This is the code doing a reset of the 32 byte buffer inside the functor. I never asked the code to do it anywhere, then why is g++ hell bent on doing it ? Is this a bug in g++? This is might be the reason for the additional 10-11 ns difference. Well, I will try to open a bug report anyways.

Another observable difference from the assembly code is that each iteration executes 2 'call' instructions. One is for the actual target function call and the other is for calling the '_M_manager' function inside the destructor, which based upon the operation passed does the required action. This is a fairly costly operation as well.  So, if you are using g++'s libstd++ implementation of std::function, watch out for the above 2 costs. 

Sentinel

And that is it! I am done with my intended plan of writing this 3 part blog post about std::function. And I have some take aways:
  1. Do not shy away from using raw pointer to functions or member function pointers if it can solve your specific problem. It is highly unlikely one would be writing a generic function everywhere. If you are not at all worried about performance, then go ahead use std::function all over the place (responsibly).
  2. Before considering any such other  'fast' delegates, consider using std::function. If in doubt, measure. If you see anything wrong in my measurements, let me surely know in the comments, I will try to add them in my tests and see the result.
  3. If you just want to pass a callback to another function, which would then be executed immediately within its call stack, then probably something like below would be more efficient:
template <typename Handler, typename... Args>
auto callback_hndler(Handler&& h, Args&&... args) {
  return (std::forward<Handler>(h))(std::forward<Args> args...);
}

callback_hndler([](auto a, auto b) { something(); }, aa, bb );

         In such cases making the signature of 'callback_hndler' to take an 'std::function' instead would be a bad decision.
ASIO does something similar in a broad sense for its async operations, but then it goes on to allocate the passed handler on heap since it has to perform the operation on a different call stack. One can use std::function in such scenarios as it lifetime management of the callback for you.

There are few more points/examples that I would have liked to give, but this much is probably already more than enough for knowing about std::function.  Guess what ? By this time Goku would have probably already transformed into SSJ3 or is he still screaming ? :)


Sunday, 4 September 2016

What you need _not_ know about std::function - Part 2


  • Part-1 : Getting to know the basics
  • Part-3 : Performance benchmarks


In the previous post we saw some basic type traits and metafunctions that are used in the implementation of std::mem_fn and std::function.  The main gist of this post is to understand the implementation of std::function in libstd++.

Also, I would have liked to talk about things like:
  • The performance penalty of using std::function.
  • When to use std::function.
  • Some performance benchmarks.
I will leave the above topics as the centre of discussion in the third part of the series (This was supposed to be the sentinel post as per my original plan).  Discussing them here will just make this post a torture to read.

A naive implementation

Library development in C++ is tough. It becomes even more intricate when it is supposed to be generic. The most difficult part or rather detail oriented part is the type system itself. Dealing with the types together with their qualifications results in lot of metaprogramming boilerplate code and if you forget something it could lead to some nasty UB.

So, let's first try to implement our own but rather very simple version of std::function. I will be skipping a-great-many required details to keep the code small.

The requirements

1. Should be able to invoke any callable target.
2. The underlying type of the callable object must be hidden i.e the only type information that should be visible is the pure function signature only.
3. The target callable object must be valid as long as the function object is valid.

I have kept bare minimum requirements for my implementation to keep off from doing metaprogramming hackery which we would be seeing anyways while going through the implementation of std::function.

The first attempt

This version just implements the basic interface of Function and also implements the specialization for function pointer. In the later versions (or attempts), the interface developed for Function would more or less remain the same only the specialization required to handle other types of callable objects would be handled.

template <typename> class Function;

template <typename Ret, typename... Args>
class Function<Ret(Args...)>
{
public:
  using implementation = detail::function_impl_base<Ret, Args...>;

  //Default constructor
  Function() = default;

  template <typename Callable>
  Function(Callable f): // Requires callable be copyable
    impl_base_(std::make_unique<detail::func_impl<Callable>>(
                   std::move(f), detail::constructor_tag()))
  {}

  // Copy constructor
  Function(const Function& other)
  {
    this->impl_base_.reset(other.impl_base_->clone());
  }

  // Copy Assignment
  Function& operator=(Function other)
  {
    this->impl_base_.reset(other.impl_base_.release());
    return *this;
  }

  template <typename Callable>
  Function& operator=(Callable cb)
  {
    this->impl_base_ =
      std::make_unique<detail::func_impl<Callable>>(
                std::move(cb), detail::constructor_tag());
    return *this;
  }

  Ret operator()(Args&&... args)
  {
    assert (impl_base_.get());
    return (*impl_base_)(std::forward<Args>(args)...);
  }

private:
  std::unique_ptr<implementation> impl_base_ = nullptr;
};

Nothing much fancy. We have the same signature as that of std::function i.e the type Function is not aware about the type of the call target. That means that for function pointer or member function pointer or functor, the syntax of the Function object declaration would be exactly the same (provided the function signature is same of course) even though the actual types of all these callable objects are different. This is achieved using a technique called 'Type Erasure'.

Type Erasure

I am not intending to explain type-erasure in detail here. There are already a lot of content on it floating on the web. boost::any is one of the very basic example of type-erasure. shared_ptr, std::function are the other that I know of in the standard library that makes use of this technique.

Type-erasure in it's basic form is implemented by making use of polymorphism and thus requires dynamic memory allocation. This is how my implementation of Function does the type-erasure.

In the above code function_impl_base is the base class defining the interface. The derived classes implementing the specializations for different types of Callable types must implement the functions provided by the interface.

namespace detail {

  struct constructor_tag {};

  template <typename Ret, typename... Args>
  class function_impl_base
  {
  public:
    virtual Ret operator()(Args&&... args) const = 0;
    virtual function_impl_base* clone() = 0;
    virtual ~function_impl_base() {}
  };

  template <typename T> class func_impl;

  // Specialization for Function pointers
  template <typename Ret, typename... Args>
  class func_impl<Ret(*)(Args...)> final : public function_impl_base<Ret, Args...>
  {
  public:
    using callable_t = Ret(*)(Args...);

    /*
     * This is a greedy constructor. constructor_tag is just used so as
     * to not interfere with regular copy-constructor.
     */
    template <typename F>
    func_impl(F&& callable, constructor_tag): callable_(std::move(callable))
    {}

    func_impl(const func_impl&) = default;

    func_impl& operator=(func_impl&) = default;

    Ret operator()(Args&&... args) const
    {
      return callable_(std::forward<Args>(args)...);
    }

    func_impl* clone()
    {
      return (new func_impl(*this));
    }

  private:
    callable_t callable_;
  };

}


Let's now look at the constructor of Function in a bit more detail:
template <typename Callable>
  Function(Callable f): // Requires callable be copyable
    impl_base_(std::make_unique<detail::func_impl<Callable>>(std::move(f), detail::constructor_tag()))
  {}

The constructor is a template constructor and is the only part of  Function class (lets forget about assignment and copy constructors for the time being) which actually knows about the type of the callable object. But this knowledge is lost as soon as the constructor exits. But, we need to save this information in some form, right ? This is where inheritance comes into picture. Using inheritance, the type of the information is saved as RTTI (Run Time Type Information) probably in the vtable. This is exactly the purpose of function_impl_base.  
The class func_impl<Ret(*)(Args...)>  inherits from function_impl_base and implements the call operator.

The final attempt

Not much difference, only added specialization for member function pointer and functors:

namespace detail {

  template <typename Ret, typename... Args>gt;
  class function_impl_base
  {
  public:
    virtual Ret operator()(Args&&... args) = 0;
    virtual function_impl_base* clone() = 0;
    virtual ~function_impl_base() {}
  };

  template <typename Signature, typename Functor>gt; class func_impl;

  // Specialization for Function pointers
  template <typename Ret, typename... Args>gt;
  class func_impl<Ret(Args...), Ret(*)(Args...)>gt; final
            : public function_impl_base<Ret, Args...>gt;
  {
  public:
    using callable_t = Ret(*)(Args...);

    // The need for constructor tag removed in this version
    func_impl(callable_t callable): callable_(std::move(callable))
    {}

    func_impl(const func_impl&) = default;

    func_impl& operator=(func_impl&) = default;

    Ret operator()(Args&&... args)
    {
      return callable_(std::forward<Args>gt;(args)...);
    }

    func_impl* clone()
    {
      return (new func_impl(*this));
    }

  private:
    callable_t callable_;
  };

  // Specialization for Functors
  template <typename Functor, typename Ret, typename... Args>gt;
  class func_impl<Ret(Args...), Functor>gt;
            : public function_impl_base<Ret, Args...>gt;
  {
  public:
    using callable_t = Functor;

    func_impl(Functor callable): callable_(std::move(callable))
    {}

    func_impl(const func_impl&) = default;
    func_impl& operator=(func_impl&) = default;

    Ret operator()(Args&&... args)
    {
      return callable_(std::forward<Args>gt;(args)...);
    }

    func_impl* clone()
    {
      return (new func_impl(*this));
    }

  private:
      callable_t callable_;
  };

  // Specialization for Member function pointers
  // Leveraging the use of mem_fn
  template <typename Class, typename Member, typename Ret, typename... Args>gt;
  class func_impl<Ret(Args...), Member Class::*>gt;
            : public function_impl_base<Ret, Args...>gt;
  {
  public:
    using callable_t = Member (Class::*);

    func_impl(callable_t callable): callable_(std::move(callable))
    {}

    func_impl(const func_impl&) = default;
    func_impl& operator=(func_impl&) = default;

    Ret operator()(Args&&... args)
    {
      return call(std::forward<Args>gt;(args)...);
    }

    template <typename ClassType, typename... Params>gt;
    Ret call(ClassType obj, Params&&... args)
    {
      return std::mem_fn(callable_)(obj, std::forward<Params>gt;(args)...);
    }

    func_impl* clone()
    {
      return (new func_impl(*this));
    }

  private:
    callable_t callable_;
  };

}

template <typename>gt; class Function;

template <typename Ret, typename... Args>gt;
class Function<Ret(Args...)>gt;
{
public:
  using implementation = detail::function_impl_base<Ret, Args...>gt;;
  using call_signature = Ret(Args...);

  //Default constructor
  Function() = default;

  // Copy constructor
  Function(const Function& other)
  {
    this->gt;impl_base_.reset(other.impl_base_->gt;clone());
  }

  // Copy Assignment
  Function& operator=(Function other)
  {
    this->gt;impl_base_.reset(other.impl_base_.release());
    return *this;
  }

  template <typename Callable>gt;
  Function& operator=(Callable cb)
  {
    this->gt;impl_base_ =
      std::make_unique<detail::func_impl<call_signature, Callable>gt;>gt;(std::move(cb));
    return *this;
  }

  // constructor
  template <typename Callable>gt;
  Function(Callable f): // Requires callable be copyable
    impl_base_(std::make_unique<detail::func_impl<call_signature, Callable>gt;>gt;(std::move(f)))
  {
  }

  Ret operator()(Args&&... args)
  {
    assert (impl_base_.get());
    return (*impl_base_)(std::forward<Args>gt;(args)...);
  }

  explicit operator bool() const noexcept
  {
    return impl_base_ != nullptr;
  }

private:
  std::unique_ptr<implementation>gt; impl_base_ = nullptr;
};
  

The above code can be accessed from HERE.

Issues with the custom implementation

1. Memory allocation. Dynamic memory allocation is costly. It is probably ok if you know you are not going to create many instances of Function object and pass it around. But it's a big no-no in a performance critical code. This can be resolved in some cases by doing SBO (Small Buffer Optimization), which is what almost all implementation does.

2. No type checking at all. Since the type of the actual callable object gets erased after constructor and assignment operators. It is better to check whether the user is correctly using the function object. If not, it should be a compile time error rather than leading to subtle undefined behaviour in the code.

3. Overhead of a virtual function call and that of an additional call to the callable object. This can be pretty bad, if you going to call your implementation functions via Function.

4. The cost of generic programming. This is my personal opinion, customized solution can be made faster than the above Function implementation, but when you have to use something like Function as part of your interface to client code, then this is the way we have to do and improve upon it. This is something which I would like to touch upon in the final part of this series.

The Implementation of std::function 

In this section we will see the implementation of std::function in libstd++ 6.1. We will be making use of some of the type traits discussed in Part-1 and will see some new ones here. Let's first lookup at some concepts and tricks individually and connect them up later.

Location Invariance

The metafunction for this check is:

template <typename T>
struct is_location_invariant
     : is_trivially_copyable<T>::type
{ };

This is one of the checks that is made to determine if SBO can be performed on the passed callable type or not. This is mainly relevant for Functors, since function pointer and member function pointers are trivially copyable. So, it is necessary that one must make sure that his/her functor implements the TriviallyCopyable concept.

For the lazy people (from cppreference) :
  • Every copy constructor is Trivial or deleted (since C++14)
  • Every move constructor is Trivial or deleted (since C++14)
  • Every copy assignment operator is Trivial or deleted (since C++14)
  • Every move assignment operator is Trivial or deleted (since C++14)
  • at least one copy constructor, move constructor, copy assignment operator, or move assignment operator is non-deleted
(since C++14)
  • Trivial non-deleted (since C++14) destructor
This implies that the class has no virtual functions or virtual base classes.
Scalar types and arrays of TriviallyCopyable objects are TriviallyCopyable as well, as well as the const-qualified (but not volatile-qualified) versions of such types.


No-Copy Types

This is nothing but the types of callables that does not need to be copied:

  class Undefined_class;

  union Nocopy_types
  {
    void*       M_object;
    const void* M_const_object;
    void (*M_function_pointer)();
    void (Undefined_class::*M_member_pointer)();
  };

This basically means, below types are no-copy types:-

  1. Pointer to a callable object.
  2. Pointer to a const callable object.
  3. A function pointer.
  4. A member function pointer.
The lack of any template parameters should give you an hint that the actual types mentioned inside the union is not going to be used. Maybe it's the size of the union ? We will get to that shortly.

Small Buffer Optimization

Let's see the data structure employed for the all important small buffer optimization by libstd++

  union Any_data
  {
    void*       M_access()       { return &M_pod_data[0]; }
    const void* M_access() const { return &M_pod_data[0]; }

    template <typename T>
    T& M_access() {
      return *static_cast <T*>(M_access()); 
    }

    template <typename T >
    const T& M_access() const {
      return *static_cast<const T*>(M_access());
    }
      
    Nocopy_types M_unused;
    char M_pod_data[sizeof(Nocopy_types)];
  };

It is just a char buffer having a size of sizeof(Nocopy_types) bytes, which should be 8 bytes on 64 bit platform. This is another constraint that the functor must have other than the location invariance,  for the implementation to perform SBO.
If the size of your functor is more than 8 bytes, then you will have to pay for one dynamic allocation.

The Function_base class

This class basically deals with the small buffer or Any_data directly. It makes the decision of what needs to be stored in Any_data i.e whether it should be the callable object as it is or the memory location of the callable object after allocating it on heap.

There are 2 inner classes named Base_manager and Ref_manager (a specialization for reference_wrapped functors), which does the actual dealing with Any_data.

Lets dig into the implementation of Base_manager, which would give us a lot of idea on what is happening behind the scenes.

      template<typename Functor>
      class Base_manager
      {
      protected:
        static const bool stored_locally =
        (is_location_invariant<Functor>::value
         && sizeof(Functor) <= M_max_size  // 8 bytes on 64 bit
         && alignof(Functor) <= M_max_align
         && (M_max_align % alignof(Functor) == 0));

        typedef integral_constant<bool, stored_locally> Local_storage;

        // Retrieve a pointer to the function object
        static Functor*
        M_get_pointer(const Any_data& source)
        {
          const Functor* ptr =
            stored_locally? std::addressof(source.M_access<Functor>())
            /* have stored a pointer */ : source.M_access<Functor*>();
          return const_cast<Functor*>(ptr);
        }

        // Clone a location-invariant function object that fits within
        // an _Any_data structure.
        static void
        M_clone(Any_data& dest, const Any_data& source, true_type)
        {
          new (dest.M_access()) Functor(source.M_access<Functor>());
        }

        // Clone a function object that is not location-invariant or
        // that cannot fit into an _Any_data structure.
        static void
        M_clone(Any_data& dest, const Any_data& source, false_type)
        {
          dest.M_access<Functor*>() =
            new Functor(*source.M_access<Functor*>());
        }

        // Destroying a location-invariant object may still require
        // destruction.
        static void
        M_destroy(Any_data& victim, true_type)
        {
          victim.M_access<Functor>().~Functor();
        }

        // Destroying an object located on the heap.
        static void
        M_destroy(Any_data& victim, false_type)
        {
          delete victim.M_access<Functor*>();
        }

      public:
        static bool
        M_manager(Any_data& dest, const Any_data& source,
                   Manager_operation op)
        {
         //skipped
         ........
        }

        static void
        M_init_functor(Any_data& functor, Functor&& f)
        { M_init_functor(functor, std::move(f), Local_storage()); }


      private:
        static void
        M_init_functor(Any_data& functor, Functor&& f, true_type)
        { new (functor.M_access()) Functor(std::move(f)); }

        static void
        M_init_functor(Any_data& functor, Functor&& f, false_type)
        { functor.M_access<Functor*>() = new Functor(std::move(f)); }
      };

Now that's something. The integral constant Local_storage is set to true if the functor can be stored inside the Any_data buffer else it will be false and the other member functions gets called based on that using tag-dispatching. For eg. check out the definitions for M_clone member function.

Let's look at functions M_init_functor and M_get_pointer in bit more detail to understand how the callable object is stored and retrieved by std::function.
 static void
 M_init_functor(Any_data& functor, Functor&& f)
 { M_init_functor(functor, std::move(f), Local_storage()); }

 static void
 M_init_functor(Any_data& functor, Functor&& f, true_type)
 { new (functor.M_access()) Functor(std::move(f)); }

 static void
 M_init_functor(Any_data& functor, Functor&& f, false_type)
 { functor.M_access<Functor*>() = new Functor(std::move(f)); }
 

Considering the passed callable object meets the requirement for being stored on stack, Local_storage would be of true_type, hence the second M_init_functor will get called, which makes use of placement new operator to construct the object on the stack itself.

If the callable object does not meet the requirement for being stored locally i.e it is either not TriviallyCopyable or has data members making its size more than what is allowed (i.e 8 bytes on 64 bit platform), then the callable object is created on heap and its memory location gets stored in the Any_data buffer. This is what the third M_init_functor function does.

 static Functor*
 M_get_pointer(const Any_data& source)
 {
   const Functor* ptr =
        stored_locally ? std::addressof(source.M_access<Functor>())
             : source.M_access<Functor*>();
   return const_cast<Functor*>(ptr);
 }

This must be now easier to reason about. Depending upon whether the callable object is stored locally or not, M_get_pointer gets the address of the callable object. The other member functions are just based upon this technique.

The Ref_manager is just basically a specialization to take care of functor passed via a reference_wrapper. Which is something one must consider to do if their function object size is greater than 8 bytes (on 64 bit platform) AND if the scope of the callable object outlives the scope of the std::function object.

So, having seen Base_manager and Ref_manager class, lets see how Function_base class looks like:

 class Function_base {
 public:     
   static const std::size_t M_max_size = sizeof(Nocopy_types);
   static const std::size_t M_max_align = alignof(Nocopy_types);

   template <typename Functor>
   class Base_manager { ... };

   template <typename Functor>
   class Ref_manager : public Base_manager <Functor*>
   {
     .....
   };

   typedef bool (*Manager_type)(Any_data&, const Any_data&,
                                Manager_operation);

   Any_data     M_functor;
   Manager_type M_manager;
 }; 

As one can see, Manager_type is a function pointer pointing to either Base_manager::M_manager or Ref_manager::M_manager function. These functions basically fill the destination Any_data with the stored callable object from the source Any_data based upon the operation. Something like below:

        static bool
        M_manager(Any_data& dest, const Any_data& source,
                  Manager_operation op)
        {
          switch (op)
          {
            case get_functor_ptr:
              dest.M_access <Functor*>() = M_get_pointer(source);
              break;

            case clone_functor:
              M_clone(dest, source, Local_storage());
              break;

            case destroy_functor:
              M_destroy(dest,Local_storage());
              break;
            }
          return false;
        }


Function Handlers

The function handlers are kind of similar in functionality to func_impl in my custom implementation i.e to invoke the actual callable object. The classes which we saw till now were to only manage the storage part, they were not concerned with invoking the call operator. 

Function handler classes are derived from either Function_base::Base_manager or Function_base::Ref_manager based upon the functor type. The function handlers are also specialized for plain function pointers, member pointers and for functors. 

We will be seeing only few of those specialization.

1. Specialization for plain function pointer.

    template<typename Signature, typename Functor>
    class Function_handler;

  template<typename Res, typename Functor, typename... ArgTypes>
    class Function_handler<Res(ArgTypes...), Functor>
    : public Function_base::Base_manager<Functor>
    {
      typedef Function_base::Base_manager<Functor> Base;

    public:
      static Res
      M_invoke(const Any_data& functor, ArgTypes&&... args)
      {
        return (*Base::M_get_pointer(functor))(
            std::forward<ArgTypes>(args)...);
      }
    };

2. Specialization for Member pointer.

   template<typename Class, typename Member, typename Res,
           typename... ArgTypes>
    class Function_handler<Res(ArgTypes...), Member Class::*>
    : public Function_handler<void(ArgTypes...), Member Class::*>
    {
      typedef Function_handler<void(ArgTypes...), Member Class::*>
        Base;

     public:
      static Res
      M_invoke(const Any_data& functor, ArgTypes&&... args)
      {
        return std::mem_fn(Base::M_get_pointer(functor)->value)(
            std::forward<ArgTypes>(args)...);
      }
    };

This is very similar to what we did for member pointers in our custom implementation i.e let std::mem_fn do the heavy lifting of taking care of both member data pointer and member function pointer.


Design uptill now

Specialization of a regular functor is also not much different than what we saw for plain function pointer. The design is like below:
1. Function_base is responsible for storing and retrieving the pointers to the callable object.
2. The function handler objects fetches the pointer to the callable objects from Function_base and invokes them with the passed arguments.

The std::function class

This is  what we all have been waiting for.But, you would be surprised (or not) to know that we have already covered all the heavy lifting part ! function class is more about the interface it provides, the type checkings and conversions.

The basic interface is pretty close to what we implemented in our custom solution, so we will be only looking at the interesting parts i.e the things we left out from our custom implementation.

A stripped down version of std::function:

 template < typename Res, typename... Args >
 class function<Res(Args...)>
     : private Function_base
 {
   typedef Res Signature(Args...);
 public:
   function(): Function_base() {}

   template <typename Functor,
                typename = Requires<not<is_same<Functor, function>>, void>,
                typename = Requires<Callable<Functor>, void>>
   function(Functor);

   Res operator()(ArgTypes... args) const;

 private:
   using _Invoker_type = Res (*)(const Any_data&, ArgTypes&&...);
   Invoker_type M_invoker;
 };

 template <typename Res, typename... ArgTypes>
 template <typename Functor, typename, typename>
 function<Res(ArgTypes...)>::
 function(Functor f)
      : Function_base()
 {
      typedef Function_handler <Signature, Functor> My_handler;

      My_handler::M_init_functor(M_functor, std::move(f));
      M_invoker = &My_handler::M_invoke;
      M_manager = &My_handler::M_manager;
 }


This is the part of code which I found pretty interesting. It is interesting because, it does not make use of inheritance the way we used in our custom implementation. With this kind of implementation, it is not required to do any dynamic allocation unless required because of the nature of functor.

The Functor parameter will determine what kind of Function_handler class we need to use. The invoker, which is a function pointer will be set to the correct M_invoke function of the function handler.
Thus, a single call to a callable object would require at most 2 pointer dereferences, no virtual calls and no dynamic allocation (depending on your Functor as we have already seen).

Type checking on assigning target

If we look at the constructor (or copy or assignment) template parameters a little bit more closely, we will see something like:
Callable <Functor>

Let us see what it is:
 template<typename From, typename To>
 using check_func_return_type
      = or<is_void<To>, is_same<From, To>, is_convertible<From, To>>;

 template<typename Func,
          typename Res2 = typename result_of<Func(ArgTypes...)>::type>
 struct Callable : check_func_return_type<Res2, Res> { };

The idea is to check whether the provided callable type yields a return type (Res2) when invoked with the argument types passed as the template parameters of std::function and that return type i.e Res2 is compatible with what was provided in the std::function declaration or in its template parameter.

The reason for adding is_same is pretty interesting. For details check THIS bug report.

I am not sure why is_convertible is required since its failure would not result in lookup of a constructor with different signature as there is only one definition of the constructor. The only advantage I see is, any wrong usage of std::function i.e. assigning it with a target with incompatible arguments would result in error thrown at the constructor itself, rather than at the calling site. If anybody knows about any other advantage, then please do let me know.

For example, see the usage of func1 and func2 and the error it yields using both my custom implementation (which does not use conversion checks) and std::function.

Error with My implementation:
./my_function.hpp:72:14: error: ambiguous conversion from derived class 'D' to base class 'A':
    struct D -> struct B -> struct A
    struct D -> struct C -> struct A
      return callable_(std::forward<Args>(args)...);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As one can see, the error is thrown at the point where we are actually invoking the callable target. This is not good since it is pointing an issue in the library implementers code, which is actually not the case. A good C++ library must try its best to point the error in the users code rather than to its own internal implementation.

inheritance.cc:45:5: error: no viable overloaded '='
  f = func2;
  ~ ^ ~~~~~
It points to the user code which messed it up. This is one nice advantage of using is_convertible which assigning a target.

A note about libcxx

The libcxx implementation of std::function looked a lot more straightforward than the libstd++ implementation and they provide a buffer size of 24 bytes (on 64 bit arch) for the SBO, which is 3 times more than what libstd++ offers. Also, libcxx supports allocator other than global new (I am not sure whether standard mandates that or not till c++14), which libstd++ does not currently.

Another difference w.r.t libstd++ is that the libcxx version makes use of virtual function call instead of function pointer as in libstd++. Now, I wouldn't state that function pointer would perform faster than a virtual function call, but I remember seeing the performance measurements of both types of call and a call via function pointer was slightly faster than a virtual function call.  Now, don't get all paranoid by that, as we will see in the final part of the blog series, std::function should better be not present in a tight code where performance matters most. In all other cases, the difference (if at all) shouldn't matter at all.

One problem that I see with non-standard way of implementing SBO is performance regression when going from one standard library implementation to another i.e. while moving from a library implementation that has larger internal buffer size than to a one having less (eg: libc++ -> libstd++). This would certainly waste quite some time of the developer who is debugging the issue (as in most cases this developer would not be the one who wrote the code :))





Friday, 24 June 2016

What you need _not_ know about std::function - Part 1


  • Part-2 : Implementation of std::function
  • Part-3 : Performance benchmarks


std::function is cool. No doubt about that. It provides you with a constant interface to create callable objects irrespective of whether it is constructed from a function pointer or member function pointer or functor or functionoid. I assume that everyone who is reading this have used it at least once. The mechanics of how it actually works is near magic.

In this post and in the second part of it, I will try my best to dissect the functional header file provided by libstd++. This post will be covering the basic type traits that will be required, std::mem_fn, few other technical stuff and in the next post I will discuss about the implementation of std::function.

The only part I am not going to touch from that file is std::bind. It's dark magic. Let's leave it at that.

Few notes:
1. The code would mostly be the prettified version of libstd++ implementation.
2. No guarantee about the correctness of my understanding. Comments are welcome if you see anything wrong in the explanation.



Some Basic Definitions

A expression for a function call as defined by the standard is
postfix-expression ( expression-list <optional> ) 
Few important definitions from the standard (Section 20.8.*)

1. Function Object Type is an object having the type of postfix-expression in the function call expression (see above).
2. Function Object is just an object of function object type.
3. Call Signature is the name of a return type followed by a parenthesized comma-separated list of zero or more argument types.
Example: Consider the function:
int foo (char c) {...}
Call signature for the above is:
int (char);
4. A Callable Type is a function object type or a pointer to member.
5. A Callable Object is an object of a callable type.
6. A Call Wrapper Type is a type that holds a callable object and supports a call operation that forwards to that object.
Below shows a very simple (but useless) example of such a call wrapper:
/* 
A wrapper around a callable type which
returns a void and takes an int.
*/
struct VoidCallWrapper {
  void operator()(int v) {
    callable_(v);
  }

  CallableType callable_; // Target Object
};
7. A Call Wrapper is an object of call wrapper type.

8. A target object is the callable object held by a call wrapper.

As you see, unlike 'C' which has only one callable type that I know of which is a function pointer, C++ has lot more variety of callable types:
1. Function pointer
2. Member function pointer
3. Member data
4. Functor
5. Functionoid

Having so many options definitely gives more flexibility, but at the same time can cause a lot of inconvenience due to difference in their syntaxes. One can easily use a template to hide the syntax differences and as we will see later that's what one should do when we do not need a stored callable object. But, when we need to store a callable object and which must remain valid across call stacks, we should use std::function. It has only the call signature as part of its type information, all the rest is erased (type-erasure). So, it is kind of a generic function pointer. All this would be the meat of discussion in one of the coming sections. 

Deep Dive into Type-Traits

Type traits are basically compile time introspection mechanism for getting some characteristics about the passed type (Mostly as a template parameter).
A quick example of a type-trait which checks whether the passed type has a member-type named value_type:

template <typename... T>
using void_t = void;

template <typename T, typename = void_t<>>
struct has_value_type : std::false_type {};

template <typename T>
struct has_value_type<T, void_t<typename T::value_type>> :
  std::true_type {};

int main() {
    static_assert (!has_value_type<int>::value, "Integer cannot have a value type!!");
    static_assert (has_value_type<std::string::iterator>::value, 
                                                  "String iterator must have value_type!!");
    return 0; 
}

The example shown makes use of the void_t trick.

Now, let's understand the type_traits used by std::mem_fn and std::function one by one.

NOTE: Not all the specializations for a type-trait would be covered in this post. The specializations for c-v qualification(s) and varargs are omitted for the sake of clarity and less typing.

Result Type for a Callable Type

A type-trait to know what the callable type returns when its object is called.
template<typename Functor>
struct Weak_result_type
    : Weak_result_type_impl<Functor>
{ };


As said earlier, the template has magically removed all the syntax differences between different callable types. All the type specific details are found out by doing partial specialization of Weak_result_type_impl struct for each callable type:

1. Specialization for regular function call signature
template <typename Res, typename... ArgTypes>
struct Weak_result_type_impl <Res(ArgTypes...)>
{ typedef Res result_type; };

This is simple, the partially specialized template parameter has pretty much the same signature of that of the call expression. The template argument deduction settles in for this specialization because it matches best (OR no other specialization matches) for a simple function.

Let's see an example of how this would work:
// This is not how the actual definition of Weak_result_type_impl
// is. It's just here for example.
template <typename F> struct Weak_result_type_impl;

template <typename Res, typename... ArgTypes>
struct Weak_result_type_impl <Res(ArgTypes...)>
{ typedef Res result_type; };

template <typename Functor>
struct Weak_result_type
    : Weak_result_type_impl <Functor>
{ };

int foobar() {
  return 42;
}

int main() {
  static_assert (
       std::is_same<Weak_result_type<decltype(foobar)>::result_type, int>::value,  
               "foobar should always return an int!!");
  return 0;
}

This should give good idea on how the rest of the partial specializations would work.

2. Specialization for function reference
template <typename Res, typename... ArgTypes>
struct Weak_result_type_impl<Res(&)(ArgTypes...)>
{ typedef Res result_type; };

3. Specialization for function pointer
template <typename Res, typename... ArgTypes>
struct Weak_result_type_impl <Res(*)(ArgTypes...)>
{ typedef Res result_type; };

4. Specialization for member function pointer
template <typename Res, typename Class, typename... ArgTypes>
struct Weak_result_type_impl <Res (Class::*)(ArgTypes...)>
{ typedef Res result_type; };

5. Specialization for Functor having a result_type  as a member type
template <bool Has_result_type, typename Functor>
struct Maybe_get_result_type {};

template <typename Functor>
struct Maybe_get_result_type <true, Functor>
{ typedef typename Functor::result_type result_type; };

template <typename Functor>
struct Weak_result_type_impl
    : Maybe_get_result_type <has_result_type<Functor>::value, Functor>
{ };

So, for a functor, it is important for that functor to have a member type called result_type in order for this type trait to work. 
Well, that's kind of not-so-generic, isn't it ? Yeah, there are other ways to get the result_type from a functor:
template <typename T, typename = void_t<>> struct Functor_return_type {};

template <typename T, typename... Args>
struct Functor_return_type<T(Args...),
                           void_t<decltype(
                                    std::declval<T>().operator()(
                                                    std::declval<Args>()...
                                  )  )
                           >>
{
 using result_type = decltype(std::declval<T>().operator()(std::declval<Args>()...));
};

class Functor {
public:
  int operator()(int v) {
    return 42;
  }
};

int main() {
  static_assert (std::is_same<Functor_return_type<Functor(int)>::result_type, int>::value, "");
  return 0;
}

This is somewhat what std::result_of does which is bit more complicated than what we have done here. So, the point was just to show that there are ways to get the return type from the functor or target callable object, but with a different interface (We had to provide expected argument types for the above.).

6. Metafunction to check if all is true

This basically means to check if all the checks passed into a metafunction evaluates to true or not. libstd++ makes use of '__and_', but here we will make use of the bool-trick, which I found here on StackOverflow.
template <bool... B>
struct bool_pack {};

template <bool... V>
using all_true = std::is_same<bool_pack<true, V...>, bool_pack<V..., true>>;

We will see a use case for it soon.

7. Metafunction to check if parameters passed are convertible to another.

We do have a standard meta function std::is_convertible<From, To> which evaluates to true if type From is implicitly convertible to To.

AllConvertible meta function works for variadic sets of template parameters and evaluates to true if and only if all the parameters provided in the first set are convertible into its respective parameters provided in the second set.
// To store the parameters as a list
template <typename... T> 
struct Pack : std::integral_constant<size_t, sizeof...(T)>
{};

template <typename From, typename To, bool = From::value == To::value>
struct AllConvertible : std::false_type
{ };

template <typename... From, typename... To>
struct AllConvertible<Pack<From...>, Pack<To...>, true>
    : all_true<std::is_convertible<From, To>...>
{ };
 

8. Metafunction Require

Require metafunction takes in a set of checks (same as bool-trick), but will fail to compile if any of the checks is false. Thus can be used as SFINAE, but maily used in mem_fn and std::function as hard compile time errors.
template <typename... Cond>
using Require = typename std::enable_if<all_true<Cond...>::value>::type;

What is std::mem_fn ?

In C++, for a class/struct, we can have pointers to its member functions and also pointers to its data members. Unlike regular function pointers, member function pointer is useless without an object to invoke it on. So, whenever you have the case for using member function pointer as a callback, one has to also provide the instance on which it needs to be called on. And, if you are in a situation wherein you want to make use of a pointer to a data member, you would have to use a slightly different dereferencing syntax. Also, member function/data pointers syntax are not pretty things to look at, but that's a secondary problem.

So, what mem_fn provides us is a consistent interface to create a callable object for both member function pointer and member data pointer.
What we get from that ? We can use it with STL algorithms now:
class Test {
public:
  void take() {
    std::cout << "Passed\n";
  }
};

int main() {
  std::array<Test, 3> arr = {Test(), Test(), Test()};
  std::for_each(std::begin(arr), std::end(arr), std::mem_fn(&Test::take));
  return 0;
}

It wouldn't be this easy (and beautiful) had we used raw member function pointers. 
Wait, I lied, there is a much more elegant way (and preferred) to do this using lambda expression:
  std::for_each(std::begin(arr), std::end(arr), [](const Test& t){ t.take(); });

Now, since mem_fn returns a callable object, it can also be used with std::function


std::mem_fn Implementation details


mem_fn is a template function and has the below signature:
template <typename Tp, typename Class>
Mem_fn<Tp Class::*>
mem_fn(Tp Class::*) noexcept;

It takes in a member function pointer or a member data pointer and returns an object of template class Mem_fn (capital 'M').

Definition of mem_fn:
template <typename Tp, typename Class>
Mem_fn <Tp Class::*>
mem_fn(Tp Class::* pm) noexcept
{
  return Mem_fn<Tp Class::*>(pm);
}

Pretty much the stuff explained before, nothing interesting. The interesting part would be to see, how Mem_fn provides a consistent interface for both member function pointers and member data pointers.

In the real implementation Mem_fn has several specializations  to deal with c-v qualifications, but as mentioned before, we will be looking at only one specialization i.e the one without c-v qualification involved. 

Also, Mem_fn is partially specialized to deal with pointers to member function and pointers to data members. This is how it provides a consistent call operator for both member function pointer and member data pointer.

Specialization for Member function

template <typename T> class Mem_Fn;

template <typename RetType, typename Class, typename... ArgTypes>
class Mem_Fn<RetType (Class::*) (ArgTypes...)>
{
  using Functor = RetType (Class::*) (ArgTypes...);

public:
  explicit Mem_Fn(Functor mf): func_(mf) { }

  // Handle lvalue reference to an object
  template <typename... Args,
            typename _R_ = AllConvertible<Pack<Args...>, Pack<ArgTypes...>>
           >
  RetType operator()(Class& obj, Args&&... args)
  {
    return (obj.*func_)(std::forward<Args>(args)...);
  }

  // Handle Rvalue reference
  template <typename... Args,
            typename _R_ = AllConvertible<Pack<Args...>, Pack<ArgTypes...>>
           >
  RetType operator()(Class&& obj, Args&&... args)
  {
    return (std::move(obj).*func_)(std::forward<Args>(args)...);
  }

  // Handle plain pointer to an object
  template <typename... Args,
            typename _R_ = AllConvertible<Pack<Args...>, Pack<ArgTypes...>>
           >
  RetType operator()(Class* obj, Args&&... args)
  {
    return (obj->*func_)(std::forward<Args>(args)...);
  }

  // Handle reference wrapper
  template <typename T, typename... Args,
            typename _R_ = AllConvertible<Pack<Args...>, Pack<ArgTypes...>>,
            typename std::enable_if<std::is_base_of<T, Class>::value>::type* = nullptr
           >
  RetType operator()(std::reference_wrapper<T> ref, Args&&... args)
  {
    return operator()(ref.get(), std::forward<Args>(args)...);
  }

  // Handle smart pointer
  // Maybe sometime later...... :)


private:
  // The member function pointer
  Functor func_;
};

With the knowledge of the type-traits we saw earlier it should be pretty easy to follow the template constraints put on the member functions. This is onle of the ways how c++ template metaprogramming offers tight type system checks.

Behaviour with Ref Qualified member functions

This is pretty basic stuff, except for one important behaviour/optimization which I surely would have missed implementing (but actually doesn't work, see next para) had I have had to implement this class.  Here, I am specifically talking about the overload for rvalue reference. Notice how the member function is called on the moved object:
 return (std::move(obj).*func_)(std::forward<Args>(args)...);

At first I expected it to call the rvalue ref-qualified version of the function (if provided of course). But that does not seem to be the case. The ref-qualifiers are tightly bound to the member function signature and unless explicitly provided, it would not call it.
Checkout this SO question for more details.


A peek into std::function

As the holy standard puts it:
The function class template provides polymorphic wrappers that generalize the notion of a function pointer. Wrappers can store, copy and call arbitrary callable objects, given a call signature, allowing functions to be first-class objects.
In the next section we will go about seeing how std::function is really implemented. So, stay tuned...

Part-2 of the series is HERE !