From Mordor, with love

Saturday, 31 March 2018

Rust: First impressions from a C++ developer

So, even I started learning Rust 2 weeks back (yay!!) whenever I got free time. And all the time that I spent learning it has been worthwhile.

This is not going to be a deep technical post, but just my impressions about Rust from where I come from (C++).

My Impressions

Getting Started

Rust is a sufficiently complex language in that, it is not as easy as Go or Python to just pickup and start. Having said that, one can still just code something useful in a day, but a language which is claiming to be a real systems language needs to be understood a lot better.
So, for me it would be the regular steps of learning the language and start coding basic data structures and algorithms from scratch and then start contributing to some project or create my own.
There are some pain points which I will come down to later.

Borrowing & Ownership

Even though I have been programming in C++ since before C++11 came into picture and understand and make use of its sophisticated / advanced features, I am still not comfortable enough to say that I am an expert in C++ with a straight face. So, coming from this background I did not have any difficulty in understanding the Borrowing and Ownership concepts. Those are the things we strive for in our modern C++ code. Who doesn't love RAII ?
But striving to achieve is lot different than being enforced by the compiler!
It is really amazing if access to invalid memory address is caught at compile time. I was really happy. With C++, there are quite a lot of simple cases which still today cannot be detected by the "-f sanitizeXXX" options.
This is such a huge win and advantage! I must admit that early in my career, I did return a reference to a local variable and that got passed through code review as well! (The changes were quite big)

Complexity

This again depends from where you are coming from. If one is coming from Node/JS/Python or even plain C background, they might get overwhelmed with the concepts.
And one of the main reason is because the compiler wont budge unless you do the right thing!! :)
For comparison, it's easier for an average C programmer to write in C++ because they continue to write in C!! They do get overwhelmed if made to code in true C++ style because thats when they understand the concepts involved in getting you Zero Cost Abstractions.
So, for me it is not at all as complex like C++. But still, I have a lot much left to learn.

Concurrency / Threading / Network library

It would have been awesome if coroutines was there. But we will be getting it soon I suppose as LLVM has it.
But then, it has Go like Channels! That itself is a very simple abstraction to get your things done quickly.
Working with the Socket level abstractions were also pretty easy. I am yet to dig into more details of it and then to MIO and Tokio libraries.
Not much to report here as I do not have much experience with the provided abstractions.

Lifetimes

This is still something which I don't think I understand to its fullest especially the Variance and Subtyping. In general, Covariant, Contravariant and Invariant types are kind of easy to understand, but at a deeper level on how to use the relevant markers is till something I do not understand.
I ran into this especially while trying to implement Vector data structure as put out in the nomicon book. We will see more about it bit later.
Otherwise lifetimes as a concept for linear type is pretty cool and might give some headache if one is going to use references a lot in uncommon ways which prohibits lifetime ellision.

Allocation

Most of what I will be putting here are questions to the fellow Rust developers. I did not find answers to those when I needed it. So don't consider it as something "you can't do in Rust". It is something which "I didn't get how to use in Rust".

As a systems programmer, I do want to have access to the base allocator and write my own Allocators.
While writing my Vector library, the nomicon book says to use "alloc" crate, but that can't be used in stable release. Then I rustup'ed to nightly release and still could not use it! The version was 1.26.
I want to use Allocators in stable release. How ?
I know there are example on how to (de)allocate using Vec. I am somehow not comfortable with that. Is it really Zero Cost ?
What I have done:

    ///
    fn allocate(count : usize) -> *mut T
    {
        let mut v: Vec<T> = Vec::with_capacity(count);
        let ptr = v.as_mut_ptr();
        mem::forget(v);
        return ptr;
    }
    ///
    fn deallocate(ptr: *mut T, len: usize, cap: usize)
    {
        unsafe {
            mem::drop(Vec::from_raw_parts(ptr, len, cap));
        }
    }

This is really not something I would like to continue doing.

Exceptions

To be honest, I am glad its not there. I clearly know about when to use exceptions and when to use return codes and blah blah.... But still. Writing exception safe code in C++ is still a challenge especially when you are designing data structures.
Result<T, E> is something I prefer compared to exceptions.
Its been hard for me while writing libraries to support both exceptions and error_code for the users who might prefer one over the other. I will probably start using Expected or Outcome whenever its available.
Do I need to even mention about the pattern matching support ? Its awesome!! Took me back to my days when I used to be Erlang developer.

Overall

It's a YES!!
There are still a lot of things I really liked but have not talked about, for eg: Traits, Enums on steroids, Pattern Matching, Functional Programming Style and generic programming. All that maybe for another day.
But sadly, its still going to be a language for my side projects.
C++ would still be my go to language for foreseeable future. With all the new features and language additions coming (creeping) in, there is barely enough time to switch.
Rust obviously has huge potential. But still not quite there to replace C/C++ completely. But I really hope C dies and see people (mostly average software developers) write more structured code.

Sunday, 4 March 2018

Designing Async Task Dispatch Library From Scratch (Part-3)

This is the final part of the series, where we will just creating task queueing system which will be more or less like asyncio BaseEventLoop. To reduce the complexity of the program, we would not be dealing with timed tasks or socket events.
(I had initially planned to do sockets as well, but I have another project in mind to do the same with C++ coroutines. Wait for that!)

Previous parts of the series:

A Task Queue

A tasking system in its simplest form is a queue of callbacks which needs to be called. On top of that, it would provide some APIs to the user to add some tasks or to wait till the task finishes.

What we will be achieving to do is something like below:

def task1():
    print ("Done with task1")

def my_gen():
    print ("Starting coro/gen my_gen")
    yield None
    print ("Finished coro/gen my_gen")

if __name__ == "__main__":
    print ("Starting test")
    print ("-------------")

    #Instantiate a run loop instance
    loop = TaskLoop()

    #Submit a task to the run loop
    loop.call_soon(task1)
    loop.run_once()

    #start a coroutine and finish it
    loop.run_until_complete(my_gen())

    #Stop the run loop
    loop.stop()

The example basically shows how the task loop works with a regular function and a coroutine:

call_soon API only takes regular functions as its argument. Anything else, it will throw an exception.
run_until_complete API takes only future objects and coroutines and will wait until the task is run to its completion.
run_once will execute the first task in the task queue if present.
stop will stop the task loop from executing any tasks from the task queue.

Introducing Handles

Well, I lied about queue of tasks in the previous section. Its is actually a queue of Handle objects.

A Handle is nothing but a stored function object along with its arguments so that it can be invoked at a later time.

class Handle(object):
    """
    A wrapper for holding a callback and the
    arguments it takes.
    """
    def __init__(self, fn, loop, args):
        """
        Constructor.
        Arguments:
            fn   : The callback function. Has to be a regular function.
            loop : The loop which would call the handler.
            args : The argument pack the callback expects.
        """
        self._cb = fn
        self._args = args
        self._loop = loop
        #Set to true if the handler is cancelled
        #Cancelled handlers are not executed by the loop
        self._cancelled = False
        #Set to true if the handler has finished execution
        self._finished = False

    def __call__(self):
        """
        The function call operator.
        """
        assert self._finished == False
        try:
            self._cb(*self._args)
        except Exception as e:
            self._finished = True
            raise e

    def cancel(self):
        """
        Cancel the handle
        """
        assert self._cancelled == False
        if not self._finished:
            self._cancelled = True
        return

    def is_cancelled(self):
        """
        """
        return self._cancelled

As can be safely deducted, a handle stores only regular functions. Futures and coroutines will not be stored in it.

So you ask, how does the task loop then work with futures/coroutines ? Lets look at the implementation of run_until_complete.

run_until_complete decoded

Lets dive into its implementation directly:

    def run_until_complete(self, coro_or_future):
        """
        Run the event loop until the future
        is ready with result or exception.
        """

        if self._loop_running:
            raise RuntimeError("loop already running")

        #Transform into a future or task(inherits from Future)
        task_or_fut = transform_future(coro_or_future, self)
        assert task_or_fut is not None

        is_new_task = not isinstance(task_or_fut, Future)
        #Add a callback to stop loop once the future is ready
        task_or_fut.add_done_callback(self._stop_loop_fut)

        try:
            self.run_forever()
        except Exception as e:
            raise e

This looks a bit different from what we achieved in our previous part of this series. The difference is because of transform_future. This is basically what asyncio.ensure_future is. It takes in either a coroutine generator object or a Future. If its a future it returns it (behaves like an identity function), else, if it is a coroutine, it wraps it inside a Task object and returns that task.

Once the future is ready, _stop_loop_fut will get called which would throw StopLoopException to notify the main loop.

def transform_future(supposed_future_coro, loop):
    """
    Loop works best with futures or with Tasks
    which wraps a generator object.
    So try to introspect the type of
    `supposed_future_coro` argument and do the best
    possible conversion of it.
    1. If its is already a type of Future, then return as it is.
    2. If it is a generator, then wrap it in Task object and
       return the task instance.

    loop object is for creating the task so that it
    can add a callback.
    """
    if isinstance(supposed_future_coro, Future):
        """
        It is a future object. Nothing to do.
        """
        if loop != supposed_future_coro._loop:
            raise RuntimeError("loop provided is not equal to the loop of future")
        return supposed_future_coro
    elif inspect.isgenerator(supposed_future_coro):
        #Create a task object
        t = loop.add_task(supposed_future_coro)
        return t
    else:
        raise TypeError("Cannot create an asynchronous execution unit from the provided arg")

    assert (False and "Code Not Reached")
    return None

In the above source, the task creation logic is handled inside the loops add_task method, bit all it does is really create a Task object.

The constructor of the task object adds the _step method to the loop which gets called later.

class Task(Future):
    """
    Wraps a generator yielding a Future
    object and abstracts away the handling of future API

    This version adds loop to the task
    """
    def __init__(self, loop, gen):
        """
        """
        super().__init__(loop)
        assert (loop is not None)
        self._loop = loop
        self.gen = gen
        self._loop.call_soon(self.step)

And from here on everything goes as per what we have already seen in the previous posts. Only change is that the calling of the Task method is now automated to be don by the task loop.

The complete implementation of the task loop is:

class TaskLoop(object):
    """
    """
    def __init__(self):
        #A deque of handles
        #CPython implementation of deque is thread-safe
        self._ready = deque()
        self._loop_running = False
        pass

    def total_tasks(self):
        """
        """
        return len(self._ready)

    def run_once(self):
        """
        Execute a task from the ready queue
        """
        if len(self._ready) == 0:
            #There are no tasks to be executed
            return

        hndl = self._ready.pop()
        #Call the handler which invokes the
        #wrapped function
        hndl()

    def run_forever(self):
        """
        This is the main loop which runs forever.

        TODO: Currently just runs infinitely. Should run
        only till some work is there.
        """
        self._loop_running = True

        while True:
            try:
                self.run_once()
            except StopLoopException:
                self._loop_running = False
                break
            except Exception as e:
                print ("Error in run_forever: {}".format(str(e)))

        return

    def run_until_complete(self, coro_or_future):
        """
        Run the event loop until the future
        is ready with result or exception.
        """

        if self._loop_running:
            raise RuntimeError("loop already running")

        #Transform into a future or task(inherits from Future)
        task_or_fut = transform_future(coro_or_future, self)
        assert task_or_fut is not None

        is_new_task = not isinstance(task_or_fut, Future)
        #Add a callback to stop loop once the future is ready
        task_or_fut.add_done_callback(self._stop_loop_fut)

        try:
            self.run_forever()
        except Exception as e:
            raise e

    def add_task(self, gen_obj):
        """
        Creates a task object and returns it
        """
        t = Task(self, gen_obj)
        return t

    def call_soon(self, cb, *args):
        """
        Add the callback to the loops ready queue
        for it to be executed.
        """
        if inspect.isgeneratorfunction(cb):
            raise NotRegularFunction()

        hndl = Handle(cb, self, args)
        self._ready.append(hndl)
        return hndl

    def stop(self):
        """
        Stop the loop as soon as possible
        """
        self.call_soon(self._stop_loop)

    def _stop_loop(self):
        """
        Raise the stop loop exception
        """
        raise StopLoopException("stop the loop")

    def _stop_loop_fut(self, fut):
        """
        """
        assert fut is not None
        raise StopLoopException("stop the loop")

The complete source code for the series can be found at
https://github.com/arun11299/PyCoroutines

Hope you enjoyed it!

Saturday, 3 February 2018

Designing Async Task Dispatch Library From Scratch (Part-2)

Working with Futures, Executors and Tasks

Part-1 Generators in detail

What is a Future

In layman terms, a future is an object which can hold the value or result of some computation done asynchronously. What does that mean ?

Consider the below example

def some_task():
    time.sleep(5)
    return 42

ret1 = some_task()              #ret1 will be 42
ret2 = async_compute(some_task) #what should be ret2 ?

ret1 would be assigned with some value only after 5 seconds and during that time the thread executing the code would be blocked and cannot do anything else. This is called synchronous way of doing things. Other examples of blocking calls from your network programming 101 are accept, connect, read, write etc.

But what about ret2 ? If we are to do asynchronous programming we cannot block the executing thread to block for 5 seconds, we can do other tasks meanwhile. To achieve that, async_compute might very well dispatch off the task to another thread. If we do that, there is no way we can return the return value of some_task to ret2 (Hold off your idea about using coroutines. We will get there).

So what async_compute can return is a placeholder where the computed result would be put at some future time. This placeholder is what commonly known as future!

There are different ways to get the computed result out from a future object:

Via a blocking call
Via callbacks

Blocking call example:

ret = async_compute(some_task)
assert isinstance(ret, Future)

print (ret.result()) # Will print 42

This example is still equivalent to the blocking example as the call to result would block till the result from the asynchronous computation is not available. In our example it would block for 5 seconds.

Callback example:

def print_result(fut):
    assert isinstance(fut, Future)
    print (fut.result()) # This will not block

ret = async_compute(some_task)
assert isinstance(ret, Future)

ret.add_done_callback(print_result)

print ("Continuing with other tasks")

What we are doing here is providing a callback to the future which would get executed when the result of the asynchronous computation is made available to the future.

Future - Building Blocks

In this section we will see what are the requirements that a Future type should have and try to create one.

States of a future object
We will be discussing only the bare minimum states required for a future. There may be other states involved based upon the implementation and feature provision.

Pending - The future is constructed with this state. It basically indicates that the asynchronous computation has either not started or finished yet.
Cancelled - The future is cancelled and all the completion callbacks are called. The future that we would be implementing does not have the ability to cancel the running asynchronous process though.
Finished - The future has the result of the asynchronous operation ready. A future is ready when the result of the async operation is available or if the async operation finished with an exception.

Client APIs of a future type

add_done_callback - For calling a function when the future state is 'Finished' or 'Cancelled'. The function must take only one argument which is the future object. The client code can access the result or the exception from the passed future argument. The purpose of adding callbacks or done callbacks to a future is to do some action in response to the completion of the async operation or many say it chaining of operations.

For eg:

def some_long_task():
    time.sleep(10)
    return 43

def another_long_task():
    time.sleep(5)
    return 68

def after_another_long_task(fut):
    res = fut1.result()
    print ("All computations finished. Result is {}".format(res))


def after_long_task(fut):
    res = fut.result() # This would not block, result is already available
    # Trigger another asynchronous computation
    fut1 = async_compute(another_long_task)
    fut1.add_done_callback(after_another_long_task)

fut = async_compute(some_long_task)
fut.add_done_callback(after_long_task)

2. set_result - This API must usually be only called by the asynchronous operation which created the future. It basically stores the computed result into the future, sets the state to 'Finished' and calls all the added callbacks.

3. set_exception - If the asynchronous executor encounters an exception while performing our operation, it will set the exception inside the future by calling this function. The state of the future would be still set to 'Finished' and will call all the added done callbacks.

4. cancel - Cancels the future by setting the state to 'Cancelled' and calls all the added done callbacks.

5. result - Blocks (or upto provided timeout) until the result is available.

6. exception - Blocks (or upto provided timeout) until the exception is set in the future.

These APIs needs to be thread safe if they are to be called from different threads. We will be using Mutex + Condition variable to achieve it.

An Implementation

# coding: utf-8

# In[11]:


import threading
import time


# Future states:
#
# An instance of the future could be in any of these states.
# 1. PENDING : The future does not have the result/exception for the corresponding work.
# 2. RUNNING : TBD
# 3. CANCELLED : The future got cancelled before the result of the associated work was computed.
# 4. FINISHED : The associated work got finished and resulted in giving out a value or exception.

# In[12]:



PENDING   = 'pending'
RUNNING   = 'running'
CANCELLED = 'cancelled'
FINISHED  = 'finished'

FUTURE_STATES = [
    PENDING,
    RUNNING,
    CANCELLED,
    FINISHED
]


# In[13]:


class FutureCancelledError(Exception):
    """"""
    def __init__(self):
        pass


class FutureTimeoutError(Exception):
    """"""
    def __init__(self):
        pass



# In[14]:


class Future(object):
    def __init__(self):
        """"""
        self._state = PENDING
        self._condition = threading.Condition()
        self._done_callbacks = []

        self._result = None
        self._exception = None

    def add_done_callback(self, cb):
        """
        Add the callback to be executed when the future state becomes
        cancelled/finished
        """
        with self._condition:
            if self._state not in [CANCELLED, FINISHED]:
                self._done_callbacks.append(cb)
                return
        #Call immediately if result/exception already set
        cb(self)


    def result(self, timeout=None):
        """
        Blocking call on the calling thread.

        timeout: time to wait for the result to be ready.

        Throws:
        FutureCancelledError if the state of future was CANCELLED
        or became CANCELLED later.

        FutureTimeoutError if future did not become ready before
        the timeout.
        """
        with self._condition:
            if self._state in [CANCELLED]:
                raise FutureCancelledError()

            if self._state == FINISHED:
                # Already done, return the result
                return self._result

            self._condition.wait(timeout)

            if self._state in [CANCELLED]:
                raise FutureCancelledError()

            if self._state == FINISHED:
                return self._result
            else:
                return FutureTimeoutError()
        pass

    def exception(self, timeout=None):
        """
        Blocking call on the calling thread.
        """
        with self._condition:
            if self._state in [CANCELLED]:
                raise FutureCancelledError()

            if self._state == FINISHED:
                #Already done. Return the exception
                return self._exception

            self._condition.wait(timeout)

            if self._state in [CANCELLED]:
                raise FutureCancelledError()

            if self._state == FINISHED:
                return self._exception
            else:
                raise FutureTimeoutError()


    def done(self):
        """Future is finished"""
        with self._condition:
            return self._state in [CANCELLED, FINISHED]

    def cancelled(self):
        """ Is the future cancelled or not"""
        with self._condition:
            return self._state == CANCELLED

    def cancel(self):
        """Cancel the future if not already finished or running"""
        with self._condition:
            if self._state in [RUNNING, FINISHED]:
                return False
            self._set_state(CANCELLED)

            self._condition.notify_all()

        self._execute_done_callbacks()
        return True

    def set_result(self, result):
        """
        Sets the result of the work associated with this future.
        """
        with self._condition:
            self._result = result
            self._state = FINISHED

            self._condition.notify_all()

        self._execute_done_callbacks()


    def set_exception(self, exp):
        """
        Sets the exception that occurred while performing
        the work associated with this future.
        """
        with self._condition:
            self._exception = exp
            self._state = FINISHED

            self._condition.notify_all()

        self._execute_done_callbacks()

    def _set_state(self, state):
        """
        Sets the state.
        Assumes that lock is taken
        """
        self._state = state

    def _execute_done_callbacks(self):
        for cb in self._done_callbacks:
            try:
                cb(self)
            except Exception as e:
                print ("ERROR: {}".format(str(e)))

Executors

Now that we understand futures and have an implementation of it ready with us, let us use it. For that, we need to have some way to submit our tasks to "some" thing which would return us with a future object instance corresponding the to the submitted task.
This "some" thing is what we call as Executors. They are responsible for execution of the submitted tasks by the client code. There are different types of executors that one can provide:

Thread Executor
Thread Pool Executor
Multi Process Executor
Queuing Executor

NOTE: There may be more varieties of executors that I am not aware of.

For demonstration purposes we will use an extremely simple executor interface.

class Executor(object):
    """
    An interface for Executor.
    A concrete implementation of this class
    is expected to override all methods of this class.
    """
    def __init__(self):
        pass

    def submit(self, task, *args, **kwargs):
        """
        Takes a task to be executed with the arguments
        it requires.
        Returns a Future object instance.
        """
        raise NotImplementedError()

With this interface in hand let us create a simple Thread executor, which will simply execute the submitted task on a new thread.

class ThreadedExecutor(Executor):
    def __init__(self):
        super(self).__init__()
        self._thread = threading.Thread(target=self._runner)

    def submit(self, task, *args, **kwargs):
        """
        Create a new thread and run the _runner function.
        Execute the task in the _runner and return the 
        future.
        """
        f = Future()
        t = threading.Thread(target=self._runner, args=(f, task, args, kwargs,))
        t.start()
        return f

    def _runner(self, fut, task, args, kwargs):
        v = task(*args, **kwargs)
        fut.set_result(v)
        pass

This is pretty easy to understand now. Inside the submit function we do the following:

Create an instance of the Future.
Create a new thread with target function to run as _runner. Pass the future instance, task to the _runner.
Return the future. The client code now has the future.
On a separate thread _runner executes the task and once complete will store the result in the passed future instance. This results in execution of all the done callbacks methods registered with that future as explained in the previous section.

NOTE: Exception handling is not shown in the above example.
Now, the client code:

def my_long_running_task():
    time.sleep(100)
    return 42

def print_task(future):
    print (future.result()) #Result will not block here.

def schedule_task():
    future = ThreadedExecutor().submit(my_long_running_task)
    future.add_done_callback(print_result)

if __name__ == "__main__":
    schedule_task()

I hope it the use of futures make more sense in the context of Executors.

The 'Task' at hand

This is where we get back to coroutines and see its brilliance in action.

NOTE: The things I am presenting here is basically influenced by David Bleazy's talk on coroutines. You should definitely check that out.

The previous example using future with executor was good in that we could do asynchronous computation in fairly straightforward manner, but still it is not easy. To get the result at the call site i.e. where the executor was run, we either need to poll continuously to check if the future is ready with result or get blocked by calling result on the future object. Since we hate blocking or polling, we attached a done callback to the future.

We need something like this:

def perform_async_task():
    result = <magic> ThreadedExecutor().submit(my_long_running_task)
    print (result)

Isn't that sweet ? It is almost like a synchronous code. Assuming it works like this, if you think deeper you can see that the function when called is getting executed in the context of 2 threads. The thread which calls the perform_async_task function and the thread which executes the my_long_running_task function.

def perform_async_task():
    result = <magic> ThreadedExecutor().submit(my_long_running_task) <-- Executor thread
    print (result)  <---------- Client calling thread

Mind blown ? Lets see how to get this actually working. What should we replace the "<magic>" tag with ? Since we are going to talk about coroutines, lets replace it with a yield for now :)

In the part-1 of the series we have already seen how to control a generator function from outside using the generator handle using the next and send functions. If you have not or forgot I highly recommend going through that again.

With that knowledge we know that RHS expression of a yield statement is what is executed on next (or send based upon the state of generator) and the value sent using send is what gets assigned to the LHS of the assignment statement.
The thing to note here is that some statement like

result = yield ThreadedExecutor().submit(my_long_running_task)

requires 2 separate operations to finish (next and then send). Until then the coroutine is in suspended state. This is what makes it possible to write something like above. Lets see how it would work:

Consider 'G' is the generator object that we get on calling perform_async_task since it has a yield statement now (instead of magic).
future = next(G) would give me the future returned by the submit function of the Thread executor. Why ? Because thats the RHS expression of the yield.
future.add_done_callback(send_result) To the obtained future we will add a done callback which will get called when the executor has done its job.
And send_result is:

def send_result(future):
    send(G, future.result())

So when the send_result gets called, it will send the result wrapped inside future to the generator which is what get assigned to the LHS variable. Cool!!! Got it ?

Now we know what needs to be done manually to get the desired behaviour. Lets put it nicely and automate it using a class named Task.

Lets first jump into an implementation of the Task itself:

class Task(object):
    """
    Wraps a generator yielding a Future
    object and abstracts away the handling of future API
    """
    def __init__(self, gen):
        self.gen = gen

    def step(self, snd_val=None, exp=None):
        """"""
        try:
            if exp:
                self.gen.throw(exp)
            else:
                fut = self.gen.send(snd_val)
                fut.add_done_callback(self._fut_done_cb)

        except StopIteration as e:
            return e.value

    def _fut_done_cb(self, fut):
        try:
            result = fut.result()
            self.step(result, None)
        except Exception as e:
            self.step(None, e)
        pass

This is what in essence an asyncio Task is. It surely does more stuff, but this is the bare minimum stuff that gets the work done.

Lets explain its working through an example.

def perform_async_task():
    result = yield ThreadedExecutor().submit(my_long_running_task)
    return reult

t = Task(perform_async_task())
t.step()

## OR

Task(perform_async_task()).step()

As one can see, we have got our asynchronous code look like a synchronous one without having any hairy callbacks and futures involved in the client code. All of those details are handled by the task.

What really happening is:

The constructor of the task takes the generator object and stores it. That make the Task a little more than a wrapper over a generator.
Now we call the step function of the task. The step function calls generator send function with None value for the first time which is basically next call on the generator.
This results in the execution of the RHS expression of the yield which is the executor submit function.
The submit function returns the future object associated with the asynchronous operation. This is what fut variable holds inside step function.
To the obtained future, we add a done callback which is basically the second member function of the Task class, _fut_done_cb. What this means is that, when the asynchronous operation finishes, _fut_done_cb will get called.
Note that until _fut_done_cb is called the coroutine perform_async_task is suspended. It is not blocked.
So when the asynchronous operation finishes, _fut_done_cb gets executed, which gets the result from the future and calls the step function again with the obtained result.
So, the step function executes again calls the send function on the generator, but this time with the result of the async operation instead of None.
The above send actually sets the result of yield to the LHS inside perform_async_task.
With that due to the absence of any further yields inside perform_async_task, send will throw a StopIteration exception.
The return value of perform_async_task will be returned by the step function.

Try it for yourself!

Saturday, 6 January 2018

Designing Async Task Dispatch Library From Scratch (Part-1)

What the series is about

Understanding how generators work
Understand how future object works
How to create a task for asynchronous execution
Creating a simple but almost complete tasking and networking library

What the series is _not_ about

Understanding basic usage of generators. Reader is expected to have used it before.
Creating a library rivalling asyncio or anything else. I am only doing this to understand the whole thing better from a system programmers perspective.

The 'yield' Keyword

Lets take the below simple example of a generator:

def my_gen():
    ret = yield 5
    return ret

Few questions that comes to my mind are:

What happens when the 'yield' expression gets executed ?
What do we get as the return value of this generator-function ?
What more is there to this generator-function other than just yielding 5 ?

To answer above questions and to get enlightened we need to know how a generator-function is different from a normal function.

Difference between a generator-function and a regular function

NOTE: Generators are of iterable types. Iterables as a concept is something we would not be looking at in this post.

Few of the very important differences are listed below. There are more difference than what are listed below but out of scope for the current topic.

Suspension and resume capability

Unlike regular function, a generator-function can be suspended in between its execution and can be resumed from the same point. That means all the states are saved before getting suspended so that they can be reused when the generator-function is resumed. Its the presence of the yield expression inside a function that provides this capability. The yield is basically the suspension and resumption point.

2. Returns a generator on calling a generator-function

Unlike regular function which returns some value (or nothing) on execution, a generator-function returns a generator object. This generator object is what we will make use or abuse to control how the generator-function should execute.

3. Suspension and resumption of generator-function can be controlled

As said above, using the generator object one can control ow to resume and what value to send to it from the client side. We will see a lot more on this in the next section.

Deconstructing a generator-function

Lets take this hello-world example:

def my_gen():
    val = yield "hello, world!"
    return val

There are 2 ways that you can work with this generator-function. Lets see the hard way first:

# Prime the generator
g = my_gen()

#Get the yield value
res = next(g)
print ("my_gen yielded result: {}".format(res))

try:
    next(g)
except StopIteration as e:
    print ("my_gen generator function returned: {}".format(e.value))

This would print:

my_gen yielded result: hello, world!
my_gen generator function returned: None

Hmm...whats going on ? The assignment clearly did not work as the return value is printed as "None"!
Now is a good time to explain about controlling a generator-function using the generator object.

A generator instance has 3 methods defined on it:

next
send
throw

The next method will proceed to the next 'yield' statement present in the function and executes whatever expression is there to its right side. The return of which is passed to the called of the next method not to the variable on the left hand side of the assignment statement.

The throw method is what will be used by the client/scheduler program to pass exceptions to the generator-function.

The send method is what makes the generator behave as a coroutine. Thanks to PEP-342.

It is similar to what next does in proceeding to the next yield statement in the function, but as the name suggests, it can send a value as the result of the execution of the right hand side of the assignment statement.

That means, its completely under my control on what to set the value of the variable val as ? Yup!!

See this in action:

g = my_gen()
next(g) # Yields "hello, world!"
try:
    g.send(42) # Sends in a value 42 which gets assigned to val and does next(g)
except StopIteration as e:
    print ("my_gen returned: {}".format(e.value))

It prints:

my_gen returned: 42

NOTE: In the above example I am using the free function next instead of the member function. The free function operates on an iterable object which the generator is and calls its __next__ method.

Yeah, fetching the return value of a generator-function is pretty weird. Agreed. It is set as the value field of the StopIteration exception. Remember this.

Isn't this pretty freakishly powerful ? In fact this ability to "send" values to the generator is what makes it a coroutine and is used for creating a tasking system like asyncio.

So, lets reiterate on what is happening with the above simple generator-function:

Prime the generator by calling the function first. That gives us a generator object to work with.
Calling next on the generator object will go to the next yield statement in the function and return the output of its expression to its right side. At this point the function is suspended. It needs to be resumed by calling any of the 3 methods next/send/resume.
At this suspended state, the control is back to client. We want to pass some random value to the generator. If the yield is used as an assignment statement, then the sent value would get assigned to it, otherwise it would just behave like regular next call.

Adding a pictorial view if it helps at all:

As can be seen in the above picture, I have divided the generator function into two blocks.
The client/scheduler code dealing with the generator does below things:
a. Prime the generator function to get the generator object.

b. Do a next on it, which calls the __next__ method of the generator, which in turn goes to the first yield statement inside the generator function and returns the output of the RHS expression, None if no RHS expression is present. In our example, it returns 5.

c. Now we want to set the value of res to something meaningful. This is done by the send method, which places the value 42 on res and also calls the __next__ method of the generator.

d. Since no more yields are present, it will throw a StopIteration exception having the return value of the generator function set in its value member.

I still don't see the point

So, how is all these freaky things going to make me any useful as a programmer you ask.
Lets see a small example of some database driver that you wrote. The task is to fetch some rows from the DB and do something with the rows.

The client of your basic DB driver would probably write something like this:

class DDBResultFetcher(object):
    def __init__(self, rows_cb):
        self.instance = None
        self.is_connected = False
        self._cb = rows_cb
    
    def connect(self):
        if not self.instance:
            instance = MyDBDriver.get_correct_instance()
        
        if self.connected:
            return
        
        self.instance.connect()
        self.is_connected = True
    
    @property
    def connected(self):
        return is_connected
    
    def on_rows_cb(self, rows):
        self._cb(rows)
    
if __name__ == "__main__":
    
    def manage_rows(rows):
        for row in rows:
            #Do something with the rows
            ....
            
    db_f = DDBResultFetcher(manage_rows)
    db_f.connect()
    
    db_f.run()
    
    ##Additional code to check if the results have been completely fetched
    ##or not
    ......

Even though I have not added many error or exception checks and still many handling of cases missing, the code to write from a users perspective is quite a lot, error prone and can become unmanageable.

How would a generator/coroutine based DB driver can make my code look like:

def fetch_results_from_db():
    instance = MyDBDriver.get_correct_instance()
    
    if not instance.connected():
        instance.connect()
    
    while True:
        rows = yield instance
        if not rows:
            break
        # Do something with the rows
        ...

My user code is so much simpler now! The onus is now on the library writer to provide mechanisms to interact with this generator-function to use the yield suspension and resumption to send the rows obtained from the DB.

How to do that ? Wait for the next instalment of the series...

NOTE: My intention here is not to propagate the idea that "free functions are better than classes". No, its just an example.

Bonus Section: But still how its works behind the scenes ?

We will look here at some disassembled output and some C code to gain a little more better understanding of what is going on. It's for those people who cant sleep without seeing a bit of assembly or C/C++ code.

Or (the easier way)

In [7]: dis.dis(gen_obj.gi_code)
  2           0 LOAD_CONST               1 (5)
              3 YIELD_VALUE
              4 STORE_FAST               0 (x)

  3           7 LOAD_FAST                0 (x)
             10 RETURN_VALUE

There are lots of resources on the internet where you can understand what the above output means, but I will just briefly mention about the significance of each column.
As can be seen, there are 5 columns:

Column 1 : The line number in the python source file which the bunch of assembly statements represent.
Column 2 : The offset in the byte code (gen_obj.gi_code.co_code) where the opcode is present.
Column 3 : The opcode name.
Column 4 : The index into the attributes of the code object. Its bit more involved, but out of scope for this discussion.
Column 5 : The constants and the variable name based upon the opcode. This is only for helping mere humans.

Now, with this much information at hand lets take a look at the disassemble output again.

Load a constant value '5'.
Yield. Here is where the current frame is saved along with all its context which is in the gi_code object.
On issuing next or send, resume the saved frame and store the sent value ('None' in case of next and send without any argument) into the variable named 'x'.
Read the value in variable 'x' and return it.

More Fun with GDB

Now we know that the stack frame for a generator function is persisted on heap so that it can be loaded and resumed again.

NOTE: From whatever I could understand from the cPython code, all stack frames are allocated on heap. But of course nothing should be written on that assumption.

We will see now what happens when I am sending a value to a generator. Things we are expecting to happen are:

Load the frame associated with the generator.
Send the value ( ? )
Run the frame till next yield or throw StopIteration exception
In case of yield, go back to the previous frame of execution.

For this, I have a put a breakpoint on _PyGen_Send function which can be found in Objects/genobject.c source of CPython. It internally calls gen_send_ex which is present in the same source file.

The source code we are debugging:

def my_gen():
    x = yield

g = my_gen()
next(g)
g.send(42)

Lets see what happens in gen_send_ex. We will only see the relevant portions of the code.

static PyObject *
gen_send_ex(PyGenObject *gen, PyObject *arg, int exc)
{
    PyThreadState *tstate = PyThreadState_GET();
    PyFrameObject *f = gen->gi_frame;
    PyObject *result;
     .
     .
     .
}

We are getting the current thread state and the frame associated with the generator object.

    if (f->f_lasti == -1) {
        if (arg && arg != Py_None) {
            char *msg = "can't send non-None value to a "
                        "just-started generator";
            if (PyCoro_CheckExact(gen))
                msg = "can't send non-None value to a "
                      "just-started coroutine";
            PyErr_SetString(PyExc_TypeError, msg);
            return NULL;
        }
    } else {
        /* Push arg onto the frame's value stack */
        result = arg ? arg : Py_None;
        Py_INCREF(result);
        *(f->f_stacktop++) = result;
    }

Here we are checking the value of the last executed instruction on the generator function frame (f->f_lasti). If it's set to -1 it means the generator has not moved to its first yield statement.

Lets check the state of our generator stack frame (which is variable 'f'):

(gdb) p *f
$18 = {
  ob_base = {
    ob_base = {
      _ob_next = 0x7fcf85618d48,
      _ob_prev = 0x7fcf855825a8,
      ob_refcnt = 1,
      ob_type = 0x88cc20 
    },
    ob_size = 20
  },
  f_back = 0x0,
  f_code = 0x7fcf86660100,
  f_builtins = 0x7fcf86784b48,
  f_globals = 0x7fcf8673ac98,
  f_locals = 0x0,
  f_valuestack = 0x2916478,
  f_stacktop = 0x2916478,
  f_trace = 0x0,
  f_exc_type = 0x0,
  f_exc_value = 0x0,
  f_exc_traceback = 0x0,
  f_gen = 0x7fcf855825a8,
  f_lasti = 3,
  f_lineno = 1,
  f_iblock = 0,
  f_executing = 0 '\000',
  f_blockstack = {{
      b_type = -875836469,
      b_handler = -875836469,
      b_level = -875836469
    } },
  f_localsplus = {0x0}
}

Check the highlighted portion. The blue field says that we are at the first line of the frame, which is true, we are at the yield statement which is first line of our generator function my_gen.

Since our generator has already reached the yield (because we already executed next on the generator before sending any value), lets focus on the else portion:

    ...
    } else {
        /* Push arg onto the frame's value stack */
        result = arg ? arg : Py_None;
        Py_INCREF(result);
        *(f->f_stacktop++) = result;
    }

Hmm...its just putting the passed value on the top of the stack! Now that should result in a STORE, right ? Lets check the disassembly:

In [3]: dis.dis(my_gen)
  2           0 LOAD_CONST               0 (None)
              3 YIELD_VALUE
              4 STORE_FAST               0 (x)
              7 LOAD_CONST               0 (None)
             10 RETURN_VALUE

Yup, thats looks about correct and as expected. The sent value is getting stored in variable x.

From here, python should execute the current generator frame and should do some bookkeeping to get back to the previous frame which called the generator next/send once it yields again or finishes.

    /* Generators always return to their most recent caller, not
     * necessarily their creator. */
    Py_XINCREF(tstate->frame);
    assert(f->f_back == NULL);
    f->f_back = tstate->frame;

    gen->gi_running = 1;
    result = PyEval_EvalFrameEx(f, exc);
    gen->gi_running = 0;

The current frame is saved in the generator's frame as highlighted and PyEval_EvalFrameEx will continue with the execution of the frame.

I have jumped through lots of details here but this should be a good start for anyone interested in digging deeper.

Part-2 About Futures, Executors and Tasks

Saturday, 8 October 2016

What you need _not_ know about std::function - Part 3

Performance of std::function

In the previous post we saw the implementation of std::function in libstd++ (g++-6.2) in detail and discussed briefly on the libc++ implementation of the same. And now, its time to look at what the performance stats has to say.

The platforms and the compilers used for running the performance tests:

Mac OS Yosemite - clang++ 3.7
Ubuntu 16.04 VM - g++-6.2
Ubuntu 16.04 VM - clang++-3.8 (with libc++)

For all tests, the code was compiled with O2 level optimization settings.

I will be trying to benchmark std::function against:

Virtual function call
C++11 version of impossibly fast delegate (From StackExchange)

The codes are available at my github repo.

NOTE:

I would be using 'volatile' in the benchmarking code and stronger data dependency inside tight loops. The use of volatile is to prevent the compiler from optimizing away the code altogether.
I am including 'iostream' header. :)
I wasn't really planning to benchmark the 'fast delegate' until I saw THIS reddit post. That left me wandering why std::function was not used.

A note about "Impossibly Fast Delegate"

The implementation simply stores the callable object (pure function pointers and member function pointers) using template hackery and later invoked whenever the delegate call function is invoked, nothing more. This scheme would not work for people who want:

The stored callable to remain valid outside the lifetime of the class object whose member function needs to be invoked.
The original implementation did not work for any other types of callbacks. But the c++11 implementation does it in a naive way, i.e to allocate the handler on heap. There is no SBO done.

So, one must be careful while doing comparison of the fast delegate with std::function, just for the reason that std::function can do more things that the "fast delegate" and certainly it would have additional costs if what you want is point #1.

Also, I found on web doing some completely ridiculous benchmarking of boost::function with the fast delegate. The author simply went on creating a new instance of boost::function from a lambda for every iteration, but does not do the same for fast delegate ! I am certainly not planning to do such naive mistakes. I will be keeping the tests as simple as possible to avoid any kind of oversight.

First Round

Here, I am just going to call the callbacks '10000000' number of times and find the total time taken for all 3 on the above mentioned platforms.

Code for Virtual function call:

#include <iostream>
#include <chrono>

constexpr static const int ITER_COUNT = 10000000;

class Base {
public:
  virtual volatile int fun_function(volatile int) = 0;
};

class Derived final: public Base {
public:
  volatile int fun_function(volatile int i) {
    return ++i;
  }
};

int main(int argc, char* argv[]) {
  Derived d;
  Base * b = nullptr;
  b = &d;
  volatile int v = 0;

  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < ITER_COUNT; i++) {
    v = b->fun_function(v);
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << '\n';
}

Code for std::function:

#include <iostream>
#include <chrono>
#include <functional>

constexpr static const int ITER_COUNT = 10000000;

volatile int fun_function(volatile int v)
{
  return ++v;
}

int main() {
  std::function<volatile int(volatile int)> f(fun_function);

  volatile int v = 0;
  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < ITER_COUNT; i++) {
    v = f(v);
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::cout <<
    std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << '\n';

  return 0;
}

Code for "Impossibly fast delegate":

#include <iostream>
#include <chrono>
#include "../impl_fast_delegate.hpp"

constexpr static const int ITER_COUNT = 10000000;

volatile int fun_function(volatile int v)
{
  return ++v;
}

int main() {
  delegate<volatile int (volatile int)> del(fun_function);

  volatile int v = 0;
  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < ITER_COUNT; i++) {
    v = del(v);
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::cout <<
    std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << '\n';
  return 0;
}

And the results are...

My analysis:

Hmm..in all cases virtual function wins by a massive margin, except for clang on ubuntu.
The fast delegate is bit faster on both Mac and ubuntu with clang but not on g++-6.2, where std::function wins against the fast delegate. For now, I will consider both std::function and fast delegate equally fast until we do some more micro-benchmarking.
The results for virtual function on Ubuntu/g++ is very surprising. What makes it do the job so much faster ? Lets have a look at its assembly (with my comments):

        movl    $10000000, %edx  // Setting up the counter value in edx
        movq    %rax, %rbx
        .p2align 4,,10
        .p2align 3
.L3:
        movl    4(%rsp), %eax    // Load the volatile 'v' into eax
        addl    $1, %eax         // Add 1 to the value
        subl    $1, %edx         // Subtract the counter
        movl    %eax, 4(%rsp)    // Save the new value to its actual place
        jne     .L3              // Loop

Well, it seems that compiler was successfully able to devirtualize the virtual call and further optimize it to make all the computations without calling any function!! Thats brilliant, but more work for me.

Let's change the main function for the virtual function code to:

int main(int argc, char* argv[]) {
  Derived d;
  Dummy u;
  Base * b = nullptr;

  if (*argv[1] == '0') {
    b = &d;
  } else {
    b = &u;
  }

  volatile int v = 0;
  auto start = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < ITER_COUNT; i++) {
    v = b->fun_function(v);
  }

  auto end = std::chrono::high_resolution_clock::now();

}

I simply added one more derived function and made the decision to point to any one of them via a base class pointer a runtime decision based on command argument.

Lets rerun the tests and see if it changes anything:

Yeah, we now do have a reasonable measurement for virtual function on Ubuntu/g++ combo. But the general analysis is:

Virtual Function Call << std::function <~ Fast Delegate

What makes std::function so much slower than Virtual Function ?

Lets have a look at there corresponding assembly code (on Ubuntu/g++-6.2) (I just hope my comments to the assembly instructions are not way off):

// Assembly for the calling part for Virtual Function
        movq    %rax, %r12
        .p2align 4,,10
        .p2align 3
.L5:
        movq    0(%rbp), %rax  // rax probably points to the VT
        movl    12(%rsp), %esi // Move volatile 'v' to esi
        movq    %rbp, %rdi     // Move the 'this' pointer to rdi
        call    *(%rax)
        subl    $1, %ebx
        movl    %eax, 12(%rsp) // The new result stored in eax is stored back.
        jne     .L5

// Assembly for the calling part for std::function
        movq    %rax, %rbp
        .p2align 4,,10
        .p2align 3
.L11:
        cmpq    $0, 32(%rsp)   // Test if callable object is stored
        movl    8(%rsp), %eax  // Read the volatile 'v' and store it in eax
        movl    %eax, 12(%rsp) // Store the volatile in eax at a different offset from stack pointer
        je      .L27           // Jump to exception handling code
        leaq    12(%rsp), %rsi // Store the volatile value in rsi
        leaq    16(%rsp), %rdi // Store the Any_data union address in rdi
.LEHB0:
        call    *40(%rsp)      // Call the function callable target
        subl    $1, %ebx
        movl    %eax, 8(%rsp)
        jne     .L11

As can be seen, std::function requires more instructions to be executed before making a call to the target callable, 6 instructions again 3 for virtual function.

One thing to not is that, for the std::function case, there is a test against zero for each iteration. This is coming from the below function:

    template <typename Res, typename... ArgTypes>
    Res
    function <Res(ArgTypes...)>::
    operator()(ArgTypes... args) const
    {
      if (M_empty())
        throw_bad_function_call();
      return M_invoker(M_functor, std::forward<ArgTypes>(args)...);
    }

Let's see what happens if we avoid doing this check.

Well, not much difference or no difference at all after removing the check! I am sure I have read it somewhere that removing the check gives increase speedup. Not sure using which compiler and on which platform, but surely it has no effect on Ubuntu using g++-6.2 with O2 optimization.

Let's check the new assembly code just for the hell of it:

.L10:
        movl    8(%rsp), %eax
        leaq    12(%rsp), %rsi
        leaq    16(%rsp), %rdi
        movl    %eax, 12(%rsp)
        call    *40(%rsp)
        subl    $1, %ebx
        movl    %eax, 8(%rsp)
        jne     .L10

As can be seen, we no longer have the instructions to make the test against zero and for throwing exceptions. So, I am assuming its the two leaq instructions which is affecting the performance in case of std::function against a virtual function call.

!!!! I wonder if the compiler could have optimized the 2 leaq instructions outside the loop, as they are not changed at all. That would make the instruction cache more hot. !!!!

Second Round

Till now, we saw the benchmarks based upon just using the chrono library and by analyzing the assembly. Let's pimp up the setup. I will be using google benchmark library (on both mac and ubuntu) and perf (only on ubuntu).

The tests are:

BM_std_function_basic: Same test which we did in first round.
BM_std_function_heavy: Test which disables SBO i.e with a functor having size big enough which cannot be stored in the internal buffer.
BM_std_function_poly_cb: Test involving member function callback. Kind of similar to what has been done here.
BM_virtual_function_basic/48: Same test which we did in first round.
BM_imp_fast_delegate_basic: Same as #1 but using fast delegate.
BM_imp_fast_delegate_heavy: Same as #2 but using fast delegate.
BM_imp_fast_delegate_poly_cb: Same as #3 but using fast delegate.

Result on Mac Os using clang++-3.7:

Hmm..std::function seems to beat 'fast delegate' in both #2 and #3 test. Lets look at what is happening on Ubuntu...

Result on Ubuntu using g++-6.2:

Result on Ubuntu using clang++-3.8 using libc++:

Hmm...g++ one certainly did optimize out the #3 and #7 test (or may be not), but I am not interested in that now. What particularly caught my attention is the 11 ns difference in the #2nd test between g++ and clang++.

Other than that, from the above basic test results, I see no reason to go for another delegate implementation rather than using std::function. There may be other interesting test cases to try out, but this is all I have time for as of now :( (This is a way to indirectly ask help from reader to contribute more tests :) ) .

To go even further beyond....

Can't help but to post this link when the heading is something like this...

Now, let's see why we have a 11 ns difference for the 'heavy' test. Between, I used 'perf' as well to get something out of it, but in that too I ended up annotating the callbacks to show the assembly. So, assembly is all I have to figure out this difference.

Assembly for the clang version:

.LBB2_3:                                
        movq    8(%rbx), %rax
        leaq    1(%rax), %rcx
        movq    %rcx, 8(%rbx)
        cmpq    72(%rbx), %rax
        jae     .LBB2_8
# BB#4:                                 
        movl    $40, %edi
        callq   _Znwm         // This is where it calls the operator new
        movq    $_ZTVNSt3__110__function6__funcI10BigFunctorNS_9allocatorIS2_EEFbPKcEEE+16, (%rax)
                              // Store the Virtual Table in rax
        movq    %rax, 32(%rsp)
        xorps   %xmm0, %xmm0
        movups  %xmm0, 20(%rax)
        movl    $0, 36(%rax)
        movq    %r15, 8(%rax)
        movl    $1735289202, 16(%rax)   
        #APP
        #NO_APP
        movq    32(%rsp), %rdi
        cmpq    %r14, %rdi
        je      .LBB2_5
# BB#6:                                 
        testq   %rdi, %rdi
        jne     .LBB2_7
        jmp     .LBB2_1
        .align  16, 0x90
.LBB2_5:                                
        movq    (%rsp), %rax // Prepare for the function call i.e make the arguments ready in register
        movq    %r14, %rdi
        callq   *32(%rax)    // Call the function, remember rax is now pointing to VT ?
        jmp     .LBB2_1
.LBB2_2:                                
        movq    %rbx, %rdi
        callq   _ZN9benchmark5State16StartKeepRunningEv
        jmp     .LBB2_3

Assembly for g++ version:

.L95:
        movq    8(%rbx), %rax
        cmpq    72(%rbx), %rax
        leaq    1(%rax), %rdx
        movq    %rdx, 8(%rbx)
        jnb     .L111
        movl    $32, %edi
        movq    $0, 32(%rsp)
.LEHB3:
        call    _Znwm          // Invoke the operator new
.LEHE3:
        leaq    8(%rsp), %rsi
        leaq    16(%rsp), %rdi

        // !! This is the start where its trying to zero out
        // the buffer inside the functor          

        movb    $0, (%rax)
        movb    $0, 1(%rax)
        movb    $0, 2(%rax)
        movb    $0, 3(%rax)
        movb    $0, 4(%rax)
        movb    $0, 5(%rax)
        movb    $0, 6(%rax)
        movb    $0, 7(%rax)
        movb    $0, 8(%rax)
        movb    $0, 9(%rax)
        movb    $0, 10(%rax)
        movb    $0, 11(%rax)
        movb    $0, 12(%rax)
        movb    $0, 13(%rax)
        movb    $0, 14(%rax)
        movb    $0, 15(%rax)
        movb    $0, 16(%rax)
        movb    $0, 17(%rax)
        movb    $0, 18(%rax)
        movb    $0, 19(%rax)
        movb    $0, 20(%rax)
        movb    $0, 21(%rax)
        movb    $0, 22(%rax)
        movb    $0, 23(%rax)
        movb    $0, 24(%rax)
        movb    $0, 25(%rax)
        movb    $0, 26(%rax)
        movb    $0, 27(%rax)
        movb    $0, 28(%rax)
        movb    $0, 29(%rax)
        movb    $0, 30(%rax)
        movb    $0, 31(%rax)

        movq    %rax, 16(%rsp)
        movq    $_ZNSt17_Function_handlerIFbPKcE10BigFunctorE9_M_invokeERKSt9_Any_dataOS1_, 40(%rsp)
        movq    $_ZNSt14_Function_base13_Base_managerI10BigFunctorE10_M_managerERSt9_Any_dataRKS3_St18_Manager_operation, 32(%rsp)
        movq    $.LC2, 8(%rsp)
        call    _ZNSt17_Function_handlerIFbPKcE10BigFunctorE9_M_invokeERKSt9_Any_dataOS1_ // Call the target callable
        movq    32(%rsp), %rax
        testq   %rax, %rax
        je      .L101
        leaq    16(%rsp), %rsi
        movl    $3, %edx
        movq    %rsi, %rdi
        call    *%rax         // Another call is part of destructor, it calls the _M_manager() function
        cmpb    $0, (%rbx)
        jne     .L95
.L110:
        movq    %rbx, %rdi
.LEHB4:
        call    _ZN9benchmark5State16StartKeepRunningEv
        jmp     .L95

Do you see all those per byte memset to 0 ? This is the code doing a reset of the 32 byte buffer inside the functor. I never asked the code to do it anywhere, then why is g++ hell bent on doing it ? Is this a bug in g++? This is might be the reason for the additional 10-11 ns difference. Well, I will try to open a bug report anyways.

Another observable difference from the assembly code is that each iteration executes 2 'call' instructions. One is for the actual target function call and the other is for calling the '_M_manager' function inside the destructor, which based upon the operation passed does the required action. This is a fairly costly operation as well. So, if you are using g++'s libstd++ implementation of std::function, watch out for the above 2 costs.

Sentinel

And that is it! I am done with my intended plan of writing this 3 part blog post about std::function. And I have some take aways:

Do not shy away from using raw pointer to functions or member function pointers if it can solve your specific problem. It is highly unlikely one would be writing a generic function everywhere. If you are not at all worried about performance, then go ahead use std::function all over the place (responsibly).
Before considering any such other 'fast' delegates, consider using std::function. If in doubt, measure. If you see anything wrong in my measurements, let me surely know in the comments, I will try to add them in my tests and see the result.
If you just want to pass a callback to another function, which would then be executed immediately within its call stack, then probably something like below would be more efficient:

template <typename Handler, typename... Args>
auto callback_hndler(Handler&& h, Args&&... args) {
  return (std::forward<Handler>(h))(std::forward<Args> args...);
}

callback_hndler([](auto a, auto b) { something(); }, aa, bb );

In such cases making the signature of 'callback_hndler' to take an 'std::function' instead would be a bad decision.
ASIO does something similar in a broad sense for its async operations, but then it goes on to allocate the passed handler on heap since it has to perform the operation on a different call stack. One can use std::function in such scenarios as it lifetime management of the callback for you.

There are few more points/examples that I would have liked to give, but this much is probably already more than enough for knowing about std::function. Guess what ? By this time Goku would have probably already transformed into SSJ3 or is he still screaming ? :)

Sunday, 4 September 2016

What you need _not_ know about std::function - Part 2

Part-1 : Getting to know the basics
Part-3 : Performance benchmarks

In the previous post we saw some basic type traits and metafunctions that are used in the implementation of std::mem_fn and std::function. The main gist of this post is to understand the implementation of std::function in libstd++.

Also, I would have liked to talk about things like:

The performance penalty of using std::function.
When to use std::function.
Some performance benchmarks.

I will leave the above topics as the centre of discussion in the third part of the series (This was supposed to be the sentinel post as per my original plan). Discussing them here will just make this post a torture to read.

A naive implementation

Library development in C++ is tough. It becomes even more intricate when it is supposed to be generic. The most difficult part or rather detail oriented part is the type system itself. Dealing with the types together with their qualifications results in lot of metaprogramming boilerplate code and if you forget something it could lead to some nasty UB.

So, let's first try to implement our own but rather very simple version of std::function. I will be skipping a-great-many required details to keep the code small.

The requirements

1. Should be able to invoke any callable target.
2. The underlying type of the callable object must be hidden i.e the only type information that should be visible is the pure function signature only.
3. The target callable object must be valid as long as the function object is valid.

I have kept bare minimum requirements for my implementation to keep off from doing metaprogramming hackery which we would be seeing anyways while going through the implementation of std::function.

The first attempt

This version just implements the basic interface of Function and also implements the specialization for function pointer. In the later versions (or attempts), the interface developed for Function would more or less remain the same only the specialization required to handle other types of callable objects would be handled.

template <typename> class Function;

template <typename Ret, typename... Args>
class Function<Ret(Args...)>
{
public:
  using implementation = detail::function_impl_base<Ret, Args...>;

  //Default constructor
  Function() = default;

  template <typename Callable>
  Function(Callable f): // Requires callable be copyable
    impl_base_(std::make_unique<detail::func_impl<Callable>>(
                   std::move(f), detail::constructor_tag()))
  {}

  // Copy constructor
  Function(const Function& other)
  {
    this->impl_base_.reset(other.impl_base_->clone());
  }

  // Copy Assignment
  Function& operator=(Function other)
  {
    this->impl_base_.reset(other.impl_base_.release());
    return *this;
  }

  template <typename Callable>
  Function& operator=(Callable cb)
  {
    this->impl_base_ =
      std::make_unique<detail::func_impl<Callable>>(
                std::move(cb), detail::constructor_tag());
    return *this;
  }

  Ret operator()(Args&&... args)
  {
    assert (impl_base_.get());
    return (*impl_base_)(std::forward<Args>(args)...);
  }

private:
  std::unique_ptr<implementation> impl_base_ = nullptr;
};

Nothing much fancy. We have the same signature as that of std::function i.e the type Function is not aware about the type of the call target. That means that for function pointer or member function pointer or functor, the syntax of the Function object declaration would be exactly the same (provided the function signature is same of course) even though the actual types of all these callable objects are different. This is achieved using a technique called 'Type Erasure'.

Type Erasure

I am not intending to explain type-erasure in detail here. There are already a lot of content on it floating on the web. boost::any is one of the very basic example of type-erasure. shared_ptr, std::function are the other that I know of in the standard library that makes use of this technique.

Type-erasure in it's basic form is implemented by making use of polymorphism and thus requires dynamic memory allocation. This is how my implementation of Function does the type-erasure.

In the above code function_impl_base is the base class defining the interface. The derived classes implementing the specializations for different types of Callable types must implement the functions provided by the interface.

namespace detail {

  struct constructor_tag {};

  template <typename Ret, typename... Args>
  class function_impl_base
  {
  public:
    virtual Ret operator()(Args&&... args) const = 0;
    virtual function_impl_base* clone() = 0;
    virtual ~function_impl_base() {}
  };

  template <typename T> class func_impl;

  // Specialization for Function pointers
  template <typename Ret, typename... Args>
  class func_impl<Ret(*)(Args...)> final : public function_impl_base<Ret, Args...>
  {
  public:
    using callable_t = Ret(*)(Args...);

    /*
     * This is a greedy constructor. constructor_tag is just used so as
     * to not interfere with regular copy-constructor.
     */
    template <typename F>
    func_impl(F&& callable, constructor_tag): callable_(std::move(callable))
    {}

    func_impl(const func_impl&) = default;

    func_impl& operator=(func_impl&) = default;

    Ret operator()(Args&&... args) const
    {
      return callable_(std::forward<Args>(args)...);
    }

    func_impl* clone()
    {
      return (new func_impl(*this));
    }

  private:
    callable_t callable_;
  };

}

Let's now look at the constructor of Function in a bit more detail:

template <typename Callable>
  Function(Callable f): // Requires callable be copyable
    impl_base_(std::make_unique<detail::func_impl<Callable>>(std::move(f), detail::constructor_tag()))
  {}

The constructor is a template constructor and is the only part of Function class (lets forget about assignment and copy constructors for the time being) which actually knows about the type of the callable object. But this knowledge is lost as soon as the constructor exits. But, we need to save this information in some form, right ? This is where inheritance comes into picture. Using inheritance, the type of the information is saved as RTTI (Run Time Type Information) probably in the vtable. This is exactly the purpose of function_impl_base.
The class func_impl<Ret(*)(Args...)> inherits from function_impl_base and implements the call operator.

The final attempt

Not much difference, only added specialization for member function pointer and functors:

namespace detail {

  template <typename Ret, typename... Args>gt;
  class function_impl_base
  {
  public:
    virtual Ret operator()(Args&&... args) = 0;
    virtual function_impl_base* clone() = 0;
    virtual ~function_impl_base() {}
  };

  template <typename Signature, typename Functor>gt; class func_impl;

  // Specialization for Function pointers
  template <typename Ret, typename... Args>gt;
  class func_impl<Ret(Args...), Ret(*)(Args...)>gt; final
            : public function_impl_base<Ret, Args...>gt;
  {
  public:
    using callable_t = Ret(*)(Args...);

    // The need for constructor tag removed in this version
    func_impl(callable_t callable): callable_(std::move(callable))
    {}

    func_impl(const func_impl&) = default;

    func_impl& operator=(func_impl&) = default;

    Ret operator()(Args&&... args)
    {
      return callable_(std::forward<Args>gt;(args)...);
    }

    func_impl* clone()
    {
      return (new func_impl(*this));
    }

  private:
    callable_t callable_;
  };

  // Specialization for Functors
  template <typename Functor, typename Ret, typename... Args>gt;
  class func_impl<Ret(Args...), Functor>gt;
            : public function_impl_base<Ret, Args...>gt;
  {
  public:
    using callable_t = Functor;

    func_impl(Functor callable): callable_(std::move(callable))
    {}

    func_impl(const func_impl&) = default;
    func_impl& operator=(func_impl&) = default;

    Ret operator()(Args&&... args)
    {
      return callable_(std::forward<Args>gt;(args)...);
    }

    func_impl* clone()
    {
      return (new func_impl(*this));
    }

  private:
      callable_t callable_;
  };

  // Specialization for Member function pointers
  // Leveraging the use of mem_fn
  template <typename Class, typename Member, typename Ret, typename... Args>gt;
  class func_impl<Ret(Args...), Member Class::*>gt;
            : public function_impl_base<Ret, Args...>gt;
  {
  public:
    using callable_t = Member (Class::*);

    func_impl(callable_t callable): callable_(std::move(callable))
    {}

    func_impl(const func_impl&) = default;
    func_impl& operator=(func_impl&) = default;

    Ret operator()(Args&&... args)
    {
      return call(std::forward<Args>gt;(args)...);
    }

    template <typename ClassType, typename... Params>gt;
    Ret call(ClassType obj, Params&&... args)
    {
      return std::mem_fn(callable_)(obj, std::forward<Params>gt;(args)...);
    }

    func_impl* clone()
    {
      return (new func_impl(*this));
    }

  private:
    callable_t callable_;
  };

}

template <typename>gt; class Function;

template <typename Ret, typename... Args>gt;
class Function<Ret(Args...)>gt;
{
public:
  using implementation = detail::function_impl_base<Ret, Args...>gt;;
  using call_signature = Ret(Args...);

  //Default constructor
  Function() = default;

  // Copy constructor
  Function(const Function& other)
  {
    this->gt;impl_base_.reset(other.impl_base_->gt;clone());
  }

  // Copy Assignment
  Function& operator=(Function other)
  {
    this->gt;impl_base_.reset(other.impl_base_.release());
    return *this;
  }

  template <typename Callable>gt;
  Function& operator=(Callable cb)
  {
    this->gt;impl_base_ =
      std::make_unique<detail::func_impl<call_signature, Callable>gt;>gt;(std::move(cb));
    return *this;
  }

  // constructor
  template <typename Callable>gt;
  Function(Callable f): // Requires callable be copyable
    impl_base_(std::make_unique<detail::func_impl<call_signature, Callable>gt;>gt;(std::move(f)))
  {
  }

  Ret operator()(Args&&... args)
  {
    assert (impl_base_.get());
    return (*impl_base_)(std::forward<Args>gt;(args)...);
  }

  explicit operator bool() const noexcept
  {
    return impl_base_ != nullptr;
  }

private:
  std::unique_ptr<implementation>gt; impl_base_ = nullptr;
};

The above code can be accessed from HERE.

Issues with the custom implementation

1. Memory allocation. Dynamic memory allocation is costly. It is probably ok if you know you are not going to create many instances of Function object and pass it around. But it's a big no-no in a performance critical code. This can be resolved in some cases by doing SBO (Small Buffer Optimization), which is what almost all implementation does.

2. No type checking at all. Since the type of the actual callable object gets erased after constructor and assignment operators. It is better to check whether the user is correctly using the function object. If not, it should be a compile time error rather than leading to subtle undefined behaviour in the code.

3. Overhead of a virtual function call and that of an additional call to the callable object. This can be pretty bad, if you going to call your implementation functions via Function.

4. The cost of generic programming. This is my personal opinion, customized solution can be made faster than the above Function implementation, but when you have to use something like Function as part of your interface to client code, then this is the way we have to do and improve upon it. This is something which I would like to touch upon in the final part of this series.

The Implementation of std::function

In this section we will see the implementation of std::function in libstd++ 6.1. We will be making use of some of the type traits discussed in Part-1 and will see some new ones here. Let's first lookup at some concepts and tricks individually and connect them up later.

Location Invariance

The metafunction for this check is:

template <typename T>
struct is_location_invariant
     : is_trivially_copyable<T>::type
{ };

This is one of the checks that is made to determine if SBO can be performed on the passed callable type or not. This is mainly relevant for Functors, since function pointer and member function pointers are trivially copyable. So, it is necessary that one must make sure that his/her functor implements the TriviallyCopyable concept.

For the lazy people (from cppreference) :

Every copy constructor is Trivial or deleted (since C++14)

Every move constructor is Trivial or deleted (since C++14)

Every copy assignment operator is Trivial or deleted (since C++14)

Every move assignment operator is Trivial or deleted (since C++14)

at least one copy constructor, move constructor, copy assignment operator, or move assignment operator is non-deleted

(since C++14)

Trivial non-deleted (since C++14) destructor

This implies that the class has no virtual functions or virtual base classes.
Scalar types and arrays of TriviallyCopyable objects are TriviallyCopyable as well, as well as the const-qualified (but not volatile-qualified) versions of such types.

No-Copy Types

This is nothing but the types of callables that does not need to be copied:

  class Undefined_class;

  union Nocopy_types
  {
    void*       M_object;
    const void* M_const_object;
    void (*M_function_pointer)();
    void (Undefined_class::*M_member_pointer)();
  };

This basically means, below types are no-copy types:-

Pointer to a callable object.
Pointer to a const callable object.
A function pointer.
A member function pointer.

The lack of any template parameters should give you an hint that the actual types mentioned inside the union is not going to be used. Maybe it's the size of the union ? We will get to that shortly.

Small Buffer Optimization

Let's see the data structure employed for the all important small buffer optimization by libstd++

  union Any_data
  {
    void*       M_access()       { return &M_pod_data[0]; }
    const void* M_access() const { return &M_pod_data[0]; }

    template <typename T>
    T& M_access() {
      return *static_cast <T*>(M_access()); 
    }

    template <typename T >
    const T& M_access() const {
      return *static_cast<const T*>(M_access());
    }
      
    Nocopy_types M_unused;
    char M_pod_data[sizeof(Nocopy_types)];
  };

It is just a char buffer having a size of sizeof(Nocopy_types) bytes, which should be 8 bytes on 64 bit platform. This is another constraint that the functor must have other than the location invariance, for the implementation to perform SBO.
If the size of your functor is more than 8 bytes, then you will have to pay for one dynamic allocation.

The Function_base class

This class basically deals with the small buffer or Any_data directly. It makes the decision of what needs to be stored in Any_data i.e whether it should be the callable object as it is or the memory location of the callable object after allocating it on heap.

There are 2 inner classes named Base_manager and Ref_manager (a specialization for reference_wrapped functors), which does the actual dealing with Any_data.

Lets dig into the implementation of Base_manager, which would give us a lot of idea on what is happening behind the scenes.

      template<typename Functor>
      class Base_manager
      {
      protected:
        static const bool stored_locally =
        (is_location_invariant<Functor>::value
         && sizeof(Functor) <= M_max_size  // 8 bytes on 64 bit
         && alignof(Functor) <= M_max_align
         && (M_max_align % alignof(Functor) == 0));

        typedef integral_constant<bool, stored_locally> Local_storage;

        // Retrieve a pointer to the function object
        static Functor*
        M_get_pointer(const Any_data& source)
        {
          const Functor* ptr =
            stored_locally? std::addressof(source.M_access<Functor>())
            /* have stored a pointer */ : source.M_access<Functor*>();
          return const_cast<Functor*>(ptr);
        }

        // Clone a location-invariant function object that fits within
        // an _Any_data structure.
        static void
        M_clone(Any_data& dest, const Any_data& source, true_type)
        {
          new (dest.M_access()) Functor(source.M_access<Functor>());
        }

        // Clone a function object that is not location-invariant or
        // that cannot fit into an _Any_data structure.
        static void
        M_clone(Any_data& dest, const Any_data& source, false_type)
        {
          dest.M_access<Functor*>() =
            new Functor(*source.M_access<Functor*>());
        }

        // Destroying a location-invariant object may still require
        // destruction.
        static void
        M_destroy(Any_data& victim, true_type)
        {
          victim.M_access<Functor>().~Functor();
        }

        // Destroying an object located on the heap.
        static void
        M_destroy(Any_data& victim, false_type)
        {
          delete victim.M_access<Functor*>();
        }

      public:
        static bool
        M_manager(Any_data& dest, const Any_data& source,
                   Manager_operation op)
        {
         //skipped
         ........
        }

        static void
        M_init_functor(Any_data& functor, Functor&& f)
        { M_init_functor(functor, std::move(f), Local_storage()); }


      private:
        static void
        M_init_functor(Any_data& functor, Functor&& f, true_type)
        { new (functor.M_access()) Functor(std::move(f)); }

        static void
        M_init_functor(Any_data& functor, Functor&& f, false_type)
        { functor.M_access<Functor*>() = new Functor(std::move(f)); }
      };

Now that's something. The integral constant Local_storage is set to true if the functor can be stored inside the Any_data buffer else it will be false and the other member functions gets called based on that using tag-dispatching. For eg. check out the definitions for M_clone member function.

Let's look at functions M_init_functor and M_get_pointer in bit more detail to understand how the callable object is stored and retrieved by std::function.

 static void
 M_init_functor(Any_data& functor, Functor&& f)
 { M_init_functor(functor, std::move(f), Local_storage()); }

 static void
 M_init_functor(Any_data& functor, Functor&& f, true_type)
 { new (functor.M_access()) Functor(std::move(f)); }

 static void
 M_init_functor(Any_data& functor, Functor&& f, false_type)
 { functor.M_access<Functor*>() = new Functor(std::move(f)); }

Considering the passed callable object meets the requirement for being stored on stack, Local_storage would be of true_type, hence the second M_init_functor will get called, which makes use of placement new operator to construct the object on the stack itself.

If the callable object does not meet the requirement for being stored locally i.e it is either not TriviallyCopyable or has data members making its size more than what is allowed (i.e 8 bytes on 64 bit platform), then the callable object is created on heap and its memory location gets stored in the Any_data buffer. This is what the third M_init_functor function does.

 static Functor*
 M_get_pointer(const Any_data& source)
 {
   const Functor* ptr =
        stored_locally ? std::addressof(source.M_access<Functor>())
             : source.M_access<Functor*>();
   return const_cast<Functor*>(ptr);
 }

This must be now easier to reason about. Depending upon whether the callable object is stored locally or not, M_get_pointer gets the address of the callable object. The other member functions are just based upon this technique.

The Ref_manager is just basically a specialization to take care of functor passed via a reference_wrapper. Which is something one must consider to do if their function object size is greater than 8 bytes (on 64 bit platform) AND if the scope of the callable object outlives the scope of the std::function object.

So, having seen Base_manager and Ref_manager class, lets see how Function_base class looks like:

 class Function_base {
 public:     
   static const std::size_t M_max_size = sizeof(Nocopy_types);
   static const std::size_t M_max_align = alignof(Nocopy_types);

   template <typename Functor>
   class Base_manager { ... };

   template <typename Functor>
   class Ref_manager : public Base_manager <Functor*>
   {
     .....
   };

   typedef bool (*Manager_type)(Any_data&, const Any_data&,
                                Manager_operation);

   Any_data     M_functor;
   Manager_type M_manager;
 };

As one can see, Manager_type is a function pointer pointing to either Base_manager::M_manager or Ref_manager::M_manager function. These functions basically fill the destination Any_data with the stored callable object from the source Any_data based upon the operation. Something like below:

        static bool
        M_manager(Any_data& dest, const Any_data& source,
                  Manager_operation op)
        {
          switch (op)
          {
            case get_functor_ptr:
              dest.M_access <Functor*>() = M_get_pointer(source);
              break;

            case clone_functor:
              M_clone(dest, source, Local_storage());
              break;

            case destroy_functor:
              M_destroy(dest,Local_storage());
              break;
            }
          return false;
        }

Function Handlers

The function handlers are kind of similar in functionality to func_impl in my custom implementation i.e to invoke the actual callable object. The classes which we saw till now were to only manage the storage part, they were not concerned with invoking the call operator.

Function handler classes are derived from either Function_base::Base_manager or Function_base::Ref_manager based upon the functor type. The function handlers are also specialized for plain function pointers, member pointers and for functors.

We will be seeing only few of those specialization.

1. Specialization for plain function pointer.

    template<typename Signature, typename Functor>
    class Function_handler;

  template<typename Res, typename Functor, typename... ArgTypes>
    class Function_handler<Res(ArgTypes...), Functor>
    : public Function_base::Base_manager<Functor>
    {
      typedef Function_base::Base_manager<Functor> Base;

    public:
      static Res
      M_invoke(const Any_data& functor, ArgTypes&&... args)
      {
        return (*Base::M_get_pointer(functor))(
            std::forward<ArgTypes>(args)...);
      }
    };

2. Specialization for Member pointer.

   template<typename Class, typename Member, typename Res,
           typename... ArgTypes>
    class Function_handler<Res(ArgTypes...), Member Class::*>
    : public Function_handler<void(ArgTypes...), Member Class::*>
    {
      typedef Function_handler<void(ArgTypes...), Member Class::*>
        Base;

     public:
      static Res
      M_invoke(const Any_data& functor, ArgTypes&&... args)
      {
        return std::mem_fn(Base::M_get_pointer(functor)->value)(
            std::forward<ArgTypes>(args)...);
      }
    };

This is very similar to what we did for member pointers in our custom implementation i.e let std::mem_fn do the heavy lifting of taking care of both member data pointer and member function pointer.

Design uptill now

Specialization of a regular functor is also not much different than what we saw for plain function pointer. The design is like below:
1. Function_base is responsible for storing and retrieving the pointers to the callable object.
2. The function handler objects fetches the pointer to the callable objects from Function_base and invokes them with the passed arguments.

The std::function class

This is what we all have been waiting for.But, you would be surprised (or not) to know that we have already covered all the heavy lifting part ! function class is more about the interface it provides, the type checkings and conversions.

The basic interface is pretty close to what we implemented in our custom solution, so we will be only looking at the interesting parts i.e the things we left out from our custom implementation.

A stripped down version of std::function:

 template < typename Res, typename... Args >
 class function<Res(Args...)>
     : private Function_base
 {
   typedef Res Signature(Args...);
 public:
   function(): Function_base() {}

   template <typename Functor,
                typename = Requires<not<is_same<Functor, function>>, void>,
                typename = Requires<Callable<Functor>, void>>
   function(Functor);

   Res operator()(ArgTypes... args) const;

 private:
   using _Invoker_type = Res (*)(const Any_data&, ArgTypes&&...);
   Invoker_type M_invoker;
 };

 template <typename Res, typename... ArgTypes>
 template <typename Functor, typename, typename>
 function<Res(ArgTypes...)>::
 function(Functor f)
      : Function_base()
 {
      typedef Function_handler <Signature, Functor> My_handler;

      My_handler::M_init_functor(M_functor, std::move(f));
      M_invoker = &My_handler::M_invoke;
      M_manager = &My_handler::M_manager;
 }

This is the part of code which I found pretty interesting. It is interesting because, it does not make use of inheritance the way we used in our custom implementation. With this kind of implementation, it is not required to do any dynamic allocation unless required because of the nature of functor.

The Functor parameter will determine what kind of Function_handler class we need to use. The invoker, which is a function pointer will be set to the correct M_invoke function of the function handler.
Thus, a single call to a callable object would require at most 2 pointer dereferences, no virtual calls and no dynamic allocation (depending on your Functor as we have already seen).

Type checking on assigning target

If we look at the constructor (or copy or assignment) template parameters a little bit more closely, we will see something like:

Callable <Functor>

Let us see what it is:

 template<typename From, typename To>
 using check_func_return_type
      = or<is_void<To>, is_same<From, To>, is_convertible<From, To>>;

 template<typename Func,
          typename Res2 = typename result_of<Func(ArgTypes...)>::type>
 struct Callable : check_func_return_type<Res2, Res> { };

The idea is to check whether the provided callable type yields a return type (Res2) when invoked with the argument types passed as the template parameters of std::function and that return type i.e Res2 is compatible with what was provided in the std::function declaration or in its template parameter.

The reason for adding is_same is pretty interesting. For details check THIS bug report.

I am not sure why is_convertible is required since its failure would not result in lookup of a constructor with different signature as there is only one definition of the constructor. The only advantage I see is, any wrong usage of std::function i.e. assigning it with a target with incompatible arguments would result in error thrown at the constructor itself, rather than at the calling site. If anybody knows about any other advantage, then please do let me know.

For example, see the usage of func1 and func2 and the error it yields using both my custom implementation (which does not use conversion checks) and std::function.

Error with My implementation:

./my_function.hpp:72:14: error: ambiguous conversion from derived class 'D' to base class 'A':
struct D -> struct B -> struct A
struct D -> struct C -> struct A
return callable_(std::forward<Args>(args)...);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As one can see, the error is thrown at the point where we are actually invoking the callable target. This is not good since it is pointing an issue in the library implementers code, which is actually not the case. A good C++ library must try its best to point the error in the users code rather than to its own internal implementation.

Error with std::function implementation:

inheritance.cc:45:5: error: no viable overloaded '='
f = func2;
~ ^ ~~~~~

It points to the user code which messed it up. This is one nice advantage of using is_convertible which assigning a target.

A note about libcxx

The libcxx implementation of std::function looked a lot more straightforward than the libstd++ implementation and they provide a buffer size of 24 bytes (on 64 bit arch) for the SBO, which is 3 times more than what libstd++ offers. Also, libcxx supports allocator other than global new (I am not sure whether standard mandates that or not till c++14), which libstd++ does not currently.

Another difference w.r.t libstd++ is that the libcxx version makes use of virtual function call instead of function pointer as in libstd++. Now, I wouldn't state that function pointer would perform faster than a virtual function call, but I remember seeing the performance measurements of both types of call and a call via function pointer was slightly faster than a virtual function call. Now, don't get all paranoid by that, as we will see in the final part of the blog series, std::function should better be not present in a tight code where performance matters most. In all other cases, the difference (if at all) shouldn't matter at all.

One problem that I see with non-standard way of implementing SBO is performance regression when going from one standard library implementation to another i.e. while moving from a library implementation that has larger internal buffer size than to a one having less (eg: libc++ -> libstd++). This would certainly waste quite some time of the developer who is debugging the issue (as in most cases this developer would not be the one who wrote the code :))