Thursday, October 26, 2023

Explicit Template Instantiation: Reusability and Modularity

In my previous post, I discussed how to reduce compilation times in C++11 using extern templates when the same template instantiation appears in multiple source files.

A few days ago, a reader of the blog asked me for some recommendations on how to choose between external templates and explicit template instantiation in a source file. Indeed, the latter is an alternative approach to the problem of reducing compile times due to multiple instantiations of the same template scattered among many source files.

In this post, I will show this technique and point out the difference with the solution explained in the previous post. Finally, I will outline some design implications of the explicit template instantiation solution.

How Does It Work

The technique consists of:

  • a header file with the declaration of the template function/class,
  • a source file with the function body and the explicit instantiations.

Using the same example of the previous post, the code becomes:


///////////////////////////////
// bigfunction.h

template<typename T>
void BigFunction();


///////////////////////////////
// bigfunction.cpp

#include "bigfunction.h"

template<typename T>
void BigFunction()
{
    // body
}

template void BigFunction<int>();


///////////////////////////////
// f1.cpp

#include "bigfunction.h"

void f1()
{
    ...
    BigFunction<int>();
}


///////////////////////////////
// f2.cpp

#include "bigfunction.h"

void f2()
{
    ...
    BigFunction<int>();
}

In this way, the compiler generates the code exactly once.

Drawbacks

On the other hand, using the above technique:

  • we prevent the compiler from expanding the code inline,
  • the template becomes inextensible and less reusable (in the case it’s part of a library or if it could become part of a library in the future).

While the first point is straightforward, the second might need some extra explanation.

The original problem we’re trying to solve is caused by the fact that we’re using the template in multiple translation units of the same project. This likely means that our template is useful and reusable, maybe in many projects: that’s why it should be considered part of a library. Even if the template is not shared with other projects but we’re reusing it multiple times in a large application, we should maintain it as if it were a library. This has to do with reusability.

Using the technique presented in this post inside a library forces us to decide in advance which template arguments we want to provide to the users (because all explicit instantiations go in the very same source file where the template body is implemented). Unfortunately, this breaks down entirely the extensibility of the library, because you cannot use the template with other types. And this is a shame because extensibility is the main raison d’être of templates.

And keep in mind that a template is reusable primarily because it’s extensible. If for some reason a template library is not extensible, it becomes also less reusable because you cannot use it with types unplanned by the library author.

So, this solution reduces the reusability of the template, in addition to making it inextensible. However, the reusability property is not completely lost, because sometimes templates are thought only for a predefined set of arguments (not so often, to tell the truth).

Choices, as Usual

A rule of thumb to decide when to use external templates or explicit instantiation in this way might be:

  • use external templates when the template is part of a library or should be considered a library,
  • use external templates when you want the template to be expanded inline,
  • otherwise, define the template in a .cpp file and add all the explicit instantiations you’re going to use in your application.

The last point also reveals that the “explicit instantiation” approach greatly reduces a project's modularity. Since you need to put all the template instantiations in one artifact (the source file), adding new features to the application will likely result in adding/updating more than just one artifact.

On the contrary, in my projects, I always try to have an “incremental design” i.e., a design where a new requirement can be implemented by just adding/updating a single artifact (e.g., a single .cpp file). An “incremental design” brings additional positive properties that I will elaborate in a future post, but cannot be achieved if the application has a single center (the cpp file) entangled with an unstable, unbounded set (the template arguments).

Wednesday, October 25, 2023

Reduce Compilation times with extern Template

Originally published on Simplify C++!, Arne Mertz’s blog on clean and maintainable C++.


Over the past few years, the compilation times of C++ projects have increased significantly, despite the availability of fast computers with multiple CPU/cores and more RAM.

This escalation can be attributed largely to:

  • The shift of certain elaborations from run-time to compile-time via templates and constexpr.
  • The rise in the number of header-only libraries.

While the first factor is unavoidable and desirable, the second can be seen as a questionable trend primarily driven by the convenience of distributing header-only libraries. Since I’m myself guilty of having developed a few header-only libraries, however, I won’t delve into this issue here :-)

In some scenarios, build times can be mitigated using techniques such as enhancing modularity, disabling optimizations, using the pimpl idiom, leveraging forward declarations, and using precompiled headers, among others.

Additionally, C++11 introduced extern template declarations (n1448) that, to some extent, can help speed up compilation times. This concept is akin to extern data declaration and directs the compiler to refrain from instantiating the template in the current translation unit.

How Does extern Template Declaration Work?

The simplest way to figure out how extern template declarations work is to reason over a code snippet. Consider these files:


// bigfunction.h

template<typename T>
void BigFunction()
{
    // function body
}

// f1.cpp

#include "bigfunction.h"

void f1()
{
    ...
    BigFunction<int>();
}

// f2.cpp

#include "bigfunction.h"

void f2()
{
    ...
    BigFunction<int>();
}

This code will lead to the generation of the following object files (on Linux, you can verify this using the nm utility):


> nm -g -C --defined-only *.o

f1.o:
00000000 W void BigFunction<int>()
00000000 T f1()

f2.o:
00000000 W void BigFunction<int>()
00000000 T f2()

Subsequently, when these two object files are linked together, one instance of BigFunction<int>() will be discarded (indicated by the “W” symbol type that nm puts near the function). Thus, the compilation time spent on generating BigFunction<int>() multiple times becomes futile.

To mitigate this redundancy, the extern keyword can be employed:


// bigfunction.h

template<typename T>
void BigFunction()
{
    // function body
}

// f1.cpp

#include "bigfunction.h"

void f1()
{
    ...
    BigFunction<int>();
}

// f2.cpp

#include "bigfunction.h"

extern template void BigFunction<int>();

void f2()
{
    ...
    BigFunction<int>();
}

Resulting in:


> nm -g -C --defined-only *.o

f1.o:
00000000 W void BigFunction<int>()
00000000 T f1()

f2.o:
00000000 T f2()

The same principle extends to template classes, utilizing this syntax:



// bigclass.h

template<typename T>
class BigClass
{
    // class implementation
};

// f1.cpp

#include "bigclass.h"

void f1()
{
    ...
    BigClass<int> bc;
}

// f2.cpp

#include "bigclass.h"

extern template class BigClass<int>;

void f2()
{
    ...
    BigClass<int> bc;
}

Missing Pieces

Unfortunately, it's not as straightforward as it appears.

For instance, when attempting to compile the aforementioned code with optimization enabled (e.g., -O2 on gcc or clang), the linker might report that BigFunction<int>() is undefined. Why?

The problem is that when compiling f1.cpp with the optimization enabled, the template function is expanded inline at the point of the function call instead of being really generated, so when the linker encounters the f2 object file, it can’t locate the expanded function anymore.

To address this, you can utilize the nm utility to check the symbols exported by the object files, and verify that the issue here is the inline expansion of the function:


> nm -g -C --defined-only *.o

f1.o:
00000000 T f1()

f2.o:
00000000 T f2()

in f1.o the symbol is missing due to optimization, and in f2.o the symbol is missing due to the extern clause.

If you’re using gcc, you can get further evidence of this by trying:


// bigfunction.h

template<typename T>
void __attribute__ ((noinline)) BigFunction()
{
    // body
}

Here, the gcc-specific attribute noinline prevents inline expansion, so that the linker can find it and not complain anymore.

A Global Approach

The gcc-specific attribute noinline is obviously not the ultimate solution to our problem.

A point worth noting here is that the strategy to reduce compilation time is relative to an entire project, and so is the usage of the extern template clause.

A project-wide strategy to capitalize on the extern template mechanism while ensuring that all necessary code is generated for linking might involve:

  1. Including a header file with the extern template clause in every translation unit where the template appears.
  2. Adding a source file to the project containing explicit instantiation.

// bigfunction.h

template<typename T>
void BigFunction()
{
    // function body
}

extern template void BigFunction<int>();


// bigfunction.cpp

#include "bigfunction.h"

template void BigFunction<int>();

// f1.cpp

#include "bigfunction.h"

void f1()
{
    ...
    BigFunction<int>();
}

// f2.cpp

#include "bigfunction.h"

void f2()
{
    ...
    BigFunction<int>();
}

This approach is also applicable when the template function/class is part of a third-party library. In such cases, adding your own header file including the library that introduces the extern template clause suffices.


// third_party_bigfunction.h

template<typename T>
void BigFunction()
{
    // function body
}

// bigfunction.h

#include <third_party_bigfunction.h>

extern template void BigFunction<int>();

// bigfunction.cpp

#include "bigfunction.h"

template void BigFunction<int>();

// f1.cpp

#include "bigfunction.h"

void f1()
{
    ...
    BigFunction<int>();
}

// f2.cpp

#include "bigfunction.h"

void f2()
{
    ...
    BigFunction<int>();
}

Summary

Reducing compile times using extern template is a project scope strategy. One should consider which are the templates most expensive that are used in many translation units and find a way to tell the build system to compile them just once.

But let’s consider for a moment what we’ve done in the previous paragraph.

We had a template function/class. To minimize the build time we decided to instantiate it only one time for a given template parameter. In doing so, we had to force the compiler to generate exactly one time the function/class for the given template parameter, preventing the inline expansion (and possibly giving up a run-time optimization). However, if the compiler decided to inline a function, chances are that it was not so big, meaning, after all, we don’t save so much build time by compiling it only once.

Anyway, if you’re determined to save both the goats and the cabbages, you can try to enable the link time optimization flags (-flto on gcc): it will perform global optimizations (e.g., inlining) having visibility of the whole project. Of course, this, in turn, will slow down the build process, but you’ll get your function template inlined but instantiated only once.

Bottom line: programming inevitably involves trade-offs between conflicting aspects, and you should measure carefully whether a template function is slowing down your build (because e.g., it’s instantiated with the same parameter in many compilation units) or your run-time execution (because e.g., it’s called in just one location but in a tight loop) and – above all – consider your priorities.

After all, the observation “premature optimization is the root of all evil” and the rule that immediately follows “measure before optimize” can also be extended to compile times. By carefully measuring the impact of extern template on both build times and run times, you can make informed decisions to strike a balance between optimization for compilation and execution.

At the end of the day, it is inevitable that we decide whether to optimize for compilation or execution. After all, that’s exactly what I wrote at the very beginning of this article: one of the methods to speed up build time is to turn off optimizations :-)

Tuesday, October 17, 2023

Flip Model: A Design Pattern

This article was originally published in the December 2018 issue of ACCU Overload Journal.


In this article, I will introduce a design solution that I have utilized multiple times in the past while working on applications designed to diagnose complex distributed systems. This solution has proven effective across various contexts and continues to demonstrate its robustness in various operational systems.

While I am aware that other developers have employed this pattern, my research has shown a lack of references to it in the existing literature. This motivated me to document and discuss it here.

After careful consideration, I have chosen to present this pattern in the familiar format of a Design Pattern. I believe that this approach remains a convenient way to discuss both software and architectural design, which are fundamental topics in Software Engineering and should not be overshadowed by more mundane concerns.

Moreover, I recognize that some novice developers may not be familiar with the groundbreaking book "Design Patterns" which has left a significant impact. With this article, I hope to bridge that knowledge gap and inspire curiosity about design patterns and software design in general.

In the following sections, I will detail the pattern following the well-established documentation structure outlined in the original book. For further reference on this structure, you can refer to the 'documentation' section in the Wikipedia article or, even better, read the original book.

Pattern name and Classification

Flip Model (behavioral).

Intent

The pattern allows multiple clients to read a complex data model that is continuously updated by a unique producer, in a thread-safe fashion.

Also known as

Model publisher, Pressman, Newsagent.

Motivation (Forces)

There are instances where it becomes necessary to decouple the utilization of a complex data structure from its source. This decoupling should enable each actor to operate at its own pace without causing interference.

Consider, for example, an application that periodically retrieves information from a vast sensor network to perform some kind of statistical elaboration on the collected data set and send alarms when some criteria are met. The data collected from the sensor network is structured in a complex lattice of objects resembling the ones you would find in the physical world so that the elaboration modules can navigate the data in a more natural way. The retrieval operation is a long, complex task, involving several network protocols, that is completely uncorrelated from the statistical analysis and alarms evaluation, and can possibly run in separated threads. Moreover, data retrieval and its usage have different timing (e.g., the sensor network is scanned every 5 minutes, while the statistical elaboration is performed on request by a human operator on the most recent collected dataset).

In this scenario, how can all the modules of the application work together on the same data structure? How can all the clients use the most updated data available in a consistent fashion? And how can the application get rid of the old data when it is no longer needed?

The core idea of the Flip Model pattern involves passing the sensor data structure from the producer to the consumers using two shared pointers (in C++) or two variables ((in garbage-collected languages). One variable (termed filling) holds the currently acquiring data object structure, while the other (termed current) holds the most recently acquired complete data.

A class named SensorNetwork determines when to initiate a new acquisition and replaces current with filling when the acquisition process concludes. When a client needs to perform some tasks on the data acquired, it contacts SensorNetwork, which returns current (i.e., the most recent data acquired). An instance of the SensorAcquisition class is kept alive and unchanged during the whole time a client holds the smart pointer (and the same is still valid in garbage collected languages).

Both data acquisition (performed by SensorAcquisition) and its reading (performed by the various clients: Statistics, ThresholdMonitor and WebService) can potentially run in multiple threads. The safety of the code is ensured through the following observations:

  • A SensorAcquisition object can be modified only by the thread of SensorNetwork, and never changed after it becomes public (i.e., the smart-pointer current is substituted by filling).
  • The smart pointer exchange is protected by a mutex.

It is worth noting that here the mutex is required because std::shared_ptr provides a synchronization mechanism that protects its control-block but not the shared_ptr instance. Thus, when multiple threads access the same shared_ptr and any of those accesses use a non-const member function, you need to provide explicit synchronization. Unfortunately, our code falls exactly under that case since the method SensorNetwork::ScanCompleted assigns the shared_ptr to a new value.

However, if the presence of a mutex makes you feel back in the eighties, please see the ‘Implementation’ section for some modern alternatives.

The figure below illustrates a typical class structure for the Flip Model pattern:

Applicability

Use Flip Model when:

  • Dealing with a complex data structure that updates slowly.
  • Multiple clients need to asynchronously read the most recent data consistently.
  • Older information should be discarded when is no longer needed.

Structure

The following figure shows the pattern structure.

Participants

  • Snapshot (SensorAcquisition)
    • Holds the whole set of data acquired from the source.
    • Performs a complete scan.
    • Possibly provides const function members to query the acquisition.
    • Possibly is a set of (heterogeneous) linked objects (e.g., a list of Measure objects).
  • Source (SensorNetwork)
    • Periodically requests the source to perform a new scan.
    • Provides the latest complete scan to its clients.
  • Client (WebService, ThresholdMonitor, Statistics)
    • Requests the Source for the latest available Snapshot and utilizes it (in read-only mode).

Collaborations

  • Source periodically creates a new Snapshot instance, assigns it to the filling shared_ptr, and commands it to start the acquisition.
  • Upon completion of the acquisition, Source performs the assignment current=filling protected by a mutex. If no clients were holding the previous current, the pointed Snapshot is automatically destroyed (by the shared pointer).
  • When a client needs the most recent Snapshot, it invokes Source::GetLastSnapshot() to obtain current.

The figure below illustrates the collaborations between a client, a source and the snapshots it creates.


Consequences

  • Flip Model decouples the producer from the readers: the producer can go on with the update of the data (slow) and each reader gets each time the most updated version.
  • Synchronization: producer and readers can run in different threads.
  • Flip Model ensures coherence across all data structures read by a reader at any given moment, without the need to lock these structures for a long time.
  • Memory consumption is minimized while ensuring that each reader has coherent access to the most recent snapshot.

Implementation

Here are eight considerations to keep in mind while implementing the Flip Model pattern:

  1. Acquisition can be initiated periodically (as illustrated in the example) or continuously (immediately after the previous acquisition concludes). In the former case, the scan period must exceed the scan duration; otherwise, the data acquired from the previous scan is discarded when the timer triggers again.
  2. While the pattern is described using C++, it can also be implemented in languages with garbage collection. In C++, std::shared_ptr is necessary to ensure that a Snapshot is deleted when no clients are using it and Source has a more recent snapshot ready. In languages with garbage collection, the collector will take care of deleting old snapshots when they’re no longer in use (though this happens at an unspecified time, potentially leading to a buildup of unused snapshots in memory).
  3. For the std::shared_ptr (or garbage collection) mechanism to work correctly (i.e., old snapshots are deleted) clients must use Source::GetLastSnapshot() each time they require a snapshot.
  4. The Snapshot (and the application in general) can be synchronous or asynchronous. In the former scenario, the Snapshot::Scan method is a blocking function and the caller (Source) must wait until the data structure is completed before acquiring the mutex and assigning current to filling. Within a synchronous application, clients will run in other threads. In the latter scenario, the Snapshot::Scan method initiates the acquisition and promptly returns. When the data structure is completed, an event notification mechanism (e.g., events, callbacks, signals) takes care to announce the end of the operation to Source, that can finally acquire the mutex before assigning current to filling. Asynchronous applications can be single-threaded or multi-threaded.
  5. The pattern supports every concurrency model: from single threaded (in a fully asynchronous application) to maximum parallelization (when the acquisition and each client operate in their own threads). When acquisition and snapshot usage occur in separate threads, a synchronization mechanism is necessary to safeguard the shared_ptr. While the simplest solution involves using a mutex, C++11 and beyond offer the possibility to use overload functions std::atomic_...<std::shared_ptr> (and maybe from C++20 std::atomic_shared_ptr). It's worth noting that the implementation of atomic functions might not always be lock-free (in fact, my tests with the latest GCC version show that they're not), potentially resulting in performance inferior to the mutex-based version. A better solution could involve using an atomic int as a key to select the correct shared_ptr (refer to the ‘Sample code’ section for more details).
  6. The objects composing Snapshot (usually a huge complex data structure) can be deleted and recreated in each scan cycle. Alternatively, a pool of objects can be utilized (in this case, the shared_ptr must be replaced with a reference-counted pool object handler).
  7. It's crucial to recognize that Snapshot (and the classes it represents) is immutable. After its creation and the scan is completed, clients can only read from it. Once a new snapshot is available, the old one is deleted, and clients begin reading the new snapshot. This approach greatly enhances concurrency, as multiple clients across different threads can read the same snapshot without requiring locks.
  8. Be aware of stupid classes! Snapshot (and the classes it represents) should not merely be passive data containers. Each class should at least contribute to retrieve its own data, and one could also consider whether to add methods and facilities to use the data.

Sample Code

The C++ code below provides an outline of the Flip Model pattern's implementation based on the description provided in the 'Motivation' section:



class SensorAcquisition
{
public:
    // interface for clients
    const SomeComplexDataStructure& Data() const { /* ... */ }
    // interface for SensorNetwork
    template 
    void Scan(Handler h) { /* ... */ }
};

class SensorNetwork
{
public:
    SensorNetwork() :
        timer( [this](){ OnTimerExpired(); } )
    {
        timer.Start(10s);
    }
    shared_ptr GetLastMeasure() const
    {
        lock_guard lock(mtx);
        return current;
    }
private:
    void OnTimerExpired()
    {
        filling = make_shared();
        filling->Scan([this](){ OnScanCompleted(); });
    }
    void OnScanCompleted()
    {
        lock_guard lock(mtx);
        current = filling;
    }

    PeriodicTimer timer;
    shared_ptr filling;
    shared_ptr current;
    mutable mutex mtx; // protect "current"
};

class Client
{
public:
    Client(const SensorNetwork& sn) : sensors(sn) {}

    // possibly executed in another thread
    void DoSomeWork()
    {
        auto measure = sensors.GetLastMeasure();
        // do something with measure
        // ...
    }
private:
    const SensorNetwork& sensors;
};

It's worth noting that the inclusion of a mutex in the code above is intended for clarity. The subsequent code snippet demonstrates a lock-free alternative:



class SensorNetwork
{
public:
  SensorNetwork() :
    timer( [this](){ OnTimerExpired(); } )
  {
    // just to be sure :-)
    static_assert(current.is_always_lock_free); // c++17

    timer.Start(10s);
  }
  shared_ptr GetLastMeasure() const
  {
    assert(current < 2);
    return measures[current];
  }
private:
  void OnTimerExpired()
  {
    // filling = 1-current
    assert(current < 2);
    measures[1-current] = make_shared();
    measures[1-current]->Scan([this](){ OnScanCompleted(); });
  }
  void OnScanCompleted()
  {
    current.fetch_xor(1); // current = 1-current
  }

  PeriodicTimer timer;
  shared_ptr measures[2];
  // c++14
  atomic_uint current{0}; // filling = 1-current
};

Just in case you were wondering, you do need an atomic integer type here, although only one thread is writing it (have a look at C++ memory model to go down the rabbit hole).

Known Uses

Flip Model is used to retrieve the periodic diagnosis of network objects in several applications I worked on. Unfortunately, confidentiality constraints prevent me from sharing specific details about these projects.

Related Patterns

  • The pattern is somewhat similar to ‘Ping Pong Buffer’ (also known as “Double Buffer” in computer graphics), but Flip Model allows multiple clients to read the state, each at its convenient pace. Moreover, in Flip Model, there can be multiple data structures simultaneously, while in ‘Ping Pong Buffer’/‘Double Buffer’ there are always two buffers (one for writing and the other for reading). Finally, in ‘Ping Pong Buffer’/‘Double Buffer’, buffers are swapped, while in Flip Model the data structures are passed from the writer to the readers and eventually deleted.
  • Snapshot can serve as a “Façade” for a complex data structure.
  • Source may utilize a “Strategy” to change update policies (e.g., periodic versus continuous)

Monday, October 16, 2023

Add an Interactive Command Line to Your Applications

In my work, I often develop applications meant to run for a long period of time, without a classic interaction with the user (i.e., without a proper UI). Sometimes these applications run on a server or a custom board, but they are rarely desktop applications. Another characteristic is that they’re not CPU-bound applications, i.e., they’re not the kind of number-crunching applications that you start and then wait for output.

Soon enough, I realized that it’s very useful to have some sort of console to interact with my applications, particularly the embedded kind where the software is running around the clock so that you can easily monitor, configure, and manage your system. After all, this idea is nothing new: Cisco routers, for example, are known for their command line interface, and the same goes for many devices that run unattended.

If you’re working on this kind of software, you should definitely consider adding a command line interface at least for debugging purposes. For example, wouldn’t it be great to connect to your embedded software using a telnet client to ask the internal state, view a dump of some internal structure, modify the log level at runtime, change the working mode, enable or disable some modules, load or unload some plugin?

When you’re in production, the benefits of an interactive (and possibly remote) command line are obvious. But consider the initial phases of development, too. When you’re writing the first prototype of an application doing I/O on custom devices, chances are that you start writing some core features to communicate with the real world: implementing some protocol, some module that does I/O, and so on. How do you test them? You can write a main function that drives your piece of software but, except for simple cases, you also need some sort of state machine to interact correctly with the hardware. So, either you develop a state machine or you use a more interactive solution. A CLI, for example :-) 

Let’s imagine: you’re writing your protocol stack to speak with a legacy machine. You’ve got primitives to tell the machine to start the electric motor, to stop it, to change direction as well as the notifications for all the meaningful events. Now you can add to your main function a command line interface, and register a command for each primitive. In this way, you can immediately begin to test use cases complex at will on your module, in an incremental manner (of course this requires to be able to write incremental modular code too, but this is a subject for another post :-)

Reinventing the Wheel, as Usual…

So, I needed a CLI for my C++ projects. As every good developer, when I need something, I first start by looking at open-source libraries to see if there exists something that works for me. Unfortunately, I couldn’t find anything that completely fit my needs. In particular, the vast majority of libraries available work only on Linux, or they aren’t libraries at all, but applications in which you have to hook external programs to commands. None of them provides remote sessions. Few are written in C++. None of them are in modern C++.

Eventually, I wrote my library, in C++14. It’s available on my GitHub page. It has production code quality and has been used in several industrial projects.

A brief summary of features:

  • C++14
  • Cross-platform (Linux and Windows tested)
  • Menus and submenus
  • Remote sessions (telnet)
  • Command history (navigation with arrow keys)
  • Autocompletion (with TAB key)
  • Async interface
  • Colors

It has a dependency on boost::asio to provide an asynchronous interface (for those cases when you want a single-thread application) and to implement the telnet server.

The library is all I needed for my projects. When I have a remote board running my software, I find it very handy to telnet on a specific port of the board to get a shell on my application. I can have a look at the internal state of the software, increase the log level when something strange happens, and even give commands to change the behavior or reset the state if something goes wrong.

Show Me Some Code!

Just to show you the syntax of the library, this is an example of a working application providing both a local prompt and a remote command line interface, with menus and submenus:


#include "cli/clilocalsession.h"
#include "cli/remotecli.h"
#include "cli/cli.h"
 
using namespace cli;
using namespace std;
 
int main()
{
    // setup cli
 
    auto rootMenu = make_unique< Menu >( "cli" );
    rootMenu -> Add(
            "hello",
            [](std::ostream& out){ out << "Hello, world\n"; },
            "Print hello world" );
    rootMenu -> Add(
            "hello_everysession",
            [](std::ostream&)
                { Cli::cout() << "Hello, everybody" << std::endl; },
            "Print hello everybody on all open sessions" );
    rootMenu -> Add(
            "answer",
            [](int x, std::ostream& out)
                { out << "The answer is: " << x << "\n"; },
            "Print the answer to Life, the Universe and Everything");
    rootMenu -> Add(
            "color",
            [](std::ostream& out)
                { out << "Colors ON\n"; SetColor(); },
            "Enable colors in the cli" );
    rootMenu -> Add(
            "nocolor",
            [](std::ostream& out)
                { out << "Colors OFF\n"; SetNoColor(); },
            "Disable colors in the cli" );
 
    auto subMenu = make_unique< Menu >( "sub" );
    subMenu -> Add(
            "hello",
            [](std::ostream& out)
                { out << "Hello, submenu world\n"; },
            "Print hello world in the submenu" );
    subMenu -> Add(
            "demo",
            [](std::ostream& out){ out << "This is a sample!\n"; },
            "Print a demo string" );
 
    auto subSubMenu = make_unique< Menu >( "subsub" );
        subSubMenu -> Add(
            "hello",
            [](std::ostream& out)
                { out << "Hello, subsubmenu world\n"; },
            "Print hello world in the sub-submenu" );
    subMenu -> Add( std::move(subSubMenu));
 
    rootMenu -> Add( std::move(subMenu) );
 
    Cli cli( std::move(rootMenu) );
    // global exit action
    cli.ExitAction( [](auto& out)
        { out << "Goodbye and thanks for all the fish.\n"; } );
 
    boost::asio::io_service ios;
 
    // setup local session (gives an application prompt)
 
    CliLocalTerminalSession localSession(cli, ios, std::cout, 200);
    localSession.ExitAction(
        [&ios](auto& out) // session exit action
        {
            out << "Closing App...\n";
            ios.stop();
        }
    );
 
    // setup server (for telnet sessions on port 5000)
 
    CliTelnetServer server(ios, 5000, cli);
    // exit action for all the connections
    server.ExitAction( [](auto& out) 
        { out << "Terminating this session...\n"; } );
    ios.run();
 
    return 0;
}

If you don’t need the remote console and a synchronous application is enough, you can simplify the setup and get rid of boost asio:


#include "cli/cli.h"
#include "cli/clifilesession.h"
 
using namespace cli;
using namespace std;
 
 
int main()
{
    // setup cli
 
    auto rootMenu = make_unique< Menu >( "cli" );
    rootMenu -> Add(
            "hello",
            [](std::ostream& out){ out << "Hello, world\n"; },
            "Print hello world" );
    rootMenu -> Add(
            "hello_everysession",
            [](std::ostream&)
                { Cli::cout() << "Hello, everybody" << std::endl; },
            "Print hello everybody on all open sessions" );
    rootMenu -> Add(
            "answer",
            [](int x, std::ostream& out)
                { out << "The answer is: " << x << "\n"; },
            "Print the answer to Life, the Universe and Everything");
    rootMenu -> Add(
            "color",
            [](std::ostream& out)
                { out << "Colors ON\n"; SetColor(); },
            "Enable colors in the cli" );
    rootMenu -> Add(
            "nocolor",
            [](std::ostream& out)
                { out << "Colors OFF\n"; SetNoColor(); },
            "Disable colors in the cli" );
 
    auto subMenu = make_unique< Menu >( "sub" );
    subMenu -> Add(
            "hello",
            [](std::ostream& out)
                { out << "Hello, submenu world\n"; },
            "Print hello world in the submenu" );
    subMenu -> Add(
            "demo",
            [](std::ostream& out){ out << "This is a sample!\n"; },
            "Print a demo string" );
 
    auto subSubMenu = make_unique< Menu >( "subsub" );
        subSubMenu -> Add(
            "hello",
            [](std::ostream& out)
                { out << "Hello, subsubmenu world\n"; },
            "Print hello world in the sub-submenu" );
    subMenu -> Add( std::move(subSubMenu));
 
    rootMenu -> Add( std::move(subMenu) );
 
 
    Cli cli( std::move(rootMenu) );
    // global exit action
    cli.ExitAction( [](auto& out)
        { out << "Goodbye and thanks for all the fish.\n"; } );
 
    CliFileSession input(cli);
    input.Start();
 
    return 0;
}

Just for Embedded Applications?

Of course not. The cli library has several uses beyond the remote supervision of embedded software. It can be a console for UI-less applications, like game servers; as a way to configure network devices; a flexible tool to test library code.

You can put a CLI basically in every application: sooner or later, it will come in handy.