Skip to content

Latest commit

 

History

History
160 lines (126 loc) · 14.5 KB

README.md

File metadata and controls

160 lines (126 loc) · 14.5 KB

Relaxed Semantics: d-CBO and Elastic 2D Designs

An out-of-order relaxed data structure is one where some operations deviate from the normal semantics. For example, a k-out-of-order FIFO queue only requires its dequeues to return one of the k + 1 oldest items, instead of the oldest item. The reason one might want to use relaxation is that it can lead to significantly faster data structures, especially in concurrent settings. In essence, on can trade ordering quality for performance.

This repository currently covers a majority of state-of-the-art out-of-order FIFO queues, as well as some stacks and counters, together with a selection of non-relaxed implementations to compare against. It both covers k-out-of-order data structures (including ones with elastic relaxation), where the 2D designs are state-of-the-art, as well as data structures with randomized out-of-order relaxation, where the d-CBO designs based on the choice-of-two perform the best.

This repository is originally based on the ASCYLIB framework, and memory is managed using SSMEM, which is a simple object-based memory allocator with epoch-based garbage collection.

Related Publications

Designs

The src folder includes the data structures implementations. Each of these implementations has a README.md file with additional information about its origin and author.

d-Choice Balanced Operations (d-CBO) Queues

These relaxed queues use d-choice load balancing to distribute operations across sub-queues in a way to achieve low relaxation errors. The d-CBO queues balance operation counts and are introduced in the PPoPP'25 paper Balanced Allocations over Efficient Queues. All d-CBO implementations can also be compiled to d-CBL that instead balance the sub-queues lenghts, as done by the d-RA queue from the earlier paper Fast and Scalable, Lock-free k-FIFO Queues. There are also Simple d-CBO implementations, which use external operation counters and give up on empty-linearizability to be completely generic over sub-queue selection.

Static 2D Designs

These designs are on a high level described in the DISC paper, and form the foundation of the 2D framework. They have had some optimizations done in conjunction with later publications.

Elastic 2D Designs

These designs extend the 2D stack and queue to encompass elastic relaxation. This means that their degree of relaxation can be changed (either manually or with a dynamic controller) during runtime. They are described in the coming Euro-Par paper.

Additional Relaxed Designs

These are implementations of other relaxed data structures. There are also a few additional ones in ./src/.

External Strict Designs

These are implementations, or copies, of external non-relaxed data structures which can be used as baselines when evaluating relaxed designs.

Usage

Simply clone the repository and run make to compile all implementations, using their default tests and switches. You can then find and run the respective data structure binary benchmark in bin/.

The default benchmark is a synthetic test where each thread repeatedly flips a coin to either insert or remove an item. The binary takes several cli flags, which are all described by running it with -h, such as make 2Dc-queue && ./bin/2Dc-queue -h. However, the most important arguments to get started might be:

  • -d: The duration in ms to run the experiment,
  • -i: The number of items to insert before starting the benchmark,
  • -n: The number of threads to use.

For the 2D data structures, you might want to start with adjusting the following parameters, which together controls its relaxation bound:

  • -l: The depth of the 2D window,
  • -w: The width of the 2D window.

For the d-CBO queues, you similarly adjust the width (as with most relaxed designs), and also control the sample size:

  • -w: The number of sub-queues,
  • -c: The d in the name, specifies the number of sub-queues to sample for each operation.

Prerequisites

The code is designed to be run on Linux and x86-64 machines, such as Intel or AMD. This is in part due to what memory ordering is assumed from the processor, and also due to the use of 128 bit compare and swaps in some data structures. Even if runnable on other architectures, some relaxation bounds will likely not hold, due to additional possible reorderings.

Furthermore, you need gcc and make to compile the tests. To run the helper scripts in scripts/, you need bc and python 3. Run the following command to install the required python packages pip3 install numpy==1.26.3 matplotlib==3.8.2 scipy==1.12.0.

Docker environment

There is a Dockerfile set up with the required packages for running the benchmarks, as well as plotting the results with the helper scripts. After setting up Docker, you can build the image by running

docker build -t relax-benchmarks .

Then you can either directly run the container to recreate the figures from the latest paper (finding the output in results/)

docker run --rm --hostname=example-pinning -v ./results:/app/results relax-benchmarks

or enter it interactively to run whatever tests you want

docker run -it --rm --hostname=example-pinning -v ./results:/app/results relax-benchmarks bash

Thread pinning is important for the results, and these benchmarks use the hostname to determine the order of pinned threads. Here we set the hostname to example-pinning, as there is a thread pinning configuration set up for that name that often performs fine. However, it is recommended to look into the last section here to set up a better configuration for thread pinning.

Recreating paper plots

We have included scripts to re-run the experiments used in the recent publications. You might have to adjust parameters in the script, such as the number of threads, and configure thread pinning as described below. When that is done, the main difference from the plots in the papers will be cosmetic and dependent on different hardware.

These scripts are avialable in ./scripts/, and will output their plots and results into the results folder when done.

Compilation details

Either navigate a the data structure directory and run make, or run make <data structure name> from top level, which compiles the data structure tests with the default settings. You can further set different environment variables, such as make VERSION=O3 GC=1 INIT=one to modify the compilation. For all possible compilation switches, see ./common/Makefile.common as well as the individual Makefile for each test. Here are the most common ones:

  • VERSION defines the optimisation level e.g. VERSION=O3. It takes one of the following five values:
    • DEBUG compile with optimisation zero and debug flags
    • SYMBOL compile with optimisation level three and -g
    • O0 compile with optimisation level zero
    • O1 compile with optimisation level one
    • O2 compile with optimisation level two
    • O3 compile with optimisation level three (Default)
    • O4 compile with optimisation level three, and without any asserts
  • GC defines if deleted nodes should be recycled GC=1 (Default) or not GC=0.
  • INIT defines if data structure initialization should be performed by all active threads INIT=all (Default) or one of the threads INIT=one
  • RELAXATION_ANALYSIS can be set in relaxed design to measure the relaxation errors of an execution. There are two methods, and all designs don't support both.
    • LOCK measures the relaxation by encapsulating every linearization with a lock, exactly calculating the error at the cost of measuring an execution with essentially no parallelism. Good to validate hard upper bounds, such as for the 2D data structures.
    • TIMER measures the relaxation by approximately timestamping every operation. Has only a small effect on the execution profile, but cannot be used for worst-case measurements due to the approximate nature of the measurements.
  • TEST can be used to change the benchmark used. This has been used in e.g. the d-CBO to test a BFS graph traversal, in the elastic data structures for testing dynamic scenarios. Further switches can be seen in the individual Makefile of each data structure.

Directory description

  • src/: Contains the data structures' source code.
  • scripts/: Contains supporting scripts, such as ones aggregating several benchmark runs into plots.
  • results/: Default folder for test output of the scripts in scripts.
  • include/: Contains support files e.g. the basic window framework (2Dc-window.c).
  • common/: Contains make definitions file.
  • external/: Contains external libraries such as the ssmem library.
  • bin/: Contains the binary files for the compiled data structure benchmarks.

Thread pinning

All tests will pin each software pthread to a hardware thread, using the function pthread_setaffinity_np. By default, the threads will be pinned sequentially to hardware threads 0, 1, 2, 3...`. However, the numbering of the hardware threads depends on the specific CPU used, and you might often want to use a different order than the default one. For example, on a dual-socket Intel Xeon E5-2695 v4 system, the even hardware threads are on the first socket while the odd numbers are on the second, and you might not want to benchmark intra-socket behavior.

Here is a short step-by-step instruction for how to add a machine-specific pinning order:

  • First, see e.g. the output from lscpu and lstopo (here we care about the P#<...> numbers) to understand the hardware topology.
  • Then add an entry for your machine in common/Makefile.common. For example, copy the one for the athena machine (start with ifeq ($(PC_NAME), athena)...), but change athena in the aforementioned line to the name of your computer (see output of uname -n), and change ATHENA in -DATHENA to a similar identifier for your computer (to be used in include/utils.h).
  • Finally, add a matching entry to the aforementioned identifier in include/utils.h. This entry primarily defines the order in which to pin the software threads to hardware threads. Here again, you can look at the entry for ATHENA for inspiration. There are three memory layouts, but the default one (the bottom-most one) is the most important to add, which should pin a thread to each core in a socket, before continuing with SMT, and finally proceeding to the next socket.
    • For a simpler example, see e.g. ITHACA.

Now all tests will use this pinning order. You can validate pinning orders by not allocating all hardware threads, and inspecting the output from htop during a test run.