Welcome

I find compilers, programming languages and performance really interesting.

My personal notes and blog posts are enumerated on the left, or you can use the search bar at the top.

I work on the NVHPC C, C++ and Fortran compilers at NVIDIA.

These views do not in any way represent those of NVIDIA or any other organization or institution that I am professionally associated with. These views are entirely my own.

Debugging in Parallel

Let’s say you have a compiler pass that performs some transformation. You check in a change to the compiler…

void my_pass(ast_t *ast) {
    // perform some transformation
}

…and you get a bug report back.

Something is broken, but it only shows up in a huge translation unit, and your pass runs thousands of times. How do you reduce the problem?

Constrain the Problem

I break the code before and after my patch into two new functions, each called from the old depending on some environment variable:

void old_pass(ast_t *ast) { /* ... */ }
void new_pass(ast_t *ast) { /* ... */ }

void my_pass(ast_t *ast) {
    static int num_calls = 1;
    char *env;
    fprintf(stderr, "num_calls=%d\n", num_calls);

    if (num_calls == std::atoi(std::getenv("USE_NEW_CODE"))) {
        new_pass(ast);
    } else {
        old_pass(ast);
    }

    num_calls++;
}

I can then change from the command line which code path the compiler will use when my pass is run:

# Disable my new change
USE_NEW_CODE=0 build/bin/clang ./bug.c -O3

# Enable my new change only on the first call
USE_NEW_CODE=1 build/bin/clang ./bug.c -O3

Tip

Rather than using environment variables, the same can be accomplished with clang’s cl::opt command line options. opt has the command line flag -opt-bisect-limit=<limit> for bisecting LLVM passes, and you can do the same thing in your own pass.

If we then turn this build and run step into a script that runs in a temporary directory, we’re almost ready to test the entire problem space in parallel:

$ cat >test.sh <<EDO
#!/usr/bin/env bash
start=$PWD
pushd $(mktemp -d)
USE_NEW_CODE=$1 $start/build/bin/clang $start/bug.c -O3
./a.out && echo $* pass || echo $* fail
EOD

Now, we could run this script from 1 up to the number of times your optimization kicks in for the failing test case, but we can do better. We can use GNU parallel¹ to test the entire problem space on all our cores:

$ seq 1000 | parallel --bar -j `nproc` ./test.sh {} '&>' logs/{}.txt
100% 400:0=0s 400
$ grep -r fail logs/
out-314.sh:314: fail
out-501.sh:501: fail

This gives you every individual instance in which your new code caused the failure (in case there were multiple). You can also bisect the failures by using a minimum and maximum instead of only enabling your new code for one single instance, in case this does not work.

Tip

creduce along with LLVM’s bugpoint and llvm-reduce can also be helpful, but not when your test application is segfaulting. creduce tends to create all kinds of segfaults and works best when you have more specific output in the failure case you can grep for.

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input.

TOC

Linux Performance Analysis

Perf analysis is super interesting to me - why does an application run faster or slower under certain conditions? Why does one compiler (or compiler switch) produce a faster application than any other? I want to know what tricks my compiler is doing to speed up my app.

This post is an example performance analysis of an application called POV-Ray. I explain my benchmark choice in the section on POV-Ray.

Approaches

There are two ways I think about approaching performance analysis: a top-down approach and a bottom-up approach. I use perf for both of these approaches, so we’ll start with an overview of perf and then apply these approaches to povray.

Key Terms

Top-down approach: look at the application starting at the root of the call stack. What does main() look like? What is the application doing at an extremely high level?

Bottom-up approach: Look at the fine-grained details of the application. What instructions are being executed? Is the application memory-, network-, or compute-bound? Where are these instructions coming from in the source?

The Linux Perf Tool

So how do we see into the guts of this app as it’s running?

IMO the best place to start (and often finish) is with the perf tool¹. Perf is a part of the linux project, so it’s supported on all linux platforms.

If you don’t already have it, you can probably install it from your package manager as linux-tools-common:

sudo apt install linux-tools-common linux-tools-`uname -r`

Perf has lots of commands, but the main two you’ll need to interact with are perf-record and perf-report. The workflow is generally:

; perf stat -- ./a.out

# This leaves the recorded data in ./perf.data
; perf record -- ./a.out
; perf report

Perf report helps you drill into the call stack to see where samples were recorded in the application, even down to the assembly instructions that corresponded to samples.

Perf Events and Perf List

Note that in the previous section I said perf report helps you view where samples were recorded and not where time was spent; perf watches for events and takes periodic samples of what’s happening on the system when it wakes up. These samples do not necessarily indicate where user-time is being spent.

Depending on your system, kernel configuration, and the configuration of perf itself, you’ll have different events available to profile.

Run perf list² to get a view of all the sampling events you can use on your system:

; perf list
List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
...

The list of samplable events is rather long and often has architecture- and cpu-specific entries, so I’ll leave it as an excercise for the reader to see what perf events are available to you on your system, and learn what they all mean.

The -F flag tells perf what observation frequency it should use when recording samples - often -F 99 (for 99 hertz) is a good place to start; you get enough data to gain insights without being overwhelmed. You can always turn it down for longer-running applications or when you’re sampling many different events.

Perf Stat

The best place to start with perf is often perf stat. This command gives a brief overview of total samples of events. If something from perf stat’s report stands out, you can use perf record with that event to drill into the sources of those samples.

A perf stat run might look like this:

; perf stat -- ./a.out
 Performance counter stats for './a.out':

         21,829.89 msec task-clock                #    0.963 CPUs utilized          
             7,097      context-switches          #  325.105 /sec                   
                 1      cpu-migrations            #    0.046 /sec                   
             5,062      page-faults               #  231.884 /sec                   
    70,001,621,188      cycles                    #    3.207 GHz                    
   155,086,020,805      instructions              #    2.22  insn per cycle         
     9,013,464,722      branches                  #  412.896 M/sec                  
        49,795,347      branch-misses             #    0.55% of all branches        

      22.661088635 seconds time elapsed

      21.785643000 seconds user
       0.051956000 seconds sys

Perf Record

perf record is the primary command for recording samples about your application or system.

My perf record commands usually look like this:

; export \
    APP=./a.out \
    FREQ=99 \
    EVENTS="cycles,instructions,branches,loads,task-clock"
; perf record \
    --output perf-$APP.data \
    --call-graph fp \
    -F $FREQ -e $EVENTS \
    -- taskset 0x2 ./a.out >/dev/null

I’m using --call-graph fp because I want perf to record callgraph information using the frame pointer - this is why you must often build your application with the -fno-omit-frame-pointer compiler flag (more on that later).

I’m also using taskset 0x2 because I only want the app to run on a single core in this example; perf can also record data for everything running on your entire system if you would like it to - or just on a specific core or for a specific application.

Perf Report

perf report will give you a TUI report like this by default:

Samples: 88K of event 'cycles', Event count (approx.): 72137516526
  Children      Self  Command  Shared Object              Symbol
+   99.61%     0.00%  povray   libboost_thread.so.1.74.0  [.] 0x00007f61e2d6f0cb
+   99.54%     0.00%  povray   povray                     [.] pov::Task::TaskThread
+   97.41%     0.03%  povray   povray                     [.] pov::Trace::ComputeTextureColour
+   97.40%     0.06%  povray   povray                     [.] pov::Trace::ComputeOneTextureColour
...

Notice the event used for the report is given in the first line.

perf report --stdio gives the same information initially, but with all the call stacks expanded; this may get overwhelming. For a the 20 second recording I took for this example, the stdio output of perf report was over 10k lines long:

; perf report --stdio|wc -l
10010

From inside the TUI you can press h to get a list of all the available commands, so I won’t enumerate them here.

I usually run perf report with the -G flag, which is shorthand for --inverted, meaning the callgraph representation is inverted.

You may have noticed that the snippet from perf report I pasted above starts with two columns: Self and Children.

The Self and Children columns

The Children indicates the percentage of samples taken in that stack frame or any of its children - meaning any samples recorded while in this stack frame or that of any functions called from the current stack frame.

The Self column is more significant: it indicates what percentage of samples were taken in the given stack frame only - meaning instructions coming from that function alone, and not any functions it calls.

The main() function is always at the top, since it calls all other function. However, unless your entire program was inlined into the main routine, its Self column is likely very low since most of the work being done is probably happening elsewhere.

FlameGraph

I mention Brendan Gregg³ a few times in this post, and you should get familiar with him and his work. His blog has many pearls and he might have a one-liner for exactly your use case.

One of his other contributions is the FlameGraph repo⁴.

Remember how our perf report contains over 10k lines of reporting for just a single application running for ~20 seconds? His flamegraph repo gives us a way to visualize and gain insights from all of that data at a very high level by creating a flamegraph from perf’s recorded data.

Note

The FlameGraph repo actually knows how to deal with other profilers too, like DTrace and SystemTap.

A workflow for generating a flamegraph might look like this:

# build and profile your application
; make
; perf record --call-graph fp -- ./a.out

; git clone https://github.com/brendangregg/FlameGraph ../FlameGraph

; perf script \
    | ../FlameGraph/stackcollapse-perf.pl \
    | ../FlameGraph/flamegraph.pl \
    > flamegraph.svg

Note

The FlameGraph scripts have actally been merged into the linux kernel’s repo, so perf built for a newer kernel has FlameGraph as a built-in script, used like so:

; perf script flamegraph -- ./a.out

# alternatively...
; perf record -- ./a.out
; perf script report flamegraph

This requires python scripting support built into perf, which my perf build does not have, so I can’t test it myself. I still use the scripts from Brendan’s repo.

POV-Ray

Povray⁵ is a 3d graphics code commonly used for benchmarking - it’s part of CPU benchmarking suites from OpenBenchmarking⁶ and spec2017⁷, which means a few things:

It’s reasonably well-optimized.

Compiler writers and hardware vendors don’t care too much about benchmarking silly code that doesn’t represent what users will actually be running.
It’s cross-platform

Part of its utility is that we can compare performance across hardware vendors
It’s well-supported by most/all compilers

compiler authors and hardware vendors care about how well POV-Ray runs on their tech, so we can assume they’ve put effort into handling povray’s code well and ensuring it builds with their compilers.
It doesn’t rely too much on libraries.

OpenBenchmarking and SPEC suites are especially useful for benchmarking because they are mostly self-contained.

Building POV-Ray

POV-Ray is opensource, so we can download it and built it ourselves:

; git clone --branch latest-stable git@github.com:POV-Ray/povray.git
; cd povray
; (cd unix; ./preinstall.sh)

We will build the app with come debug information enabled so we have more visibility into the app’s behavior as it runs:

; ./configure \
    --disable-strip \
    --prefix=$PWD/../povray-gcc-12/ \
    COMPILED_BY="Asher Mancinelli on $(date)" \
    CFLAGS='-fno-omit-frame-pointer' CXXFLAGS='-fno-omit-frame-pointer' \
    CC=gcc-12 CXX=g++-12
; ./unix/povray --version |& grep flags
  Compiler flags:      -pipe -Wno-multichar -Wno-write-strings -fno-enforce-eh-specs -Wno-non-template-friend -g -pg -O3 -ffast-math -march=native -fno-omit-frame-pointer

Frame Pointer

You’ll notice I used the unfortunately-named -fno-omit-frame-pointer. This tells the compiler to maintain the frame pointer in the frame pointer register (ebp on x86_64 systems); the compiler might otherwise reuse the register as a general-purpose register, but we’re going to tell the perf tool to use the frame pointer register for building analyses, so we need to keep it around.

Once we have the app built, we can run the standard benchmark (this takes a while):

; make -j `nproc` install
; ./unix/povray --benchmark </dev/null
...
Render Options
  Quality:  9
  Bounding boxes.......On   Bounding threshold: 3
  Antialiasing.........On  (Method 1, Threshold 0.300, Depth 3, Jitter 0.30,
 Gamma 2.50)
==== [Rendering...] ========================================================
Rendered 15360 of 262144 pixels (5%)

Values Statement

TL;DR

These values motivate and inform how I work.

They are values, meaning I aspire to live up to them and they are at times in tension with each other.

They are: curiosity, honesty, rigor, communication, empathy.

Curiosity

I chose to take an orientation of curiosity towards any problem I work on. Replace feelings of frustration, anger and disappointment with curiosity when possible.

Honesty

My coworkers should expect everything I say to be my complete and honest understanding.
I ask when I don’t understand something.
When I present a solution, I point out what I do and don’t understand. I am not afraid to be wrong or change my opinion, publicly.
This includes times when honesty is uncomfortable or inconvenient, including when I have made a mistake.

Rigor

Any artifact of my work should clearly demonstrate rigor and thoughtfulness.
When describing issues or summarizing an investigation, I should include the possible solutions that I see, and give a recommendation. If someone else must read my report, come up with possible solutions and give a recommendation, that report was unfinished. When assigning me a problem, my assigner should expect a rigorous investigation, a range of possible solutions, and a preferred recommendation.
Rigor implies good engineering. I write quality code.

Effective Communication

Effective written and spoken communication must be a core competency. Emails, comments, code review, personal messages and official documents convey the appropriate tone and level of detail.
Comments in software are the guideposts for future developers, most importantly myself.
When speaking, I am either prepared to speak or explain that I will speak more at a later time when I am prepared.

Empathy

In any communication, I consider how it will be received, who is receiving it, and how they may feel about it.
The code I write considers the user and the next developer to read or modify it. Have empathy on the developer that maintains your code, because it will most likely be you.
This includes writing tools for tasks that others also perform. Share the value of your work with others and deduplicate where possible. Make other’s lives better.

Link to NVIDIA’s core values page

Principles of Technology Leadership | Bryan Cantrill | Monktoberfest 2017

changelog ep with Bryan Cantrill

⁴

https://oxide.computer/principles

What I need to remember when scripting

Shell/Scripting

GNU Parallel

  seq 200 | parallel --bar ./t.sh {} '>' ajm-{}.txt

Used with some test script like this to find corner cases:

  $ cat t.sh
  init=$PWD
  cd $(mktemp -d)
  echo $1
  ./exe $1 &> stdout
  diff stdout $init/correct && echo pass

This way I can run the executable N times and then recursively grep the output directory to find any cases that passed/failed.

Bash REGEX

Use $BASH_REMATCH or ${BASH_REMATCH[@]} with the regex match operator in a test expression:

$ [[ " -c -fPIC /path/to/file.F90 " =~ [^\ ]+/file.F90 ]] && echo ${BASH_REMATCH[@]}
/path/to/file.F90

more bash regex stuff

CLI

exe=$0
usage() {
    cat <<EOD
EOD
    exit 1
}

port=6666
cmd=""
host=""
while [[ $# -ne 0 ]]; do
  case $1 in
    -p) port=$2; shift;;
    -h) host=$2; shift;;
    --) shift;cmd="$*";shift $#;;
    *) usage;;
  esac
  shift
done

Strings

global sub

  $ var="some text to work with, more text"
  $ echo ${var//text/content}
  some content to work with, more content

previous command with some replacement

  $ echo one two three one
  one two three one
  $ !!:gs/one/five
  $ echo five two three five
  five two three five

local sub

  $ echo ${var/text/content}
  some content to work with, more text

Deletes LONGEST match, be careful of ./file.txt as the ./ component will get the whole thing removed.

  $ file=log.txt
  $ echo ${file%%.*}
  log

  $ file=./log.txt
  $ echo ${file%.*}
  ./log

string casing

  x="HELLO"
  echo $x  # HELLO

tolower

  y=${x,,}
  echo $y  # hello

toupper

  z=${y^^}
  echo $z  # HELLO

Linux documentation on string manipulation

Perl

One-Liners

(Other perl one-liners)[https://learnbyexample.github.io/learn_perl_oneliners/one-liner-introduction.html]

print line number and line of matching regex

perl -ne 'printf "%8d %s", $., $_ if /pattern/'

in-place replacement of input

perl -pi.bk -E 's/, !dbg !\d+//' good.llvm
      |      |                 | input file
      |      | regex/code/replacement
      | backup file extension to use

Getopt

Misc

Print the command to be run, run the arguments in a shell, and print all the lines of stderr/stdout:

sub sh {
    my $cmd="@_";
    chomp($cmd);
    say $cmd;
    open my $fh, "$cmd|";
    print while (<$fh>);
    close $fh;
    say "ec=$?";
}

Editors and Tools

Editor

My editing workflow starts with sshing into a box and reattaching to a tmux session:

ssh myhost
tmux attach

I have an ssh alias set up to forward certain ports as well:

$ cat ~/.ssh/config
Host myhost
  User ashermancinelli
  HostName myhost.domain.com
  LocalForward 6666 localhost:6666
  LocalForward 6667 localhost:6667
  LocalForward 6668 localhost:6668
  LocalForward 6669 localhost:6669

From inside the tmux session, I usually have several sessions each with a few shell windows and a neovim server running:

nvim --headless --listen localhost:6666

Then I can attach my editor¹ to the remote editing session from my local terminal:

neovide --no-multigrid --fork --server localhost:6666

Tools

I use Spack² for nearly everything.

I have a set of tools I need on every machine, which I install with Spack. At the time of writing, this is the list:

ripgrep
the-silver-searcher
bat
fzf
fd
rlwrap
neovim@master
lua
vim
perl
python
py-pip

I install all of these like so:

spack \
  --config concretizer:targets:granularity:generic \
  install -j `nproc` \
  ripgrep the-silver-searcher bat fzf fd rlwrap neovim@master lua vim perl python py-pip \
  --fresh

Setting the target granularity to generic means I can usually install once for each architecture in the cluster I’m working on, however sometimes OS incompatibilities mean I need to reinstall a few times.

From a script, I load all the default packages I need by just choosing the first one that matches my current generic architecture.

garch=`spack arch --platform`-`spack arch --operating-system`-`spack arch --generic-target`
for pkg in ${packages[@]}; do eval `spack load --sh --first $pkg arch=$garch`; done

The better way to do this would be to use environments³, but installing them as individual packages makes it easier to reuse all my scripts for different architectures and operating systems rather than creating environments for each. I can just update my list of default packages and reinstall as-needed without activating an environment, reconcretizing and reinstalling every time.

NeoVide, neovim gui client

Spack | A flexible package manager supporting multiple versions, configurations, platforms, and compilers.

https://spack.readthedocs.io/en/latest/environments.html

Scattered notes from learning about the implementation of VLA.

What is VLA?

Variable-length arrays are dynamic, stack-allocated arrays. The compiler needs to increase the stack size in the current stack frame to allocate enough space for the array. Assuming negative stack-growth like on x86, the compiler will decrease the stack pointer sufficiently to store the array.

This is almost identical to alloca. Both alloca and VLAs are essentially primitives to modify the stack pointer.

Eg:

  // Subtracts N from current stack pointer returns sp 
  int *curr_sp = alloca(N * sizeof(int));

  // equivilant to
  int curr_sp[N];

One key difference between the two:

Quote

The memory alloca() returns is valid as long as the current function persists. The lifetime of the memory occupied by a VLA is valid as long as the VLA’s identifier remains in scope. You can alloca memory in a loop for example and use the memory outside the loop, a VLA would be gone because the identifier goes out of scope when the loop terminates.

Memory Layout

Because the stack grows down on most platforms, the stack pointer after an alloca or VLA allocation but arrays are addressed sequentially upwards, the address of the first element of a VLA array (or the pointer returned by alloca) will be the value of the stack pointer after it’s modified.

Element 0 of the array or alloca-allocated memory is therefore immediately above the stack pointer after allocation, and is addressed by increasing sequentially until the end of the array. Accessing past the array will then run into previously declared stack variables.

When the function returns, the stack space will be available for subsequent function calls to use automatically, so there is no need to explicitly free memory allocated by VLA/alloca.

Examples

GCC Documentation

These arrays are declared like any other automatic arrays, but with a length that is not a constant expression. The storage is allocated at the point of declaration and deallocated when the block scope containing the declaration exits.

// ./vla <size>
int main(int argc, char** argv) {
  int len = atoi(argv[1]);
  int array[len];
  for (int i=0; i<len; i++)
    array[i] = i;
}

Declaring the array decrements the stack pointer enough to provide memory for the array:

{% include vla/inspect-stack.c %}

$ uname -a
Linux carbon 5.15.0-71-generic #78-Ubuntu SMP Tue Apr 18 09:00:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ gcc inspect-stack-vla.c && LEN=10 IDX=4 ./a.out
&vla[0]: 140737151458112
before: 140737151458160
after: 140737151458112
diff: 48

Notice that the address stored in the stack pointer after declaring the VLA array is the same as the address of the first element of the VLA array as depicted in the diagram above.

`alloca`

Instead of declaring a VLA array, we can create a pointer to memory allocated by alloca to produce the same effect:

{% include vla/inspect-stack-alloca.c %}

$ gcc inspect-stack-alloca.c && LEN=10 IDX=4 ./a.out
&vla[0]: 140728646054592
before: 140728646054640
after: 140728646054592
diff: 48

Compare the GCC docs for alloca with that of variable length arrays and notice the similarities:

GCC Documentation

The function alloca supports a kind of half-dynamic allocation in which blocks are allocated dynamically but freed automatically.

Allocating a block with alloca is an explicit action; you can allocate as many blocks as you wish, and compute the size at run time. But all the blocks are freed when you exit the function that alloca was called from, just as if they were automatic variables declared in that function. There is no way to free the space explicitly.

Why Might This Be a Bad Idea?

The dynamic nature of VLAs means the offset of stack variables declared after the VLA into the stack frame of the function is also dynamic - which means the function will need extra instructions to calculate the address of these variables whenever they are referenced in the body of the function.

This may be a worthwhile tradeoff, but know that use of VLAs means your code may need a few extra instructions every time you use stack variables.

Conclusion & Links

Solve a leetcode problem in BQN and I rant about the joy of programming.

Leetcode

The Leetcode problem is "Set Matrix Zeroes" where we're tasked with setting rows and columns of a matrix that contain zero to be all zeroes.

BQN Solution

   i←⟨
     3‿4⥊⟨0,1,3,0,3,4,5,2,1,3,1,5⟩
     3‿3⥊⟨1,1,1,1,0,1,1,1,1⟩
   ⟩

   Z ← {
     bm←0=𝕩
     a←∨` ∨`⌾⌽ bm
     b←(∨`˘) ((∨`˘)⌾(⌽˘)) bm
     𝕩×a¬∘∨b
   }
   
   ⟨"#1","#2"⟩∾i≍Z¨i
┌─                       
╵ "#1"        "#2"       
  ┌─          ┌─         
  ╵ 0 1 3 0   ╵ 1 1 1    
    3 4 5 2     1 0 1    
    1 3 1 5     1 1 1    
            ┘         ┘  
  ┌─          ┌─         
  ╵ 0 0 0 0   ╵ 1 0 1    
    0 4 5 0     0 0 0    
    0 3 1 0     1 0 1    
            ┘         ┘  
                        ┘

Some other solutions from the BQN Matrix chat room:

   ⊢×0≠∧˝˘∧⌜∧˝           # Marshall & Dzaima (tacit!)
   (≠⥊∧´)˘{𝕩×(𝔽⌾⍉∧𝔽)0≠𝕩} # Dzaima & Rampoina
   {𝕩×(∧˝˘∧≢⥊∧˝)0≠𝕩}     # Dzaima

On the Joy of Programming

It’s been a few months since I’ve written BQN or APL, so I feel like I’m looking at the language family with fresh eyes.

I was struck by the resemblance between solving this leetcode problem and creating art:

I know I’m not the best at either, and many, many people can write more elegant BQN and more elegant poetry than I can (for example)
I can thoroughly enjoy both when detached from any performance metrics - the process is far more valuable to me than the end-product
both can be deeply social actions - sharing your painting with someone and discussing my solution with the BQN chat room are both social and exciting. Even if someone comes back with a more wonderful painting or more terse solution, I enjoy the social interaction just as much.

I stumbled upon this thread on twitter describing how Kurt Vonnegut responded to a letter from a high school English student asking for life advice. In short, his response was to do art and enjoy the process of becoming who you are.

Tear it up into teeny-weeny pieces, and discard them into widely separated trash receptacles. You will find that you have already been gloriously rewarded for your poem. You have experienced becoming, learned a lot more about what’s inside you, and you have made your soul grow.
— Gabe Hudson (@gabehudson) May 2, 2022

Creating art seems to be central to the importance of life as far as I can tell.

The most recent episode of ArrayCast with Stevan Apter dipped into this as well when the panelists discussed the aesthetic of writing APL. In some ways they were a little reserved about saying they enjoy APL at least in part due to the aesthetic of the language. I don't think this is something to shy away from - if we can't appreciate the beauty of what we do, why are we doing it at all?

I loved working through this rather simple problem.

I loved the process of visualizing the inputs, of thinking through possible solutions while going about my day.

I loved taking my solution to the BQN forum for more gifted and practiced BQN-ers to find far simpler and more elegant solutions than mine.

The whole process felt like writing a poem, and at the end I’m rewarded by sharing this poem with others, seeing what they come up with, and comparing their thoughts with mine.

There is a unique joy and beauty I find in BQN (and APL more broadly), and that’s what keeps me coming back.

As Kurt Vonnegut pointed out, what else could be a more worthwhile way to spend my time?

Please, give it a try, and fall in love with the community while you’re at it.

I’ve found NixOS to provide a wonderful development environment after learning a bit of the Nix language - but building and hacking on LLVM on NixOS gave me some trouble. Hopefully reading this post will save you the trouble!

Build Environment

If you’d just like to see my config, go to the end of the post where I’ve pasted the whole thing

I’m no Nix language expert by any stretch of the imagination, nor am I an expert in dynamic linking or managing an operating system.

I started with the nix-shell example provided in the nix documentation here and made additions as I found the need.

I had to pass the library directories of both GCC and GLIBC using both the -B and -L flags, because some required object files (like crt1.o) must be found at link time, and clang/gcc don’t search LD_LIBRARY_PATH for these files. -B will tell the compilers to look in the provided paths for these files.

  libcFlags = [
    "-L ${stdenv.cc.libc}/lib"
    "-B ${stdenv.cc.libc}/lib"
  ];

  # The string version of just the gcc flags for NIX_LDFLAGS
  nixLd = lib.concatStringsSep " " [
    "-L ${gccForLibs}/lib"
    "-L ${gccForLibs}/lib/gcc/${targetPlatform.config}/${gccForLibs.version}"
  ];

  gccFlags = [
    "-B ${gccForLibs}/lib/gcc/${targetPlatform.config}/${gccForLibs.version}"
    "${nixLd}"
  ];

The official documentation uses LLVM_ENABLE_PROJECTS to enable runtimes, which is deprecated, so I first removed that in favor of a manual two-stage build for libc++ and libc++abi.

  # For building clang itself, we're just using the compiler wrapper and we
  # don't need to inject any flags of our own.
  cmakeFlags = lib.concatStringsSep " " [
    "-DGCC_INSTALL_PREFIX=${gcc}"
    "-DC_INCLUDE_DIRS=${stdenv.cc.libc.dev}/include"
    "-DCMAKE_BUILD_TYPE=Release"
    "-DCMAKE_INSTALL_PREFIX=${installdir}"
    "-DLLVM_INSTALL_TOOLCHAIN_ONLY=ON"
    "-DLLVM_ENABLE_PROJECTS=clang"
    "-DLLVM_TARGETS_TO_BUILD=X86"
  ];
  cmakeCmd = lib.concatStringsSep " " [
    "export CC=${stdenv.cc}/bin/gcc; export CXX=${stdenv.cc}/bin/g++;"
    "${cmakeCurses}/bin/cmake -B ${builddir} -S llvm"
    "${cmakeFlags}"
  ];

To build clang itself, I activate the nix shell and build only clang:

$ cd llvm-project
$ nix-shell
$ eval "$cmakeCmd"
$ make -C build -j `nproc`

I didn’t use LLVM_ENABLE_RUNTIMES since I had trouble passing the CMake arguments to the runtime builds through the top-level build. The purpose of LLVM_ENABLE_RUNTIMES is to build an LLVM project using the just-built clang/LLVM, however compile and link arguments are not passed to the runtime builds using the default CMAKE_CXX_FLAGS as I expected (or at least I was unable to get this to work).

Instead, I configured a seperate set of cmake arguments for the runtimes, and manually passed the just-built clang compiler as CXX to the runtime builds like so:

  cmakeRuntimeFlags = lib.concatStringsSep " " [
    "-DCMAKE_CXX_FLAGS=\"${flags}\""
    "-DLIBCXX_TEST_COMPILER_FLAGS=\"${flags}\""
    "-DLIBCXX_TEST_LINKER_FLAGS=\"${flags}\""
    "-DLLVM_ENABLE_RUNTIMES='libcxx;libcxxabi'"
  ];
  cmakeRtCmd = lib.concatStringsSep " " [
    "export CC=${builddir}/bin/clang; export CXX=${builddir}/bin/clang++;"
    "${cmakeCurses}/bin/cmake -B ${builddir}-rt -S runtimes"
    "${cmakeRuntimeFlags}"
  ];

$ cd llvm-project
$ eval "$cmakeRtCmd"
$ make -C build-rt -j `nproc`

Testing, Linking, Running

A huge issue with testing arose due to the way NixOS handles it’s dynamic loader.

When you use software in NixOS, it’s usually found somewhere in the /run/current-system/sw/ prefix, while most software expects to be run from /usr or /usr/local, or it expects to be able to find key libraries under those prefixes (eg libc.so).

Instead, each software component has it’s own prefix under /nix/store, for example:

$ which perl
/run/current-system/sw/bin/perl
$ file $(which perl)
/run/current-system/sw/bin/perl: symbolic link to 
    /nix/store/kpzx6f97583zbjyyd7b17rbv057l4vn2-perl-5.34.0/bin/perl

Each compiler must then know the correct locations of the standard libraries and software components, such as the dynamic loader, the standard C library, etc. To acomplish this, Nix-provided compilers ship with wrappers that inject the required flags.

If I inspect my GNU compilers, we see that I’m not using g++ directly:

$ which g++
/nix/store/gkzmfpb04ddb7phzj8g9sl6saxzprssg-gcc-wrapper-10.3.0/bin/g++
$ file $(which g++)
/nix/store/gkzmfpb04ddb7phzj8g9sl6saxzprssg-gcc-wrapper-10.3.0/bin/g++:
  a /nix/store/v1d8l3zqnia3hccqd0701szhlx22g54z-bash-5.1-p8/bin/bash
  script, ASCII text executable

The actual compiler is found in a seperate prefix:

$ grep 'g++' $(which g++) | head -n1
[[ "/nix/store/mrqrvina0lfgrvdzfyri7sw9vxy6pyms-gcc-10.3.0/bin/g++" = *++ ]] && isCxx=1 || isCxx=0

On my current system, the true compiler is found under /nix/store/mrqrvina0lfgrvdzfyri7sw9vxy6pyms-gcc-10.3.0.

This becomes an issue with testing LLVM because the tests run executables built with the just-built clang++, not with the wrapped compiler! That’s why we have to inject so many flags in our shell.nix.

When I initially ran make check-cxx to run the tests for libc++, I found errors like this:

FileNotFoundError: [Errno 2]
    No such file or directory:
    '/home/asher/workspace/llvm-project/brt/libcxx/test/std/containers/sequences/deque/deque.modifiers/Output/erase_iter_iter.pass.cpp.dir/t.tmp.exe'

It looks like the tests rely on an executable that’s not there. However, if I check that location:

$ file /home/asher/workspace/llvm-project/brt/libcxx/test/std/containers/sequences/deque/deque.modifiers/Output/erase_iter_iter.pass.cpp.dir/t.tmp.exe
/home/asher/workspace/llvm-project/brt/libcxx/test/std/containers/sequences/deque/deque.modifiers/Output/erase_iter_iter.pass.cpp.dir/t.tmp.exe:
ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked,
interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, with debug_info, not stripped

the executable was built.

If I try to run it directly, it still appears to not exist:

$ /home/asher/workspace/llvm-project/brt/libcxx/test/std/containers/sequences/deque/deque.modifiers/Output/erase_iter_iter.pass.cpp.dir/t.tmp.exe
bash: /home/asher/workspace/llvm-project/brt/libcxx/test/std/containers/sequences/deque/deque.modifiers/Output/erase_iter_iter.pass.cpp.dir/t.tmp.exe:
  No such file or directory

What’s going on here?

Dynamic Linker

If we return to the output of running file on our mysterious executable, we see it thinks its dynamic linker is /lib64/ld-linux-x86-64.so.2:

ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked,
interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, with debug_info, not stripped

However, if we look for that file, it doesn’t exist:

$ file /lib64/ld-linux-x86-64.so.2
/lib64/ld-linux-x86-64.so.2:
  cannot open `/lib64/ld-linux-x86-64.so.2'
  (No such file or directory)

The patchelf utility from the NixOS developers will tell us where the dynamic linker is for a given executable:

$ patchelf --print-interpreter /path/to/llvm-test.exe
/lib64/ld-linux-x86-64.so.2

Again, our executable wasn’t built by a Nix compiler wrapper, it was built by the clang we just compiled ourselves. What dynamic linker do other programs on this system use?

$ patchelf --print-interpreter $(which bash)
/nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2

That’s right, everything on NixOS is built in its own prefix. If we had used a compiler wrapper, the flags to tell the executable which linker to use would have been injected.

I found that the linker flag --dynamic-linker will set the dynamic linker path for a given executable, and it’s used here in GCC’s compiler wrapper:

$ grep 'dynamic-link' $(which g++)
        extraBefore+=("-Wl,-dynamic-linker=$NIX_DYNAMIC_LINKER_x86_64_unknown_linux_gnu")

I can’t quite figure out how NIX_DYNAMIC_LINKER_x86_64_unknown_linux_gnu is set other than that it’s set by the script in $gcc_wrapper_prefix/nix-support/add-flags.sh, but I did find the file dynamic-linker under that same prefix:

$ file $(which g++)
/run/current-system/sw/bin/g++: symbolic link to
  /nix/store/gkzmfpb04ddb7phzj8g9sl6saxzprssg-gcc-wrapper-10.3.0/bin/g++
$ ls /nix/store/gkzmfpb04ddb7phzj8g9sl6saxzprssg-gcc-wrapper-10.3.0
bin  nix-support
$ ls /nix/store/gkzmfpb04ddb7phzj8g9sl6saxzprssg-gcc-wrapper-10.3.0/nix-support
add-flags.sh      cc-ldflags      libc-crt1-cflags  libcxx-ldflags  orig-libc-dev            utils.bash
add-hardening.sh  dynamic-linker  libc-ldflags      orig-cc         propagated-build-inputs
cc-cflags         libc-cflags     libcxx-cxxflags   orig-libc       setup-hook

So the file $gcc_wrapper_prefix/nix-support/dynamic-linker contains the path to the dynamic linker the compiler is using:

$ cat /nix/store/gkzmfpb04ddb7phzj8g9sl6saxzprssg-gcc-wrapper-10.3.0/nix-support/dynamic-linker
/nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2

I’ll then use this in my shell.nix to get the path to the dynamic linker, and then pass that to clang for building the LLVM runtimes so the correct dynamic linker is used for executables built by clang:

  dynLinker = lib.fileContents "${stdenv.cc}/nix-support/dynamic-linker";
  flags = lib.concatStringsSep " " ([
    "-Wno-unused-command-line-argument"
    "-Wl,--dynamic-linker=${dynLinker}"
                          ↑↑↑↑↑↑↑↑↑↑↑↑↑
  ] ++ gccFlags ++ libcFlags);

I’ve also added -Wno-unused-command-line-argument to the compile flags so we’re not spammed with warnings every time a link directory or file directory I pass to the compiler invokation is ignored.

We can now finally run our tests. Sometimes required binaries are still not found by lit, so I use the cxx-test-depends target to build test dependencies and then I run lit manually:

$ make cxx-test-depends -C build-rt
$ ./build/bin/llvm-lit ./libcxx

Full Config

I was able to find lots of information about nix from playing around in the nix repl and using tab completion, like this:

$ nix repl
Welcome to Nix 2.4. Type :? for help.

nix-repl> pkgs = import <nixpkgs> {}

nix-repl> pkgs.lib.concatStringsSep " " ["one" "two" "three"]
"one two three"

nix-repl> pkgs.stdenv.cc.<TAB><TAB>
pkgs.stdenv.cc.__ignoreNulls                pkgs.stdenv.cc.libc_dev
pkgs.stdenv.cc.all                          pkgs.stdenv.cc.libc_lib
pkgs.stdenv.cc.args                         pkgs.stdenv.cc.man
pkgs.stdenv.cc.bintools                     pkgs.stdenv.cc.meta
...

nix-repl> "${pkgs.stdenv.cc.cc}"
"/nix/store/mrqrvina0lfgrvdzfyri7sw9vxy6pyms-gcc-10.3.0"

nix-repl> pkgs.lib.fileContents "${pkgs.stdenv.cc}/nix-support/dynamic-linker"
"/nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2"

Here’s the full config:

with import <nixpkgs> {};

let
  builddir = "build";
  installdir = "install";
  gccForLibs = stdenv.cc.cc;
  dynLinker = lib.fileContents "${stdenv.cc}/nix-support/dynamic-linker";
  libcFlags = [
    "-L ${stdenv.cc.libc}/lib"
    "-B ${stdenv.cc.libc}/lib"
    ];

  # The string version of just the gcc flags for NIX_LDFLAGS
  nixLd = lib.concatStringsSep " " [
    "-L ${gccForLibs}/lib"
    "-L ${gccForLibs}/lib/gcc/${targetPlatform.config}/${gccForLibs.version}"
  ];

  gccFlags = [
    "-B ${gccForLibs}/lib/gcc/${targetPlatform.config}/${gccForLibs.version}"
    "${nixLd}"
    ];

  flags = lib.concatStringsSep " " ([
      "-Wno-unused-command-line-argument"
      "-Wl,--dynamic-linker=${dynLinker}"
    ] ++ gccFlags ++ libcFlags);

  # For building clang itself, we're just using the compiler wrapper and we
  # don't need to inject any flags of our own.
  cmakeFlags = lib.concatStringsSep " " [
    "-DGCC_INSTALL_PREFIX=${gcc}"
    "-DC_INCLUDE_DIRS=${stdenv.cc.libc.dev}/include"
    "-DCMAKE_BUILD_TYPE=Release"
    "-DCMAKE_INSTALL_PREFIX=${installdir}"
    "-DLLVM_INSTALL_TOOLCHAIN_ONLY=ON"
    "-DLLVM_ENABLE_PROJECTS=clang"
    "-DLLVM_TARGETS_TO_BUILD=X86"
  ];

  # For configuring a build of LLVM runtimes however, we do need to inject the
  # extra flags.
  cmakeRuntimeFlags = lib.concatStringsSep " " [
    "-DCMAKE_CXX_FLAGS=\"${flags}\""
    "-DLIBCXX_TEST_COMPILER_FLAGS=\"${flags}\""
    "-DLIBCXX_TEST_LINKER_FLAGS=\"${flags}\""
    "-DLLVM_ENABLE_RUNTIMES='libcxx;libcxxabi'"
  ];

  cmakeCmd = lib.concatStringsSep " " [
    "export CC=${stdenv.cc}/bin/gcc; export CXX=${stdenv.cc}/bin/g++;"
    "${cmakeCurses}/bin/cmake -B ${builddir} -S llvm"
    "${cmakeFlags}"
  ];

  cmakeRtCmd = lib.concatStringsSep " " [
    "export CC=${builddir}/bin/clang; export CXX=${builddir}/bin/clang++;"
    "${cmakeCurses}/bin/cmake -B ${builddir}-rt -S runtimes"
    "${cmakeRuntimeFlags}"
  ];

in stdenv.mkDerivation {

  name = "llvm-dev-env";

  buildInputs = [
    bashInteractive
    cmakeCurses
    llvmPackages_latest.llvm
  ];

  # where to find libgcc
  NIX_LDFLAGS = "${nixLd}";

  # teach clang about C startup file locations
  CFLAGS = "${flags}";
  CXXFLAGS = "${flags}";

  cmakeRuntimeFlags="${cmakeRuntimeFlags}";
  cmakeFlags="${cmakeFlags}";

  cmake="${cmakeCmd}";
  cmakeRt="${cmakeRtCmd}";
}

{% include latex.html %} {% include mermaid.html %}

The foundation of GPU programming is linear algebra. This post takes you from a basic linear algebra example problem to a GPU-accelerated example that calculates a matrix-vector product.

Key Takeaways

Correctness precedes parallelism and performance
Identify and understand the underlying algorithms at play
Speed is not the same as efficiency
Use sane defaults, only optimize after profiling and testing
Use the most specialized tool for the job

NOTE: This post is geared towards those without significant experience in linear algebra, high performance computing, or GPU programming.

Outline

Mathematical Understanding of Algorithm
Algorithm Analysis
C on the Host
CUDA C
1. Core Concepts
  1. Host and Device
  2. Using the CUDA Runtime
  3. Shared Memory
  4. CUDA Mat-Vec Multiply
CUDA C++
1. Naive Approach
2. More Sophisticated Approach
3. The Best Tool for the Job
Conclusion
BQN Example
Links, References, Additional Reading

Mathematical Understanding of Algorithm

We’ll be performing a matrix-vector dot product several ways in this post.

The operation is depicted below:

Matvec dot product, credit this post: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.2-Multiplying-Matrices-and-Vectors/

Let p be the result of the dot product of matrix Mat and vector v. The dot product is calculated like so:

$$ \\ p \gets Mat \cdot v

v_0 \cdot \left[ {\begin{array}{c} Mat_{0, 0} \ Mat_{1, 0} \ Mat_{2, 0} \ \end{array} } \right] + v_1 \cdot \left[ {\begin{array}{c} Mat_{0, 1} \ Mat_{1, 1} \ Mat_{2, 1} \ \end{array} } \right] + v_2 \cdot \left[ {\begin{array}{c} Mat_{0, 2} \ Mat_{1, 2} \ Mat_{2, 2} \ \end{array} } \right]

\left[ {\begin{array}{cc} (Mat_{0,0} \cdot v_0) + (Mat_{0,1} \cdot v_1) + (Mat_{0,2} \cdot v_2) \ (Mat_{1,0} \cdot v_0) + (Mat_{1,1} \cdot v_1) + (Mat_{1,2} \cdot v_2) \ (Mat_{2,0} \cdot v_0) + (Mat_{2,1} \cdot v_1) + (Mat_{2,2} \cdot v_2) \ \end{array} } \right] \ $$

Notice how values of v are broadcast to match the shape of Mat:

$$ \\ \left[ {\begin{array}{c} v_{0} & v_{1} & \cdots & v_{n}\\ v_{0} & v_{1} & \cdots & v_{n}\\ \vdots & \vdots & \ddots & \vdots\\ v_{0} & v_{1} & \cdots & v_{n}\\ \end{array} } \right] \\ $$

We can broadcast values of v into columns of a matrix with the same shape as the matrix Mat, and then pair the Mat and v element-wise, creating a matrix of tuples (or a 3d matrix if you prefer):

$$ \\ tuplespace \gets \left[ {\begin{array}{cc} (Mat_{0,0}, v_0) & (Mat_{0,1}, v_1) & (Mat_{0,2}, v_2) \\ (Mat_{1,0}, v_0) & (Mat_{1,1}, v_1) & (Mat_{1,2}, v_2) \\ (Mat_{2,0}, v_0) & (Mat_{2,1}, v_1) & (Mat_{2,2}, v_2) \\ \end{array} } \right] \\ $$

This is sometimes called a tuple space, or the domain of our algorithm. The book How to Write Parallel Programs: A First Course covers tuple spaces in great detail.

Now that we have constructed our tuple space, we might group our computations into self-contained units of work along each row.

Let tuplespace be the 2 dimensional matrix tuple space given above. We then may form a vector with units of work yielding indices of the output vector:

$$ \\ \left[ {\begin{array}{cccc} w(0) \gets \sum_{i \gets 0}^{N} tuplespace_{0, i, 0} \cdot tuplespace_{0, i, 1} \\ w(1) \gets \sum_{i \gets 0}^{N} tuplespace_{1, i, 0} \cdot tuplespace_{1, i, 1} \\ \vdots \\ w(M) \gets \sum_{i \gets 0}^{N} tuplespace_{M, i, 0} \cdot tuplespace_{M, i, 1} \\ \end{array} } \right] \\ $$

Equivalently:

$$ \\ \left[ {\begin{array}{cccc} w(0) \gets \sum_{i \gets 0}^{N} Mat_{0,i} \cdot v_{i} \\ w(1) \gets \sum_{i \gets 0}^{N} Mat_{1,i} \cdot v_{i} \\ \vdots \\ w(M) \gets \sum_{i \gets 0}^{N} Mat_{M,i} \cdot v_{i} \\ \end{array} } \right] \\ $$

Our units of work may independently operate on subsets (rows) of our tuple space.

Algorithm Analysis

The first question we must ask ourselves when parallelizing code is this: are any iterations of the algorithm dependent on values calculated in other iterations? Is iteration N dependent on calculations in iteration N-1? In other words, are the loop bodies entirely independent of each other?

If so, our algorithm is loop independent and trivially parallelizable. This slidedeck from a UT Austin lecture are helpful additional reading on this topic.

The fundamental algorithm at play here is a reduction or a fold. If you see these terms elsewhere in literature, documentation, or algorithms in libraries or programming languages, they almost certainly mean the same thing. Some collection of values are reduced or folded into a single value.

You might be thinking to yourself, we are starting with a collection of values (a matrix) and yet we end up with a collection of values (a vector). How is this a reduction/fold?

This is a good question: the reduction is not performed over the entire matrix, but only the rows of the matrix. Each row of the matrix is reduced into a single value.

The algorithm each unit of work performs is called transform-reduce (or sometimes map-reduce).

Although transform-reduce might seem like two algorithms (it kinda is!), it is such a universal operation that it is often considered it’s own algorithm (or at least it’s packaged as its own algorithm in libraries). For example, the Thrust abstraction library that ships with NVIDIA’s CUDA Toolkit has the transform-reduce algorithm built-in.

In this case, we would like to transform our input tuples by multiplying two elements together, and then reduce our input using the sum operator.

In Python, a given unit of work might look like this:

from functools import reduce
tuplespace_row0 = [
    (0, 2),
    (1, 2),
    (2, 2),
    ]

def work(tupl):
    return reduce(
            lambda a, b: a + b,        # use + to reduce
            map(lambda x: x[0] * x[1], # use * to transform
                tupl                   # input tuple
                )
            )

# Input to map is mat_row
# Input to reduce is [0, 2, 4]
# Final value is 6
print(work(tuplespace_row0)) # yields 6

The following formula is a more formal definition of a single unit of work in our example:

$$ \\ r \gets current rank \\ W_{r} \gets \sum_{i \gets 0}^{N} M_{r,i} \cdot v_{i} \\ \\ $$

In the above case, the summation is the reduce operation, and the multiplication of the matrix elements and vector elements is the transform operation, transforming each tuple into a scalar before the reduction.

The key insight about this reduction is that no unit of work depends on another unit of work. The domains of each unit of work are non-overlapping. In other words, this algorithm is loop independent and can be parallelized along the rows of our tuplespace, again given by:

$$ \\ \left[ {\begin{array}{ccc} (Mat_{0,0}, v_0) & (Mat_{0,1}, v_1) & (Mat_{0,2}, v_2) \\ \hline \\ (Mat_{1,0}, v_0) & (Mat_{1,1}, v_1) & (Mat_{1,2}, v_2) \\ \hline \\ (Mat_{2,0}, v_0) & (Mat_{2,1}, v_1) & (Mat_{2,2}, v_2) \\ \end{array} } \right] \\ $$

It was by identifying and understanding the underlying algorithms (broadcast and transform-reduce) of our higher-level algorithm that we are able to determine if and how it is parallelizable and loop independent.

Identify and understand the underlying algorithms

NOTE: Even if your operation seems to be loop dependent, there are sometimes clever tricks you can use to parallelize your code. Perhaps you just haven’t been exposed to the correct algorithm yet!

We now hopefully understand that a matrix-vector product is formally a broadcasted multiply followed by a series of sum-reductions and that we can parallelize our algorithm by breaking it up into self-contained units of work. We can now move on to implementing and parallelizing the algorithm.

C on the Host

The code for such a calculation might look like this in C:

void matvecmul(int* mat, int* vec, int* out, int m, int n) {
    for (int i=0; i < m; i++)
        for (int j=0; j < n; j++)
            out[i] += vec[j] * mat[j+(i*n)];
}

Here’s some example data fed into our matrix vector product:

int main() {
    int M = 3;
    int N = 4;

    int mat[M*N];
    for (int i=0; i < M*N; i++) mat[i] = i;

    int vec[N];
    for (int i=0; i < N; i++) vec[i] = i;

    int out[M];

    memset(out, 0, sizeof(int[M]));
    matvecmul(mat, vec, out, M, N);

    return 0;
}

The output of this program (with some printing code added in):

Matrix:
  0   1   2   3 
  4   5   6   7 
  8   9  10  11 
Vector:
  0   1   2   3 
Output:
 14  38  62

Feel free to verify these results and play around with other values using online software like this CASIO calculator website, or a scripting language. Here’s an example of the above problem using BQN, one of my favorite languages to use when understanding an algorithm.

Demonstrating that we have a correct algorithm with tests is a precondition for optimizing and parallelizing an algorithm:

Testing for correctness precedes parallelism and performance

We know that a given index in our output vector can be computed independently of any other indices in the output vector from the respective row in our tuple space. We can then pull out a function that performs a single unit of work as identified above.

int unit_of_work(int* mat, int* vec, int row, int n) {
    double sum = 0;
    mat += row * n;
    for (int i=0; i < n; i++)
        sum += mat[i] * vec[i];
    return sum;
}

Compare this now with the single unit of work we described above:

$$ \\ r \gets current rank \\ W_{r} \gets \sum_{i \gets 0}^{N} M_{r,i} \cdot v_{i} \\ \\ $$

Our new matvecmul function can now just iterate over all the rows and dispatch the actual work to the unit_of_work function. We can even use OpenMP to parallelize our loop:

void matvecmul_on_tuplespace(int* mat, int* vec, int* out, int m, int n) {
    // dispatch calculations to unit_of_work for each row of mat
    #pragma omp parallel for
    for (int row=0; row < m; row++)
        out[row] = unit_of_work(mat, vec, row, n);
}

You might have noticed that our new implementation has more code than our original implementation, and might be slightly more complex. This is okay, and it gets at an important point:

Speed is not the same as efficiency

This excellent podcast episode from the lead HPC architect at NVIDIA explains this point in detail.

If our code performs more work overall it is less efficient. If that additional work means we can perform calculations on multiple threads or additional devices resulting in lower runtime, it is faster and we’ve increased its speed. The key difference between speed and efficiency is this: speed is a factor of time and efficiency is a factor of work. Sometimes optimizing code means improving speed, other times efficiency. Most of the time, to run code on a GPU, you do have to perform more work to set up the calculation, so strictly speaking our code will be faster and less efficient.

CUDA C

CUDA C is the basis of the CUDA runtime, and forms the foundation for all other CUDA-related abstractions. We’ll take a look at some basic concepts before jumping into the code. This CUDA C introductory slide deck is helpful in understanding the basics.

Core Concepts

Host and Device

When working with a GPU, it’s important to keep in mind the difference between the host and the device.

Just like your CPU, your GPU has access to it’s own memory. Programming a GPU entails managing your CPU’s memory along with your GPU’s memory. If you would like your GPU to have access to some memory you’re using on the CPU, you’ll have to allocate memory on the GPU and copy it over.

If you don't tell the GPU to perform any work, then your CUDA C code is really just C code: ```c #include int main() { puts("Hello!"); return 0; } ```

You can then invoke NVIDIA’s compiler, NVCC, to compile the program:

$ cat hello.cu
#include <cstdio>
int main() {
  puts("Hello!");
  return 0;
}
$ nvcc hello.cu -o hello && ./hello
Hello!

If you invoke nvcc with the -v flag for extra verbosity, you can see that nvcc actually uses a host compiler to build the parts of your program that don’t involve running code or manipulating memory on the GPU. nvcc uses multiple passes, where it compiles the CUDA code and generates host-only source for the host compiler to compile. See this Compiler Explorer link and look at the compilation output window in the bottom right pane to see all the output. Notice that GCC is invoked, along with the program ptxas. PTX is an assembly target, so your CUDA programs will emit ptx code which can be run on your GPU’s special purpose processing units. The command line flags for code generation can be complicated. Refer to the official CUDA programming guide when needed. Just as you can use asm volitile("" : : : ""); in C and C++ to write inline assembly, you can also write inline ptx assembly in your programs. Also like C and C++, it is almost certainly more effective for you to write your code in a higher level language like CUDA C++, and write PTX after profiling and testing, when you are sure you need it.

If you’re careful, you might also have noticed that GCC was passed the command line argument -x c++, even though we’re working in plain CUDA C. This is because cuda code is by default built on the host as C++. If you use the oldest CUDA compiler available on Compiler Explorer, you’ll see that it still defaults to building the host code under C++14.

The full NVCC compilation pipeline is depicted below:

Using the CUDA Runtime

In this example, we introduce three aspects of the CUDA programming model:

The special keyword __global__
Device memory management
Kernel launches

// square.cu
#include <cstdio>
#include <cuda.h>
#include <cuda_runtime.h>

__global__ void square(int *ar, int n) {
  int tid = threadIdx.x;
  if (tid < n)
    ar[tid] = ar[tid] * ar[tid];
}

int main() {
  #define N 10

  // Allocate static memory on host
  int ar[N];
  for (int i=0; i < N; i++) ar[i] = i;

  // Allocate memory on device, copy from host to device
  int* d_ar;
  cudaMalloc(&d_ar, sizeof(int[N]));
  cudaMemcpy(d_ar, ar, sizeof(int[N]), cudaMemcpyHostToDevice);

  // Launch kernel to run on the device
  square<<<1, 15>>>(d_ar, N);

  // Copy memory back from device
  cudaMemcpy(ar, d_ar, sizeof(int[N]), cudaMemcpyDeviceToHost);

  // Display values after kernel
  for (int i=0; i < N; i++)
    printf("%d ", ar[i]);
  puts("");

  // Deallocate memory
  cudaFree(d_ar);
  return 0;
}

$ nvcc square.cu -o square
$ ./square
0 1 4 9 16 25 36 49 64 81

__global__ indicates that the code runs on the device and is called from the host. Keep in mind that we have two memory spaces and two execution spaces.

The following table enumerates common operations in C along with their CUDA counterpart:

{% include hpc-101-matvec/cuda-c-alloc-table.html %} The angle brackets surrounding our _kernel launch parameters_ determine how the kernel will be executed by the GPU. The possible kernel launch parameters are enumerated at this link.

The kernel launch parameters determine how many streaming multiprocessors (SMs) will execute code on the GPU. The first two parameters are objects of type dim3, and they can be up to three-dimensional vectors. The first kernel launch parameter is the grid size, and the second is the block size.

Grids consist of blocks.

Blocks consist of threads.

Therefore, the total number of threads launched by your kernel will be:

$$ totalthreads \gets gridsize.x \times gridsize.y \times gridsize.z \\ \times blocksize.x \times blocksize.y \times blocksize.z \\ $$

CUDA kernels may be launched with a 1-3 dimensional grid, and a 1-3 dimensional block. The image below might have been launched with these kernel launch parameters:

  dim3 grid_size(3, 3, 1);
  dim3 block_size(3, 3, 1);
  myfunc<<<grid_size, block_size>>>();

You might also notice that we guard our operation with this if statement.

  if (tid < n)
    ar[tid] = ar[tid] * ar[tid];

For performance reasons, it’s usually best to launch your kernels with a multiple of the number of threads in a given block on your GPU, so you may launch with more GPU threads than you need.

For additional reading on the hardware implementation of the CUDA programming model, please refer to chapter 4 of the NVIDIA CUDA Programming Guide.

Shared Memory

Although each thread launches with its own stack memory, threads can share memory just like OS threads. The third kernel launch parameter determines how many bytes will be allocated for each block that is launched.

In the following example, we make use of CUDA shared memory, as indicated by the `__shared__` keyword annotating the array in our kernel, as well as our use of the third kernel launch parameter: ```cuda #include #include #include

global void mulsum(int* a, int* b, int* out, int n) { int tid = threadIdx.x; extern shared int tmp[]; /* ▲

│ ┌───────────────────────────────┐
│ │External and shared, allocated │
└──┤by the cuda runtime when kernel│
│ is launched │
└───────────────────────────────┘ */ if (tid >= n) return;

tmp[tid] = a[tid] * b[tid];

__syncthreads();

if (tid == 0) { int sum = 0; for (int i=0; i < n; i++) sum += tmp[i]; *out = sum; } }

int main() { #define N 10

int a[N]; for (int i=0; i < N; i++) a[i] = i;

int b[N]; for (int i=0; i < N; i++) b[i] = i;

int* d_a; cudaMalloc(&d_a, sizeof(int[N])); cudaMemcpy(d_a, a, sizeof(int[N]), cudaMemcpyHostToDevice);

int* d_b; cudaMalloc(&d_b, sizeof(int[N])); cudaMemcpy(d_b, b, sizeof(int[N]), cudaMemcpyHostToDevice);

int* d_out; cudaMalloc(&d_out, sizeof(int));

mulsum<<<1, N, sizeof(int[N])>>>(d_a, d_b, d_out, N); /* ▲

```
        │
```
┌────────┴────────────────────────────┐
│Size of shared memory to be allocated│
│ for kernel launch │
└─────────────────────────────────────┘ */

int out; cudaMemcpy(&out, d_out, sizeof(int), cudaMemcpyDeviceToHost); printf(“%d\n”, out);

cudaFree(d_a); cudaFree(d_b); cudaFree(d_out); return 0; }


```console
$ nvcc mul-sum-reduce.cu && ./a.out
285

Notice how our shared memory is declared:

  extern __shared__ int tmp[];

It is external because we are not allocating the memory in our kernel; it’s allocated by the cuda runtime when we pass the third parameter to the kernel launch parameters:

mulsum<<<1, N, sizeof(int)*N>>>(d_a, d_b, d_out, N);
               ▲
               │
 ┌─────────────┴───────────────────────┐
 │Size of shared memory to be allocated│
 │         for kernel launch           │
 └─────────────────────────────────────┘

There can only be one segment of shared memory in a kernel launch, so the shared memory segment will be interpreted as whatever type we declare our shared memory with. In this case, it’s an array of ints. Although there is strictly one segment of shared memory in a kernel launch, you can still declare multiple variables as __shared__, so long as they all fit in the allocated shared memroy.

We also introduced another CUDA extension to the host language: __syncthreads(). __syncthreads() is a fence or barrier, a point which no thread in that block can cross until all threads have reached it. __syncthreads() There are many other CUDA primitives for atomic, and synchronization operations, such as atomicAdd.

CUDA Mat-Vec Multiply

We again return to our matvecmul example, this time armed with some knowledge about the CUDA runtime and some software and hardware abstractions.

#include <cstdio>
#include <cuda.h>
#include <cuda_runtime.h>

__global__ void matvecmul(int* mat, int* vec, int* outv,
                          int m, int n) {

  int rowidx = blockIdx.x;
  int colidx = threadIdx.x;

  extern __shared__ int tmp[];

  if (colidx < n && rowidx < m) {
    tmp[colidx] = mat[colidx + (rowidx * n)] * vec[colidx];

    __syncthreads();

    if (colidx == 0) {
      int sum = 0;
      for (int i=0; i < n; i++)
        sum += tmp[i];
      outv[rowidx] = sum;
    }
  }
}

int main() {
  #define M 10
  #define N 15

  int a[M*N];
  for (int i=0; i < M*N; i++) a[i] = i;

  int b[N];
  for (int i=0; i < N; i++) b[i] = i;

  int* d_a;
  cudaMalloc(&d_a, sizeof(int[M*N]));
  cudaMemcpy(d_a, a, sizeof(int[M*N]), cudaMemcpyHostToDevice);

  int* d_b;
  cudaMalloc(&d_b, sizeof(int[N]));
  cudaMemcpy(d_b, b, sizeof(int[N]), cudaMemcpyHostToDevice);

  int* d_c;
  cudaMalloc(&d_c, sizeof(int[M]));

  matvecmul<<<M, N, sizeof(int[N])>>>(d_a, d_b, d_c, M, N);

  int c[M];
  cudaMemcpy(c, d_c, sizeof(int[M]), cudaMemcpyDeviceToHost);

  cudaFree(d_a);
  cudaFree(d_b);
  cudaFree(d_c);
  return 0;
}

After adding some printing code to our example above, we get the following:

$ nvcc matvecmul.cu && ./a.out
Matrix:
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14
  15   16   17   18   19   20   21   22   23   24   25   26   27   28   29
  30   31   32   33   34   35   36   37   38   39   40   41   42   43   44
  45   46   47   48   49   50   51   52   53   54   55   56   57   58   59
  60   61   62   63   64   65   66   67   68   69   70   71   72   73   74
  75   76   77   78   79   80   81   82   83   84   85   86   87   88   89
  90   91   92   93   94   95   96   97   98   99  100  101  102  103  104
 105  106  107  108  109  110  111  112  113  114  115  116  117  118  119
 120  121  122  123  124  125  126  127  128  129  130  131  132  133  134
 135  136  137  138  139  140  141  142  143  144  145  146  147  148  149

Vector:
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14

Output:
1015 2590 4165 5740 7315 8890 10465 12040 13615 15190

This BQN example verifies the output from our CUDA program: ``` Mul ← +˝∘×⎉1‿∞ m←10‿15⥊↕200 v←↕15 m Mul v ⟨ 1015 2590 4165 5740 7315 8890 10465 12040 13615 15190 ⟩ ```

In our CUDA C example, we launch a block for each row of our matrix. This way, we can share memory between each thread operating on a given row of the matrix. A single thread per row can then perform the sum reduction and assign the value to the index in the output vector.

CUDA C++

Naive Approach

In our previous CUDA C example, we weren’t really using CUDA C perse, but CUDA C++. If you read the introductory chapter in the official NVIDIA CUDA Programming guide, you’ll see that CUDA C is really just CUDA mixed with the common language subset between C and C++ on the host. We were using CUDA C++ the whole time, we just restricted ourselves to the C subset of C++ for simplicity.

#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sequence.h>
#include <cuda.h>
#include <cuda_runtime.h>

/* ┌───────────────────────────┐
   │Using thrust's pointer type│
   │     instead of int*       │
   └───────────────────────────┘ */
__global__ void matvecmul(
    thrust::device_ptr<int> mat,
    thrust::device_ptr<int> vec,
    thrust::device_ptr<int> outv,
    int m, int n) {

  int rowidx = blockIdx.x;
  int colidx = threadIdx.x;

  extern __shared__ int tmp[];

  if (colidx < n && rowidx < m)
  {
    tmp[colidx] = mat[colidx + (rowidx * n)] * vec[colidx];

    __syncthreads();

    if (colidx == 0) {
      int sum = 0;
      for (int i=0; i < n; i++)
        sum += tmp[i];
      outv[rowidx] = sum;
    }
  }
}

int main() {
  #define M 10
  #define N 15

  /* ┌────────────────────────────────┐
     │ Using thrust's host and device │
     │    vectors over raw arrays.    │
     │No longer need to use cudaMemcpy│
     │        or cudaMalloc!          │
     └────────────────────────────────┘ */
  thrust::device_vector<int> a(M*N);
  thrust::sequence(a.begin(), a.end(), 0);

  thrust::device_vector<int> b(N);
  thrust::sequence(b.begin(), b.end(), 0);

  thrust::device_vector<int> c(M, 0);

  matvecmul<<<M, N, sizeof(int[N])>>>(a.data(), b.data(), c.data(), M, N);

  /* ┌────────────────────────────┐
     │The assignment operator will│
     │   perform the cudaMemcpy   │
     └────────────────────────────┘ */
  thrust::host_vector<int> out = c;

  puts("Output:");
  for (int i=0; i < M; i++)
    printf("%d ", out[i]);
  puts("");
  return 0;
}

$ nvcc thrust-ex.cu && ./a.out
Output:
1015 2590 4165 5740 7315 8890 10465 12040 13615 15190

As you can see, the code looks quite similar, except for the lack of memory management. This is hiding a few extra details as well. In our original CUDA example, we first allocate and assign to memory on the host before copying it to the device. In this example, we allocate memory on the device first, and perform assignments on the device.

In this line, we allocate device memory for our matrix, and assign values to it on the device.

  thrust::device_vector<int> a(M*N);
  thrust::sequence(a.begin(), a.end(), 0);

thrust::sequence is almost identical to std::iota or for (int i=0; i < N; i++) vec[i] = i;, except that it may execute on the device. In this new example, we launch three kernels instead of one: one for each call to thrust::sequence, and one for our manual kernel launch. You can look at the details of the ptx assembly in Compiler Explorer here.

More Sophisticated Approach

Remember all that fuss about fundamental algorithms in the earlier sections? How our fundamental algorithm here is a transform-reduce?

Well, in our first-pass CUDA implementation, we don’t really use this to our advantage. Our kernel contains the following lines:

    if (colidx == 0) {
      int sum = 0;
      for (int i=0; i < n; i++)
        sum += tmp[i];
      outv[rowidx] = sum;
    }

Thrust's `transform_reduce` uses a rather complicated multi-pass, tiled approach to reducing a collection of values to a single value, but we only use a single thread in a block to actually reduce a given index in our output vector.

While we used a raw loop at least once per block, an optimized reduction will perform something like the following:

Extremely performant reductions are actually quite hard to get right - it’s easy to get some parallelism in a reduction, but it takes significant effort to truly maximize the speed you can get from a GPU. This slidedeck from the NVIDIA developer blog details various approaches to optimizing a reduction operation on a GPU. Thrust’s reduce implementation will even select different strategies and launch parameters based on the sizes of the data it operates on.

The point of this Thrust discussion is not to dissuade you from writing raw CUDA kernels - it’s to dissuage you from doing it too early. In the majority of cases, it’s likely that using a library around raw CUDA kernels will result in faster code and less development time. Once you have already written your code using known algorithms, once you have tested your code to demonstrate its correctness, once you have profiled your code to demonstrate where the performance bottlenecks are on the target architectures you care about, then it makes sense to write raw CUDA kernels.

Use sane defaults, only optimize after profiling and testing

So let’s try again, using Thrust’s parallel algorithms to compute the reductions for each row of the matrix-vector multiplication (godbolt link here):

#include <cstdio>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sequence.h>

__global__
void broadcast_to_matrix(thrust::device_ptr<int> mat,
    thrust::device_ptr<int> vec,
    int m, int n) {
  const auto col = blockIdx.x;
  const auto row = threadIdx.x;

  if (row < m and col < n)
    mat[col+(row*n)] = vec[col];
}

int main() {
  #define M 10
  #define N 15

  thrust::device_vector<int> a(M*N);
  thrust::sequence(a.begin(), a.end(), 0);

  thrust::device_vector<int> b(N);
  thrust::sequence(b.begin(), b.end(), 0);

  thrust::device_vector<int> broadcasted_b(M*N);
  broadcast_to_matrix<<<N, M>>>(broadcasted_b.data(), b.data(), M, N);

  thrust::host_vector<int> c(M);

  thrust::zip_iterator iter(thrust::make_tuple(a.begin(), broadcasted_b.begin()));

  for (int i=0; i < M; i++)
    c[i] = thrust::transform_reduce(iter+(i*N), iter+(i*N)+N,
        [] __device__ (auto tup) -> int {
          return thrust::get<0>(tup) * thrust::get<1>(tup);
        },
        0,
        thrust::plus<int>()
        );

  puts("Output:");
  for (int i=0; i < M; i++)
    printf("%d ", c[i]);
  puts("");
  return 0;
}

Since we’re using some more fancy C++ features, we have to pass some extra flags to the compiler:

$ nvcc -std=c++17 --extended-lambda fancy.cu && ./a.out
Output:
1015 2590 4165 5740 7315 8890 10465 12040 13615 15190

The first kernel is launched manually, since we’re just broadcasting values from the vector to another vector to match the shape of our matrix. We perform this step so we can create a zip iterator from the matrix and the broadcasted vector:

  thrust::zip_iterator iter(thrust::make_tuple(a.begin(), broadcasted_b.begin()));

This means we can feed the single iterator into our transform_reduce operation. Elements obtained by the zip iterator are passed into our lambda function, and we simply multiply the two values together, before using the plus functor to reduce the vector of intermediate values for a given row into a scalar:

  for (int i=0; i < M; i++)
    c[i] = thrust::transform_reduce(iter+(i*N), iter+(i*N)+N,
        [] __device__ (auto tup) -> int {
          return thrust::get<0>(tup) * thrust::get<1>(tup);
        },
        0,
        thrust::plus<int>()
        );

If we want to use threading on the host as well, we can even use an OpenMP directive:

#pragma openmp parallel for
  for (int i=0; i < M; i++)
    c[i] = thrust::transform_reduce(iter+(i*N), iter+(i*N)+N,
        [] __device__ (auto tup) -> int {
          return thrust::get<0>(tup) * thrust::get<1>(tup);
        },
        0,
        thrust::plus<int>()
        );

We’ll have to tell nvcc to pass the -fopenmp flag to the host compiler:

$ nvcc -std=c++17 --extended-lambda -Xcompiler -fopenmp fancy.cu

The Best Tool for the Job

We have by now hopefully learned that we should use the most specialized tool for the job, and we should write kernels by hand only when we’re sure we can do better than your libraries of choice. We can take this principle one step further with a little extra knowledge of our problem.

A matrix-vector product is a very common linear algebra operation, and a member of the Basic Linear Algebra Subroutines interface, which CUDA provides a library for (CUBLAS). Because this is such a common operation, NVIDIA provides an extremely fast implementation - far more optimized than anything we would write by hand.

This knowledge of our problem leads us to using the most appropriate library, and likely to the fastest solution.

#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sequence.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

int main() {
  #define M 10
  #define N 15

  cublasHandle_t ch;
  cublasCreate(&ch);

  thrust::device_vector<double> a(M*N);
  thrust::sequence(a.begin(), a.end(), 0);

  const double alpha = 1.0;
  const double beta = 0.0;

  thrust::device_vector<double> b(N);
  thrust::sequence(b.begin(), b.end(), 0);

  thrust::device_vector<double> c(M);

  #define PTR(x) thrust::raw_pointer_cast(x.data())
  cublasDgemv(
      ch,
      CUBLAS_OP_T,
      N, M,
      &alpha,
      PTR(a), N,
      PTR(b), 1,
      &beta,
      PTR(c), 1
      );
  #undef PTR

  thrust::host_vector<double> hc = c;

  puts("Output:");
  for (int i=0; i < M; i++)
    printf("%.1f ", hc[i]);
  puts("");

  cublasDestroy(ch);

  return 0;
}

$ nvcc cublas.cu -lcublas && ./a.out
Output:
1015.0 2590.0 4165.0 5740.0 7315.0 8890.0 10465.0 12040.0 13615.0 15190.0

Conclusion

BQN Example

Personally, I use BQN to prototype solutions to problems and to better understand the fundamental algorithms at play; you don’t have to know an APL in order to understand this, but it might be helpful. Feel free to skip this section; it is not critical to understanding the concepts.

Here’s a permalink to the BQN snippet.

   # Same matrix as in our C example
   mat ← 3‿3⥊↕10
┌─       
╵ 0 1 2  
  3 4 5  
  6 7 8  
        ┘
   # Same vector as in our C example
   vec ← 3⥊2
⟨ 2 2 2 ⟩

   +˝⎉1 mat×vec
⟨ 6 24 42 ⟩

The core algorithm is seen in the final expression:

+˝⎉1 mat×vec
▲    ▲
│    │     ┌───────────────────────────┐
│    └─────┤Multiply rows of mat by vec│
│          │        element-wise       │
│          └───────────────────────────┘
│     ┌─────────────────────────┐
│     │Sum-reduce rows of matrix│
└─────┤ resulting from mat×vec  │
      └─────────────────────────┘

Alternatively:

Links, References, Additional Reading

Overview of how I debug performance regressions when developing a compiler. I don’t claim this is the best way to do it, email me or tweet at me if you’ve got better ideas😉

Starting Point

Compilers are very complicated and the results can be surprising. Sometimes performance issues only show up in large scale real-world applications. How do you go about debugging such an issue?

As you might expect, narrowing down the issue to be minimal and reproducible is the first task. Ideally, we narrow the performance regression down to a single translation unit, though sometimes this isn’t enough. For this post, we’ll assume that the bulk of the performance regression you see in your application is coming from one translation unit, and that you know which patch is causing the regression (if you don’t know which patch is causing the regression… well you can bisect the recent patches too😁).

Bisecting the Object Files

Assume we have two compilers: compiler A which doesn’t have the “bad” changes (the “good” compiler), and compiler B which does (the “bad” compiler). We’ll start by building the application with both compilers, building half of the object files with compiler A and half with compiler B. Say we have 100 object files that are linked into the application; we’d build the first 50 with compiler A and the second 50 with compiler B.

If the perf regression isn’t observed after you re-link all the object files into the application, then we know the bulk of the issue is in the object files that were just built with compiler A. We can then rebuild all the object files in the second 50 with compiler A and build object files 26-50 or 1-25 with compiler B. In this way, we bisect all the translation units until we find the single TU with the largest impact on performance.

This can be really tedious and manual, but it’s not too hard to script😉.

Bisecting the Source File

Now that we’ve narrowed our regression down to a single TU, our work gets a little more complicated. We can use the same bisection process as before, but this time we’ll have to do it on a single file. To acomplish this, we’ll have to figure out which parts of the source file depend on each other so we can break it into two new source files, one to be built with compiler A and one to be built with compiler B (all other TUs being built with the “good” compiler).

Depending on the situation you may create two source files, each with half of the content of the original, or maybe you’ll use the same source file but use macro guards so each compiler only builds half of the source, eg:

/* includes, declarations, and global defs up here */

#ifdef COMPILERA
// stuff built with the good compiler...
#else /* COMPILERB */
// stuff built with the bad compiler...
#endif

You may then add -DCOMPILERA to the invokation of compiler A so each compiler only builds half of the TU in question. Again, if we don’t see the perf regression, we swap the macro guards and try again. We then have compiler B build a quarter of the original TU and have compiler A build the other 3/4ths, and see if we observe the regression, etc etc. Ideally, at the end of this process we know exactly which function(s) are causing the regression.

What Next?

After we’ve narrowed the regression down to a function or two (🤞) things can get tricky, and very much depends on the nature of the changes that caused the regression.

At this point I think it’s best to ask some questions:

Was the patch in question related to a specific pass?
- Can the effects of that pass be seen in the function(s) we found to be causing the regression?
- Is the regression observed when the pass is disabled?
Do you notice any obvious differences between the IR the compilers generate for the identified functions?
- Can you use those differences to work backwards to the code that generated that IR?
If you enable lots of debugging output (like dumping all the opt pass remarks) and build with compilers A and B and then diff the output, are there any glaring differences? Maybe an earlier change allowed another pass (uninvolved in the patch) to perform some transformations it otherwise would not, or maybe vice-versa.

Why Might This Not Work?

Sometimes the effects only occur in a short function that is always inlined, in which case you might not find a specific TU or set of functions at the root of the regression; for this reason, you might want to crank the inlining pass down as low as it goes to help you narrow down the issue. It’s often best to use the fewest optimizations possible when debugging this sort of thing (so long as you still observe the behavior).

std::expected and some spectacular extensions are hopefully coming to C++23.

Existing Practice

I’m used to seeing code that looks like this, not just in C but in C++ codebases too:

int main() {
  int ierr;
  double * mat = (double*)malloc(sizeof(double[100]));

  ierr = dowork(mat, 10, 10);
  check_error(ierr);

  ierr = domorework(mat, 10, 10);
  check_error(ierr);

  free(mat);
}

Integers represent error conditions, and these usually map to an error message like so:

const char* error_messages[] = {
    "success",
    "got some failure",
    "another kind of failure",
};
enum error_types {
    success,
    some_failure,
    some_other_failure,
};

void check_error(int ierr) {
    if (ierr) {
        printf("got error '%s'\n", error_messages[ierr]);
        exit(ierr);
    }
}

This way when you encounter an error condition in some function, you just return the corresponding error code like so:

int dowork(double* matrix, int M, int N) {
  // do some work here

  if (/*some error condition*/)
      return some_failure;

  // success state
  return success;
}

And the error handler reports the failures:

  Program returned: 1
got error 'got some failure'

On one hand, there’s a sense of security to writing code like this. You always check your error conditions, your code never throws exceptions, and the range of possible failures is very clear. I don’t mind this pattern in C, but in C++ we have some goodies that are far nicer to use in my opinion (especially in C++23).

Why `std::expected` is Awesome

I’ll be using mdspan from the Kokkos implementation and expected from Sy Brand’s implementation for this section.

In the last year, 3 papers have come through the ISO C++ mailing lists for std::expected, and the most recent paper from Jeff Garland proposes some of Sy Brand’s wonderful extensions to std::expected which allow you to chain together monadic operations on expected values:

and_then
or_else
transform

I think these extensions are extremely elegant, and I think some folks that are more used to the C-style error handling could be won over. Using std::expected means your errors have value semantics, which is something I like about the C-stlye error handling. Chaining together these operations makes the programmer’s intent so much clearer.

Let’s look at another example using matrices, but this time using expected and mdspan.

`expected` Example

I’ll get some using statements out of the way:

namespace stdex = std::experimental;
using mat_t = stdex::mdspan<double, 
    stdex::extents<
      stdex::dynamic_extent,
      stdex::dynamic_extent
    >
  >;
using expect_mat = tl::expected<mat_t, std::string>;

I can’t help but marvel at how nice and readable this looks: The intent of the programmer is very clear in my opinion, even without seeing the rest of the code.

int main() {
  auto raw = std::make_unique<double[]>(25);
  auto mat = mat_t(raw.get(), 5, 5);
  setup(mat)                  // zero initialize
    .and_then(set_diag)       // set the diagonal of the matrix
    .and_then(print)          // print out some values
    .or_else(report_errors);  // if unexpected, print the error and quit
}

/*
 * Program returned: 0
 * 1.0 0.0 0.0 0.0 0.0 
 * 0.0 1.0 0.0 0.0 0.0 
 * 0.0 0.0 1.0 0.0 0.0 
 * 0.0 0.0 0.0 1.0 0.0 
 * 0.0 0.0 0.0 0.0 1.0 
 */

This leads to some nice and/or interesting patterns. Say we want to check that the matrix passed to set_diag is square; we could perform our check and return an error message if the check fails, much like we could throw an exception:

auto set_diag(mat_t mat) {
  if (mat.extent(0) != mat.extent(1))
    return make_unexpected("expected square matrix!");

  for (int i=0; i < mat.extent(0); i++)
    mat(i, i) = 1.0;

  return expect_mat(mat);
}

I also like using an immediatly invoked lambda for this, but I’m not sure how readable/maintainable this is long-term:

auto set_diag(mat_t mat) {
  return mat.extent(0) == mat.extent(1)
    ? [=] {
      for (int i=0; i < mat.extent(0); i++)
        mat(i, i) = 1.0;
      return expect_mat(mat);
    }()
    : make_unexpected("expected square matrix!");
}

Either way, the error is handled as expected (no pun intended):

auto report_errors(std::string_view err) {
  fmt::print("got error: '{}'\n", err);
  std::exit(EXIT_FAILURE);
}
int main() {
  auto raw = std::make_unique<double[]>(25);
  // It's not square!
  auto mat = mat_t(raw.get(), 25, 1);
  setup(mat)
    .and_then(set_diag)
    .and_then(print)
    .or_else(report_errors);
}
/*
 * Program returned: 1
 * got error: 'expected square matrix!'
 */

3 Ways to use the Monadic Functions on an Expected Value

1. Functor

We can use functors in the expected chain like so:

struct SetRow {
  std::size_t row;
  double value;
  expect_mat operator()(mat_t mat) {
    for (int i=0; i<mat.extent(1); i++)
      mat(row, i) = value;
    return mat;
  }
};
int main() {
  auto raw = std::make_unique<double[]>(25);
  auto mat = mat_t(raw.get(), 5, 5);
  setup(mat)
    .and_then(set_diag)
    .and_then(SetRow{1, 3.5})
    .and_then(print)
    .or_else(report_errors);
}
/*
 * Program returned: 0
 * 1.0 0.0 0.0 0.0 0.0 
 * 3.5 3.5 3.5 3.5 3.5 
 * 0.0 0.0 1.0 0.0 0.0 
 * 0.0 0.0 0.0 1.0 0.0 
 * 0.0 0.0 0.0 0.0 1.0 
 */

2. Binding Args to a Function that Takes Multiple Arguments

Using std::bind on a function taking more arguments would also acomplish this:

auto set_row(mat_t mat, std::size_t row, double value) {
    for (int i=0; i<mat.extent(1); i++)
        mat(row, i) = value;
    return expect_mat(mat);
}
int main() {
  auto raw = std::make_unique<double[]>(25);
  auto mat = mat_t(raw.get(), 5, 5);
  setup(mat)
    .and_then(set_diag)
    .and_then(std::bind(set_row, /*mat=*/_1, /*row=*/1, /*value=*/3.5))
    .and_then(print)
    .or_else(report_errors);
}
/*
 * Program returned: 0
 * 1.0 0.0 0.0 0.0 0.0 
 * 3.5 3.5 3.5 3.5 3.5 
 * 0.0 0.0 1.0 0.0 0.0 
 * 0.0 0.0 0.0 1.0 0.0 
 * 0.0 0.0 0.0 0.0 1.0 
 */

3. Lambdas

And of course, lambdas:

int main() {
  auto raw = std::make_unique<double[]>(25);
  auto mat = mat_t(raw.get(), 5, 5);
  setup(mat)
    .and_then(set_diag)
    .and_then([] (auto mat) {
      for (int i=0; i < mat.extent(1); i++)
        mat(3, i) = 2.0;
      return expect_mat(mat);
    })
    .and_then(print)
    .or_else(report_errors);
}
/*
 * Program returned: 0
 * 1.0 0.0 0.0 0.0 0.0 
 * 0.0 1.0 0.0 0.0 0.0 
 * 0.0 0.0 1.0 0.0 0.0 
 * 2.0 2.0 2.0 2.0 2.0 
 * 0.0 0.0 0.0 0.0 1.0
 */

Conclusion

Hopefully you’ve been won over by the elegance of expected and mdspan. Godbolt links can be found for these examples in the links below:

GTest exposes clean interfaces for parameterizing your tests by value and by type - but what if you want both?

Getting Started

The need for this arose when converting some of Flang’s unit tests to use GTest to conform with the rest of LLVM’s testing framework (Flang is LLVM’s fortran 2018 compiler). Linked here is the revision where we needed to parameterize both by value and type.

Let’s start with an interface to test:

// addone.hpp
#pragma once

template<typename T>
auto addOne(T t) {
  return t + 1;
}

Our first test might look like this:

#include <gtest/gtest.h>
#include "addone.hpp"

TEST(AddOneTests, doAddTo5) {
  ASSERT_EQ(addOne(5), 6) << "addOne(5) != 6!";
}

What if we want to test multiple values without repeating code? Instead of something like this:

TEST(AddOneTests, doAddTo5) {
  ASSERT_EQ(addOne(5), 6) << "addOne(5) != 6!";
  ASSERT_EQ(addOne(6), 7) << "addOne(6) != 7!";
  ...
}

Value Parameterized Tests

We can parameterize our test by value with a fixture:

struct AddOneTestsFixture
    : public testing::TestWithParam<std::tuple<int, int>> {};

TEST_P(AddOneTestsFixture, doAdd) {
  int input = std::get<0>(GetParam());
  int expect = std::get<1>(GetParam());
  ASSERT_EQ(addOne(input), expect)
      << "addOne(" << input << ") != " << expect << "!";
}

INSTANTIATE_TEST_SUITE_P(
    AddOneTests,
    AddOneTestsFixture,
    testing::Values(
      std::make_tuple(1, 2),
      std::make_tuple(3, 4),
      std::make_tuple(9, 10)));

This way, our tests run over all values we pass in at the end:

$ ./tests
Running main() from /tmp/googletest-20201214-81667-fx54ix/googletest-release-1.10.0/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from AddOneTests/AddOneTestsFixture
[ RUN      ] AddOneTests/AddOneTestsFixture.doAdd/0
[       OK ] AddOneTests/AddOneTestsFixture.doAdd/0 (0 ms)
[ RUN      ] AddOneTests/AddOneTestsFixture.doAdd/1
[       OK ] AddOneTests/AddOneTestsFixture.doAdd/1 (0 ms)
[ RUN      ] AddOneTests/AddOneTestsFixture.doAdd/2
[       OK ] AddOneTests/AddOneTestsFixture.doAdd/2 (0 ms)
[----------] 3 tests from AddOneTests/AddOneTestsFixture (0 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (0 ms total)
[  PASSED  ] 3 tests.

Type Parameterized Tests

Our interface addOne takes a template parameter - what if we want to test this on multiple types?

First, we’ll want our fixture to take template parameters and we’ll have to declare the fixture as templated in GTest:

template<typename T>
struct AddOneTestsFixture : public ::testing::Test {};
TYPED_TEST_SUITE_P(AddOneTestsFixture);

And keep the first iteration of our test, but this time using the TypeParam type exposed by the GTest TYPED_TEST_SUITE api:

TYPED_TEST_P(AddOneTestsFixture, doAddOne) {
  ASSERT_EQ(addOne<TypeParam>(5), 6) << "addOne(5) != 6!";
}

We’ll also have to register each test with our typed test suite:

REGISTER_TYPED_TEST_SUITE_P(AddOneTestsFixture, doAddOne);

If we had more tests, you would register them in the same statement as above:

REGISTER_TYPED_TEST_SUITE_P(AddOneTestsFixture, doAddOne, doAddTwo, ...);

We are then able to instantiate our templated test suite with all the types we intend to use with our test suite:

using Types = testing::Types<int, long long, std::size_t>;
INSTANTIATE_TYPED_TEST_SUITE_P(TestPrefix, AddOneTestsFixture, Types);

And our type-parameterized tests are working!

$ ./tests
Running main() from /tmp/googletest-20201214-81667-fx54ix/googletest-release-1.10.0/googletest/src/gtest_main.cc
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from TestPrefix/AddOneTestsFixture/0, where TypeParam = int
[ RUN      ] TestPrefix/AddOneTestsFixture/0.doAddOne
[       OK ] TestPrefix/AddOneTestsFixture/0.doAddOne (0 ms)
[----------] 1 test from TestPrefix/AddOneTestsFixture/0 (0 ms total)

[----------] 1 test from TestPrefix/AddOneTestsFixture/2, where TypeParam = long long
[ RUN      ] TestPrefix/AddOneTestsFixture/2.doAddOne
[       OK ] TestPrefix/AddOneTestsFixture/2.doAddOne (0 ms)
[----------] 1 test from TestPrefix/AddOneTestsFixture/2 (0 ms total)

[----------] 1 test from TestPrefix/AddOneTestsFixture/3, where TypeParam = unsigned long
[ RUN      ] TestPrefix/AddOneTestsFixture/3.doAddOne
[       OK ] TestPrefix/AddOneTestsFixture/3.doAddOne (0 ms)
[----------] 1 test from TestPrefix/AddOneTestsFixture/3 (0 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (0 ms total)
[  PASSED  ] 4 tests.

Type and Value Parameterized Tests

Now is the tricky part - GTest doesn’t expose an API for parameterizing tests over values and types so we have to do some work ourselves.

First, let’s define the types and input data we’ll be parameterizing our tests over:

template <typename T>
using ParamT = std::vector<std::tuple<T, T>>;

static std::tuple<ParamT<int>, ParamT<long long>, ParamT<std::size_t>> allParams{
  { // Test cases for int
    std::make_tuple(1, 2),
    std::make_tuple(5, 6),
    std::make_tuple(9, 10),
  },
  { // Test cases for long long
    std::make_tuple(1, 2),
    std::make_tuple(5, 6),
    std::make_tuple(9, 10),
  },
  { // Test cases for size_t
    std::make_tuple(1, 2),
    std::make_tuple(5, 6),
    std::make_tuple(9, 10),
  },
};

This structure assumes you may want to add test inputs later on that may be different depending on the type. If this is not the case and you know your make_tuple calls are static_cast-able into your parameter type, you may do the following to reduce code duplication:

#define ADDONE_TESTPARAMS                                                      \
  { std::make_tuple(1, 2), std::make_tuple(5, 6), std::make_tuple(9, 10), }
static std::tuple<ParamT<int>, ParamT<long long>, ParamT<std::size_t>> allParams{
        ADDONE_TESTPARAMS, ADDONE_TESTPARAMS, ADDONE_TESTPARAMS, };

Now, let’s refactor our fixture to take the types and values we just defined:

template <typename T>
struct AddOneTestsFixture : public testing::Test {
  AddOneTestsFixture() : params{std::get<ParamT<T>>(allParams)} {}
  ParamT<T> params;
};

You may notice we set params to std::get< ParamT<T> >(allParams) - this is how we accomplish type and value parameterized tests. We use the infrastructure of a type parameterized test, and leverage std::tuple to do the value parameterization.

For the actual test code, we again reuse most of the test from our first type- parameterized test, this time using the params field of our test fixture:

TYPED_TEST_P(AddOneTestsFixture, doAddOne) {

  // Iterate over the parameters configred by our fixture
  for(auto const& [input, expect] : this->params) {

    // The assertions stay the same as in our original type-parameterized test
    ASSERT_EQ(addOne(input), expect)
      << "addOne(" << input << ") != " << expect << "!";
  }
}

And voilà! our tests are parameterized over values and types:

$ ./tests
Running main() from /tmp/googletest-20201214-81667-fx54ix/googletest-release-1.10.0/googletest/src/gtest_main.cc
[==========] Running 3 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 1 test from TestPrefix/AddOneTestsFixture/0, where TypeParam = int
[ RUN      ] TestPrefix/AddOneTestsFixture/0.doAddOne
[       OK ] TestPrefix/AddOneTestsFixture/0.doAddOne (0 ms)
[----------] 1 test from TestPrefix/AddOneTestsFixture/0 (0 ms total)

[----------] 1 test from TestPrefix/AddOneTestsFixture/1, where TypeParam = long long
[ RUN      ] TestPrefix/AddOneTestsFixture/1.doAddOne
[       OK ] TestPrefix/AddOneTestsFixture/1.doAddOne (0 ms)
[----------] 1 test from TestPrefix/AddOneTestsFixture/1 (0 ms total)

[----------] 1 test from TestPrefix/AddOneTestsFixture/2, where TypeParam = unsigned long
[ RUN      ] TestPrefix/AddOneTestsFixture/2.doAddOne
[       OK ] TestPrefix/AddOneTestsFixture/2.doAddOne (0 ms)
[----------] 1 test from TestPrefix/AddOneTestsFixture/2 (0 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 3 test suites ran. (0 ms total)
[  PASSED  ] 3 tests.

Full Code Listing

// addone.hpp
#pragma once
template<typename T>
auto addOne(T t) {
  return t + 1;
}

// addone_test.cpp
#include "addone.hpp"
#include <gtest/gtest.h>
#include <tuple>
#include <vector>

template <typename T>
using ParamT = std::vector<std::tuple<T, T>>;

static std::tuple<ParamT<int>, ParamT<long long>, ParamT<std::size_t>> allParams{
  {
    // Test cases for int
    std::make_tuple(1, 2),
    std::make_tuple(5, 6),
    std::make_tuple(9, 10),
  },
  {
    // Test cases for long long
    std::make_tuple(1, 2),
    std::make_tuple(5, 6),
    std::make_tuple(9, 10),
  },
  {
    // Test cases for size_t
    std::make_tuple(1, 2),
    std::make_tuple(5, 6),
    std::make_tuple(9, 10),
  },
};

template <typename T>
struct AddOneTestsFixture : public testing::Test {
  AddOneTestsFixture() : params{std::get<ParamT<T>>(allParams)} {}
  ParamT<T> params;
};

TYPED_TEST_SUITE_P(AddOneTestsFixture);

TYPED_TEST_P(AddOneTestsFixture, doAddOne) {
  for(auto const& [input, expect] : this->params) {
    ASSERT_EQ(addOne(input), expect)
      << "addOne(" << input << ") != " << expect << "!";
  }
}

REGISTER_TYPED_TEST_SUITE_P(AddOneTestsFixture, doAddOne);

using Types = testing::Types<int, long long, std::size_t>;
INSTANTIATE_TYPED_TEST_SUITE_P(TestPrefix, AddOneTestsFixture, Types);

Makefile used for this post:

CFLAGS := -I/usr/local/Cellar/googletest/1.10.0/include
CFLAGS += -L/usr/local/Cellar/googletest/1.10.0/lib -lgtest -lgtest_main
CFLAGS += -lpthread -std=c++17
CXX    =  clang++

all:
	$(CXX) addone_test.cpp $(CFLAGS) -o tests

{% include footer.html %}

References

Third in this series, this post focuses on leveraging environments for debugging and reproducing errors.

In the previous post about package development with Spack, we discussed environment management with Spack, particularly integration with a private repository. What are some of the benefits of this, other than onboarding new developers?

As we’ve developed our private spack repository, we’ve also added some spack environments along the way:


floodsimulationrepo
├── environments
│   ├── jupiter
│   │   └── env.yaml
│   ├── saturn
│   │   └── env.yaml
│   └── osx
│       └── env.yaml
├── packges
│   ├── ipopt
│   │   └── package.py
│   └── floodsimulation
│       └── package.py
└── repo.yaml

In this post, we’re going to look at using this configuration to reproduce errors and coordinate development and debugging within a large team.

Reproducing Finicky Errors

Bugs relating to interfaces between libraries are sometimes the most difficult to track down. To use an example from my experience, one of my teams was developing a library that builds on the optimization solver HiOp, which leverages CUDA, MAGMA, and many BLAS/LAPACK routines. After developing some functionality tests to ensure our library was performing as expected, we noticed that on some platforms with certain CUDA devices and CUDA versions, our library was failing to converge within our expected tolerance. For weeks we stepped through debuggers and discussed possibile issues with our codebase and the various libraries we depend on to no avail.

We eventually enlisted the help of collaborators from another laboratory to build and test our codebase under similar conditions on their platforms to ensure they were able to reproduce the bug. In order to ensure our collaborators were able to accurately reproduce the environments in which we found the bug, we created and distributed spack environments specific to that development snapshot.

Continuing with our FloodSimulation example, let us imagine we found a bug when running with CUDA versions v11.0.104 through 11.1 and HiOp v0.3.0 on our imaginary cluster Jupiter, and would like other teammembers to reproduce the bug on differrent platforms but using the same stack. We might create an environment file like so:


# reproduce-error.yaml
spack:
  specs:
  - floodsimulation ^floodsimulationrepo.petsc ^floodsimulationrepo.hiop
    ^cuda@11.0.104 ^hiop@0.3.0
  view: true
  packages:
    cuda:
      versions: [11.0.104, 11.1.0]
    hiop:
      versions: [0.3.0]

You can see we’ve established the exact matrix of libraries in which our bug manifests. We then asked our collaborators to install all versions of this spack environment and attempt to reproduce our bug, with all certainty that they will accurately reproduce the environment.

We also tend to track these environments in our repository like so:


floodsimulationrepo
├── environments
│   ├── jupiter
│   │   ├── env.yaml
│   │   └── reproduce-error.yaml <---
│   ├── saturn
│   │   └── env.yaml
│   └── osx
│       └── env.yaml
├── packges
│   ├── ipopt
│   │   └── package.py
│   └── floodsimulation
│       └── package.py
└── repo.yaml

so that in future threads we are able to refer back to the exact configurations which caused bugs in the past.

With this strategy, we are able to maintain a reproducible and consistent software stack with robust coordination between teams. Another potential use-case of this strategy is to coordinate profiling efforts - each month or so a spack environment which instantiates a development snapshot of the development stack may be distributed to profiling teams. This way, the profiling team may work on a known working configuration of the software stack to identify performance bottlenecks while the core development team continues developing.

Previous in this Series

Next in this Series

{% include footer.html %}

Compilers are extremely proficient at catching (and even suggesting fixes for) errors in your code. What about cases that are not formally errors, but should not exist in your codebase? This post explores using Clang tools to address this case.

Example Use Case

When using portability libraries such as RAJA and Kokkos, the capture clauses of lambda statements are extremely important. When developing the open source optimization engine HiOp to use the portability library RAJA in its linear algebra library, we ran into an issue where a RAJA forall statement would implicitly capture the this pointer in an instance method which would cause memory access errors when running on a GPU device.

For example, let’s say we have the following Vector class:

#include <RAJA/RAJA.hpp>

struct Vector {
  using namespace RAJA;
  
  Vector(std::size_t sz, int* data/*=pointer to data on device*/)
    : sz(sz), data(data) {}
    
  void times_constant(int factor) {
  
    forall<cuda_exec<128>>(RangeSegment(0, sz),
      [=] (Index_type i) {
      
        // Here, `data` is captured implicitly via `this` pointer
        // even though `this` does not reside on the GPU
        data[i] *= factor;
      });
      
  }
private:
  std::size_t sz;
  int* data;
};

As described in the comments above, the data lives on the device, but is accessed via the this pointer which does not. The solution to this memory access error is to create a local copy of the pointer outside the scope of the RAJA lambda:

  void times_constant(int factor) {
    auto* local_data = this->data;
    forall<cuda_exec<128>>(RangeSegment(0, sz),
      [=] (Index_type i) {
        // Here, `data` is no longer captured implicitly via `this` pointer
        local_data[i] *= factor;
      });
  }

Of course, this is not an error that will be captured by nvcc, hipcc or another host compiler (that I know of). At first we just examined each kernel in our codebase to ensure we did not use the this pointer implicitly. Without too much effort however, we were able to develop a small tool to search our codebase for this exact case.

Clang Tools

The first step in creating this tool was to set up a CMake project to link against LLVM libraries. The directory structure looked like this:


lambda-checker
├── CMakeLists.txt
└── src
    ├── CMakeLists.txt
    ├── driver.cpp
    └── actions.hpp

Quite simple, no? The Clang documentation for frontend actions walks through a similar task.

src/driver.cpp contains all of the code to instantiate an AST Action with a clang compiler instance (and any options you would like to give your driver), while src/actions.hpp contains the actual code to traverse the AST.

CMake

In the top-level CMakeLists.txt and after the usual CMake project preamble, we include relevant clang libraries:

# top-level CMakeLists.txt

find_package(Clang REQUIRED)

set(CMAKE_MODULE_PATH
  ${CMAKE_MODULE_PATH}
  "${LLVM_CMAKE_DIR}"
  )

include(AddLLVM)

include_directories(${LLVM_INCLUDE_DIRS})
include_directories(${CLANG_INCLUDE_DIRS})
add_definitions(${LLVM_DEFINITIONS})
add_definitions(${CLANG_DEFINITIONS})

Then, we added our plugin as a target:

# top-level CMakeLists.txt
add_executable(LambdaChecker src/driver.cpp)
target_link_libraries(LambdaChecker
  PRIVATE
  clangAST
  clangBasic
  clangFrontend
  clangSerialization
  clangTooling
  clangIndex
  clangRewrite
  clangTooling
  )

Actions

Before we start with src/actions.hpp, let us first discuss strategy. Finding a potentially dangerous lambda capture requires two predicates for each lambda function found in the AST:

Does the lambda capture this?
Does the lambda dereference this to access a pointer or array-like member?

I broke each of these steps into its own frontend action. The first simple searches the AST for a lambda function and checks if it captures this:

Find Lambda that Captures `this`

// src/actions.hpp
class FindLambdaCaptureThis
  : public RecursiveASTVisitor<FindLambdaCaptureThis> {
public:
  explicit FindLambdaCaptureThis(ASTContext *Context)
    : Context(Context), MemberVisitor(Context) {}

  bool VisitLambdaExpr(LambdaExpr *Expr) {
    bool FoundThis = false;
    for (auto it = Expr->capture_begin(); it != Expr->capture_end(); it++) {
      if (it->capturesThis()) {
        FoundThis = true;
        break;
      }
    }

    /* If `this` is not captured, we don't care about it. */
    if (!FoundThis)
      return true;

    const CompoundStmt* LambdaBody = Expr->getBody();
    if (LambdaBody->body_empty())
      return true;

    for(auto Stmt : LambdaBody->body()) {
      MemberVisitor.Parent = Expr;
      MemberVisitor.TraverseStmt(Stmt);
    }

    return true;
  }

private:
  ASTContext *Context;
  FindLambdaCapturedFields MemberVisitor; // we'll come back to this
};

You may find that we define the function VisitLambdaExpr - because this is a special name registered within clang, the compiler instance will run this function on any AST node that matches it: every lambda expression.

Walking through the class above, we first check if the lambda expression captures this:

    bool FoundThis = false;
    for (auto it = Expr->capture_begin(); it != Expr->capture_end(); it++) {
      if (it->capturesThis()) {
        FoundThis = true;
        break;
      }
    }

If the lambda does not capture this, we can continue traversing the AST:

    if (!FoundThis)
      return true;

Then we make another check to ensure the lambda body is not empty:

    const CompoundStmt* LambdaBody = Expr->getBody();
    if (LambdaBody->body_empty())
      return true;

If all the above conditions are met, we traverse the body of the lambda to find any pointer- or array-like member variables accessed in the lambda:

    for(auto Stmt : LambdaBody->body()) {
      MemberVisitor.Parent = Expr;
      MemberVisitor.TraverseStmt(Stmt);
    }

Now that we have a higher-level AST traversal class to find lambdas that capture this, we can look at our next AST traversal class which checks for problematic uses of member variables. The member visitor will accept all forms of expressions, so we only run that visitor on the statements in the body of the lambda. You may also notice that we set the Parent field of our MemberVisitor - this is to improve the quality of the diagnostics we are able to emit. We’ll expand on this later.

Member Visitor

This AST visitor class ensures no pointer- or array-like member variables are accessed in the lambda

struct FindLambdaCapturedFields
  : public RecursiveASTVisitor<FindLambdaCapturedFields> {
public:
  explicit FindLambdaCapturedFields(ASTContext *Context)
    : Context(Context) {}

  bool VisitMemberExpr(MemberExpr *Expr) {
    auto MemberType = Expr->getType();

    /* Problematic use of member variable! Time to generate diagnostic
     * information. */
    if (MemberType->isArrayType() || MemberType->isPointerType()) {

      /* Report diagnostic information */
      clang::DiagnosticsEngine &DE = Context->getDiagnostics();

      /* Error message describing the issue */
      auto ID = DE.getCustomDiagID(
          clang::DiagnosticsEngine::Error,
          "Found lambda capturing pointer-like member variable here.");
      DE.Report(Expr->getBeginLoc(), ID);

      /* Remark indicating which member variable triggered the error */
      ID = DE.getCustomDiagID(clang::DiagnosticsEngine::Note,
          "Member variable declared here:");
      DE.Report(Expr->getMemberDecl()->getBeginLoc(), ID);

      /* Remark with suggested change to mitigate the issue */
      ID = DE.getCustomDiagID(clang::DiagnosticsEngine::Remark,
          "Consider creating a local copy of the member variable in local scope"
          " just outside the lambda capture.");
      DE.Report(Parent->getBeginLoc(), ID);
    }
    return true;
  }

  ASTContext *Context;
  LambdaExpr *Parent=nullptr;
};

First, we check the type of the expression inside the lambda:

    auto MemberType = Expr->getType();
    /* Problematic use of member variable! Time to generate diagnostic
     * information. */
    if (MemberType->isArrayType() || MemberType->isPointerType()) {

If we enter this conditional, we’ve found a potential problem! Now what to do?

Diagnostics

Clang diagnostics are again a very rich library which won’t be fully flushed out here - please consult the documentation for the Clang Diagnostics Engine.

First order of business in emmitting diagnositcs is to get a handle for a diagnositcs engine capable of printing helpful messages to the user of our tool.

      clang::DiagnosticsEngine &DE = Context->getDiagnostics();

Let’s think for a moment about the sort of diagnostic we would like to emit. I think we should report three things to the user if a lambda expression meets our critera for an error:

Location in the lambda where the member variable is used via this pointer
Location of that member’s declaration
Suggestion for fixing the issue

Let’s address these one-by-one: first, report the location where the member variable is potentially erroneously used.

      auto ID = DE.getCustomDiagID(
          clang::DiagnosticsEngine::Error,
          "Found lambda capturing pointer-like member variable here.");
      DE.Report(Expr->getBeginLoc(), ID);

Then, where the member variable was declared:

      /* Remark indicating which member variable triggered the error */
      ID = DE.getCustomDiagID(clang::DiagnosticsEngine::Note,
          "Member variable declared here:");
      DE.Report(Expr->getMemberDecl()->getBeginLoc(), ID);

Finally, a suggestion for fixing the error:

      /* Remark with suggested change to mitigate the issue */
      ID = DE.getCustomDiagID(clang::DiagnosticsEngine::Remark,
          "Consider creating a local copy of the member variable in local scope"
          " just outside the lambda capture.");
      DE.Report(Parent->getBeginLoc(), ID);

At this point, we’re essentially done - all we need is a bit of boilerplate code to connect our AST consumer classes up to a compiler instance:

class LambdaCaptureCheckerConsumer : public clang::ASTConsumer {
public:
  explicit LambdaCaptureCheckerConsumer(ASTContext *Context)
    : Visitor(Context) {}
  explicit LambdaCaptureCheckerConsumer(CompilerInstance& CI)
    : Visitor(&CI.getASTContext()) {}

  virtual void HandleTranslationUnit(clang::ASTContext &Context) {
    Visitor.TraverseDecl(Context.getTranslationUnitDecl());
  }
private:
  FindLambdaCaptureThis Visitor;
};

Now we’re done with the file src/actions.hpp.

Driver

In src/driver.cpp we create an AST frontend action to create and use the compiler action we defined in src/actions.hpp:

// src/driver.cpp
class LambdaCaptureCheckerAction : public clang::ASTFrontendAction {
public:
  virtual std::unique_ptr<clang::ASTConsumer> CreateASTConsumer(
    clang::CompilerInstance &Compiler, llvm::StringRef InFile) {
    return std::unique_ptr<clang::ASTConsumer>(
        new LambdaCaptureCheckerConsumer(&Compiler.getASTContext()));
  }
};

Here I omit any command line options. The documentation on this topic is rich, so if you would like to add command line options you shouldn’t have too much trouble.

// src/driver.cpp
static cl::OptionCategory LambdaCaptureCheckerCategory("LambdaChecker options");

int main(int argc, const char **argv) {
  CommonOptionsParser Op(argc, argv, LambdaCaptureCheckerCategory);

  /* Create a new Clang Tool instance (a LibTooling environment). */
  ClangTool Tool(Op.getCompilations(), Op.getSourcePathList());

  return Tool.run(newFrontendActionFactory<LambdaCaptureCheckerAction>().get());
}

Running

At this point, you may also generate a clang plugin library to use our AST actions which can be loaded via compiler invocation, however I opted to stick with a standalone executable.

In order to fully test our AST action, I also created a subdirectory for examples, giving us the following directory structure:


lambda-checker
├── CMakeLists.txt
├── src
│   ├── CMakeLists.txt
│   ├── driver.cpp
│   └── actions.hpp
└── test
    └── capture-test.cpp

Where capture-test.cpp contains:

// capture-test.cpp
struct CaptureTest {

  /* Should not capture */
  int *i;

  /* Should not capture */
  int j[1];

  /* OK to capture */
  int k=0;

  /* Method which implicitly captures `this` pointer and modifies member
   * variable `i`. This is problematic when using portability libraries, as
   * member variables may not reside on the host. */
  void methodUsingBadCapturePointer() {
    auto throwaway = [=] () {
      *i = 1;
    };
  }

  /* Raw arrays should not be used either. */
  void methodUsingBadCaptureArray() {
    auto throwaway = [=] () {
      j[0] = 1;
    };
  }

  /* The preferred method to mitigate the issue outlined above is to create a
   * local copy of the pointer and modify the underlying data through the copy.
   */
  void methodUsingGoodCapture() {
    int* localCopy = i;
    auto throwaway = [=] () {
      *localCopy += 1;
    };
  }

  /* Methods which capture `this` variables which are not pointers should not
   * cause an issue. */
  void methodNotCapturingPointer() {
    auto throwaway = [=] () {
      k++;
    };
  }
};

int main() { return 0; }

I added this as a CMake target such that the compile commands database would be generated for our test case (additional documentation for compile-commands database). To do this, add the following to the top-level CMakeLists.txt:

# top-level CMakeLists.txt
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
add_executable(dummy-target test/capture-test.cpp)

This way, we are able to run our plugin driver directly on our test case.


$ cd lambda-capture
$ mkdir build
$ cd build
$ cmake .. && make

$ # At this point, the file `compile_commands.json` should exist in your CWD
$ # and you should be able to run the driver on our test case:
$ ./src/LambdaChecker ../test/capture-test.cpp
/path/to/lambda-capture/test/capture.cpp:17:8: error: Found lambda capturing pointer-like member variable here.

      *i = 1;
       ^
/path/to/lambda-capture/test/capture.cpp:4:3: note: Member variable declared here:
  int *i;
  ^
/path/to/lambda-capture/test/capture.cpp:16:22: remark: Consider creating a local copy of the member variable in local scope
just outside the lambda capture.
    auto throwaway = [=] () {

As you can see, our tool seems to be correctly identifying our domain-specific error! After developing this tool and running it over all of our codebases which make heavy use of portability libraries such as RAJA and Kokkos, we are confident that we have purged this error from our codebase.

Hopefully this demonstration helps your team weed out nasty errors like these from your codebase as well.

The full code listings can be found in the repository linked here. The code snippets used here for example purposes will not map perfectly to the current repository, but should serve as a concrete starting point.

{% include footer.html %}

References

Second in this series, this post focuses on getting stared with environments.

In the previous post about package development with Spack, we discussed the following points:

Creating a private spack repository
Maintaining private packages
Maintaining forks of upstream packages

In the following post(s), we’ll discuss the following in greater detail:

Managing environments for reproducibility
Using environments for development
Using environments for continuous integration
Managing configurations for distributed and multi-platform development

Managing Environments

One of the most useful features of Spack is it’s support for environments. An environment is a file describing a package or set of packages you wish to install, along with all of the spack configuration options you wish to use (see documentation linked above for more information). In the previous post, we used an example package FloodSimulation and a corresponding repository - we’ll continue with this example here.

When managing a complex and mutable set of dependencies, reproducibility is extremely important. Spack environments allow the development team to document every single option they used to install a particular package or set of packages. For example, let’s say FloodSimulation has a few more dependencies: HiOp, MAGMA, CUDA > 10.1, and PETSc. Each of these packages has many configuration and installation options, and you may even want to use versions of these packages installed by your system administrator.

Our FloodSimulation package may have a package.py file that looks like this:


from spack import *
import os

class FloodSimulation(CMakePackage):
    homepage = "https://github.com/your-username/flood-simulation"
    git = "https://github.com/your-username/flood-simulation.git"
    version('1.0.0', tag='v1.0.0')
    depends_on('petsc')
    depends_on('hiop')
    depends_on('cuda@10.1:', when='+cuda')
    depends_on('magma', when='+cuda')

    def cmake_args(self):
        args = []
        if '+cuda' in self.spec:
            args.append('-DFLOODSIMULATION_ENABLE_CUDA=ON')
        return args

Let us also assume you need to install with your fork of the PETSc and HiOp spack packages, and that you’d like to use the MAGMA and CUDA installations provided by your system administrator or package manager. In this case, you might have an environment file like this:


# example-env.yaml
spack:
  specs:
  - floodsimulation ^floodsimulationrepo.petsc ^floodsimulationrepo.hiop
  view: true
  packages:
    magma:
      externals:
      - spec: magma@2.5.4
        prefix: /path/to/magma
      buildable: false
    cuda:
      externals:
      - spec: cuda@10.2.89
        prefix: /path/to/cuda
      buildable: false

Constructing your spack environment from this file is easy as:


$ spack env create my-environment ./example-env.yaml
$ spack install

$ # To locate your new installation:
$ spack location -i floodsimulation

This way, if another developer needs to reproduce your development environment, you may distribute the environment file to perfectly recreate your installation. I reccommend tracking your environments in version control along with the rest of your private spack repository with the following example directory layout:


floodsimulationrepo
├── environments
│   └── example-env.yaml
├── packges
│   ├── ipopt
│   │   └── package.py
│   └── floodsimulation
│       └── package.py
└── repo.yaml

Using Spack Environments

In my opinion, the three most significant use cases for spack environments are:

Reproducible environments for development
Reproducing finicky errors
Continuous integration/testing

Reproducible Environments for Development

With a complex codebase, onboarding often requires significant resources for new developers. Gettings started with a new codebase can be challanging, especially when building the software stack in the first place can take up to several days. I have found distributing a spack environment which is known to instantiate your software stack on a particular development machine to be a mostly frictionless method of getting new developers started on a codebase. With the example environment file above, we specify instructions to instatiate the FloodSimulation software stack on a particular machine with a couple pre-installed packages. If you are developing on many platforms and you need developers up and running on all platforms with a short turnaround time, distributing spack environments will likely be a suitable solution.

Extending the example above, the following directory structure is a suitable way to maintain spack environments to instatiate the FloodSimulation stack on multiple platforms. Let’s assume you want developers up and running on two clusters, Jupiter and Saturn, as well as OSX for local development:


floodsimulationrepo
├── environments
│   ├── jupiter
│   │   └── env.yaml
│   ├── saturn
│   │   └── env.yaml
│   └── osx
│       └── env.yaml
├── packges
│   ├── ipopt
│   │   └── package.py
│   └── floodsimulation
│       └── package.py
└── repo.yaml

In this way, you may simply instruct a new developer to run the following to get started with local development on a Mac:


$ # in the directory `floodsimulationrepo`:
$ spack env create local-development ./environments/osx/env.yaml
$ spack install

$ # Any time the developer logs in to develop locally:
$ spack env activate local-development

And when they need to get started developing on the institutional cluster Jupiter:


$ spack env create remote-development ./environments/jupiter/env.yaml
$ spack install

An Aside on Spack Views

If you didn’t notice earlier, our environment file contains the line:

  view: true

This means that when you activate your spack environment (spack env activate <environment name>), spack will use a “view” by defaul. A view is a single installation prefix into which all packages installed via the environment are symbolically linked.

By default, spack views are installed to:

$SPACK_ROOT/var/spack/environments/<environment name>/.spack-env/

and the environment file is installed to

$SPACK_ROOT/var/spack/environments/<environment name>/spack.yaml

If our new developer activates her spack environment like so:


$ spack env activate local-development
$ ls $SPACK_ROOT/var/spack/environments/local-development/.spack-env/lib
libmagma.so libhiop.so ...

she will have access to all of her newly installed libraries in a single installation prefix, automatically added to her LD_LIBRARY_PATH and DYLD_LIBRARY_PATH.

The Spack command reference on views contains further information on this topic.

Back to Using Environments

Now that we have a feel for creating, using, and distributing spack environments, how do we develop with them?

As we saw in the aside on views above, activating a spack environment with the view option enabled (which is the default) adds the view to the user’s path, library path, etc. Assuming FloodSimulation is a CMake project (as we specified in the example spack package above) and we have written clean enough CMake to find external libraries, the workflow for building FloodSimulation should be relatively straightforward:


$ git clone https://github.com/your-username/flood-simulation.git
$ mkdir build install
$ cd build

$ # At this step, cmake should be able to find all of your libraries in the
$ # spack view:
$ cmake ../flood-simulation -DCMAKE_INSTALL_PREFIX=$PWD/../insatll

$ # If some libraries are not found, simply run `spack location -i package`
$ # to find the prefix for the package, and add it manually with ccmake:

$ # Assuming magma is the package cmake failed to find, run this command and
$ # copy the path:
$ spack location -i magma

$ # Manually add the path to the magma installtion to cmake:
$ ccmake

$ make install

From here, developers may continue the edit-build-test cycle for FloodSimulation, knowing they are using the correct set of dependencies.

Read on for discussion on additional uses for environments you should consider incorporating into your workflow.

Previous in this Series

Next in this Series

{% include footer.html %}

Spack is typically used for package deployment, however this post will be about package development with Spack. First in this series, this post focuses on motivation and introduction.

Intro

Most upstream Spack packages are quite stable. In my experience, the majority of Spack packages are based on packages that have already existed for a long time, and a member of the community created a Spack package to install it (eg OpenMPI). In this case, the options and versions for the package are probably set-in-stone. There are likely very few PRs submitted for these packages, and users can rely on the dependencies and installation process staying mostly the same.

For a package under extremely heavy development however, this is not the case. To use a package manager and iterate rapidly on a package, I think there are roughly 3 criteria for that package manager:

Adding a new dependency should be easy and fast
Adding new versions should be easy and fast
Adding new options should be easy and fast

In my opinion, using the typical Spack workflow of submitting a pull request to the upstream Spack repository for every meaningful change meets none of these critera.

Configuring Spack Repositories

An alternative strategy is to use Spack’s support for external repositories. A repository is simply a directory which contains a repo.yaml file and a packages/ directory under which Spack packages reside.

For example, creating the file

# repo.yaml
repo:
  namespace: examplerepo

in a directory with a packages subdirectory like so:

examplerepo
├── packges
│   └── examplepackage
│       └── package.py
└── repo.yaml

is a valid repository.

Running spack repo add . in this directory will add the path to that repository to your Spack configuration:

# ~/.spack/repos.yaml
repos:
  - $spack/var/spack/repos/builtin
  - /path/to/examplerepo

After configuring your example repository, you are able to install packages from it directly! If your package conflicts with a builtin package, you may install it using the namespace set in examplerepo/repo.yaml:


$ # If examplepackage is unique:
$ spack install examplepackage

$ # If examplepackage conflicts with a builtin package:
$ spack install examplerepo.examplepackage

3rd Party Packages in Your Spack Repo

The most common case that I have found of a package conflicting with a builtin is when one of your packages relies on a fork of an upstream package, so you maintain a modified version of the upstream package (in examplerepo/packages/forked-package/package.py, for example). This allows developers to iterate quickly and modify dependencies without attempting to maintain a fork of the entire spack repository.

For example, let’s say you’re developing a package FloodSimulation which relies on OpenMPI and Ipopt. As you develop your software, you realize the Ipopt Spack package doesn’t expose all of Ipopt’s configuration options, and you need to make rather significant edits to the Ipopt Spack package. You could go through the pull request process upstream, however if you have many similar edits to many other packages, you may want to maintain an Ipopt fork in your spack repository:

floodsimulationrepo
├── packges
│   ├── ipopt
│   │   └── package.py
│   └── floodsimulation
│       └── package.py
└── repo.yaml

You may then install FloodSimulation with your fork of Ipopt like so:


$ spack install floodsimulation ^floodsimulationrepo.ipopt

If you track your private spack repository with source control, it is quite easy to maintain your small package repository while your key packages are under heavy development. Each release of your package may serve as a time to submit all of your modifications to forked packages as well as the spack package descriptions to upstream spack such that end users are able to fully take advantage of your configuration.

This strategy alone has the potential to save a significant amount of developer time when heavily developing a package. The next post will go further into managing environments and multi-platform configurations.

Next in this Series

{% include footer.html %}

New library feature coming to C++23

Tensors

A YouTuber by the name of Context Free posted a few videos about tensors in various programming languages, including C++ (first video, second video). I loved these videos, but I noticed ContextFree failed to mention std::mdspan, a new library feature coming to C++23.

`std::mdspan`

std::mdspan (or std::experimental::mdspan until the proposal is accepted-I’ll be using stdex::mdspan, as stdex is a common alias for std::experimental) is an alias for a more complex type, but at it’s core it is a pointer plus some number of extents. Extents are either sizes of a given dimension on an mdspan, or the sentinal std::dynamic_extent, which indicates the extent is dynamic, and doesn’t have to be supplied at compile-time. These extents and the sentinal dynamic_extent can be mixed and matched. This powerful capability allows users to work with data as if it were a complex matrix structure while the underlying data remain linear.

For example, given raw data, an mdspan may be constructed and passed to some library that expects a matrix with rank 3:

// https://godbolt.org/z/eWKev9nas
template<T>
using mat_3d = std::mdspan<
                 T,
                 std::extents<
                     std::dynamic_extent
                   , std::dynamic_extent
                   , std::dynamic_extent
                 >
               >;

template<typename T> void work(mat_3d<T> mat);

int main() {
  int raw[500] = {};

  work(stdex::mdspan(raw, 100, 5, 1));
  work(stdex::mdspan(raw, 4, 5, 5));
  work(stdex::mdspan(raw, 10, 1, 50));
}

Note that we will have to be more careful if we mix-and-match compile-time and run-time extents.

// https://godbolt.org/z/Phr1vh9zs
template<typename T> using mat_3d = stdex::mdspan<
  T,
  stdex::extents<
    stdex::dynamic_extent,
    3,
    stdex::dynamic_extent
  >
>;

template<typename T> void work(mat_3d<T> mat) {}

int main() {
  int raw[27] = {};
  work(mat_3d<int>(raw, 3, 3, 3));
  work(mat_3d<int>(raw, 9, 3, 0));
  // work(mat_3d<int>(raw, 3, 0, 9)) will fail bc extents don't match!
  return 0;
}

After supplying a dummy implementation of work to print the shape, we get

mat has shape [100, 5, 1]
mat has shape [4, 5, 5]
mat has shape [10, 1, 50]

In either case, the underlying data is the same, though it’s viewed differently in each of the invocations of work. It’s no coincidence that view seems like a natural name for mdspan - mdspan was developed by the authors of the portable execution library Kokkos and inspired by the Kokkos::View type.

Subspans

Just like std::span, mdspan has support for taking subspans of a given span. This is more complicated with mdspan however, due to mdspan’s variadic extents.

There are three ways to take a slice of an mdspan:

An integer index into the respective dimension
A std::tuple or std::pair begin/end pair of indices
The special value std::full_extent indicating all elements of the dimension should be selected in the subspan

For example:

// https://godbolt.org/z/Wrr4dhEs8
int main() {
  int raw[27] = {};
  std::iota(raw, raw+27, 0);

  // full will have shape [3,3,3]
  auto full = stdex::mdspan(raw, 3, 3, 3);

  // sub will have shape [1,3] and values [18,19,20]
  auto sub = stdex::submdspan(full, 2, std::pair{0, 1}, stdex::full_extent);
}

Note that using a single value as an extent passed to submdspan squashes the dimension to 0, while passing a pair or tuple will keep the dimension around, even if that dimension is 1. pair/tuple do not effect rank, while passing an index as an extent does.

I hope this article was enough to get you interested in mdspan and the future of C++! Make sure to check out Daisy Hollman’s appearance on CppCast, Context Free’s YouTube channel, and the reference implementation of C++23’s std::mdspan.

{% include footer.html %}

Links

We use the 6 most popular programming languages of the 1960’s to solve a leetcode problem!

Most of these languages have changed a lot since the 1960s, so the way I’m using these languages won’t be quite the same as they were used back then. For example, I couldn’t find a way to compile and/or run an ALGOL50 program, so I’ll have to use Algol68, a later standard of the language. Similarly, the first APLs were intended for use on a blackboard, and the first actual implementations were all proprietary. Many of the languages were originally written on punchcards and physically inserted into a punchard reader, and I don’t have access to any of that. For the most part, I made some attempt to use an older version of each language to get a better feel for what it would be like to use the language back in the day.

I’ll be looking at the languages in ascending order based on their popularity in 1965.

Along with my solution for each language, I’ll give a little bit of history and a quote from Edsger Dijkstra (whether I agree with it or not :smile:). His scathing remarks about almost every language on this list were too good to leave out.

All these solutions and the build system needed to compile the examples can be found in this repository. ***NOTE: The repo linked above has been temporarily made private due to intellectual property questions and will be restored as soon as possible.*** See the youtube video version of this content linked here.

Problem

Find the element that is greater than both neighbors, aka the peak element.

Example 1:

Input: nums = [1,2,3,1]

Output: 2

Explanation: 3 is a peak element and your function should return the index number 2.

Example 2:

Input: nums = [1,2,1,3,5,6,4]

Output: 1 or 5

Explanation: Your function can return either index number 1 where the peak element is 2, or index number 5 where the peak element is 6.

Constraints:

1 <= nums.length <= 1000
-231 <= nums[i] <= 231 - 1
nums[i] != nums[i + 1] for all valid i.

Content

Here’s how the various languages stack up. We’ll start at the bottom with APL and work our way up to Fortran.

APL

APL is a mistake, carried through to perfection. It is the language of the future for the programming techniques of the past: it creates a new generation of coding bums.

Edsger Dijkstra

APL was originally designed by Ken Iverson in 1957 as a mathematical notation to be used on blackboards[ref]. Kev Iverson was hired by IBM in 1960 to further develop the notation, at that point still just a mathematical notation and not a programming language. Iverson’s paper A Programming Language was published in 1962, and would be the basis for naming the language APL. Finally in 1966 the IBM released APL\360 written in a bit under 40,000 lines of Basic Assembly Language 360.

Just before leaving IBM, in 1979 Iverson gave his famous ACM Turing Award Lecture titled Notation as a tool of Thought where he builds up algorithm intuition in the reader using the APL language[ref]. In 1980, Iverson left IBM for I. P. Sharp Associates where he developed SHARP APL [ref]. It was just after this in 1981 that Dyalog APL was born, potentially the most popular APL implementation tothis day and a significant force in the APL community[ref]. Ken Iverson moved on from IPSharp in 1990 to JSoftware to write the J programming language along with Roger Hui, a colleague from I.P. SHARP, who sadly passed away earlier this month (October 2021).

This solution was given to me by Adám Brudzewsky on the APL Farm discord server (more information on the discord here and also here). This runs on APL\360 thanks to an IBM 5110 emulator (how cool is this!!!) I used this IBM APL\360 User’s Manual to play around with Adám’s solution in the emulator.

NOTE: These solutions use base 1 indices

     ∇ Z←F L
[1]    Z←((L>¯1↓0,L)∧(L>1↓L,0))⍳1
     ∇
     F 1 2 3 1
3
     F 2 1 2 3
1

Here’s a snippet from the user’s manual linked earlier:

And two more solutions from Adám:

Second Solution

     ∇ Z←F L;A;B
[1]    A←L>1↓L,0
[2]    B←L>¯1↓0,L
[3]    Z←(A∧B)⍳1
     ∇

Third Solution

     ∇ Z←F L
[1]    Z←(L>⌈⌿¯1 1⌽(2,⍴L)⍴L)⍳1
     ∇

I also solved it in BQN just for fun:

   i0 ← 1‿2‿3‿1
   i1 ← 1‿2‿1‿3‿5‿6‿4
   i2 ← 2‿1‿2‿3‿1
   F ← ((¯∞⊸«˜<⊢)∧(⊢>¯∞⊸»))⊐(1˙)
   F ¨ i0‿i1‿i2
┌─
· ┌·    ┌·    ┌·
  · 2   · 1   · 0
      ┘     ┘     ┘
                    ┘

And here’s the image explanation of the solution. These diagrams are meant to be read from top to bottom as the BQN program executes. You can generate diagrams like these on your own by clicking the Explain button before running your code on the Try BQN page linked here.

Here's an explanation of each part of this solution

Lisp

LISP has been jokingly described as “the most intelligent way to misuse a computer”. I think that description a great compliment because it transmits the full flavor of liberation: it has assisted a number of our most gifted fellow humans in thinking previously impossible thoughts.

Edsger Dijkstra

The 5th most popular programming language in 1965 was Lisp.

Lisp was invented by John McCarthy in 1958 at MIT with his paper Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I, paralleling Ken Iverson’s paper A Programming Language.[ref].

I used MIT Scheme for my Lisp since it seems like the oldest lisp implementation that I can still install.

Although Scheme is such an old language, it felt very futuristic and clean. I’ve used other lisps before, but I’m nowhere near an expert. Scheme felt like a wonderful and comprehensible tool. I really loved using it and I think I’ll be spending some more quality time with Scheme in these videos.

If you have a better scheme solution, please let me know, I’d love to see it.

(define shl
  (lambda (v l)
    (reverse (cons v (reverse (cdr l))))))

(define shr
  (lambda (v l)
    (cons v (reverse (cdr (reverse l))))))

(define solve
  (lambda (input)
    (reduce max 0
            (map
              (lambda (a b)
                (if a b -1))
              (map >
                   input
                   (map max
                        (shl -99999 input)
                        (shr -99999 input)))
              (iota (length input)))))))

(for-each
  (lambda (l)
    (newline)
    (display (solve l))
    (newline))
  (list
    '(1 2 3 1)
    '(1 2 1 3 5 6 4)
    '(2 1 2 3 2 1)))

I did find that the builtin functions for Scheme were a bit lacking. For example I had to write my own functions to shift a value into a list.

(define shl
  (lambda (v l)
    (reverse (cons v (reverse (cdr l))))))

(define shr
  (lambda (v l)
    (cons v (reverse (cdr (reverse l))))))

Here’s what it looks like to use them. Although I had to write my own, it did come easily.

1 ]=> (shl 0 '(1 2 3))

;Value: (2 3 0)

1 ]=> (shr 0 '(1 2 3))

;Value: (0 1 2)

Let’s walk through this solution inside-out.

(define solve
  (lambda (input)
    (reduce max 0
            (map
              (lambda (a b)
                (if a b -1))
              (map >
                   input
                   (map max
                        (shl -99999 input)
                        (shr -99999 input)))
              (iota (length input))))))

For an input (1 2 3 1), we’ll find the max of shifting left and right. If a number is greater than the max of the left and right, we know it’s greater than both the left and the right value.

1 ]=> (define input '(1 2 3 1))

;Value: a

1 ]=> (map max
        (shl -99999 input)
        (shr -99999 input))

;Value: (2 3 2 3)

Now we just have to find the indices in the input where the input is greater than the last return value, the greater of either shift.

      (map >
           input
           (map max
                (shl -99999 input)
                (shr -99999 input)))

;Value: (#f #f #t #f)

Then we can just take the index if the previous map gave us a #t true value, or -1 otherwise. We then take the max of these values to find the peak element.

1 ]=> (map
        (lambda (a b)
          (if a b -1))
        (map >
             input
             (map max
                  (shl -99999 input)
                  (shr -99999 input)))
        (iota (length input)))

;Value: (-1 -1 2 -1)

Here’s the same code as before, we’ve just wrapped it in a max reduce to get our final answer.

1 ]=> (reduce max 0
              (map
                (lambda (a b)
                  (if a b -1))
                (map >
                     input
                     (map max
                          (shl -99999 input)
                          (shr -99999 input)))
                (iota (length input))))
;Value: 2

Of course now we can just wrap all that code in a function:

1 ]=> (define solve
        (lambda (input)
          (reduce max 0
                  (map
                    (lambda (a b)
                      (if a b -1))
                    (map >
                         input
                         (map max
                              (shl -99999 input)
                              (shr -99999 input)))
                    (iota (length input))))))
      
1 ]=> (solve '(1 2 3 1))

;Value: 2

And we can run it on a few inputs to verify our solution:

1 ]=> (for-each
        (lambda (l)
          (newline)
          (display (solve l))
        (list
          '(1 2 3 1)
          '(1 2 1 3 5 6 4)
          '(2 1 2 3 2 1)))

2
5
3

BASIC

It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration.

Edsger Dijkstra

BASIC stands for Beginner’s All-Purpose Symbolic Instruction Code[ref]. BASIC was designed by two math professors at Dartmouth College in 1964. John Kemeny, one of the co-creators of BASIC attended lectures from John von Neumann and worked as Albert Einstein’s mathematical assistant for a time[ref], and Tom Kurtz, the other co-creator, first proposed the concept of time-sharing[ref]. These guys were clearly pretty bright. BASIC was probably the first beginner-oriented language, with the goal of getting students started writing programs as quickly as possible.

We needed a language that could be ‘taught’ to virtually all students (and faculty) without their having to take a course.

Thomas Kurtz, co-inventor of BASIC

Visual Basic, a descendent of BASIC used in Excel and other Microsoft products, was actually one of the first languages I ever learned, writing Excel macros for the finance department of the company I worked for.

Although it was a programming on-ramp for me, I still have to side with Dijkstra on BASIC (although maybe not so harshly). BASIC was the product of some brilliant folks and it had a huge impact on the history of programming, but I can’t say I recommend it today.

This solution is pretty much the same solution I used for the rest of the programming languages: create a wrapper array so I can pretend that out of bounds in either direction is -∞. I then check all the values to see if any element is greater than the elements to its left and right.

I used FreeBASIC to run this example:

         dim i0(1 to 4) as integer
         i0(1) = 1
         i0(2) = 2
         i0(3) = 3
         i0(4) = 1

         dim i1(1 to 7) as integer
         i1(1) = 1
         i1(2) = 2
         i1(3) = 1
         i1(4) = 3
         i1(5) = 5
         i1(6) = 6
         i1(7) = 4

         function solve(prob() as integer) as integer
             dim vals(1 to ubound(prob)+2) as integer
             vals(1) = -9999999
             vals(ubound(prob)+1) = -9999999
             for i as integer = 1 to ubound(prob)
                 vals(i) = prob(i)
                 if (vals(i)>vals(i+1) and vals(i)>vals(i-1)) then solve=i-1
             next
         end function

         print solve(i0())
         print solve(i1())

$ ./src/freebasic/lc-peak-element-freebasic
 2
 5

ALGOL

You may notice that Algol is the only language that does not have a scathing quote from Dijkstra. This is probably in part because Dijkstra was a significant contributor to Algol![ref]

In 1958-1959, Dijkstra was involved in a number of meetings that culminated in the publication of the report defining the ALGOL 60 language. Ironically, Dijkstra’s name does not appear in the list of 13 authors of the final report: it seems he left the committee prematurely because he could not agree with the majority opinions.[ref]

Algol/Fortran family tree:

Here is a language so far ahead of its time that it was not only an improvement on its predecessors but also on nearly all its successors.

Tony Hoare[ref]

I’m using the Algol68 Genie compiler-interpreter for this code. I honestly found Algol pretty usable! I saw some code that used function pointers, and it looked pretty clean. It seems like Algol has some pretty modern first-class-function capabilities. I can see why it was the language in which computer algorithms were published for many years[ref]. ALGOL was actually designed by an international committee of the ACM during 1958–60 for this purpose.

PROC solve = ([]INT elements)INT: (
  INT found := -1;
  [1:(UPB elements)+1]INT vs;
  vs[1] := - 999 999 999;
  vs[UPB elements] := - 999 999 999;
  FOR i FROM LWB elements TO UPB elements DO vs[1+i] := elements[i] OD;
  FOR i FROM 2 TO UPB elements DO
    IF vs[i] > vs[i+1] AND vs[i] > vs[i-1] THEN found := i-1 FI
  OD;
  found
);

main:(
  []INT i0 = (1,2,3,1);
  []INT i1 = (1,2,1,3,5,6,4);

  print(("Input #0: ", solve(i0), new line));
  print(("Input #1: ", solve(i1), new line))
)

$ a68g ../src/algol68/lc-peak-element.al
Input #0:          +3
Input #1:          +6

I honestly wouldn’t mind writing more Algol down the line.

COBOL

The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offense.

Edsger Dijkstra

The history behind COBOL is extremely inspiring and exciting, however COBOL was very painful to use. And I only learned the most shallow bit of COBOL - in order to read more like plain English, COBOL has over 300 keywords. I can only imagine what it feels like to maintain a 500k line COBOL codebase.

I used the GNUCobol compiler for this example. You’ll notice that everything is indented - COBOL, like several of the other languages I’m covering here, was originally used on a punchcard as explained in this article from opensource.com. Each puncard represented a single line of code, and the first six and final eight columns of each card were reserved for sequence numbers and identifiers, which you’ll see here as an asterisk * for comments, and a dash - for line continuation.

       ID DIVISION.
       PROGRAM-ID. ARRAYTEST.
       ENVIRONMENT DIVISION.
       DATA DIVISION. 
       WORKING-STORAGE SECTION.
      * Store both problems and their sizes and answers in one
      * structure
       01 SIZES PIC 9(3) OCCURS 2 TIMES VALUE 0.
       01 OFFSETS PIC 9(3) OCCURS 2 TIMES VALUE 0.
       01 PROBLEMS PIC 9(3) OCCURS 12 TIMES VALUE 0.
       01 CURRENT-PROBLEM PIC 9(3) VALUE 0.
       01 TMP PIC 9(3) VALUE 0.
       01 IDX PIC 9(3) VALUE 1.
       01 NPROBLEMS PIC 9(3) VALUE 2.
       01 ANSWERS PIC S9(3) OCCURS 2 TIMES VALUE -1.
       01 VALS PIC S9(5) OCCURS 15 TIMES.
       PROCEDURE DIVISION.

       100-MAIN.

      *    Set up problem [1,2,3,1]
           MOVE 4 TO SIZES(1).
           MOVE 0 TO OFFSETS(1).

           MOVE 1 TO PROBLEMS(1).
           MOVE 2 TO PROBLEMS(2).
           MOVE 3 TO PROBLEMS(3).
           MOVE 1 TO PROBLEMS(4).

      *    Set up problem [1,2,1,3,5,6,4]
           MOVE 7 TO SIZES(2).
           MOVE 4 TO OFFSETS(2).

           MOVE 1 TO PROBLEMS(5).
           MOVE 2 TO PROBLEMS(6).
           MOVE 1 TO PROBLEMS(7).
           MOVE 3 TO PROBLEMS(8).
           MOVE 5 TO PROBLEMS(9).
           MOVE 6 TO PROBLEMS(10).
           MOVE 4 TO PROBLEMS(11).

      *    Run solve procedure on both problems
           PERFORM VARYING CURRENT-PROBLEM FROM 1 BY 1 UNTIL CURRENT-PRO
      -BLEM > NPROBLEMS
             MOVE OFFSETS(CURRENT-PROBLEM) TO IDX
             PERFORM SOLVE
             DISPLAY ANSWERS(CURRENT-PROBLEM) END-DISPLAY
           END-PERFORM.

           STOP RUN.

       SOLVE.
           MOVE -99999 TO VALS(1).
           MOVE -99999 TO VALS(SIZES(CURRENT-PROBLEM)).
           PERFORM VARYING IDX FROM 1 BY 1 UNTIL IDX>SIZES(CURRENT-PROBL
      -EM)
             COMPUTE TMP = IDX + OFFSETS(CURRENT-PROBLEM) END-COMPUTE
             MOVE PROBLEMS(TMP) TO VALS(1+IDX)
           END-PERFORM.

           PERFORM VARYING IDX FROM 2 BY 1 UNTIL IDX>SIZES(CURRENT-PROBL
      -EM)
             
             COMPUTE TMP = IDX + OFFSETS(CURRENT-PROBLEM) END-COMPUTE
             IF PROBLEMS(TMP) > PROBLEMS(TMP - 1)
      -AND PROBLEMS(TMP) > PROBLEMS(TMP + 1)
               MOVE IDX TO ANSWERS(CURRENT-PROBLEM)
             END-IF
           END-PERFORM.

       PRINT-AR.
           DISPLAY "IDX=" IDX " VALUE=" PROBLEMS(IDX) END-DISPLAY.

Running gives:

$ ./src/cobol/lc-peak-element-cobol
+003
+006

I certainly felt the weight of this when I tried to write this function: Every time I had to change a condition that wrapped a line, I would join the lines together and figure out where the new line break should be, and make sure to get the - character in the 7th column. I’m sure there are some more modern conventions around COBOL considering 5 billion lines of new COBOL code are written every year[ref], but I’m pretty content not to write any COBOL for a while.

       SOLVE.
           MOVE -99999 TO VALS(1).
           MOVE -99999 TO VALS(SIZES(CURRENT-PROBLEM)).
           PERFORM VARYING IDX FROM 1 BY 1 UNTIL IDX>SIZES(CURRENT-PROBL
      -EM)
             COMPUTE TMP = IDX + OFFSETS(CURRENT-PROBLEM) END-COMPUTE
             MOVE PROBLEMS(TMP) TO VALS(1+IDX)
           END-PERFORM.

           PERFORM VARYING IDX FROM 2 BY 1 UNTIL IDX>SIZES(CURRENT-PROBL
      -EM)
             
             COMPUTE TMP = IDX + OFFSETS(CURRENT-PROBLEM) END-COMPUTE
             IF PROBLEMS(TMP) > PROBLEMS(TMP - 1)
      -AND PROBLEMS(TMP) > PROBLEMS(TMP + 1)
               MOVE IDX TO ANSWERS(CURRENT-PROBLEM)
             END-IF
           END-PERFORM.

Now on to the inspiring stuff: TLDR COBOL was designed by a consensus-driven committee with a huge focus on portability, and led by women.

In 1959 Grace Hopper, a retired Navy officer, organized a meeting of users and manufacturers to conceive of a programming language in response to Mary Hawes’s call for a portable programming language, a language that could be compiled and ran on computers from multiple manufacturers. You can read more about the history of COBOL on this Twitter thread from Bryce Lelbach which you should definitely check out.

I think we have a lot to learn from COBOL, and a lot to be thankful for. The ISO didn’t come along until 1988, long after Grace Hopper initiated the committee for the development of COBOL. I’m sure we owe a huge debt to COBOL and the women-led consensus-driven community that gave us COBOL. I want to learn all I can from that history.

I don’t want to write any more COBOL than I have to though :smile:.

Fortran

FORTRAN, ‘the infantile disorder’, by now nearly 20 years old, is hopelessly inadequate for whatever computer application you have in mind today: it is now too clumsy, too risky, and too expensive to use.

Edsger Dijkstra

Fortran, the grand finale, #1 on our list. I completely disagree with Dijkstra on this one - I love Fortran’s history and I occasionally write it professionally.

In 1953, John Backus submitted a proposal to his bosses at IBM to develop a more practical alternative to assembly language[ref]. The first compiler didn’t come around until 1957. The name “Fortran” is derived from FORmula TRANslation, and very quickly became the lingua franca for scientific computing.

Fortran still is one of the most used programming languages in high performance computing (HPC), and it’s not going away any time soon. The Flang project, part of the LLVM project, is a modern Fortran compiler targeting the 2018 standard with support for OpenACC, OpenMP and other cool optimization stuff. It’s written in wonderfully modern and well-maintained C++, and I use the C++ style guide from Flang for my technical teams at my day job. Flang is definitely worth keeping an eye on, I think it will become a significant force in HPC in the coming years.

I tried using the g77 compiler from GCC 3.4.4 for this example to get a better feel for historical Fortran, but after removing some syntax I didn’t realize came with f90, I realized the 32b build of GCC would probably not be able to target my modern machine.

$ g77 ./src/fortran/lc-peak-element.f
/tmp/ccYc9ESc.s: Assembler messages:
/tmp/ccYc9ESc.s:39: Error: invalid instruction suffix for `push'
/tmp/ccYc9ESc.s:91: Error: invalid instruction suffix for `push'

This at least let me know my code was valid Fortran 77! After removing any compilation errors with g77, I just built the example with gfortran from gcc 11.2.

c     Comments require a 'c'in the first column, just like the
c     punchcards!
      program main
        integer :: ret
        integer :: i0(4)
        integer :: i1(7)

c       First Problem
        i0(1) = 1
        i0(2) = 2
        i0(3) = 3
        i0(4) = 1

c       Second Problem
        i1(1) = 1
        i1(2) = 2
        i1(3) = 1
        i1(4) = 2
        i1(5) = 4
        i1(6) = 2
        i1(7) = 1

        ret = -1

        call solve(i0, 4, ret)
        print*,ret

        call solve(i1, 7, ret)
        print*,ret

      end program

      subroutine solve(input, len, ret)
        integer :: input(len)
        integer :: len
        integer :: ret

        integer :: vals(len+2)
        integer :: i

        vals(1) = -99999
        vals(len) = -99999

        do i = 2, len
          vals(i) = input(i+1)
        end do

        do i = 2, len
          if (vals(i) > vals(i+1) .and. vals(i) > vals(i-1)) then
            ret = i-1
            return
          endif
        enddo

      end subroutine

For Fortran 90 I can do array assignments like this, which I really like:

vals(2:len) = input

instead of this:

        do i = 2, len
          vals(i) = input(i+1)
        end do

and I don’t get to use the intent keyword, both of which were big drawbacks, but this really wasn’t too bad.

Looking to the future of Fortran, GCC’s GFortran is very actively maintained, the newest NVIDIA HPC SDK has fantastic Fortran support, the new US Dept. of Energy Exascale supercomputer Frontier will use AMD GPUs which have hipfort, a Fortran interface to AMD GPU libraries, and Intel’s GPU platform and Fortran compiler are widely used as well. Fortran has a wonderfully rich history, and it’s certainly a part of our future.

Much of my work has come from being lazy. I didn’t like writing programs, and so, when I was working on the IBM 701, writing programs for computing missile trajectories, I started work on a programming system to make it easier to write programs.

John Backus

Conclusion

I hope you all enjoyed foray into the history of programming languages (and computing in general)!

{% include footer.html %}

References

Solving a leetcode problem in four programming languages using two acceleration paradigms!

NOTE: This post is a transcript of the youtube video linked here.

In addition to that, we’ll be using various combinations of two programming paradigms common in distributed computing: using a GPU to perform some calculations and MPI to distribute our calculation among multiple processes, potentially on multiple machines.

We’ll be looking at this leetcode problem, which is to determine if a 9 x 9 Sudoku board is valid, but not necessarily solvable. Each row, column, and subbox of the grid must have the digits 1-9.

Let’s jump right in to our BQN solution.

Content

BQN

This is what our sudoku boards will look like:

# two 8s in the first block
bad ← ⟨8, 3, 0, 0, 7, 0, 0, 0, 0
       6, 0, 0, 1, 9, 5, 0, 0, 0
       0, 9, 8, 0, 0, 0, 0, 6, 0
       8, 0, 0, 0, 6, 0, 0, 0, 3
       4, 0, 0, 8, 0, 3, 0, 0, 1
       7, 0, 0, 0, 2, 0, 0, 0, 6
       0, 6, 0, 0, 0, 0, 2, 8, 0
       0, 0, 0, 4, 1, 9, 0, 0, 5
       0, 0, 0, 0, 8, 0, 0, 7, 9⟩

# valid sudoku
good ← ⟨5, 3, 0, 0, 7, 0, 0, 0, 0
        6, 0, 0, 1, 9, 5, 0, 0, 0
        0, 9, 8, 0, 0, 0, 0, 6, 0
        8, 0, 0, 0, 6, 0, 0, 0, 3
        4, 0, 0, 8, 0, 3, 0, 0, 1
        7, 0, 0, 0, 2, 0, 0, 0, 6
        0, 6, 0, 0, 0, 0, 2, 8, 0
        0, 0, 0, 4, 1, 9, 0, 0, 5
        0, 0, 0, 0, 8, 0, 0, 7, 9⟩

And here is our full solution. This solution will be the basis for all of our later solutions.

F ← {𝕊𝕩:
  Fl ← 0⊸≠⊸/                       # Filter 0s out
  Dup ← (∨´∾´)¬∘∊¨                 # Are there any duplicates?

  rs ← Dup Fl¨(9/↕9)⊔𝕩             # Check rows
  cs ← Dup Fl¨(81⥊↕9)⊔𝕩            # Check columns

  bi ← 27⥊3/↕3
  bs ← Dup Fl¨(bi∾(3+bi)∾(6+bi))⊔𝕩 # Check blocks

  (bs ∨ rs ∨ cs)⊑"true"‿"false"
}

This first line is a function to filter out any 0s:

   Fl ← 0⊸≠⊸/
   Fl ⟨5, 3, 0, 0, 7, 0, 0, 0, 0⟩
⟨ 5 3 7 ⟩

Here we have another utility function to return an integer indicating whether any duplicates were found in any sublists:

   Dup ← (∨´∾´)¬∘∊¨
   Dup ⟨⟨5, 3, 7⟩, ⟨1, 2, 3⟩⟩
0
   Dup ⟨⟨5, 3, 7⟩, ⟨1, 2, 2⟩⟩
1

Next we check for duplicates in all the filtered rows and columns:

   rs ← Dup Fl¨(9/↕9)⊔𝕩
   cs ← Dup Fl¨(81⥊↕9)⊔𝕩

These ranges are used to create indices for grouping the values in X. I’ll show a trimmed down version of their output here to give you an idea:

   3‿3⥊(3/↕3) # For the rows
┌─       
╵ 0 0 0  
  1 1 1  
  2 2 2  
        ┘
   3‿3⥊(9⥊↕3) # For the columns
┌─       
╵ 0 1 2  
  0 1 2  
  0 1 2  
        ┘

Next I do something similar to get the indices for the boxes.

   bi ← 27⥊3/↕3
   3‿9⥊bi
┌─                   
╵ 0 0 0 1 1 1 2 2 2  
  0 0 0 1 1 1 2 2 2  
  0 0 0 1 1 1 2 2 2  
                    ┘

This creats indices for the first three boxes, and you can probably imagine how to extend this to get the indices for all the boxes. I just add three to the previous indices, and then add six, and then append them all together. Here’s the second layer:

   3‿9⥊bi+3
┌─                   
╵ 3 3 3 4 4 4 5 5 5  
  3 3 3 4 4 4 5 5 5  
  3 3 3 4 4 4 5 5 5  
                    ┘

And the final layer:

   3‿9⥊bi+6
┌─                   
╵ 6 6 6 7 7 7 8 8 8  
  6 6 6 7 7 7 8 8 8  
  6 6 6 7 7 7 8 8 8  
                    ┘

And all three layers of indices stacked on top of each other:

   9‿9⥊(bi∾(3+bi)∾(6+bi))
┌─                   
╵ 0 0 0 1 1 1 2 2 2  
  0 0 0 1 1 1 2 2 2  
  0 0 0 1 1 1 2 2 2  
  3 3 3 4 4 4 5 5 5  
  3 3 3 4 4 4 5 5 5  
  3 3 3 4 4 4 5 5 5  
  6 6 6 7 7 7 8 8 8  
  6 6 6 7 7 7 8 8 8  
  6 6 6 7 7 7 8 8 8  
                    ┘

Using these indices, I group all the elements of the input, and then check all of them for duplicates:

   bs ← Dup Fl¨(bi∾(3+bi)∾(6+bi))⊔𝕩 # Check blocks

And in the end I check if there were duplicates in the blocks, in the rows, or in the columns, and use that to index into our strings that indicate whether our sudoku board is valid or not.

   (bs ∨ rs ∨ cs)⊑"true"‿"false"

Approach

Before we move on to the Python solution, I’d like to talk about our approach to this solution in the rest of the languages, because they will all be pretty similar.

Just like in the BQN solution, we have three collections which represent the validity of the rows, another for the columns, and a third for the blocks.

Here I have a subset of a sudoku board on the bottom.

Initial Row, Column, and Block Matrices

In our procedural languages, we’ll create an array thrice the size of the grid to hold these values.

Note that this is not as space (or time) efficient as many of the solutions that you can find on the discussion page for the leetcode problem, but it is much easier to parallelize and that’s really the point of this video.

Let’s now walk through a few steps of our algorithm starting at the second row and first column of our sudoku board, which relates to the second row of our “row matrix.”

Because we’re looking at our row matrix, we’ll take the row index in our sudoku board as the row for our row matrix, and we’ll take the value in the cell, in this case 6, as the column in our row matrix. We’ll increment the value at this location in our row matrix, or in the first layer of our 3-d sum matrix that we’ll use to get our final answer.

Checking Rows

Let’s move on to check the first row and second column of our sudoku board for our column matrix. Because we’re looking at our column matrix, or the second layer of our final sum array, we’ll use the column index as the row index in our column matrix, and the value in that cell for the column index in our column matrix.

We’ll increment the value at this location in our column matrix, or in the second layer of our 3-d sum matrix that we’ll use to get our final answer.

Checking Columns

Finally, let’s look at the first block in our sudoku board, which corresponds to the first row in our block matrix, and let’s look at the first cell in that block. The value in the first cell in the first block is 8, so we’ll increment the first row and eighth column in our block matrix.

Checking Blocks

If we then perform these three operations for every cell in the sudoku board, we’ll have a final matrix that indicates all the row-column-block-value combinations that we have, and if any cell in that final matrix has a value greater than one, then our board is invalid.

If we were then to check the final cell in the first block of our board, we would find that the eighth element of the first row of our block matrix would be incremented again, which would mean we have an invalid board!

Checking Last Element of Block

If any value in our final array is greater than one, then we know we have at least one duplicate in at least one row, column, or block.

What’s neat about this solution is that no single operation depends on any other operation as long as we perform our operations atomically. This way, our work can be performed on multiple machines or different devices, and as long as we synchronize at the end, our solution will be the same.

Now that we’ve talked strategy, let’s see what this looks like in our Python solution.

Python

Here’s our simple Python solution:

shape = 9
blksz = 3
def solve(board):
    ar = [[[0 for j in range(3)] for i in range(shape)] for k in range(shape)]
    for r in range(shape):
        for c in range(shape):
            v = board[r][c]
            if 0 == v:
                continue
            ar[r][v - 1][0] += 1
            ar[c][v - 1][1] += 1

            bi = (r // blksz) * blksz + (c // blksz)
            ar[bi][v - 1][2] += 1
    return max(max(i) for j in ar for i in j) < 2

You can see here that we increment the value in the first layer of our full 3D matrix according to the row and the value in the cell currently being examined:

            ar[r][v - 1][0] += 1

We do the same for our column matrix:

            ar[c][v - 1][1] += 1

And finally for our block matrix, it just takes a little bit of math to figure out what our block index is.

            bx = r // blksz
            by = c // blksz
            bi = bx * blksz + by
            ar[bi][v - 1][2] += 1

I use this main function to run the python solution:

if __name__ == "__main__":
    for b in sudoku_9x9.boards():
        print(solve(b))

We run our example with two valid boards and two invalid boards and get the answers we expect:

$ python ./src/python/lc-valid-sudoku.py
True
True
False
False

Python And MPI

Now we’ll look at another python example, but this time one that uses MPI to distribute the calculations.

MPI provides a lot of infrastructure for distributed computing: using the mpirun command spawns N processes, each of which knows how many processes were spawned, what its unique process ID is, and some other relevant information. These processes may be spawned on multiple machines even, and MPI gives us the tools to communicate between these processes. We’ll take advantage of this infrastructure to perform our calculations on multiple processes.

import numpy as np
from mpi4py import MPI
shape = 9
blksz = 3
comm = MPI.COMM_WORLD
def solve(board, comm):
    ar = np.zeros((9, 9, 3), dtype=np.int64)
    chunk = (81 + comm.size - 1) // comm.size
    subscripts = (*itertools.product(range(9), range(9)),)
    for i in range(comm.rank * chunk, (comm.rank * chunk) + chunk):
        if i >= 81:
            break
        r, c = subscripts[i]
        v = board[r][c]
        if 0 == v:
            continue
        ar[r][v - 1][0] += 1
        ar[c][v - 1][1] += 1
        bi = (r // blksz) * blksz + (c // blksz)
        ar[bi][v - 1][2] += 1
    gar = np.zeros((9 * 9 * 3,), dtype=np.int64)
    comm.Reduce([ar.flatten(), MPI.INT], [gar, MPI.INT], op=MPI.SUM, root=0)
    comm.Barrier()
    return max(gar.flatten()) < 2 if 0 == comm.rank else False

This is what the setup looks like to get an MPI program running.

if __name__ == "__main__":
    if 0 == comm.rank:
        print("Running with size {0}".format(comm.size))

    for b in sudoku_9x9.boards():
        comm.Barrier()
        if 0 == comm.rank:
            ret = solve(b, comm)
            print(ret)
        else:
            solve(b, comm)

Here we chunk our work up based on how many processes we have:

    chunk = ((9 * 9) + comm.size - 1) // comm.size

Say we’re given 5 processes and we have 81 cells to check (because that’s the size of our sudoku board). The calculation would look something like this:

chunk is then the smallest amount of work for each process such that all the work that needs to be done is performed. This is a common calculation that needs to be done in parallel computing. Note that our final process may exit early if the work is not evenly divisible by the chunk size.

>>> work = 81
>>> size = 5
>>> chunk = (work + size - 1) // size
>>> chunk
17
>>> chunk * size
85

We then generate all the possible combinations of rows and columns, and iterate over only the elements that fall within the chunk of work that belongs to our current MPI process.

    subscripts = (*itertools.product(range(9), range(9)),)
    for i in range(comm.rank * chunk, (comm.rank * chunk) + chunk):
        if i >= work:
            break
        r, c = subscripts[i]

The rest of this code is exactly the same as our serial implementation:

        v = board[r][c]
        if 0 == v:
            continue
        ar[r][v - 1][0] += 1
        ar[c][v - 1][1] += 1
        bi = (r // blksz) * blksz + (c // blksz)
        ar[bi][v - 1][2] += 1

This next bit is more interesting. We create a global array with the size we need to hold our final sum matrix, and we use the MPI function Reduce. This function will perform the operation op, in this case MPI.SUM, to join the arrays ar and gar together on rank 0 specified by the root argument. This means that our final summed matrix for all components of the solution is on the MPI process with rank 0. We can then check if we have any cells with values greater than one, and return that value if we’re on rank 0. Otherwise, we can just return false since no other rank has the final array.

    gar = np.zeros((9 * 9 * 3,), dtype=np.int64)
    comm.Reduce([ar.flatten(), MPI.INT], [gar, MPI.INT], op=MPI.SUM, root=0)
    comm.Barrier()
    return max(gar.flatten()) < 2 if 0 == comm.rank else False

Here I run the example on 5 processes, and we see we get the same solution as with our serial example.

$ mpirun -n 5 python ./src/python/lc-valid-sudoku-mpi.py
Running with size 5
True
True
False
False

Now let’s move on to all our C++ solutions.

C++

All of our C++ solutions will use a board like this:

const auto board = std::array<int, 81>{
  5, 3, 0,  0, 7, 0,  0, 0, 0,
  6, 0, 0,  1, 9, 5,  0, 0, 0,
  0, 9, 8,  0, 0, 0,  0, 6, 0,
  
  8, 0, 0,  0, 6, 0,  0, 0, 3,
  4, 0, 0,  8, 0, 3,  0, 0, 1,
  7, 0, 0,  0, 2, 0,  0, 0, 6,
  
  0, 6, 0,  0, 0, 0,  2, 8, 0,
  0, 0, 0,  4, 1, 9,  0, 0, 5,
  0, 0, 0,  0, 8, 0,  0, 7, 9
};

Here’s our serial C++ solution.

auto isgood(const std::array<int, 81> &board) -> bool {
  auto ar = std::vector<int>(9 * 9 * 3, 0);
  for (int r = 0; r < 9; r++)
    for (int c = 0; c < 9; c++) {
      const auto v = board[idx2(r, c)];
      if (0 == v)
        continue;
      ar[idx3(r, v - 1, 0)] += 1;
      ar[idx3(c, v - 1, 1)] += 1;
      const auto bi = (r / blksz) * blksz + (c / blksz);
      ar[idx3(bi, v - 1, 2)] += 1;
    }
  const auto m =
    std::accumulate(ar.begin(), ar.end(), -1, max);
  return m < 2;
}

You can see pretty much everything about our solution is the same so far. You may notice the idx2 and idx3 functions - these just calculate the linear index from subscripts so we can almost use 2d and 3d subscripts while keeping our arrays totally linear.

Here’s our idx2 function for example. I’ll be using these functions for the rest of my solutoins since they make the code much more readable.

int idx2(int r, int c) {
  return (r * 9) + c;
};

Here at the end I find the max value again and check to make sure it’s less than two:

  const auto m = 
    std::accumulate(ar.begin(), ar.end(), -1, max);
  return m < 2;

Running our executable gives us the same answers as our previous implementations:

$ ./src/cpp/lc-valid-sudoku
true
true
false
false

C++ And MPI

Here we have our MPI distributed C++ solution, let’s walk through it in a few steps.

auto isgood(const std::array<int, 81> &board, int rank, int size) -> bool {
  const auto chunk = (81 + size - 1) / size;
  auto ar = std::vector<int>(81 * 3, 0);
  auto indices = std::vector<std::pair<int, int>>(81 * size, std::make_pair(-1, -1));
  for (int r = 0; r < 9; r++)
    for (int c = 0; c < 9; c++)
      indices[idx2(r, c)] = std::make_pair(r, c);
  for (std::size_t i = chunk * rank; i < chunk + (chunk * rank); i++) {
    const auto &[r, c] = indices[i];
    const auto v = board[idx2(r, c)];
    if (r < 0 or 0 == v) continue;
    ar[idx3(r, v - 1, 0)] += 1;
    ar[idx3(c, v - 1, 1)] += 1;
    const auto bi = (r / blksz) * blksz + (c / blksz);
    ar[idx3(bi, v - 1, 2)] += 1;
  }
  std::vector<int> gar(9 * 9 * 3, 0);
  MPI_Reduce(ar.data(), gar.data(), gar.size(), MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
  return 0 == rank ? std::accumulate(gar.begin(), gar.end(), -1, std::max) < 2
                   : false;
}

All the setup is the same between the last several solutions.

Astute viewers may recognize this as a cartesian product, but I couldn’t find a nice way to do this with the STL algorithms. If any viewers know of a nicer way to generate the cartesian product of two containers, please let me know.

  for (int r = 0; r < 9; r++)
    for (int c = 0; c < 9; c++)
      indices[idx2(r, c)] = std::make_pair(r, c);

The core loop is much the same as our other solutions, aside from unpacking the row and column as a tuple.

  for (std::size_t i = chunk * rank; i < chunk + (chunk * rank); i++) {
    const auto &[r, c] = indices[i];
    const auto v = board[idx2(r, c)];
    if (r < 0 or 0 == v)
      continue;
    ar[idx3(r, v - 1, 0)] += 1;
    ar[idx3(c, v - 1, 1)] += 1;
    const auto bi = (r / 3) * 3 + (c / 3);
    ar[idx3(bi, v - 1, 2)] += 1;
  }

This section is exactly equivilant to the Python version below. This should give you an idea of what it’s like to use the raw C and Fortran interfaces to MPI.

  std::vector<int> gar(9 * 9 * 3, 0);
  MPI_Reduce(ar.data(), gar.data(), gar.size(), MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
  return 0 == rank ? std::accumulate(gar.begin(), gar.end(), -1, sb::max) < 2
                   : false;

Python version:

    gar = np.zeros((9 * 9 * 3,), dtype=np.int64)
    comm.Reduce([ar.flatten(), MPI.INT], [gar, MPI.INT], op=MPI.SUM, root=0)
    comm.Barrier()
    return max(gar.flatten()) < 2 if 0 == comm.rank else False

In my main function I iterate over the same boards and use some extra logic so we only see the results that rank 0 gave back:

int main(int argc, char **argv) {
  int size, rank;
  MPI_Init(&argc, &argv);
  MPI_Comm comm = MPI_COMM_WORLD;
  MPI_Comm_size(comm, &size);
  MPI_Comm_rank(comm, &rank);
  for (const auto &board : all_boards) {
    bool ret;
    if (0 == rank) {
      ret = isgood(board, rank, size);
      std::cout << bool2str(ret) << "\n";
    } else isgood(board, rank, size);
    MPI_Barrier(comm);
  }
  MPI_Finalize();
  return 0;
}

Running this works just as all our previous solutions did:

$ mpirun -n 5 ./src/cpp/lc-valid-sudoku-mpi
true
true
false
false

We’ll now take a look at our CUDA-enabled solution.

C++ And CUDA

Here’s our single-process CUDA implementation. I for the most part am using raw CUDA, but I use a few helper methods from Thrust as well, such as the type-safe device malloc and free and some pointer-casting methods. For those that are unfamiliar, the funny-looking function calls with the triple braces are how you launch a raw cuda kernel. These allow you to pass arguments to the CUDA runtime to let it know how you’d like your CUDA kernel to be launched.

auto isgood(const Board &board) -> bool {
  auto d_ar = device_malloc<int>(81 * 3);
  setar<<<1, dim3(9, 9)>>>(
      raw_pointer_cast(d_ar), 
      raw_pointer_cast((thrust::device_vector<int>(board.begin(), board.end())).data()));
  cudaDeviceSynchronize();
  const auto m = thrust::reduce(d_ar, d_ar+(81*3), -1, thrust::maximum<int>());
  device_free(d_ar);
  return m < 2;
}

I have the following using statements to make the code a little more readable hopefully.

using thrust::device_malloc;
using thrust::device_free;
using thrust::device_vector;
using thrust::host_vector;
using thrust::raw_pointer_cast;

Along with the previous code that should look pretty familiar at this point, I define two other CUDA kernels. The first is this short setrc kernel, which sets rows and columns based on the kernel launch parameters I pass. This is a shortcut for a cartesian product of the rows and columns that runs on the GPU.

__global__ void setrc(int *rows, int *cols) {
  const int r = threadIdx.x, c = threadIdx.y;
  rows[idx2(r, c)] = r;
  cols[idx2(r, c)] = c;
}

The other kernel is this setar function, which is the same core kernel that’s been at the heart of all of our solutions so far.

__global__ void setar(int *ar, const int *board) {
  const auto row = threadIdx.x, col = threadIdx.y;
  const int value = board[idx2(row, col)];
  if (0 == value) return;
  atomicAdd(&ar[idx3(row, value, 0)], 1);
  atomicAdd(&ar[idx3(col, value, 1)], 1);
  const int bi = (row / blksz) * blksz + (col / blksz);
  atomicAdd(&ar[idx3(bi, value, 2)], 1);
}

Outside of those two kernels, the solution should look pretty familiar at this point. We allocate our final array and pass it to our cuda kernel, along with the sudoku board after copying it to the GPU.

  auto d_ar = device_malloc<int>(81 * 3);
  setar<<<1, dim3(9, 9)>>>(
    raw_pointer_cast(d_ar),
    raw_pointer_cast(
      (device_vector<int>(board.begin(), board.end())).data()
    )
  );

We then syncronize with our GPU to make sure the kernel finishes before reducing to find the maximum value with thrust::reduce, freeing our device memory, and returning whether all values fell below two.

  cudaDeviceSynchronize();
  const auto m = thrust::reduce(d_ar, d_ar+(81*3), -1, thrust::maximum<int>());
  device_free(d_ar);
  return m < 2;

Let’s move on to our most complex example, the C++ CUDA-enabled, MPI-distributed implementation.

C++ And CUDA And MPI

Now that we’re using two extra paradigms, CUDA GPU device offloading and MPI distributed computing, our code is looking more noisy. It’s still pretty much the same solution as our non-distributed CUDA solution though.

auto isgood(const Board &board, int rank, int size) -> bool {
  const auto chunk = (81 + size - 1) / size;
  const auto rows = device_malloc<int>(chunk * size),
             cols = device_malloc<int>(chunk * size);
  thrust::fill(rows, rows + (chunk * size), -1);
  setrc<<<1, dim3(9, 9)>>>(raw_pointer_cast(rows), raw_pointer_cast(cols));
  auto d_ar = device_malloc<int>(81 * 3);
  thrust::fill(d_ar, d_ar + (81 * 3), 0);
  setar<<<1, chunk>>>(
      raw_pointer_cast(d_ar),
      raw_pointer_cast((device_vector<int>(board.begin(), board.end())).data()),
      raw_pointer_cast(rows), raw_pointer_cast(cols), chunk * rank);
  cudaDeviceSynchronize();
  auto h_ar = host_vector<int>(d_ar, d_ar + (81 * 3));
  auto gar = host_vector<int>(81 * 3, 0);
  MPI_Reduce(h_ar.data(), gar.data(), gar.size(), MPI_INT, MPI_SUM, 0,
             MPI_COMM_WORLD);
  device_free(rows); device_free(cols); device_free(d_ar);
  if (rank > 0)
    return false;
  const auto m = thrust::reduce(thrust::host, gar.begin(), gar.end(), -1,
                                thrust::maximum<int>());
  return m < 2;
}

The setar kernel is a bit different from our non distributed CUDA solution since we’re only operating on a subset of our sudoku board. We set the values in our final sum matrix for the row, column and block submatrices just like before. This time however, we’re given this offset parameter. This is because we’re not just running CUDA kernels, we’re running CUDA kernels on multiple processes and potentially multiple machines, so we’re only performing a subset of the full set of operations. This offset parameter tells us where we should start relative to the entire set of operations. We’re also not using the builtin threadIdx.y value since we’re launching our kernel in a 1D grid with precalculated row and column indices instead of a 2D grid.

__global__ void setar(int *ar, const int *board, const int *rows,
                      const int *cols, const int offset) {
  const auto i = offset + threadIdx.x;
  const int r = rows[i], c = cols[i];
  const int value = board[idx2(r, c)];
  if (r < 0 || 0 == value)
    return;
  atomicAdd(&ar[idx3(r, value, 0)], 1);
  atomicAdd(&ar[idx3(c, value, 1)], 1);
  const int bi = (r / blksz) * blksz + (c / blksz);
  atomicAdd(&ar[idx3(bi, value, 2)], 1);
}

If we return to the start of our top-level function, you’ll see that we calculate the work that should be performed on this MPI process. We also set up our row and column indices using our cartesian product kernel.

  const auto chunk = (81 + size - 1) / size;
  const auto rows = device_malloc<int>(chunk * size),
             cols = device_malloc<int>(chunk * size);
  thrust::fill(rows, rows + (chunk * size), -1);
  setrc<<<1, dim3(9, 9)>>>(raw_pointer_cast(rows), raw_pointer_cast(cols));

We then set up our final sum matrix on the device:

  auto d_ar = device_malloc<int>(81 * 3);
  thrust::fill(d_ar, d_ar + (81 * 3), 0);

And then launch our core kernel to perform the operations assigned to the current rank:

  setar<<<1, chunk>>>(
      raw_pointer_cast(d_ar),
      raw_pointer_cast((device_vector<int>(board.begin(), board.end())).data()),
      raw_pointer_cast(rows), raw_pointer_cast(cols), chunk * rank);

We syncronize with our GPU device and copy the data to a host vector before reducing the final sum array across all of our ranks using MPI. Note that if we used a GPU-enabled MPI provider we could send the data on the device directly to another system’s GPU without copying the memory to the host, but this has other complications so I kept it simple for this example.

  cudaDeviceSynchronize();
  auto h_ar = host_vector<int>(d_ar, d_ar + (81 * 3));
  auto gar = host_vector<int>(81 * 3, 0);
  MPI_Reduce(h_ar.data(), gar.data(), gar.size(), MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

And then we perform our final reduction on our root rank to see if we have any cells with values greater than 1. We could perform this reduction on the device, but it’s probably not worth it to copy the data back to the device for just one operation.

  if (rank > 0)
    return false;
  const auto m = thrust::reduce(thrust::host, gar.begin(), gar.end(), -1,
                                thrust::maximum<int>());
  return m < 2;

And there we have it, our sudoku validator is running on multiple processes and using GPUs.

$ mpirun -n 7 ./src/thrust/lc-valid-sudoku-mpi-thrust
true
true
false
false

Now let’s move on to Fortran.

Fortran

You’re likely not surprised that this looks a lot like our previous solutions.

subroutine isgood(board, ret)
  implicit none
  integer, dimension(0:(shape*shape)-1), intent(in) :: board
  logical, intent(out) :: ret
  integer, dimension(0:(shape * shape * 3)-1) :: ar
  integer :: v, row, col, i, bx, by
  ar = 0
  do row = 0, shape-1
    do col = 0, shape-1
      v = board(idx2(row, col))
      if (v .eq. 0) cycle
      ar(idx3(row, v-1, 0)) = ar(idx3(row, v-1, 0)) + 1      
      ar(idx3(col, v-1, 1)) = ar(idx3(col, v-1, 1)) + 1
      ar(idx3(bi(row, col), v-1, 2)) = ar(idx3(bi(row, col), v-1, 2)) + 1
    end do
  end do
  v = maxval(ar) - 1
  ret = (v .lt. 1)
end subroutine isgood

If I clear away the declarations and initializations, this looks fairly readable. You may notice that I have to repeat myself a few times because there’s not a really nice way to incremenet a value in fortran.

subroutine isgood(board, ret)
  do row = 0, shape-1
    do col = 0, shape-1
      v = board(idx2(row, col))
      if (v .eq. 0) cycle
      ar(idx3(row, v-1, 0)) = ar(idx3(row, v-1, 0)) + 1      
      ar(idx3(col, v-1, 1)) = ar(idx3(col, v-1, 1)) + 1
      ar(idx3(bi(row, col), v-1, 2)) = ar(idx3(bi(row, col), v-1, 2)) + 1
    end do
  end do
  v = maxval(ar) - 1
  ret = (v .lt. 1)
end subroutine isgood

Now we move on to the MPI-distributed Fortran implementation. This solution is pretty long so I’ll break the function into a few slides like a sliding window.

Fortran And MPI

Here is the full solution.

subroutine isgood(board, ret)
  use mpi
  implicit none
  integer, dimension(shape*shape), intent(in) :: board
  logical, intent(out) :: ret
  integer, dimension(shape * shape * 3) :: ar, gar
  integer, dimension(shape * shape) :: rows, cols
  integer :: v, row, col, i, chunk, rank, size, ierr

  ar = 0
  gar = 0

  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)

  do row = 0, shape-1
    do col = 0, shape-1
      rows(1+idx2(row, col)) = row
      cols(1+idx2(row, col)) = col
    end do
  end do

  chunk = ((shape*shape) + size - 1) / size

  do i = 1+(rank*chunk), (rank*chunk)+chunk
    if (i .gt. (shape*shape)) exit
    row = rows(i)
    col = cols(i)
    v = board(1+idx2(row, col))
    if (v .eq. 0) cycle
    ar(idx3(row, v-1, 0)+1) = ar(idx3(row, v-1, 0)+1) + 1      
    ar(idx3(col, v-1, 1)+1) = ar(idx3(col, v-1, 1)+1) + 1
    ar(idx3(bi(row, col), v-1, 2)+1) = ar(idx3(bi(row, col), v-1, 2)+1) + 1
  end do
  
  call MPI_Reduce(ar, gar, 3*shape*shape, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
  call MPI_Barrier(MPI_COMM_WORLD, ierr)

  if (0 .eq. rank) then
    v = maxval(gar) - 1
    ret = (v .lt. 1)
  else
    ret = .false.
  end if

end subroutine isgood

Let’s trim away the declarations and initializations again:

subroutine isgood(board, ret)
  do row = 0, 8
    do col = 0, 8
      rows(1+idx2(row, col)) = row
      cols(1+idx2(row, col)) = col
    end do
  end do
  chunk = (81 + size - 1) / size
  do i = 1+(rank*chunk), (rank*chunk)+chunk
    if (i .gt. 81) exit
    row = rows(i)
    col = cols(i)
    v = board(1+idx2(row, col))
    if (v .eq. 0) return
    ar(idx3(row, v-1, 0)+1) = ar(idx3(row, v-1, 0)+1) + 1      
    ar(idx3(col, v-1, 1)+1) = ar(idx3(col, v-1, 1)+1) + 1
    ar(idx3(bi(row, col), v-1, 2)+1) = ar(idx3(bi(row, col), v-1, 2)+1) + 1
  end do
  call MPI_Reduce(ar, gar, 3*81, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
  call MPI_Barrier(MPI_COMM_WORLD, ierr)
  if (0 .eq. rank) then
    v = maxval(gar) - 1
    ret = (v .lt. 1)
  else
    ret = .false.
  end if
end subroutine isgood

You’ll notice that I create row and column arrays again because this makes distributing the processes much simpler.

  do row = 0, 8
    do col = 0, 8
      rows(1+idx2(row, col)) = row
      cols(1+idx2(row, col)) = col
    end do
  end do

The core loop is the same as the other distributed solutions. I work only on the rows and columns assigned to the current rank.

  do i = 1+(rank*chunk), (rank*chunk)+chunk
    if (i .gt. 81) exit
    row = rows(i)
    col = cols(i)
    v = board(1+idx2(row, col))
    if (v .eq. 0) return
    ar(idx3(row, v-1, 0)+1) = ar(idx3(row, v-1, 0)+1) + 1      
    ar(idx3(col, v-1, 1)+1) = ar(idx3(col, v-1, 1)+1) + 1
    ar(idx3(bi(row, col), v-1, 2)+1) = ar(idx3(bi(row, col), v-1, 2)+1) + 1
  end do

We reduce the solution across all of our ranks to get the full array on rank 0. We then perform our max reduce to get our answer and we return!

  call MPI_Reduce(ar, gar, 3*81, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
  call MPI_Barrier(MPI_COMM_WORLD, ierr)
  if (0 .eq. rank) then
    v = maxval(gar)
    ret = (v .lt. 2)
  else
    ret = .false.
  end if

Running this gives us the answers we expect.

$ mpirun -n 7 ./src/fortran/lc-valid-sudoku-ftn-mpi
 Running with world size of           7
 T
 T
 F
 F

Conclusion

I hope you’ve all enjoyed this video and the foray into distributed computing in a few different programming languages.

{% include footer.html %}

YouTube Description

We solve a Leetcode problem in four languages using various combinations of MPI and CUDA!

0:00 Problem Introduction
0:36 BQN Solution
2:07 Solution Strategy
4:54 Python Solution
5:42 Python & MPI Solution
8:01 C++ Solution
8:55 C++ & MPI Solution
9:58 C++ & CUDA Solution
11:24 C++ & MPI & CUDA Solution
13:31 Fortran Solution
13:55 Fortran & MPI Solution
14:38 Conclusion
Written version: http://www.ashermancinelli.com/leetcode-distributed-computing
LinkedIn: https://www.linkedin.com/in/asher-mancinelli-bb4a56144/
GitHub Repo for Examples: https://github.com/ashermancinelli/algorithm-testbed

Solving a hard leetcode problem in the BQN APL dialect and CUDA C++!

NOTE: This post is a transcript of the youtube video linked here.

Problem

Hello everyone, today I’d like to go through two solutions to a LeetCode problem. We’ll first look at the solution with the BQN array language, and then we’ll look at a GPU-capable solution in CUDA that uses the Thrust template library.

Link here.

Given a string containing just the characters ‘(’ and ‘)’, find the length of the longest valid parentheses substring.

For example, for the string ")()())" the expected answer is 4, and for this string the expected answer is two: "())". Of course, for an empty string the answer is 0.

We’ll be looking at the solution in BQN first.

BQN/APL Solution

Here is the full solution:

   F ← {0⌈1+⌈´⌈´¨∾¨1↓¨⊔¨0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩}
   F ")()())"
4
   F "(()"
2
   F ""
0

I take the index into the string ")(" to convert to integers and I take all the prefixes of that array.

   {↓")("⊸⊐ 𝕩} "(()"
⟨ ⟨ 1 1 0 ⟩ ⟨ 1 0 ⟩ ⟨ 0 ⟩ ⟨⟩ ⟩

I then multiply by two and subtract one so each index represents the change in the level of nesting at that index.

   {1-˜¨2×¨↓")("⊸⊐ 𝕩} "(()"
⟨ ⟨ 1 1 ¯1 ⟩ ⟨ 1 ¯1 ⟩ ⟨ ¯1 ⟩ ⟨⟩ ⟩

I then plus-scan to find the cumulative level of nesting up to that index for each prefix of the array:

   {+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩} "(()"
⟨ ⟨ 1 2 1 ⟩ ⟨ 1 0 ⟩ ⟨ ¯1 ⟩ ⟨⟩ ⟩

Find the zeros in each prefix, since these are the locations where the substring is balanced:

   {0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩} "(()"
⟨ ⟨ 0 0 0 ⟩ ⟨ 0 1 ⟩ ⟨ 0 ⟩ ⟨⟩ ⟩

We can then group the results to find the indices which are nonbalanced and balanced for each prefix:

   {⊔¨0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩} "(()"
┌─                                            
· ⟨ ⟨ 0 1 2 ⟩ ⟩ ⟨ ⟨ 0 ⟩ ⟨ 1 ⟩ ⟩ ⟨ ⟨ 0 ⟩ ⟩ ⟨⟩  
                                             ┘
   {⊔¨0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩} ")()())" # Longer problem
┌─                                                                                                              
· ⟨ ⟨ 0 2 4 5 ⟩ ⟨ 1 3 ⟩ ⟩ ⟨ ⟨ 0 2 4 ⟩ ⟨ 1 3 ⟩ ⟩ ⟨ ⟨ 0 2 3 ⟩ ⟨ 1 ⟩ ⟩ ⟨ ⟨ 0 2 ⟩ ⟨ 1 ⟩ ⟩ ⟨ ⟨ 0 1 ⟩ ⟩ ⟨ ⟨ 0 ⟩ ⟩ ⟨⟩  
                                                                                                               ┘

We can then of course drop the first list in each prefix so we only have the balanced indices. I’ll switch to the longer problem here so it’s a little easier to see what’s happening:

   {1↓¨⊔¨0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩} ")()())"
┌─                                                      
· ⟨ ⟨ 1 3 ⟩ ⟩ ⟨ ⟨ 1 3 ⟩ ⟩ ⟨ ⟨ 1 ⟩ ⟩ ⟨ ⟨ 1 ⟩ ⟩ ⟨⟩ ⟨⟩ ⟨⟩  
                                                       ┘

We can then flatten the sublists together and find the largest element, which represents the index in a given prefix with the longest valid substring:

   {⌈´⌈´¨∾¨1↓¨⊔¨0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩} ")()())"
3

Because we are using 0-based indices as God intended, we’ll have to add one to the result. We’ll also take the maximum of our result and 0 in case no balanced substrings were found, which would otherwise give us ¯∞:

   {0⌈1+⌈´⌈´¨∾¨1↓¨⊔¨0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩} ")()())"
4

Finally, let’s look at all the test cases:

   F ← {0⌈1+⌈´⌈´¨∾¨1↓¨⊔¨0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩}
   F ")()())"
4
   F "(()"
2
   F ""
0

Now that we’ve gone through the BQN solution, let’s take a look at the CUDA and Thrust solution

CUDA/Thrust Solution

Here is the full solution, minus some includes and using statements:

auto solve(const string& problem) -> int {
  const int N = problem.size();
  if (0 == N)
    return 0;

  host_vector<int> mapping;
  mapping.reserve(N);
  std::transform(problem.begin(), problem.end(), std::back_inserter(mapping),
                 [=](const char &c) { return c == '(' ? 1 : -1; });
  device_vector<int> d_mapping = mapping;

  vector<int> starts(N - 1);
  std::iota(starts.begin(), starts.end(), 0);

  int max_len = std::accumulate(
      starts.begin(), starts.end(), 0,
      [&d_mapping, N](int max_so_far, int i) {
        device_vector<int> prefix(N-i);
        thrust::inclusive_scan(d_mapping.begin()+i, d_mapping.end(), prefix.begin());

        device_vector<int> indices(N - i);
        thrust::sequence(indices.begin(), indices.end(), 0);

        auto zip_start = thrust::make_zip_iterator(
            thrust::make_tuple(prefix.begin(), indices.begin()));
        auto zip_end = thrust::make_zip_iterator(
            thrust::make_tuple(prefix.end(), indices.end()));

        int max_for_prefix = thrust::transform_reduce(
            zip_start, zip_end,
            [=] __device__(const auto &tup) -> int {
              return thrust::get<0>(tup) == 0 ? 1 + thrust::get<1>(tup) : 0;
            },
            0, thrust::maximum<int>());

        return std::max(max_so_far, max_for_prefix);
      });

  return max_len;
}

int main() {
  for (const string &problem : { ")()())", "(()", "" })
    std::cout << solve(problem) << "\n";
  return 0;
}

This is quite a lot to take in, so let’s break it down.

First I grab the problem size so I don’t have to keep repeating myself, and I check to make sure our problem size is greater than zero. I then transform the string into integers and copy the data to the GPU device. This step is just like the bqn solution up until this point.

  const int N = problem.size();
  if (0 == N)
    return 0;

  host_vector<int> mapping;
  mapping.reserve(N);
  std::transform(problem.begin(), problem.end(), std::back_inserter(mapping),
                 [=](const char &c) { return c == '(' ? 1 : -1; });
  device_vector<int> d_mapping = mapping;

In BQN:

{1-˜¨2×¨↓")("⊸⊐ 𝕩}

I then create an STL vector to hold the starting positions for each prefix. I’m using the STL here instead of Thrust because I’ll otherwise have to nest my CUDA calls, and not all of the Thrust API is callable on the GPU device. Ideally, we fit as much of our algorithm onto the GPU device to minimize any data transfer between memory spaces, but I still ended up using a mixture of the STL and Thrust.

  vector<int> starts(N - 1);
  std::iota(starts.begin(), starts.end(), 0);

Because of how the stl algorithms are used, we have to now go to the end of our BQN solution. This call to accumulate corresponds to our outter reduction in our BQN solution here:

  // BQN) F ← {0⌈1+⌈´⌈´¨∾¨1↓¨⊔¨0=+`¨1-˜¨2×¨↓")("⊸⊐ 𝕩}
  //               ^
  //              here

  int max_len = std::accumulate(
      starts.begin(), starts.end(), 0,
      [&d_mapping, N](int max_so_far, int i) {
        // ...
      });

We’re reducing over the maximum balanced substring for each prefix of the input string.

Next I create a device vector for the given prefix, and take the prefix sum of the current prefix.

  // BQN) +`
  int max_len = std::accumulate(...
        device_vector<int> prefix(N-i);
        thrust::inclusive_scan(d_mapping.begin()+i, d_mapping.end(), prefix.begin());

I then create an iota to zip with our prefix-summed substring (or a range in BQN parlance, or a sequence in Thrust parlance (can’t we all just agree on a term here…)):

        device_vector<int> indices(N - i);
        thrust::sequence(indices.begin(), indices.end(), 0);

        auto zip_start = thrust::make_zip_iterator(
            thrust::make_tuple(prefix.begin(), indices.begin()));
        auto zip_end = thrust::make_zip_iterator(
            thrust::make_tuple(prefix.end(), indices.end()));

This corresponds to the couple dyad in BQN or the zip function in Python and lots of functional languages.

I then perform two algorithms in this one step. If the given position in the prefix-summed substring is zero, that means it’s balanced and I want to keep the index. Otherwise, I can just throw it out. After performing this transform or map algorithm, I take the max reduction of the substring to find the greatest index at which the substring is balanced. If there are multiple points in the substring where the parens are balanced, this will find the greatest one.

        int max_for_prefix = thrust::transform_reduce(
            zip_start, zip_end,
            [=] __device__(const auto &tup) -> int {
              return thrust::get<0>(tup) == 0 ? 1 + thrust::get<1>(tup) : 0;
            },
            0, thrust::maximum<int>());

I then return the maximum balanced substring for the current prefix, which is then folded in the outter std::accumulate to find the greatest balanced substring for all prefixes in the original string.

  int max_len = std::accumulate(...
        [...](int max_so_far, int i) {
            int max_for_prefix = ...
            return std::max(max_so_far, max_for_prefix);
        });

I then return the maximum length I found, and we have our answer!

auto solve(const string& problem) -> int {
  ...
  int max_len = std::accumulate(...);
  return max_len;
}

I ran this with the same test cases like so:

int main() {
  for (const string &problem : { ")()())", "(()", "" })
    std::cout << solve(problem) << "\n";
  return 0;
}

And running gave me:

$ ./src/thrust/lc-longest-valid-parens
4
2
0

Just like we expected!

YouTube Video Description

We solve a hard leetcode problem in both BQN and CUDA C++ with the Thrust library.

Conclusion

Thanks for tuning in and I hope you enjoyed this example program. You can find all the GPU examples I used in the links below. Connor Hoekstra, if you’re reading or watching this, I hope to see you out-do my BQN and CUDA solutions in another video :).

{% include footer.html %}

Links

Finding the best espresso in one of the homes of 3rd-wave coffee.

The Plan

In the spring and summer of 2023 I started a list of coffee shops to evaluate. By the end of the summer I hope to have established a rough ranking of the best shops, along with some honorable mentions for great shops to work from.

Here are some of the early contenders:

The second shop to get a 10/10 on my arbitrary rating system blew me away. I’m not usually a huge fan of fruity espresso, but the double-fermented Las Frutas roasted in-house was the best espresso I’ve had to date. The columbian beans were sealed in water bags for 72 hours before another 30 hour wet-tank soak were incredible. The roaster and barista recomended Sterling Coffee Roasters for my next espresso, so I’m hopeful that Sterling will rate pretty highly as well.

Ovation Coffee and Tea was the first shop to be awarded a perfect 10/10. The location in PSU’s campus right off the North-South streetcar line has plenty of space to work and REALLY cute porcelein cups.

Behind the Museum

The Behind the Museum cafe’s beautiful floor-to-ceiling windows and Japanese pottery along the walls make for an especially nice work environment, especially when it’s raining - the rain rolling off those massive windows immediately puts you at ease.

6/21/23 update:

In my second visit, I had another Etheopian roast, the Sidama Setamo, an Etheopian washed-process SO Heirloom roast with bergamot and wild cherries as flavor comps. The barista compared the roast to Fruit Loops, which felt far closer to my experience than the listed comps; I also got notes of green apple.

I’m determined to try every roast this place puts on espresso; if I ever move out of Portland I’m going to kick myself for not having as much of this espresso as I can. This may be consistently the best espresso I’ll ever have.

I got to sit with Eric (the owner) for a minute today and ask a few questions about Sterling. He started the operation in 2005 (under a different name I believe) and now has a team working at the roaster while he spends most of his time at the cafe interacting with customers and tasting roasts.

He spoke really highly of Keeper cafe (which I had never heard of) as well as Barista (which I tried later that day) as well as Courier, which is currently somewhere in my top 5 cafes & roasters in Portland. He also referred to Vivace in Seattle as the Mecca of the 90s and 2000s coffee culture, but lamented the lack of innovation he’s seen in Seattle’s coffee scene in the years since; he used to take new employees up to thier roaster to learn the trade. Same with Slate roasters, who have since closed their doors.

I hope to set up a more formal interview with Eric sometime to tell Sterling’s story.

–> 6/14/23 (original review):

Prior to trying Sterling’s espresso, the #1 on my list was from Courier Coffee. When I told the barista at Courier how much I enjoyed their espresso, he made sure to recommend Sterling for my next stop.

Oh, you like espresso? Yeah, you should definitely check out Sterling

He backed that comment up with a few more remarks about how special their espresso is, and it went to the top of my list of places to try next. It did not dissapoint.

–>

The barista was excited to welcome me in and tell me all about the roast they had ready for espresso, which happened to be the Nigusse Lemma, a natural-process small-batch Ethiopian roast with Etheopian Heriloom varietals. I got the same feelings I get trying a new whiskey at the Scotch Lounge on the East side of town - she talked through the roasting process at their East-side roastery and the origins of the beans.

After pulling the shot, she put together a board with a small glass of sparkling water as a pallate clenser (as is usual in espresso-focused cafes).

Adding to the Scotch Lounge feelings, the espresso was served in a glencairn-style whiskey serving glass along with a third glass with a drip brew so as to try the roast from another perspective.

Both coffees were smooth and creamy - I jotted down notes of strawberries and cream before I even saw “Raspberry & Vanilla Cream” listed as flavor comps on the bag of coffee they used for my espresso.

–>

This espresso struck me as perfectly balanced - without hesitation, I wanted to share it with every one of my friends still not won over by espresso. In my experience, when people try espresso and decide they hate it, 95% of the time it’s due to the intensity of the flavors, especially with sour and bitter notes. When I try an intense espresso, something that punches you in the mouth with sourness or fruit or bitterness or chocolate or nuttiness, I can enjoy the intensity (if it’s done well!) - but that seems to be the attribute of espresso that puts people off of it.

This espresso was not intense in the usual way, but not too mild either. It was full of flavor with delicious notes of citrus, cream, a balance of bitterness to round out the flavor profile, yet not too intense to be someone’s first espresso.

This espresso, along with their drip brew, the knowledgable baristas, and a cozy cafe seating area nestled into NW Portland with indoor trees and fast wifi brings this to the top of my list. The barista who served me happened to recommend Barista for one of my next espresso destinations which I haven’t had the opportunity to try yet - I look forward to posting a review given that the current champion of my list thinks I should give it a try.

As I work out an espresso tour of Portland, I have to pencil Sterling in as the final destination. I would want any espresso afficionados visiting the city to finish their time here with a Sterling espresso.

–>

Sterling Coffee homepage

“Coffee should be dope”

–>

Nestled in central Chinatown, this cafe front for a roastery has their own streetwear line and feels like a trap-house recording studio. Self-termed Portland’s Hype Barista, Nike Jordan 1s are hung up everywhere and trap plays over the speakers all day.

The BREEZY Don Enrique Colombian single-origin medium roast was seriously good. As soon as I finish making this espresso tour, I plan on coming back here to buy a bag or two of the BREEZY - it was one of the best-smelling roasts I’ve come across in the last year.

I haven’t seen DEADSTOCK mentioned on many blogs, which feels like an injustice. Definitely worth stopping by.

–>

Deadstock Coffee homepage

Solid option in NW Portland with a variety of local roasts, not just their own.

–>

I had the spirit lifter blend from Puff roasters, which was unfortunately out of stock in the consumer-sized bags, so I was unable to see the actual roast level, flavor comps, etc. It tasted like a medium roast with some very mild sweetness, almost like a vanilla tea; very clear, very light body.

They had two other roasts on espresso; this is a thoughtful cafe with interesting roasts. I plan on returning to rate all three roasts they have available, and I’ll keep my eyes on what new roasts they bring into the shop.

Although they roast their own coffee, they also bring in beans from other local roasters. Sourcing and serving the best local beans is their mission statement, as listed on their website.

–>

Barista homepage

Not my favorite, but it might be yours.

The Never Coffee cafe on SW 12th and Alder is one of the nicest cafe environments I’ve found so far, with a modern feel and pastel color palette. They got a low score from me simply because their espresso was not my favorite - it felt a little bland, not overly fruity or dark and rich, somewhere in the middle with less flavor than I was hoping for.

I had their Bangarang roast, a blend of washed Kenyan, washed Peruvian, and honey-process Honduran beans, which the barista was more than happy to tell me about - they clearly care about coffee, roast in-house, and take creative steps with their drinks, but their espresso was not for me and that’s my measuring stick.

Other reviews of Never Coffee call out their lattes as very above-average, which seems to be the case; almost every single person I saw in the cafe in the morning I spent there chose one of their 5 unique latte flavors, and I was the only one with a doppio.

If you stop by, maybe try one of their lattes.

Never Coffee homepage

Unique energy, passionate about espresso.

–>

Named for it’s location relative to the Ladd’s Edition neighborhood in SE Portland, this roaster and cafe front has several roasts available for espresso at any given time (and one or two more specifically for drip and pour-over).

The flavor comps for the roast I tried were absolutely spot-on: I got strong notes of sweetness and white chocolate from the 90h Tabi. These Tabi beans from Tolima, Colombia underwent a 90-hour submerged fermentation process. I hadn’t heard of the Tabi variety before: it’s a hybrid of bourbon, typica, and timor varietials. Upper Left is clearly passionate about espresso, but has something for everyone: more than half of the patrons were sporting iced lattes on the Friday afternoon I stopped in.

The venue is very unique as well: the roasting machine is separated from the cafe by only a bar seating area; you can sit right next to the roasting machine if you like. The white walls, light wooden furnishing, and ample indoor and outdoor seating made for a wonderful environment.

Speaking of the outdoor seating, there was a DJ playing lowkey club mixes over industrial-sized speakers in the outdoor seating area which happened to be almost entirely full; the neighborhood clearly takes full advantage of having this cafe so accessible.

Upper Left is a must-try.

–>

Upper Left Roasters homepage

Enjoyable, but not my first recomendation.

–>

The Etheopian and Indonesian in-house medium roast (the Collective Blend) was subtle and tasty - I wasn’t smacked across the face with fruit or punched in the nose with dark-chocolately bitterness. The notes on the package mentioned dark cherry and pear; I can see where they got pear but grapefruit is the predominant flavor that came to mind.

It was enjoyable! I would have it again, but I’m not rushing back or sending out recomendations to all my friends.

The sea-salt iced matcha latte the person in front of me ordered looked beautiful, and I heard several people mention how much they liked it - if you’re a matcha latte kinda person, this might be a fun stop.

–>

Abba Coffee Roasters homepage

Do you like blonde roasts or need a beautiful space to get some work done?

–>

The first thing to draw me in to a coffee shop is usually the smells - pastries, coffee, and maybe the nutty wooden smell of roasting if I’m lucky enough to be in a roastery.

Walking into Rose City Coffee however, I was struck by the dark cherry-wood and soft-touch-textured metal furnishing, the brick walls, and the wide-open cafe space - there are no walls or furniture to obstruct your view of one side of the cafe from the other, combining with the floor-to-celeing windows to make the space feel larger than it is.

–>

I had the Floribunda, a medium blend with chocolate and floral flavor comps… and that’s about all I could find about the roast between asking in the shop and looking online. This is obviously not an espresso specialty shop, but my doppio was still alright.

I got quite a bit of sourness up front, without much darkness or bitterness to balance it out. If you’re a fiend for blonde roasts, this would be a pretty fun place to stop - but the menu seems to emphasize lattes, so I don’t know that I would recommend this spot to a huge fan of blonde espresso.

Personally, I would have loved a darker roast and a LOT more information about the beans, but espresso isn’t every shop’s emphasis, and that’s alright too. Their food options looked (and smelled) amazing as well.

Rose City will live in my head as the place to go if I need to stay put somewhere for all day - great food, decent coffee, and a free-feeling space that I could spend a whole workday at.

–>

On top of the food, coffee, and environment, the friend that recommended this place to me remarked that they have live music on Saturdays! The community events add to Rose City’s vibe, which leaves it feeling remarkably like the Flying M coffee shop in the Boise, Idaho area, my all-around favorite shop from my time living there.

The Flying M will always have a special place in my heart, so I can’t help but feel sentimental cracking open my laptop to get some work done in the refreshing Rose City seating area. It’s worth stopping by if you’re in the SE part of the city (especially on a Saturday 😊).

–>

Rose City Coffee homepage
The Flying M homepage

This place is something special. Please don’t leave Portland without trying their espresso.

–> 6/21/23 update:

I’m determined to try every roast this place puts on espresso. If I ever move from Portland, I’m going to kick myself for not having as much of this espresso as I can. This may be consistently the best espresso I’ll ever have.

–> 6/14/23 (original review):

Oh, you like espresso? Yeah, you should definitely check out Sterling

He backed that comment up with a few more remarks about how special their espresso is, and it went to the top of my list of places to try next. It did not dissapoint.

–>

After pulling the shot, she put together a board with a small glass of sparkling water as a pallate clenser (as is usual in espresso-focused cafes).

Adding to the Scotch Lounge feelings, the espresso was served in a glencairn-style whiskey serving glass along with a third glass with a drip brew so as to try the roast from another perspective.

–>

Sterling Coffee homepage

There’s too much to say. You should really just go try their espresso.

–>

Superjoy attempts to bring a blend of artful appreciation of coffee and chinese culture to the Portland coffee scene.

–>

I tried not to eavesdrop, but while I drank my coffee and enjoyed the ambiance of the shop, I couldn’t help but overhear Christopher (who seems to be the primary roaster) training up a new employee. I overheard him gently preaching the value of consistency in a barista; how to consistently and diligently clean your workspace in between every shot, how to measure all the observable input variables, how to deliver the same shot over and over again.

He clearly cares about his coffee and the presentation of his roasts, which he seems to put 100% effort into.

–>

My espresso tasted perfectly balanced with a mild finish; on the darker side of fruity with notes of cherry and pomegranate and a bit of nuttiness. I had their medium roast blend of natural-process Ethiopian and washed-process Guatamalan beans

Christopher was excited to tell me about ever aspect of his shop, from the new design and branding language he recently brought to the bags to the procurement of the different beans he roasts in-house.

While I listened to Chris passionately give me every detail about his shop, I noticed some very interesting specialty drinks on the seasonal menu: a chinese pepper mocha, a rose honey latte, and a maple oat latte. This heavily reminded me of Never Coffee’s creative seasonal specialty lattes.

–>

Chris also mentioned that they usually have some chinese beans in house, but happened to be out that day; I had never (to my knowledge) tried chinese-grown beans before, so I plan to visit next week when they (hopefully) have them back in stock.

–>

Superjoy homepage

Espresso should be fun! Don’t be too intimidated by all the lingo.

–>

These are my espresso cliff notes, things I keep referring back to when I get confused about some aspect of espresso.

NOTE: This is not about how to make espresso - just getting a feel for the lingo and culture, hopefully increasing your enjoyment of the drink and specialty shops that serve it.

What is Espresso?

https://home.lamarzoccousa.com/the-beginners-guide-to-espresso-drinks-2/

Espresso is a coffee brewing method. It’s a way of making coffee where a small amount of near boiling water is forced through finely ground coffee, under pressure.

The space above the flat espresso puck fills with water and the espresso machine applies 9 bars of pressure to force water through the coffee.

The (General) Recipe: To pull a double shot, grind 18 grams of finely ground coffee in your portafilter. Aim for about 28-36 grams of espresso in about 25-30 seconds.

Processes

Perfectdailygrind article

Washed coffees focus solely on the bean. They let you taste you what’s on the inside, not the outside.

Washed coffees depend almost 100% on the bean having absorbed enough natural sugars and nutrients during its growing cycle. This means the varietal, soil, weather, ripeness, fermentation, washing, and drying are key.

This means that the washed process highlights the true character of a single origin bean like no other process. It’s why so many specialty coffees are washed.

–>

The natural process, also known as the dry process, is a back-to-basics approach that stems from Ethiopia. The fruit remains on the bean, and dries undisturbed. … This inconsistency is often the result of unripe fruit drying and turning brown alongside ripe fruits. However, there are many who believe this process actually has the potential to create the most flavourful coffees.

–>

When done right, honey processed coffee can literally taste like someone has put honey and brown sugar in your cup of coffee – although the name actually comes from how sticky the beans get during processing. The honey process is strongly associated with Costa Rica and, in recent years, subcategories have developed: yellow, red, golden, black, and white honey.

Coffee Affection article

Honey processing leaves a sticky inner coating called mucilage on the coffee bean. This “honey” ferments, giving the bean a sweet, fruity flavor.

Varietals

Within each species of coffe bean (Arabica, Robusta, Liberica, and Excelsa), there are many, many varieties.

Species: Arabica

Home Coffee Expert article

Arabica coffee is the most popular coffee bean in the world, accounting for around 64% of global coffee production.

As the Arabica coffee plant originated in the highland areas of Ethiopia it requires high altitudes and is quite sensitive to high temperatures. It can also only be grown in what is known as the “Coffee Belt” – the equatorial regions between the Tropic of Cancer and the Tropic of Capricorn.

High-quality Arabica coffee beans should be sweet with chocolate, caramel, fruit, nut, and floral notes.

The varieties of arabica can be listed under four main groups, or collections of varieties:

Ethiopian landrace
- associated with incredibly high-quality beans and low yields.
- huge number of “heirloom” varieties grown in Ethiopia
Bourbon and Typica group
- huge amount of Arabica coffee grown today descended from a small number taken from Yemen in the late 17th century. The path taken determines their designation as either Bourbon or Typica.
Introgressed
F1 hybrid
- Hybrids are created when two genetically different “parent” plants are bred to create a new cultivar. The “F1” refers to the fact these are the first generation offspring.
- broken down into two categories: Introgressed and Not Introgressed. Introgressed F1 Hybrids have characteristics from two distinct species like Robusta and Arabica. Wheras Not Introgressed F1 Hybrids are made by crossing only varieties of Arabica coffee beans.

References

All the shops and roasters I tried while visiting my friend in Seattle!

–>

Let’s Get the Ratings Out of the Way

Cafe Allegro: 8.5/10
Cafe Vita: 8/10
Cafe Ladro: 8.5/10
Zeitgeist: 6.5/10
Good Weather Bike & Coffee: 7.5/10
Lighthouse Roasters: 7/10
Monorail Espresso: 3/10
Fonté: 2/10

Overall Impressions

I was struck by how much more prevalent darker roasts were in Seattle - it was my impression that most specialty espresso was in the lighter range of roasts to bring out more of the differences between varietals. Almost every shop I visited served a pretty dark and heavy-bodied roast for their espresso, and few places changed their roasts out with the seasons.

I spoke with Collin from Cafe Ladro about the coffee culture in Seattle:

We were really ahead of our time in the 90s, but most places haven’t been updated since then. You would probably find fewer shops serving lighter (and better) roasts in a newer place like Portland. Things are even worse in San Fran - they were killing the game 30 years ago, but if you go now it’s mostly the same stuff.

That being said, there were still some great spots roasting beans in-house. Cafe Ladro (where I met Collin) and Cafe Vita were wonderful. Cafe Allegro was a hidden gem I would absolutely swing by again, especially given how close it is to the 1 metro line; I had two coffees with an old friend there before hopping on the 1 to catch my Amtrack back to Portland, but I could have stayed much longer. The Good Weather Bike & Coffee cafe-and-bike-repair-shop-combo served a really nice Herkimer espresso blend, who roasts their beans a few neighborhoods away.

Collin mentioned Superjoy Roasters in Portland, and though they are on my list I have not stopped by there yet.

Cafe Allegro: 8.5/10

Hidden in an alleyway in between a used bookstore and a pizza joint, Cafe Allegro had the most enjoyable espresso of any of the cafes I visited in Seattle. The barista had a t-shirt reading Where the hell is Allegro? paying homage to the cafe’s obscure location.

I felt a sense of relief when I took my first sip of Allegro’s espresso and felt familiar waves of fruit wash over me. For most of my coffee-drinking life I considered myself more of a dark-roast kinda guy, but since getting more into espresso and going back to darker roasts, I feel like my taste has developed to like well-done lighter roasts far more.

I got notes of berries (strawberries?) and dried fruit. The listed flavor comps were stone fruit (?) and chocolate. I didn’t get much of either of those, but I still loved it (and went back for a second).

The barista mentioned a cafe in northern San Fransisco named after the Italian city Trieste after which Cafe Allegro was designed; maybe if my espresso journey takes me that far I’ll give it a try! Definitely stop by if you have some time near the 1 line in the University District.

Cafe Vita: 8/10

I had the Cafe del Sol, a house-roasted medium blend. It was pretty good - the first >=8 score of this Seattle trip. I experienced medium body, balanced flavor (not nearly as dark as my previous shots) on the nutty side of sweet. The flavor comps on the bag read milk chocolate, caramel, dark cherry; I mostly got the caramel.

Nothing mind-blowing, but I would have it again.

Cafe Ladro: 8.5/10

This was the first specialty espresso place I got to visit. I had the Ladro house medium blend with milk chocolate, dark cherry and caramel flavor comps, but the strongest flavors I got were nuttiness/hazelnut and vanilla. I found a really mild finish, I wasn’t punched in the nose by bitterness or dryness like many of the other shops.

This was a really nice espresso! If you’re in the West Queen Anne area you might want to stop by.

Zeitgeist: 6.5/10

This smaller German coffeehouse with aged brick walls was a nice environment and okayish coffee, but wasn’t overly impressive to me. Unfortunately, they didn’t seem all that busy either - I could see myself coming back if I needed a place to work or if I knew there would be a more busy or upbeat energy, but this was not my favorite place. I had the Jackson Street Blues roast from Fulcrum Coffee Roasters.

Good Weather Bike & Coffee: 7.5/10

Good Weather was interesting - the cafe environment was serene, tucked into an alleyway in Capitol Hill near a secondhand sportswear shop. I was excited to see them serving Herkimer’s espresso blend; I wanted to try these beans from a Herkimer cafe front, but I didn’t get the chance.

I got a lighter body, almost like a black tea and a much lighter roast than some of the other (read: worse) shots I tried this weekend. I tasted soem chamomile and dried fruits, maybe raisins/craisins.

Lighthouse Roasters: 7/10

This neighborhood roaster was filled with dogs, and the cafe bar shelved probably a dozen kinds of dog treats. The roast was too dark for specialty espresso, but I would absolutely have an americano or latte of their house roast. I get the sense the neighborhood patrons stop by for a coffee and grab a bag of their house roast for their drip machine at home. An extremely small and cozy shop with maybe 4 small tables right next to the roasters and at least half a dozen bags of green coffee beansready to be processed.

If I lived next to this cafe, I’d probably be drinking far more americano for how enjoyable their cafe environment was.

Monorail Espresso: 3/10

The cute window-only cafe front was unfortunately not very good… Very dry, very dark, HEAVY body. The smells from the cafe and from my espresso were wonderful (I still love a dark-roast) but my espresso was not very good. My friend’s vanilla latte was great though!

Fonté: 2/10

Fonté was unfortunately the first espresso in this entire journey that I could not even finish. Extremely dark and heavy-bodied, my mouth puckered up with bitterness with both of the sips I could get down. Very unfortunate.

–>

Thanks to my friend Emma that stuck with me and my annoying coffee flavor charts and endless cafes :)

Pleasant, mild, unique, imported roasts from Sweden - try some beans you’ll likely not find anywhere else!

–>

The first thing I look for when I walk into a new cafe in the Portland area is the coffee bags the shop is selling:

Are they using Coava or Proud Mary beans? Do they roast their own? Am I trying something new today, or maybe an old favorite?

Walking into Adapt Cafe in SE, I was surprised to see bags I’d not seen before - light and modern packaging around small, 250 gram bags from Morgon Coffee Roasters from Gothenburg, Sweden.

The swedish beans bled out into the design language of the shop - smallish seating area with huge windows and modern-feeling all-black furnishing against white walls. The barista was happy to tell me about the imported roasts from the very young 5-year-old roasting company they buy from, and the rapid pace at which they have been winning awards:

Our combined resume contains… competing in most things coffee. Some of those competitions resulted in a bronze in roasting, two silvers in brewing and two gold medals in coffee tasting.

In 2020 we had our first prize spotlight for Morgon when we were finalists in the 12th annual international Sprudgie’s award as “Most notable roaster in 2020”.

Morgon Coffee Roasters’ “About Us” page, see links below

–>

I found the espresso to be very pleasant, on the mild side - the primary flavor comp I experienced was plum; a touch of sweetness and sourness, but not really any citrus per se, and certainly not much bitterness. After the first sip, I couldn’t tell if it was more sour or bitter. It wasn’t until the second and third sip that I made up my mind; that’s how subtle it was. The other flavor comps were structured, chocolate, and bisuit, but I could only really pick up notes of plum.

I had their honey process Costa Rican beans of the Catuai variety of Arabica from the Montero family - at first I thought Montero might have been referring to a coffee variety I’d not yet heard of, but no: these beans were grown by the Montero family, and Morgon’s website has a whole page about their relationship with the Montero family. Carlos Montero even visited Morgon in Gothenburg to see their roasting operation.

For being a dark-roast kinda guy, I really enjoyed their espresso. It’s obvious they care about coffee, their baristas are knoledgable, and they get their beans from a serious roasting company. I would love to see more Morgon beans around town, maybe even in a second Adapt Cafe location. I think they would get a lot of business in a more central location, though I do enjoy how many local roasters get the spotlight in cafes downtown.

–>

Adapt Cafe insta
Morgon Coffee Roasters homepage

They say you never forget your first love…

–>

…and Coava was mine.

They are the first Portland-local roaster and cafe I fell in love with.

When I first moved to Portland, there was a Coava cafe in the same building as my apartment, on the lobby floor. Nearly all 5 work-days, nearly every week since I moved here, I worked from that cafe in the upstairs seating area surrounded by Monstera Deliciosas, PSU students, and speakers playing indi tracks I’d never heard before.

I’ve consumed more Coava espresso than any other form of espresso in my life by a long shot. It’s no surprise that they earn a perfect 10/10; fruity but not too blond, bitter but not too dark.

I’ll forever mourn their west-side location, but their headquarters shop on SE Main and Grand right off the 6 bus line is a much nicer cafe anyways. The seating area is home to beautiful woodworking projects from Bamboo Revolution with whom Coava shares a space, along with antique roasting equipment and open-air garage doors and fans (at least in the summer time).

For this review I had a doppio of the Java Papatong, a washed-process from Indonesia, but I’ve loved many of their other roasts:

the San Marcos,
the SO Blend from the Americas and East Africa,
the Darfusu washed process,
the Fazenda Serra do Boné from Brazil,

and plenty of others. Their roasts are all seasonal, so try whatever they have.

–>

Coava Homepage

All the Portland espresso recomendations I could find online, and the shops that come up again and again.

–>

My Takeaways

The most common names in these blogs are probably:

Coava
Proud Mary
Case Study
Heart
Stumptown

If you visit any given cafe in Portland, chances are high their beans come from either Heart, Coava or Proud Mary, and for good reason. All three of them are local and put out some solid roasts… but I still recommend finding some smaller shops that still roast in-house.

In my opinion, the most underrated cafes and roasters are:

Sterling
Deadstock
Courier

and the most overrated are:

Stumptown
Case Study
Never Coffee (unless you really like lattes)
PUSH X PULL

Lists From Other Blogs

As of Jun 15 2023

PDX Monthly

Perhaps the most solid collection of Portland roasters and cafes, I found several shops I’d never heard of while it doesn’t miss the usual suspects.

Abba
Albina Press
Carnelian Coffee
Case Study
Cathedral Coffee
Courier Coffee
Deadstock
Either/Or
Electrica
The Fresh Pot
Futura Coffee Roasters
Good Coffee
Guilder
Heart
In J/Super Joy
J Vein Caffe
Kalesa
Keeper Coffee
Less and More
Never Coffee
Portland Cà Phê
Prince Coffee
PUSH X PULL
Roseline
Soro Soro
Sterling Coffee
Tōv Coffee
Upper Left Roasters

Oregon Obsessed

I’ve tried (and liked) almost all of Oregon Obsessed’s recomendations, though a few are overrated in my opinion:

Coava
Good Coffee
Stumptown
Case Study
Nossa Familia
Proud Mary
Deadstock
Never Coffee
Ovation
Portland Coffee Roasters

Daily Hive

Good to see Coava consistently topping these lists - hadn’t heard of Pájaro either.

Coava
Pájaro
PUSH X PULL
Heart Coffee
Stumptown
Upper Left
Deadstock
Good Coffee

Coffee Affection

Pretty much just hits the usual suspects. Not much is new here.

Coava
Good Coffee
Stumptown
Case Study
Nossa Familia
Proud Mary
Deadstock
Never Coffee
Ovation
Portland Coffee Roasters

FourSquare

Barista
Coava
Stumptown
Case Study
Heart
Courier
Spella
Sterling
Water Avenue
Good Coffee
Common Grounds
Nossa Familia

HopCulture

I hadn’t seen Kiosko before:

Proud Mary
Kiosko
Heart
Stumptown
Never Coffee Lab
Good Coffee
Coava

Boam

Fairlane is the only name on this list new to me. Pretty solid list.

Good Coffee
Nossa Familia
PUSH X PULL
Sunny Day Coffee
Ole Latte Coffee
Ovation Coffee & Tea
Fairlane Coffee
Sisters Coffee Company in The Pearl
Coava
Grendel’s Coffee House

R/Espresso Subreddit

As of the time of writing, many subreddits are protesting a recent change to reddit’s 3rd party API access rules, so I can’t view the listright now.

Bean Box

Roseline Coffee
Water Avenue Coffee Company
Coava
Good Coffee
Ovation
Proud Mary
Prince
Either/Or
Heart Cofffee Roasters
Nossa Familia Coffee
Stumptown

Trip Advisor

This list is focused on cafes and not really coffee… there were some names on this list I hadn’t seen before, so I’ll keep it around… but I wouldn’t put much weight on these names (hence why it comes last in my research page).

Chery’s on 12th
St Honore
Tin Shed Garden Cafe
Jam on Hawthorne
Island Cafe
Ken’s Artisan Bakery
Stumptown
The Waffle Window
Case Study
Cadillac Cafe
Milo’s City Cafe
Broder
Lovejoy
Gravy
Dragonfly Coffee House
Zell’s
Bleu Door Bakery
Petunia’s Pies & Pastries
Cameo Cafe
Public Domain Coffee
Stepping Stone Cafe
The Daily Feast PDX
Barista
Dulin’s Cafe
Heart Coffee
Babica Hen Cafe
Prasad

Delicious, well-balanced, predominantly Brazilian roasts and espresso.

–>

I’ve had espresso from Nossa before at their northern location in the Pearl where I had one of their in-house roasts, but when I stopped by for this review they had a roast from Keia & Martyn’s Coffee roasters (also based out of Portland). This black-owned roaster was featured in the SE Nossa cafe I visited for the month of June (I’ll just have to revisit when they have their own roasts on espresso😉).

The Colombian Yeny Capiz roast from Keia & Martyn’s radical line had flavor comps of molasses, sweet red tea, citrus, and candied hazelnut, though I didn’t get much of the citrus. My coffee tasted balanced, nutty, and a bit sweet; the molasses and candied hazelnut came through pretty clearly to me.

Augusto Carneiro started Nossa to bring his family’s coffee to Portland from Brazil, and the shop continues to pride itself on the farmer-roaster relationships they have with all of their growers, which now come from Guatemala, Nicaragua, Peru, Ethiopia as well.

My coffee was solid: well-rounded, tasty, not overpowering. It didn’t show up Courier or Sterling, but it was also clearly in the top echelon of coffees I’ve had in the city.

–>

Notes