**Kanru Hua**@khuasw

Thinking about opening a thread to explain to non-experts about how we optimized neural network inference in the recent Synthesizer V update. Basically a detailed version of yesterday's stream. Anyone interested?

2021-02-20 01:21:40**Kanru Hua**@khuasw

To sum up what we did in Synthesizer V Studio 1.2 in one phrase, it would be JIT-compiled quantized sparse matrix-vector multiplication kernels. Hold on, I know this is a roller coaster of jargons. In this thread I will explain by breaking this down part by part. twitter.com/khuasw/status/…

2021-02-20 19:12:56**Kanru Hua**@khuasw

Since this might turn into a long thread, I'll focus on one topic a day. Are you ready?

2021-02-20 19:12:57**Kanru Hua**@khuasw

(1) Matrix-vector multiplication (MVM). An artificial neural network boils down to a bunch of really simple arithmetic operations, e.g. a + b * x1 + c * x2 + ... But when you (purposefully) compose millions of these together, they can be turned into really complicated machines.

2021-02-20 19:30:49**Kanru Hua**@khuasw

Essentially what we do to build a voice is to pick some “magical” a, b, c, ... that best represent the voice, and plug them into the equations, except that there are millions of them. These a, b, c are called parameters.

2021-02-20 19:30:49**Kanru Hua**@khuasw

Writing these equations down one at a time is quite unhandy. So linear algebra comes to help. In an extremely simplified view, matrices & vectors are notations to do simple math in large batches.

2021-02-20 19:30:50**Kanru Hua**@khuasw

Some neural network models are mainly composed of matrix-matrix multiplication. In our case, the bottleneck is matrix-vector multiplication, mainly used in a network that generates waveform samples (the “neural vocoder”).

2021-02-20 19:30:50**Kanru Hua**@khuasw

The challenge here is not only that we have a large network, but also the fact that this network needs to run tens of thousands of times per second to synthesize high quality audio in real time. That would break down to billions of additions and multiplications per second.

2021-02-20 19:30:50**Kanru Hua**@khuasw

Modern CPUs run at several giga cycles per second ("Hz"). This is on a similar order of magnitude as the number of operations per second above. However, the margin is very tight. In fact, not all CPU cycles can do useful work (we will visit this point later). Very challenging!

2021-02-20 19:30:51**Kanru Hua**@khuasw

So far is our problem setup. Our goal is to make this MVM operation as fast as possible on a modern CPU. Let's take a break here as our brain needs to rest. I'll be back tomorrow. Feel free to leave questions!

2021-02-20 19:30:51**Kanru Hua**@khuasw

Day 2 - Sparse Matrix-Vector Multiplication (SpMVM). Today we’re taking the first step towards accelerating our neural network. One lucky fact is that out of the millions of parameters (those a, b, c…) in that big matrix, a lot of them are just redundant. twitter.com/khuasw/status/…

2021-02-21 19:13:27**Kanru Hua**@khuasw

In fact you can throw away a lot of these parameters without hurting sound quality; this results in what we call a sparse matrix (and the opposite is, you guessed it, a dense matrix).

2021-02-21 19:13:27**Kanru Hua**@khuasw

Of course there are still parameters that are truly important and can’t be thrown away. If you remove too many of the parameters eventually the quality will drop. The synthesized voice will sound more and more like from a walkie-talkie until it completely degrades into noise.

2021-02-21 19:13:27**Kanru Hua**@khuasw

So, the art is to remove the less contributing parameters carefully & remove as many as possible without hurting the quality. There are many tricks to do that (which goes beyond the scope of this talk). If done properly, we can often get rid of 3/4 of the parameters!

2021-02-21 19:13:28**Kanru Hua**@khuasw

However, reducing the number of parameters by 4x won’t necessarily mean that we get a 4x speed boost for free. When executing the sparse neural network, the program needs to skip the parameters that were removed. This skipping process adds a sometimes expensive overhead.

2021-02-21 19:13:28**Kanru Hua**@khuasw

There exist many smarter ways to arrange how parameters are stored in computer memory to minimize this overhead. However, even in the most optimistic case, it’s hard to get close to a perfect 1:1 speed up. We will go back to this point on day 4.

2021-02-21 19:13:28**Kanru Hua**@khuasw

To sum up today’s topic. Going sparse is an effective way to compress a neural network. If done right, it can still speed up execution by a few times, although this would require highly optimized code for SpMVM. Tomorrow we will review another way to accelerate MVM: quantization.

2021-02-21 19:13:28**Kanru Hua**@khuasw

Day 3 in a series of explaining accelerating neural network inference to a non-technical audience - Quantized Matrix-Vector Multiplication. twitter.com/khuasw/status/…

2021-02-22 21:30:32**Kanru Hua**@khuasw

Historically there’s a trend that integer arithmetic runs faster than the decimal counterpart. It is not hard to see that as a decimal number has to be stored in two parts, for example pi ~= 3.1415926 splits into 31415926 and 1 for the decimal point position (“floating point”).

2021-02-22 21:30:32**Kanru Hua**@khuasw

The support for floating point addition, multiplication, ... requires more circuitry so early CPUs were weak on number crunching, unless you convert all numbers into integers, do the computation and scale them back to decimals. This is called fixed-point arithmetic.

2021-02-22 21:30:33**Kanru Hua**@khuasw

The performance disparity between floating point and fixed point arithmetic on Intel and AMD chips has largely become a thing of the past when CPUs are good enough to fit billions of transistors.

2021-02-22 21:30:33**Kanru Hua**@khuasw

However, there’s still one thing good about integers. If precision isn’t that important, you can use only 2 bytes to store an integer, as opposed to 4 bytes required by a single precision float (a standard format supported by a lot of chips).

2021-02-22 21:30:33**Kanru Hua**@khuasw

On some chips it’s not only a reduction of space consumption. If each time you can process a 4-byte block then you can as well process two 2-byte blocks at a time. If you can process 4 4-bytes block a time then it scales to 8 2-bytes blocks. This means 2x the bandwidth for free!

2021-02-22 21:30:34**Kanru Hua**@khuasw

This feature is called single-instruction-multiple-data (SIMD). On x86 platform it was introduced to early Pentium in the 90s. AMD followed suit. By 2020 the latest CPUs can process 16 4-byte floats or 32 2-byte integers or 64 1-byte integers for a few billion times per second!

2021-02-22 21:30:34**Kanru Hua**@khuasw

So SIMD and fixed point arithmetic are awesome, but how do they related to neural nets? It turns out neural nets don’t require a lot of precision to work. In most cases, you won’t even be able to tell the difference if we switch from 4-byte floats to 2-byte integers.

2021-02-22 21:30:35**Kanru Hua**@khuasw

In some extreme cases you can even go down to 1-byte integers (which means 8-bit, or one that is natively supported by a Famicom). Even in the cases when we see a quality loss, there are ways to make up for that by retraining the low-precision neural network.

2021-02-22 21:30:35**Kanru Hua**@khuasw

There is however one problem with fixed-point numbers: the range of representable numbers becomes very limited with so few bits available. For example If you’re summing up one thousand ones, you’ll get 1000, which won’t fit in an 1-byte integer.

2021-02-22 21:30:36**Kanru Hua**@khuasw

Now looking back I think this is already a lot of information for today so let’s take a rest and chew on this. We’ll continue tomorrow to see how to solve the range problem and what implication it has on MVM performance.

2021-02-22 21:30:36**Kanru Hua**@khuasw

Day 4 in a series explaining accelerating neural network inference to a non-technical audience. Yesterday we talked about the awesome speed up thanks to integer representations of parameters (and if you don’t understand, click through to Day 1). We however had a range problem. twitter.com/khuasw/status/…

2021-02-23 19:46:13**Kanru Hua**@khuasw

The range for a signed 8-bit integer is from -128 to 127 (so there are 256 numbers in total) but the multiplication of two 8-bit integers can easily go beyond this range. For example, 40 * 40 = 1600, which is way larger than 127.

2021-02-23 19:46:13**Kanru Hua**@khuasw

Even addition and subtraction have this problem. Obviously if you add 1 to 127, it will go beyond the largest representable value, not to mention if you are adding a bunch of these together.

2021-02-23 19:46:13**Kanru Hua**@khuasw

What will happen when the result of addition/multiplication goes out of the range? It will be wrapped back to the lower end of the range. For example 127 + 1 wraps back to -128 and 127 + 2 wraps to -127… Yes, you’re literally getting a negative number.

2021-02-23 19:46:14**Kanru Hua**@khuasw

When this wrapping/overflow happens, the synthesized voice will be like from a mistuned radio or it’ll just be completely noise. There is one way to prevent this kind of overflow: carefully scale the values down before doing MVM to make sure the result will be in the range.

2021-02-23 19:46:14**Kanru Hua**@khuasw

But even with scaling, overflow still happens occasionally, especially when the user throws in a rare input or some extreme values not seen during development. We don’t want a product that randomly malfunctions. Luckily we still have one trick up the sleeve: saturated arithmetic.

2021-02-23 19:46:14**Kanru Hua**@khuasw

There are special commands supported by a CPU (not all) that changes the behavior of arithmetic overflow. Back to the 127 + 1 example: normally it ends up as -1 but with saturated addition, the result will be capped at 127. Still not perfect, but much better than going negative.

2021-02-23 19:46:15**Kanru Hua**@khuasw

What we have so far? We know that some CPUs can do fixed-point arithmetic on a lower precision but with greater process capacity. We know there are challenges with fixed-point arithmetic but we also found solutions.

2021-02-23 19:46:15**Kanru Hua**@khuasw

Fixing the overflow problem requires extra code to scale the parameters, to do saturated arithmetic, etc. Just like the case with sparse MVM, the extra code adds an overhead. So, is it worthy to go fixed-point? Well, it’s complicated as it depends on what the CPU has to offer.

2021-02-23 19:46:15**Kanru Hua**@khuasw

For example, on Intel’s side, Sandy Bridge and Ivy Bridge processors (i3/i5/i7 2xxx, 3xxx) can do 8 float operations in one cycle, but they can’t do 16 2-byte integer operations until the next generation (Haswell, i3/i5/i7 4xxx).

2021-02-23 19:46:16**Kanru Hua**@khuasw

AMD processors had the equivalent support on the paper by Excavator series (e.g. Athlon X4 845), but their implementation wasn’t well-optimized until the much more recent Zen 2. Although, once optimized well enough, they seem to do fixed-point even faster than Intel.

2021-02-23 19:46:16**Kanru Hua**@khuasw

Taking some time to organize the words. Today's explaining neural network acceleration in plain words series will come a bit later, maybe tomorrow morning.

2021-02-25 00:00:01**Kanru Hua**@khuasw

Day 5 in a series explaining accelerating neural network inference to a non-technical audience. We have so far reviewed two approaches: (1) making matrices sparse and (2) quantizing them into integers (and if you don’t understand the jargons, click through to Day 1). twitter.com/khuasw/status/…

2021-02-25 12:08:49**Kanru Hua**@khuasw

What’s common between these two methods: both are making (very) small quality sacrifices in exchange for huge speed boosts and there’s always some computation overhead due to the deviation from the standard way of doing matrix multiplications.

2021-02-25 12:08:50**Kanru Hua**@khuasw

We are interested in applying both of them to achieve even greater acceleration. That’d be a sparse matrix with all parameters stored in small integers. However, whether this is really going to speed things up or not depends on the specific hardware the program runs on.

2021-02-25 12:08:50**Kanru Hua**@khuasw

As noted before, very very old PCs perform better with quantization; your typical “old” PC by 2021 standard however may run a tad faster without it; and then again, the fresh-out-of-shelf ones perform better with it.

2021-02-25 12:08:50**Kanru Hua**@khuasw

Same happens for sparse MVM but this time it’s beyond just a matter of hardware. For not-so-sparse matrices (e.g. when more than half the numbers are still left intact), it is probably just easier and faster to treat it as a dense matrix.

2021-02-25 12:08:51**Kanru Hua**@khuasw

Even for those very sparse matrices, we still have a few choices on exactly how to store the data in the memory, and in what order to load them into the CPU and process them. Each way has its strengths and weaknesses that will affect some certain types of sparse matrices.

2021-02-25 12:08:51**Kanru Hua**@khuasw

Neural networks used in Synthesizer V AI come in a lot of different sizes. Some can be made sparse, some can not. Our software runs on every x86 CPU since Pentium 4 (2004). There’s an infinite permutation of hardware versus matrix types and sizes.

2021-02-25 12:08:51**Kanru Hua**@khuasw

How should we program the neural network (or specifically, the MVM routine) to support all these cases? Are we seriously going to write 100 or 1000 versions targeting each case? No. Instead of writing the program itself, we let the program write another program to do the work!

2021-02-25 12:08:52**Kanru Hua**@khuasw

What’s more interesting - all these happen on the fly. We are not shipping a prebuilt program with all these scenarios baked into it (which would be absurdly large). We are shipping a “program builder” that takes in a neural network and generates code to run it.

2021-02-25 12:08:52**Kanru Hua**@khuasw

What is the code here? If you’re into programming, you might know that C, C++ and Fortran are among the most performant programming languages. Even faster than that is assembly language, which directly translates to the 0s and 1s that a CPU can read.

2021-02-25 12:08:53