stringtranslate.com

Strength reduction

In compiler construction, strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations.[1] The classic example of strength reduction converts strong multiplications inside a loop into weaker additions – something that frequently occurs in array addressing.(Cooper, Simpson & Vick 1995, p. 1)

Examples of strength reduction include replacing a multiplication within a loop with an addition and replacing exponentiation within a loop with a multiplication.

Code analysis

Most of a program's execution time is typically spent in a small section of code (called a hot spot), and that code is often inside a loop that is executed over and over.

A compiler uses methods to identify loops and recognize the characteristics of register values within those loops. For strength reduction, the compiler is interested in:

Loop invariants are essentially constants within a loop, but their value may change outside of the loop. Induction variables are changing by known amounts. The terms are relative to a particular loop. When loops are nested, an induction variable in the outer loop can be a loop invariant in the inner loop.

Strength reduction looks for expressions involving a loop invariant and an induction variable. Some of those expressions can be simplified. For example, the multiplication of loop invariant c and induction variable i

c = 7;for (i = 0; i < N; i++){ y[i] = c * i;}

can be replaced with successive weaker additions

c = 7;k = 0;for (i = 0; i < N; i++){ y[i] = k; k = k + c;}

Strength reduction example

Below is an example that will strength-reduce all the loop multiplications that arose from array indexing address calculations.

Imagine a simple loop that sets an array to the identity matrix.

for (i = 0; i < n; i++){ for (j = 0; j < n; j++) { A[i,j] = 0.0; } A[i,i] = 1.0;}

Intermediate code

The compiler will view this code as

0010 ; for (i = 0, i < n; i++)0020 ; {0030 r1 = #0  ; i = 00040 G0000:0050 load r2, n ; i < n0060 cmp r1, r20070 bge G000100800090 ; for (j = 0; j < n; j++)0100 ; {0110 r3 = #0  ; j = 00120 G0002:0130 load r4, n ; j < n0140 cmp r3, r40150 bge G000301600170 ; A[i,j] = 0.0;0180 load r7, n0190 r8 = r1 * r7 ; calculate subscript i * n + j0200 r9 = r8 + r30210 r10 = r9 * #8  ; calculate byte address0220 fr3 = #0.00230 fstore fr3, A[r10]02400250 r3 = r3 + #1  ; j++0260 br G00020270 ; }0280 G0003:0290 ; A[i,i] = 1.0;0300 load r12, n ; calculate subscript i * n + i0310 r13 = r1 * r120320 r14 = r13 + r10330 r15 = r14 * #8  ; calculate byte address0340 fr4 = #1.00350 fstore fr4, A[r15]03600370 ; i++0380 r1 = r1 + #10390 br G00000400 ; }0410 G0001:

This expresses 2-dimensional array A as a 1-dimensional array of n*n size, so that whenever the high-level code expresses A[x, y] it will internally be A[(x*n)+y] for any given valid indices x and y.

Many optimizations

The compiler will start doing many optimizations – not just strength reduction. Expressions that are constant (invariant) within a loop will be hoisted out of the loop. Constants can be loaded outside of both loops—such as floating point registers fr3 and fr4. Recognition that some variables don't change allows registers to be merged; n is constant, so r2, r4, r7, r12 can be hoisted and collapsed. The common value i*n is computed in (the hoisted) r8 and r13, so they collapse. The innermost loop (0120-0260) has been reduced from 11 to 7 intermediate instructions. The only multiply that remains in the innermost loop is line 0210's multiply by 8.

0010 ; for (i = 0, i < n; i++)0020 {0030 r1 = #0  ; i = 00050 load r2, n0130 ; load r4, n killed; use r20180 ; load r7, n killed; use r20300 ; load r12, n killed; use r20220 fr3 = #0.00340 fr4 = #1.00040 G0000:0060 cmp r1, r2 ; i < n0070 bge G000100800190 r8 = r1 * r2 ; calculate subscript i * n0310 ; r13 = r1 * r2 killed; use r8  ; calculate subscript i * n0090 ; for (j = 0; j < n; j++)0100 {0110 r3 = #0  ; j = 00120 G0002:0140 cmp r3, r2 ; j < n0150 bge G000301600170 ; A[i,j] = 0.0;0200 r9 = r8 + r3 ; calculate subscript i * n + j0210 r10 = r9 * #8  ; calculate byte address0230 fstore fr3, A[r10]02400250 r3 = r3 + #1  ; j++0260 br G00020270 }0280 G0003:0290 ; A[i,i] = 1.0;0320 r14 = r8 + r1 ; calculate subscript i * n + i0330 r15 = r14 * #8  ; calculate byte address0350 fstore fr4, A[r15]03600370 ;i++0380 r1 = r1 + #10390 br G00000400 }0410 G0001:

There are more optimizations to do. Register r3 is the main variable in the innermost loop (0140-0260); it gets incremented by 1 each time through the loop. Register r8 (which is invariant in the innermost loop) is added to r3. Instead of using r3, the compiler can eliminate r3 and use r9. The loop, instead of being controlled by r3 = 0 to n-1, can be controlled by r9=r8+0 to r8+n-1. That adds four instructions and kills four instructions, but there's one fewer instruction inside the loop.

0110 ; r3 = #0 killed  ; j = 00115 r9 = r8 ; new assignment0117 r20 = r8 + r2 ; new limit0120 G0002:0140 ; cmp r3, r2 killed  ; j < n0145 cmp r9, r20 ; r8 + j < r8 + n = r200150 bge G000301600170 ; A[i,j] = 0.0;0200 ; r9 = r8 + r3 killed  ; calculate subscript i * n + j0210 r10 = r9 * #8  ; calculate byte address0230 fstore fr3, A[r10]02400250 ; r3 = r3 + #1 killed  ; j++0255 r9 = r9 + #1  ; new loop variable0260 br G0002

Now r9 is the loop variable, but it interacts with the multiply by 8. Here we get to do some strength reduction. The multiply by 8 can be reduced to some successive additions of 8. Now there are no multiplications inside the loop.

0115 r9 = r8 ; new assignment0117 r20 = r8 + r2 ; new limit0118 r10 = r8 * #8  ; initial value of r100120 G0002:0145 cmp r9, r20 ; r8 + j < r8 + n = r200150 bge G000301600170 ; A[i,j] = 0.0;0210 ; r10 = r9 * #8 killed  ; calculate byte address0230 fstore fr3, A[r10]02400245 r10 = r10 + #8  ; strength reduced multiply0255 r9 = r9 + #1  ; loop variable0260 br G0002

Registers r9 and r10 (= 8*r9) aren't both needed; r9 can be eliminated in the loop. The loop is now 5 instructions.

0115 ; r9 = r8 killed0117 r20 = r8 + r2 ; limit0118 r10 = r8 * #8  ; initial value of r100119 r22 = r20 * #8  ; new limit0120 G0002:0145 ; cmp r9, r20 killed  ; r8 + j < r8 + n = r200147 cmp r10, r22 ; r10 = 8*(r8 + j) < 8*(r8 + n) = r220150 bge G000301600170 ; A[i,j] = 0.0;0230 fstore fr3, A[r10]02400245 r10 = r10 + #8  ; strength reduced multiply0255 ; r9 = r9 + #1 killed  ; loop variable0260 br G0002

Outer loop

Back to the whole picture:

0010 ; for (i = 0, i < n; i++)0020 {0030 r1 = #0  ; i = 00050 load r2, n0220 fr3 = #0.00340 fr4 = #1.00040 G0000:0060 cmp r1, r2 ; i < n0070 bge G000100800190 r8 = r1 * r2 ; calculate subscript i * n0117 r20 = r8 + r2 ; limit0118 r10 = r8 * #8  ; initial value of r100119 r22 = r20 * #8  ; new limit0090 ; for (j = 0; j < n; j++)0100 {0120 G0002:0147 cmp r10, r22 ; r10 = 8*(r8 + j) < 8*(r8 + n) = r220150 bge G000301600170 ; A[i,j] = 0.0;0230 fstore fr3, A[r10]02400245 r10 = r10 + #8  ; strength reduced multiply0260 br G00020270 }0280 G0003:0290 ; A[i,i] = 1.0;0320 r14 = r8 + r1 ; calculate subscript i * n + i0330 r15 = r14 * #8  ; calculate byte address0350 fstore fr4, A[r15]03600370 ;i++0380 r1 = r1 + #10390 br G00000400 }0410 G0001:

There are now four multiplications within the outer loop that increments r1. Register r8 = r1*r2 at 0190 can be strength reduced by setting it before entering the loop (0055) and incrementing it by r2 at the bottom of the loop (0385).

The value r8*8 (at 0118) can be strength reduced by initializing it (0056) and adding 8*r2 to it when r8 gets incremented (0386).

Register r20 is being incremented by the invariant/constant r2 each time through the loop at 0117. After being incremented, it is multiplied by 8 to create r22 at 0119. That multiplication can be strength reduced by adding 8*r2 each time through the loop.

0010 ; for (i = 0, i < n; i++)0020 {0030 r1 = #0  ; i = 00050 load r2, n0220 fr3 = #0.00340 fr4 = #1.00055 r8 = r1 * r2 ; set initial value for r80056 r40 = r8 * #8  ; initial value for r8 * 80057 r30 = r2 * #8  ; increment for r400058 r20 = r8 + r2 ; copied from 01170058 r22 = r20 * #8  ; initial value of r220040 G0000:0060 cmp r1, r2 ; i < n0070 bge G000100800190 ; r8 = r1 * r2 killed  ; calculate subscript i * n0117 ; r20 = r8 + r2 killed - dead code0118 r10 = r40 ; strength reduced expression to r400119 ; r22 = r20 * #8 killed  ; strength reduced0090 ; for (j = 0; j < n; j++)0100 {0120 G0002:0147 cmp r10, r22 ; r10 = 8*(r8 + j) < 8*(r8 + n) = r220150 bge G000301600170 ; A[i,j] = 0.0;0230 fstore fr3, A[r10]02400245 r10 = r10 + #8  ; strength reduced multiply0260 br G00020270 }0280 G0003:0290 ; A[i,i] = 1.0;0320 r14 = r8 + r1 ; calculate subscript i * n + i0330 r15 = r14 * #8  ; calculate byte address0350 fstore fr4, A[r15]03600370 ;i++0380 r1 = r1 + #10385 r8 = r8 + r2 ; strength reduce r8 = r1 * r20386 r40 = r40 + r30 ; strength reduce expression r8 * 80388 r22 = r22 + r30 ; strength reduce r22 = r20 * 80390 br G00000400 }0410 G0001:

The last multiply

That leaves the two loops with only one multiplication operation (at 0330) within the outer loop and no multiplications within the inner loop.

0010 ; for (i = 0, i < n; i++)0020 {0030 r1 = #0  ; i = 00050 load r2, n0220 fr3 = #0.00340 fr4 = #1.00055 r8 = r1 * r2 ; set initial value for r80056 r40 = r8 * #8  ; initial value for r8 * 80057 r30 = r2 * #8  ; increment for r400058 r20 = r8 + r2 ; copied from 01170058 r22 = r20 * #8  ; initial value of r220040 G0000:0060 cmp r1, r2 ; i < n0070 bge G000100800118 r10 = r40 ; strength reduced expression to r400090 ; for (j = 0; j < n; j++)0100 {0120 G0002:0147 cmp r10, r22 ; r10 = 8*(r8 + j) < 8*(r8 + n) = r220150 bge G000301600170 ; A[i,j] = 0.0;0230 fstore fr3, A[r10]02400245 r10 = r10 + #8  ; strength reduced multiply0260 br G00020270 }0280 G0003:0290 ; A[i,i] = 1.0;0320 r14 = r8 + r1 ; calculate subscript i * n + i0330 r15 = r14 * #8  ; calculate byte address0350 fstore fr4, A[r15]03600370 ;i++0380 r1 = r1 + #10385 r8 = r8 + r2 ; strength reduce r8 = r1 * r20386 r40 = r40 + r30 ; strength reduce expression r8 * 80388 r22 = r22 + r30 ; strength reduce r22 = r20 * 80390 br G00000400 }0410 G0001:

At line 0320, r14 is the sum of r8 and r1, and r8 and r1 are being incremented in the loop. Register r8 is being bumped by r2 (=n) and r1 is being bumped by 1. Consequently, r14 is being bumped by n+1 each time through the loop. The last loop multiply at 0330 can be strength reduced by adding (r2+1)*8 each time through the loop.

0010 ; for (i = 0, i < n; i++)0020 {0030 r1 = #0  ; i = 00050 load r2, n0220 fr3 = #0.00340 fr4 = #1.00055 r8 = r1 * r2 ; set initial value for r80056 r40 = r8 * #8  ; initial value for r8 * 80057 r30 = r2 * #8  ; increment for r400058 r20 = r8 + r2 ; copied from 01170058 r22 = r20 * #8  ; initial value of r22005A r14 = r8 + r1 ; copied from 0320005B r15 = r14 * #8  ; initial value of r15 (0330)005C r49 = r2 + #1005D r50 = r49 * #8  ; strength reduced increment0040 G0000:0060 cmp r1, r2 ; i < n0070 bge G000100800118 r10 = r40 ; strength reduced expression to r400090 ; for (j = 0; j < n; j++)0100 {0120 G0002:0147 cmp r10, r22 ; r10 = 8*(r8 + j) < 8*(r8 + n) = r220150 bge G000301600170 ; A[i,j] = 0.0;0230 fstore fr3, A[r10]02400245 r10 = r10 + #8  ; strength reduced multiply0260 br G00020270 }0280 G0003:0290 ; A[i,i] = 1.0;0320 ; r14 = r8 + r1 killed  ; dead code0330 ; r15 = r14 * #8 killed  ; strength reduced0350 fstore fr4, A[r15]03600370 ;i++0380 r1 = r1 + #10385 r8 = r8 + r2 ; strength reduce r8 = r1 * r20386 r40 = r40 + r30 ; strength reduce expression r8 * 80388 r22 = r22 + r30 ; strength reduce r22 = r20 * 80389 r15 = r15 + r50 ; strength reduce r15 = r14 * 80390 br G00000400 }0410 G0001:

There's still more to go. Constant folding will recognize that r1=0 in the preamble, so several instructions will clean up. Register r8 isn't used in the loop, so it can disappear.

Furthermore, r1 is only being used to control the loop, so r1 can be replaced by a different induction variable such as r40. Where i went 0 <= i < n, register r40 goes 0 <= r40 < 8 * n * n.

0010 ; for (i = 0, i < n; i++)0020 {0030 ; r1 = #0  ; i = 0, becomes dead code0050 load r2, n0220 fr3 = #0.00340 fr4 = #1.00055 ; r8 = #0 killed  ; r8 no longer used0056 r40 = #0  ; initial value for r8 * 80057 r30 = r2 * #8  ; increment for r400058 ; r20 = r2 killed  ; r8 = 0, becomes dead code0058 r22 = r2 * #8  ; r20 = r2005A ; r14 = #0 killed  ; r8 = 0, becomes dead code005B r15 = #0  ; r14 = 0005C r49 = r2 + #1005D r50 = r49 * #8  ; strength reduced increment005D r60 = r2 * r30 ; new limit for r400040 G0000:0060 ; cmp r1, r2 killed  ; i < n; induction variable replaced0065 cmp r40, r60 ; i * 8 * n < 8 * n * n0070 bge G000100800118 r10 = r40 ; strength reduced expression to r400090 ; for (j = 0; j < n; j++)0100 {0120 G0002:0147 cmp r10, r22 ; r10 = 8*(r8 + j) < 8*(r8 + n) = r220150 bge G000301600170 ; A[i,j] = 0.0;0230 fstore fr3, A[r10]02400245 r10 = r10 + #8  ; strength reduced multiply0260 br G00020270 }0280 G0003:0290 ; A[i,i] = 1.0;0350 fstore fr4, A[r15]03600370 ;i++0380 ; r1 = r1 + #1 killed  ; dead code (r40 controls loop)0385 ; r8 = r8 + r2 killed  ; dead code0386 r40 = r40 + r30 ; strength reduce expression r8 * 80388 r22 = r22 + r30 ; strength reduce r22 = r20 * 80389 r15 = r15 + r50 ; strength reduce r15 = r14 * 80390 br G00000400 }0410 G0001:

Other strength reduction operations

Operator strength reduction uses mathematical identities to replace slow math operations with faster operations. The benefits depend on the target CPU and sometimes on the surrounding code (which can affect the availability of other functional units within the CPU).

Induction variable (orphan)

Induction variable or recursive strength reduction replaces a function of some systematically changing variable with a simpler calculation using previous values of the function. In a procedural programming language this would apply to an expression involving a loop variable and in a declarative language it would apply to the argument of a recursive function. For example,

 f x = ... (3 ** x) ... (f (x + 1)) ...

becomes

 f x = f' x 1 where f' x z = ... z ... (f' (x + 1) (3 * z)) ...

Here modified recursive function f takes a second parameter z = 3 ** x, allowing the expensive computation (3 ** x) to be replaced by the cheaper (3 * z).

See also

Notes

  1. ^ Steven Muchnick; Muchnick and Associates (15 August 1997). Advanced Compiler Design Implementation. Morgan Kaufmann. ISBN 978-1-55860-320-2. Strength reduction.
  2. ^ In languages such as C and Java, integer division has round-towards-zero semantics, whereas a bit-shift always rounds down, requiring special treatment for negative numbers. For example, in Java, -3 / 2 evaluates to -1, whereas -3 >> 1 evaluates to -2. So in this case, the compiler cannot optimize division by two by replacing it by a bit shift.
  3. ^ Granlund, Torbjörn; Peter L. Montgomery. "Division by Invariant Integers Using Multiplication" (PDF).
  4. ^ Jones, Nigel. "Division of integers by constants". Archived from the original on 26 March 2024.

References