code optimization

Introduction to Code Optimization
Source: Katrin Becker, 1999

Constant Folding:
- when several constants are combined in an expression, they can be reduced to constants at compile time
int i = 100;
int j,k;
k = 200;
j = i + k;

: all end up begin constant (i.e. not changed)

Dead Code Removal:

: have no effect on the answers
: change nothing if removed
: 2 kinds: unreachable (eg. code after return statement)

: produce results never used
Strength Reduction:
- replacing more expensive calculations with cheaper ones
Y = X**2;
J = K*2;

1) what would probably happen : calls math library subroutine; converted to logarithm; multiplied; converted back -OR- notice small power and do X*X instead

2) multiply by 2 could be recognized as shift, or just K+K
Variable Renaming:
x = y * z;
q = r + x + x;
x = a + b;

notice x is being recycled: could do
x0 = y * z;
q = r + x0 + x0;
x = a + b;

This makes 1)&2) independent of 3) so they can be done in parallel

Common Sub-Expression Elimination:
D = C * ( A + B );
E = ( A + B ) / 2;

t = (A + B); // (A + B) need only be done once
D = C * t;
E = t / 2;

Address Calculations:

A[i][j] may actually be done this way:

address[A] + (i-1)*sizeof_datatype(A) + (j-1)*sizeof_datatype(A) * column_dimension(A)
Loop Invariant Code Motion:
- moving calculations whose results don't change from iteration to iteration out of the loop
for( i=1; i<=N;i++ )
{

A[i] = B[i] + C * D;
E = G[k];

}

t = C * D;
for( i=1; i<=N;i++ )

A[i] = B[i] + t;

E = G[k];
Induction Variable Simplification:
- induction variables are those that change as a linear function of the loop iteration count

for( i = 1; i <= N; i++)

k = i * 4 + m;

k = m;
for( i = 1; i <= N; i++)

k = k + 4;

NOTE: these are not equivalent; this doesn't look terribly useful - till you realize that this is exactly what happens in arrays address calculations.

Eliminating Clutter:

Things that contribute to overhead:

- subroutine calls
- indirect memory references
- tests within loops
- wordy tests
- type conversions
- variables preserved unnecessarily

Things that restrict compiler flexibility

- subroutine calls
- indirect memory reference
- tests within loops
- ambiguous pointers
Subroutine Calls: What's involved in a subroutine call? example: a big corporation: Dept.A has a pile of paperwork to be completed by Dept.B - Dept.A finishes its part - packages it up: data; forms; charge#'s, forms, etc. - transfer to Dept.B - unpack - do it - repack - return to Dept.A A subroutine call is very much like that. How to make things more efficient: - use local variables whenever possible (referencing reference parameters or globals is often more expensive) - use macros? - use inline functions?
Branches Within Loops:

small = 1.0e-20;
for( i = 1; i <= n; i++)

if (abs(A[i]) >= small)

b[i] += a[i]*C;

- trying to avoid computations:

- the idea is that if A[i] is sufficiently small, it won't change B[i] enough to matter so we can avoid some unnecessary computation without changing the outcome
- on newer machines this is probably not worth it: the computation is likely faster than the branch
- on older architectures it's the other way around

Loop Invariant Conditionals:
- conditional branches whose outcome is not affected by the loop

for(i=1; i<=n; i++)

if (n == 0)

A[i] += B[i] * C;

else

A[i] = 0;

Since n doesn't change, neither will the outcome of the test:
if (n == 0)

for(i=1; i<=n; i++)

A[i] += B[i] * C;

else

for(i=1; i<=n; i++)

A[i] = 0;

Loop Index Dependent Conditionals:
- the outcome of this conditional does depend on the loop variables, but in a predictable way

for(i=1; i<=n; i++)

for(j=1; j<=n; j++)

if (j < i)

A[i][j] += B[i][j]*C;

else

A[i][j] = 0.0;

for(i=1; i<=n; i++)
{

for(j=1; j<=i-1; j++)

A[i][j] += B[i][j]*C;

for(j=i; j<=n; j++)

A[i][j] = 0.0;

}

Independent Loop Conditionals:

for(i=1; i<=n; i++)

for(j=1; j<=n; j++)

if (B[i][j] > 1.0)

A[i][j] += B[i][j] * C;

Not much to be done with this one, but since every iteration is independent, the loop can be unrolled or can be performed in parallel.

Dependent Loop Conditionals:

for(i=1; i<=n; i++)

if ( X < A[i])

X += B[i]*2;

In a case like this there is no choice but to execute exactly as written.

Reductions:

for( i = 0; i < n; i++)

z = a[i] > z ? a[i] : z;

This one, on first inspection looks like the same kind of loop as the previous one. BUT we are looking for the biggest in the array - that value will be the same (essentially) no matter how we go about looking for it So, we can restructure the loop to check several elements at a time:

for( i = 0; i < n; i+=2)
{

z0 = a[i] > z0 ? a[i] : z0;
z 1= a[i+1] > z1 ? a[i+1] : z1;

}
z = z0 > z1 ? z0 : z1;

This way cuts down on the number of iterations, and allows for the possibility of parallel processing.

Conditionals That Transfer Control:

for(i=1; i<=n; i++)

for(j=1; j<=n; j++)
{

if (B[i][j] == 0)
{

cout << i << j << endl;
exit(1);

}
A[i][j] /= B[i][j] ;

}

This forces iterations to be done precisely in order. A standard trap (like an exception) allows the programmer to achieve efficient code yet still detect the error.
Other Clutter:
Data Type Conversions:

- in newer machines, double-precision calculations may actually be faster than single-precision
- watch the mixed mode arithmetic, and implicit type conversions

Handling Array Elements in Loops:

- if a particular array element is to be referenced a number of times, it is probably more efficient to store the value in a temporary location and use it rather than having to de-reference an array each time. The same holds true for pointer de-referencing.
Loop Optimizations:
- When we make changes to code in the name of performance we must verify that it is actually helping. This can really only be done by testing both with and without the modifications. A simple change to a different compiler or a move to a different architecture can un-do anything you've done. You could even inadvertantly end up hindering the optimization capabilities of the compiler you are using.
Operation Counting:
- done to quantify the loop
- count things like loads, stores, floating-point operations, library calls, compares, etc.

Remember things like repeated array addressing (normally expensive) can be made quite efficient by the compiler: If, for example we are stepping through all the elements of an array (this amounts to single-stepping through memory - which is about as efficient as it gets). In fact, in a case like this we may end up making the compiler's job harder if we try and help by holding array values in intermediate variables. In the end, the only way to really tell if you've made things better or worse is to actually measure it by timing execution or creating a profile of your program while it's running.

Basic Loop Unrolling:
- legacy idea: most compilers do it automatically if it looks like there's a benefit
- used to be done explicitly and you will still see it in old code
- old code may now 'confuse' new compilers and make things worse rather than better
- this is not something you will often have to do yourself, but you should recognize it when you see it (in old code or generated machine code)

for (i=1; i<=n; i++)

A[i] += B[i] * C;

for (i=1; i<=n; i+4)
{

A[i] += B[i] * C;
A[i+1] += B[i+1] * C;
A[i+2] += B[i+2] * C;
A[i+3] += B[i+3] * C;

}

Unrolling the loop gives us the same operations in fewer iterations with less loop overhead.
It also opens the way for doing all 4 operations in parallel (ot possible with the other setup).
Notice the second form of the loop is not exactly the same as the first. If n is not divisible by 4 we need to do some pre-processing to deal with the "leftovers".

Loops with low trip counts:

- may be more efficient to repeat a series of calculations a small number of times than to put them in a loop.

Fat Loops:

:loops with many calculations per iteration are fat
: unrolling a fat loop probably won't help much - may even have the opposite effect: the loop is physically bigger, so takes up more memory - system 'travels' farther from beginning to end of each loop (may cost in memory management)

Loops containing Procedure Calls:

:function calls are expensive
: if the function you are calling is 'fat', unrolling the loop won't help

Loops with Branches in them:

for(i=1; i<=n; i++)

for(j=1; j<=n; j++)
{

if (B[i][j] >= 0)

A[i][j] += B[i][j] * C;

}

Each iteration is independent of evry other - unrolling is not a problem. Could leave the outer loop alone and just unroll the inner one.

Nested Loops:

Outer Loop Unrolling:

for(i=1; i<=n; i++)

for(j=1; j<=n; j++)

for(k=1; k<=n;k++)
{

A[i][j][k] += B[i][j][k] * C;

}

for(i=1; i<=n; i++)

for(j=1; j<=n; j+2)

for(k=1; k<=n;k++)
{

A[i][j][k] += B[i][j][k] * C;
A[i][j+1][k] += B[i][j+1][k] * C;

}

for(i=1; i<=n; i++)

for(j=1; j<=n; j+2)

for(k=1; k<=n;k+2)
{

A[i][j][k] += B[i][j][k] * C;
A[i][j+1][k] += B[i][j+1][k] * C;
A[i][j][k+1] += B[i][j][k+1] * C;
A[i][j+1][k+1] += B[i][j+1][k+1] * C;

}

Loop Interchange:
- rearranging a loop nest so all the right stuff is at the centre

Memory Access Patterns:
C++ arrays are stored in column-major order (actually, the right-most subscript has the smallest step size (when stepping through an array in memory). In FORTRAN, it's the other way around. Because of the way arrays tend to be stored in memory, having the subscript with the smallest step size (stride) change the fastest usually has the effect of keeping the work in a more concentrated area of memory.

Loop Interchange to Ease Memory Access Patterns:

Matrix Multiplications:

for(i=1; i<=n; i++)

for(j=1; j<=n; j++)
{

sum = 0.0;
for(k=1; k<=n;k++)

sum += A[i][k] * B[k][j];

C[i][j] = sum;

}

Problem here is that A[i][k] has a non-unit stride. Each iteration consists of 2 loads, a multiplication, and an addition. We can juggle this a bit so all memory accesses are unit strides:

for(i=1; i<=n; i++)

for(j=1; j<=n; j++)

C[i][j] = 0.0;

for(k=1; k<=n;k++)

for(j=1; j<=n; j++)
{

scale = B[k][j];
for(i=1; i<=n; i++)

C[i][j] += A[i][k] * scale;

}

When Interchange Won't Work:
We want to be able to rearrange the loop so that it works on data in 'little neighbourhoods'.

Blocking to Ease Memory Access Patterns:

for(i=1; i<=n; i++)

for(j=1; j<=n; j++)

A[i][j] += B[j[i];

Any way we look at this, one of the arrays is referenced with a non-unit stride.
for(i=1; i<=n; i+2)

for(j=1; j<=n; j+2)
{

A[i][j] += B[j][i];
A[i+1][j] += B[j][i+1];
A[i][j+1] += B[j+1][i];
A[i+1][j+1] += B[j+1][i+1];

}

Programs that Require more Memory than You Have:
Fall into 2 main categories:
1. Software-managed, out-of-core solutions
2. Virtual memory-managed out-of-core solutions

Software-Managed, Out-of-Core Solutions:
- usually involves telling the program how much memory you have and it figures out the rest.
- trick is to group memory references together so they are localized