Parallel Processing

We have mentioned that the operation of many numerical methods can be speeded up by the proper use of parallel processing or distributed systems.

Vector/Matrix Operations:
- For inner products, a very elementary case, we assume we have two vectors, , , of length , and an equal number of parallel processors, $proc(i), i=1,\ldots,n$ , where each contains the components, .
- Then the multiplication of all the can be done in parallel in one time unit.
- if the processors are connected suitably we can actually do the addition part in time units.
- This assumes a high degree of connectivity between the processors. The different designs are referred to as the topologies of the systems.
- Suppose our processors were only connected as a linear array in which the communication send/receive is just between two adjacent processors.
  
  $\begin{displaymath} P_1\leftrightarrow P_2 \leftrightarrow \ldots \leftrightarrow P_{n-1} \leftrightarrow P_n \end{displaymath}$
- Then our addition of the elements would be time steps, because we could do an addition at each end in parallel and proceed to the middle.
- The -dimensional hypercube is a graph with vertices in which each vertex has edges (is connected to other vertices). This graph can be easily defined recursively because there is an easy albalgorithm determine the order in which the vertices are connected. See Fig. 1.
  
  Figure 1: Connectivity between the processors (topologies).
  
  $\includegraphics[scale=1]{figures/2.1.ps}$
- Two vertices are adjacent if and only if the indices differ in exactly one bit.
- We can get to the -dimensional hypercube by making two copies of the -dimensional hypercube and then adding a zero to the leftmost bit of the first -dimensional cube and then doing the same with a l to the second cube.
- There are many other designs for connecting the processors. Such designs include names like star, ring, torus, mesh
- For the matrix/vector product, Ax, we assume that processor contains the th row of as well as the vector, .
- Because one processor is performing this dot product of row of and of vector , the whole operation will only take units of time.
- For the matrix/matrix product, AB, for two $n \times n$ matrices, we suppose that we have processors. Before we start our computations each processor, $P_{ij}$ will have received the values for row of and column of . Based on our previous discussion, we can expect the time to be .
Gaussian Elimination:
- To see how this can be done in a parallel-processing environment, we must examine the row-reduction phase.
- If we are computing in a parallel processing environment, we can take advantage of the independence by assigning each row-reduction task to a different processor:
  
  $\begin{displaymath} \left[ \begin{array}{rrrrr} 1 & 2 & 1 & 3 &4\\ 2 & 5 & 4 ... ...-(1/1)R_1 \\ \rightarrow proc(3):R_4-(3/1)R_1 \\ \end{array}\end{displaymath}$
- Suppose each row assignment statement requires 4 time units, one for each element in a row. Then the sequential algorithm performs this stage of the row reduction in 12 time units, whereas we need only 4 time units for the parallel algorithm.
- This example of parallel processing on the first stage of row reduction of a $4 \times 4$ system matrix generalizes to any row reduction in stage of an $n \times n$ system matrix.
- Recall that there are row-reduction stages in Gaussian elimination, one for each of the columns of the coefficient matrix except for the last column. This suggests that we need processors to do the reduction in parallel.
  Algorithms for Row Reduction in Gaussian Elimination:
  $\fbox{\parbox{11cm}{ Sequential Processing (without pivoting)\\ For j = 1 To (n... ... \\ a[i, k] = a[i, k] - a[i,j]/a[j,j]*a[j,k] \\ End For k \\ End For i \\ }}$
- Sequential algorithm requires operations and that the parallel algorithm with processors accomplishes the same task in operations.
Problems in Using Parallel Processors:
- The algorithm described here does not pivot. Thus, our solution may not be as numerically stable as one obtained via a sequential algorithm with partial pivoting.
- The coefficient matrix is assumed to be nonsingular. It is easy to check for singularity at each stage of the row reduction, but such error-handling will more than double the running time of the algorithm.
- We have ignored the communication and overhead time costs that are involved in parallelization. Because of these costs, it is probably more efficient to solve small systems of equation using a sequential algorithm.
- Other, perhaps faster, parallel algorithms exist for solving systems of linear equations.
Iterative Solutions -The Jacobi Method
- Recall that at each iteration of the algorithm a new solution vector $x^{(n+1)}$ is computed using only the elements of the solution vector from the previous iteration, $x^{(n)}$ .
- Because Gauss-Seidel iteration requires that the new iterates for each variable be used after they have been obtained, this method cannot be speeded up by parallel processing.

**Figure 1:** Connectivity between the processors (topologies).
$\includegraphics[scale=1]{figures/2.1.ps}$

2004-11-03