Description

Overview
As discussed in class, OpenMP can be used to parallelize existing applications. We will be applying this property of OpenMP in this MP, using it to parallelize a program from Numerical Recipes { a collection of algorithms across a very broad range of domains. We will be comparing the performance of the program with and without OpenMP, and test its parallel scaling. In addition, four credithour students will compare the parallel performance of two implementations of sparse matrixvector multiplication from the textbook.

Getting Started
2.1 Skeleton Programs
Pull the skeleton for MP5 from the release repository: https://githubdev.cs.illinois. edu/cs420fa19/_release

Part A, Parallelizing a “Numerical Recipe”
3.1 Your Task
A Pade approximant is the “best” approximation of a function by a rational function, developed by French mathematician Henri Pade in 1890. We have provided you with a program that nds Pade approximants (pade), sourced from the book Numerical Recipes in C, which also uses the “recipes”: LU decomposition (ludcmp), LU backsubstitution (lubksb), and iterative improvement (mprove). Your task is to parallelize these functions using OpenMP pragmas, aiming for a 2x speedup for 4 threads for data size 2048. Hint: You can use VTune’s hotspots analysis to identify loops that can bene t from parallelization.
Once you have successfully parallelized it and veri ed its correctness, you may use the small Pythonbased, benchmarking program we have included to test its parallel performance. See MP3 for reference about how to run it and what to expect from its output.
3.1.1 OpenMP Notes
Adding the “qopenmp” ag to your CFLAGs during compilation will compile your program with OpenMP. There are two targets in your Make le, one that compiles the program with
1
and without this ag. To change the number of threads used by OpenMP, you can set the OMP NUM THREADS environment variable.
3.1.2 Skeleton Notes
The arguments to the program are, optionally, the size (as an integer, default 2048) and –verbose, which prints 21 points of the Pade approximant and the function its approximating. The program produces “correct” results for up to around size 256, then starts including notanumber’s (pade is not meant to scale, we are using it in an unusual way in this regard); so, test your program’s correctness using smaller datasizes and do not worry about its results for the larger ones.
How does parallelizing this program with OpenMP compare to using pthreads?
Were there any loops in the program that you could not parallelize (e.g. due to size, dependencies, etc.)? If so, pick one and explain why not.
Were you able to achieve the desired speedup? If not, why not? If so, how could you achieve a higher speedup?
Brie y comment on the parallel scaling of your implementation for datasize 2048, including a copy of the output of the benchmarking program in your report.

Part B, Sparse MatrixVector Multiplication (4 Credits Only)
4.1 Your Task
The textbook provides two versions of sparse matrixvector multiplication (sMVM), known as CRS and JDS, in Section 3.6. It discusses how to parallelize them and their expected parallel performance in Section 7.3. Your task is to implement each version in C and parallelize them using OpenMP, comparing their performance for a variety of thread counts and datasizes to test the assumptions made in the textbook.
We have provided a skeleton program for this assignment, which loads matrices in both CRS and JDS representation and, additionally, measures the execution time and checks the correctness of your sMVM implementations (against a dense, reference version of MVM), failing when your implementations are incorrect. You can use the –verbose ag to print the results from the various implementations of MVM for debugging purposes. We have also provided an updated version of randMatrix that generates large, sparse matrices and vectors for this assignment.
For 4credit students… did the assumptions made in the textbook about the parallel performance of each version withstand testing? Comment about why or why not.
2

Submission Guidelines
5.1 Expectations For This Assignment
The graded portions of this assignment are your parallelized program from part A (and B for 4 cr. students) and answers to the shortanswer questions.
5.1.1 Points Breakdown
4 10pts:  Each Short Answer Question
40pts:  Correctness of Parallelized Numerical Recipe Program
20pts:  Demonstrated Speedup for Parallelized Numerical Recipe Program
50pts:  Parallelized MatVec Program and Answer to Question (4 cr. only)
5.2 Pushing Your Submission
Follow the git commands to commit and push your changes to the remote repo. Ensure that your les are visible on the GitHub web interface, your work will not be graded otherwise. Only the most recent commit before the deadline will be graded.
3