I am trying to multiple the values of a matrix.
#include <stdio.h> #include <omp.h> #include <time.h> #include <stdlib.h> #include <omp.h> #define N 2048 #define FactorIntToDouble 1.1; #define THREAD_NUM 4 double firstMatrix [N] [N] = {0.0}; double secondMatrix [N] [N] = {0.0}; double matrixMultiResult [N] [N] = {0.0}; // Sync void matrixMulti() { for(int row = 0 ; row < N ; row++){ for(int col = 0; col < N ; col++){ double resultValue = 0; for(int transNumber = 0 ; transNumber < N ; transNumber++) { resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ; } matrixMultiResult [row] [col] = resultValue; } } } void matrixInit() { for(int row = 0 ; row < N ; row++ ) { for(int col = 0 ; col < N ;col++){ srand(row+col); firstMatrix [row] [col] = ( rand() % 10 ) * FactorIntToDouble; secondMatrix [row] [col] = ( rand() % 10 ) * FactorIntToDouble; } } } // Parallel void matrixMulti2(int start, int end) { printf("Op: %d - %dn", start, end); for(int row = start ; row < end ; row++){ for(int col = 0; col < N ; col++){ double resultValue = 0; for(int transNumber = 0 ; transNumber < N ; transNumber++) { resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ; } matrixMultiResult [row] [col] = resultValue; } } } void process1(){ clock_t t1 = clock(); #pragma omp parallel { int thread = omp_get_thread_num(); int thread_multi = N / 4; int start = (thread) * thread_multi; int end = 0; if(thread == (THREAD_NUM - 1)){ end = (start + thread_multi); }else{ end = (start + thread_multi) - 1; } matrixMulti2(start, end); } clock_t t2 = clock(); printf("time 2: %ldn", t2-t1); } int main(){ matrixInit(); clock_t t1 = clock(); matrixMulti(); clock_t t2 = clock(); printf("time: %ld", t2-t1); process1(); return 0; }
I have both a parallel and sync version. But the parallel version is longer than the sync version.
Current the sync takes around 90 seconds and the parallel over 100. Which makes no sense to me.
My logic was to split the matrix into 4 parts from the first 4 statement. Which I believe is logical.
After I finish this part. I would like to figure out how to speed up this process for the parallel even more. Possibly using Strassen’s Matrix Multiplication. I just don’t know where to start or how to get to this point.
I’ve already spent around 5 hours trying to figure this out.
Advertisement
Answer
Here it is:
// Sync void matrixMulti() { #pragma omp parallel for collapse(2) for(int row = 0 ; row < N ; row++){ for(int col = 0; col < N ; col++){ double resultValue = 0; for(int transNumber = 0 ; transNumber < N ; transNumber++) { resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ; } matrixMultiResult [row] [col] = resultValue; } } }
Update: Here is what I got on an 8 core system using gcc 10.3 -O3 -fopenmp flags (I show you the program’s output and result of linux time command) :
main()
was changed to measure the time with omp_get_wtime()
because in linux clock()
measures processor time:
double t1 = omp_get_wtime(); matrixMulti(); double t2 = omp_get_wtime(); printf("time: %f", t2-t1);
Serial program:
time: 25.895234 real 0m33.296s user 0m33.139s sys 0m0.152s
using: #pragma omp parallel for
time: 3.573521 real 0m11.120s user 0m32.205s sys 0m0.136s
using: #pragma omp parallel for collapse(2)
time: 5.466674 real 0m12.786s user 0m49.978s sys 0m0.248s
The results suggest that initialization of matrix takes ca. 8 s, so it may also be worth parallelizing. Without collapse(2)
the program runs faster, so do not use collapse(2)
clause.
Note that on your system you may got different speed improvement or even decrease depending on your hardware. Speed of matrix multiplication strongly depends on the speed of memory read/write. Shared-Memory Multicore systems (i.e most PCs, laptops) may not show any speed increase upon parallelization of this program, but Distributed-Memory Multicore systems (i.e. high-end serves) definitely show performance increase. For more details please read e.g. this.
Update2: On Ryzen 7 5800X I got 41.6 s
vs 1.68 s
, which is a bigger increase than the number of cores. It is because more cache memory is available when all the cores is used.