## Fast and Scalable Architectures and Algorithms for the Computation of the Forward and Inverse Discrete Periodic Radon Transform with Applications to 2D Convolutions and Cross-Correlations

Please use this identifier to cite or link to this item: http://hdl.handle.net/1928/32282

##### View/Open

Title

Fast and Scalable Architectures and Algorithms for the Computation of the Forward and Inverse Discrete Periodic Radon Transform with Applications to 2D Convolutions and Cross-Correlations

Author(s)

Carranza, Cesar

Advisor(s)

Pattichis, Marios

Committee Member(s)

Calhoun, Vince

Jordan, Ramiro

Llamocca, Daniel

Jordan, Ramiro

Llamocca, Daniel

Department

University of New Mexico. Dept. of Electrical and Computer Engineering

Degree Level

Doctoral

Abstract

The Discrete Radon Transform (DRT) is an essential component of a wide range of applications in image processing, e.g. image denoising, image restoration, texture analysis, line detection, encryption, compressive sensing and reconstructing objects from projections in computed tomography and magnetic resonance imaging. A popular method to obtain the DRT, or its inverse, involves the use of the Fast Fourier Transform, with the inherent approximation/rounding errors and increased hardware complexity due the need for floating point arithmetic implementations. An alternative implementation of the DRT is through the use of the Discrete Periodic Radon Transform (DPRT). The DPRT also exhibits discrete properties of the continuous-space Radon Transform, including the Fourier Slice Theorem and the convolution property. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions O(N^3) and the need for a large number of memory accesses.
This PhD dissertation introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on the use of: (i) a parallel array of fixed-point adder trees, (ii) circular shift registers to remove the need for accessing external memory components when selecting the input data for the adder trees, and (iii) an image block-based approach to DPRT computation that can fit the proposed architecture to available resources, and as a result, for an NxN image (N prime), the proposed approach can compute up to N^2 additions per clock cycle. Compared to previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources. For the fastest case, I introduce optimized architectures that can compute the DPRT and its inverse in just 2N +ceil(log2 N)+1 and 2N +3(log2 N)+B+2 clock cycles respectively, where B is the number of bits used to represent each input pixel. In comparison, the prior state of the art method required N^2 +N +1 clock cycles for computing the forward DPRT. For systems with limited resources, the resource usage can be reduced to O(N) with a running time of ceil(N/2)(N + 9) + N + 2 for the forward DPRT and ceil(N/2)(N + 2) + 3ceil(log2 N) + B + 4 for the inverse.
The results also have important applications in the computation of fast convolutions and cross-correlations for large and non-separable kernels. For this purpose, I introduce fast algorithms and scalable architectures to compute 2-D Linear convolutions/cross-correlations using the convolution property of the DPRT and fixed point arithmetic to simplify the 2-D problem into a 1-D problem. Also an alternative system is proposed for non-separable kernels with low rank using the LU decomposition. As a result, for implementations with enough resources, for a an image and convolution kernel of size PxP, linear convolutions/cross correlations can be computed in just 6N + 4 log2 N + 17 clock cycles for N = 2P-1.
Finally, I also propose parallel algorithms to compute the forward and inverse DPRT using Graphic Processing Units (GPUs) and CPUs with multiple cores. The proposed algorithms are implemented in a GPU Nvidia Maxwell GM204 with 2048 cores@1367MHz, 348KB L1 cache (24KB per multiprocessor), 2048KB L2 cache (512KB per memory controller), 4GB device memory, and compared against a serial implementation on a CPU Intel Xeon E5-2630 with 8 physical cores (16 logical processors via hyper-threading)@3.2GHz, L1 cache 512K (32KB Instruction cache, 32KB data cache, per core), L2 cache 2MB (256KB per core), L3 cache 20MB (Shared among all cores), 32GB of system memory. For the CPU, there is a tenfold speedup using 16 logical cores versus a single-core serial implementation. For the GPU, there is a 715-fold speedup compared to the serial implementation. For real-time applications, for an 1021x1021 image, the forward DPRT takes 11.5ms and 11.4ms for the inverse.

Date

May 2016

Subject(s)

Radon Transform

Scalable Architecture

Parallel Architecture

2D Convolution

2D Crosscorrelation

Scalable Architecture

Parallel Architecture

2D Convolution

2D Crosscorrelation