LoboVault Home
 

Theory of Resource Allocation for Robust Distributed Computing

LoboVault

Please use this identifier to cite or link to this item: http://hdl.handle.net/1928/12103

Theory of Resource Allocation for Robust Distributed Computing

Show full item record

Title: Theory of Resource Allocation for Robust Distributed Computing
Author: Pezoa, Jorge E.
Advisor(s): Hayat, Majeed M.
Committee Member(s): Mostofi, Yasamin
Bridges, Patrick
Ghani, Nasir
Santhanam, Balu
Department: University of New Mexico. Dept. of Electrical and Computer Engineering
Subject(s): Resource allocation
Distributed computing
Load balancing
computer networks
Correlated failures
Stochastic regeneration
Degree Level: Doctoral
Abstract: Lately, distributed computing (DC) has emerged in several application scenarios such as grid computing, high-performance and reconfigurable computing, wireless sensor networks, battle management systems, peer-to-peer networks, and donation grids. When DC is performed in these scenarios, the distributed computing system (DCS) supporting the applications not only exhibits heterogeneous computing resources and a significant communication latency, but also becomes highly dynamic due to the communication network as well as the computing servers are affected by a wide class of anomalies that change the topology of the system in a random fashion. These anomalies exhibit spatial and/or temporal correlation when they result, for instance, from wide-area power or network outages These correlated failures may not only inflict a large amount of damage to the system, but they may also induce further failures in other servers as a result of the lack of reliable communication between the components of the DCS. In order to provide a robust DC environment in the presence of component failures, it is key to develop a general framework for accurately modeling the complex dynamics of a DCS. In this dissertation a novel approach has been undertaken for modeling a general class of DCSs and for analytically characterizing the performance and reliability of parallel applications executed on such systems. A general probabilistic model has been constructed by assuming that the random times governing the dynamics of the DCS follow arbitrary probability distributions with heterogeneous parameters. Auxiliary age variables have been introduced in the modeling of a DCS and a hybrid continuous and discrete state-space model the system has been constructed. This hybrid model has enabled the development of an age-dependent stochastic regeneration theory, which, in turn, has been employed to analytically characterize the average execution time, the quality-of-service and the reliability in serving an application. These are three metrics of performance and reliability of practical interest in DC. Analytical approximations as well as mathematical lower and upper bounds for these metrics have also been derived in an attempt to reduce the amount of computational resources demanded by the exact characterizations. In order to systematically assess the reliability of DCSs in the presence of correlated component failures, a novel probabilistic model for spatially correlated failures has been developed. The model, based on graph theory and Markov random fields, captures both geographical and logical correlations induced by the arbitrary topology of the communication network of a DCS. The modeling framework, in conjunction with a general class of dynamic task reallocation (DTR) control policies, has been used to optimize the performance and reliability of applications in the presence of independent as well as spatially correlated anomalies. Theoretical predictions, Monte- Carlo simulations as well as experimental results have shown that optimizing these metrics can significantly impact the performance of a DCS. Moreover, the general setting developed here has shed insights on: (i) the effect of different stochastic mod- els on the accuracy of the performance and reliability metrics, (ii) the dependence of the DTR policies on system parameters such as failure rates and task-processing rates, (iii) the severe impact of correlated failures on the reliability of DCSs, (iv) the dependence of the DTR policies on degree of correlation in the failures, and (v) the fundamental trade-off between minimizing the execution time of an application and maximizing its reliability.
Graduation Date: December 2010
URI: http://hdl.handle.net/1928/12103

Files in this item

Files Size Format View
DissertationManuscriptJEP.pdf 4.730Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record

UNM Libraries

Search LoboVault


Advanced Search

Browse

My Account