Detecting element failures is a relevant issue in distributed systems. A fault tolerant system needs to detect a failure and recover from it promptly. In fact, traditional approaches to fault tolerance are usually not completely free from errors during the failure detection phase; a good failure detector is thus a very important component of them to minimize these errors. In this paper we present a failure detector able to monitor both asynchronous and synchronous elements of a distributed system by exchanging messages with the monitored elements. In order to assess the health status of monitored elements our failure detector relies on a simple query/ACK mechanism, which however requires a reliable timeout estimate in order to properly set the monitoring interval. To this purpose our failure detector uses the history of past estimates to compute new values for both quantities. The model proposed here introduces a new label to tag monitored elements, besides those used in traditional failures detectors. To evaluate this work, we compared it with two other algorithms by computing performance metrics, such as specificity and sensitivity, and by considering the number of required control packets. We also compared the performance of the failure detectors by computing their detection time.

Farruggia, A., Ortolani, M., Lo Re, G. (2010). FDAE: A f̲ailure d̲etector for a̲synchronous e̲vents. In F. Ko, Y. Na (a cura di), Proceedings of the Sixth International Conference on Networked Computing and Advanced Information Management (NCM), 2010 (pp. 197-202).

FDAE: A f̲ailure d̲etector for a̲synchronous e̲vents

FARRUGGIA, Alfonso;ORTOLANI, Marco;LO RE, Giuseppe
2010-01-01

Abstract

Detecting element failures is a relevant issue in distributed systems. A fault tolerant system needs to detect a failure and recover from it promptly. In fact, traditional approaches to fault tolerance are usually not completely free from errors during the failure detection phase; a good failure detector is thus a very important component of them to minimize these errors. In this paper we present a failure detector able to monitor both asynchronous and synchronous elements of a distributed system by exchanging messages with the monitored elements. In order to assess the health status of monitored elements our failure detector relies on a simple query/ACK mechanism, which however requires a reliable timeout estimate in order to properly set the monitoring interval. To this purpose our failure detector uses the history of past estimates to compute new values for both quantities. The model proposed here introduces a new label to tag monitored elements, besides those used in traditional failures detectors. To evaluate this work, we compared it with two other algorithms by computing performance metrics, such as specificity and sensitivity, and by considering the number of required control packets. We also compared the performance of the failure detectors by computing their detection time.
2010
978-1-4244-7671-8
Farruggia, A., Ortolani, M., Lo Re, G. (2010). FDAE: A f̲ailure d̲etector for a̲synchronous e̲vents. In F. Ko, Y. Na (a cura di), Proceedings of the Sixth International Conference on Networked Computing and Advanced Information Management (NCM), 2010 (pp. 197-202).
File in questo prodotto:
File Dimensione Formato  
FDAE- A Failure Detector for Asynchronous Events.pdf

Solo gestori archvio

Descrizione: articolo principale + cover + TOC
Tipologia: Versione Editoriale
Dimensione 868.05 kB
Formato Adobe PDF
868.05 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10447/53363
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact