content top

Optimizing Hadoop for MapReduce

Optimizing Hadoop for MapReduce

Finally, the “Optimizing Hadoop for MapReduce” book I’ve been working on for the past months has been published and is now available in full version.

 

You can buy the book directly from the Packt Publishing website.

=========================================================================================

Finalement, le livre “Optimizing Hadoop for MapReduce” sur lequel je travaille depuis plusieurs mois a été publié et est maintenant disponible en version finale.

Vous pouvez commander le livre directement sur le site de l’éditeur Packt Publishing.

 

RavenDB 2.x beginner's Guide

Optimizing Hadoop for MapReduce

 

Optimizing Hadoop for MapReduce Cover
Read More

Hadoop et MapReduce : HDFS

Hadoop et MapReduce : HDFS

Dans l’article précédent j’ai brièvement présenté le concept de HDFS (Hadoop Distributed Filesystem). Dans cet article nous allons le présenter un peu plus en détail.

Dans ce deuxième article de la série, nous reviendrons sur le concept déjà introduit des namenode et datanode. Puis nous introduirons un élément important, le secondary namenode.

Read More

Hadoop et MapReduce : introduction

Hadoop et MapReduce : introduction

Cet article sera le premier d’une serie qui présentent la mise en oeuvre de ce système ainsi que ses capacités adapté à de grands volumes de données (Big Data). Dans cette une introduction je vais expliquer les principes de Hadoop ainsi que son utilité. Dans la suite des articles on se focalisera sur l’aspect pragmatique de ce framework par l’élaboration d’un exemple, dont le but sera de traiter un grand volume de données. De même, dans les prochains articles, on analysera la configuration et comment le mettre en place.

Read More

Parallel Apriori algorithm for frequent pattern mining

Parallel Apriori algorithm for frequent pattern mining

Abstract

Apriori is a frequent pattern mining algorithm for discovering association rules. It is one of the most well-known algorithms for discovering frequent patterns along with FP-Growth algorithm. However, as a result of the current advances in the area of storage of very large databases and the tremendous growth in number of transactions, sequential Apriori becomes a bottleneck because of the long running time of the algorithm. In this paper, our goal is to develop a new parallel version of Apriori that aims to reduce the overall running time of the algorithm. Although Apriori is not known to be highly parallelizable, several attempts have been made to parallelize it in various ways, either by using parallel I/O hardware to optimize database scans or by distributing the workload on multiple processors. However, many of the parallel approaches suffer from noticeable latencies in synchronizing results being collected from each individual processor after a parallel iteration terminates. Our approach focuses on trying to maximize the workload being executed in parallel and to minimize the synchronization point delays by adopting a parallel pre-computing scheme during generation of the superset. By applying our new approach, the running time of the algorithm is reduced by an order of magnitude compared to other parallel implementations of the same algorithm.

Read More
content top