Scalable Data Mining

Scalable Data Mining is a VRE designed to apply Data Mining techniques to biological data. The algorithms are executed in a distributed fashion on the e-Infrastructure nodes or on local multi-core machines. Scalability is thus meant as distributed data processing but even as services dynamically provided to the users. The system is scalable in the number of users and in the size of the data to process. Statistical data processing can be applied to perform Niche Modelling or Ecological Modelling experiments. Other applications can use general purpose techniques like Bayesian models. Time series of observations can be managed as well, in order to classify trends, catch anomaly patterns and perform simulations. The idea under the distributed computation for data mining techniques is to overcome common limitations that can happen when using statistical algorithms:

the training and projection procedure timing, the linear or non-linear time increase when the number of data to process increases, the multiple runs needed for reducing overfitting or local minima problems, the multiple models topologies to be evaluated for assessing the optimal model's configuration.

All the above issues strongly limit the amount of time a scientist can dedicate to the evaluation of the results and to the combination and comparison of the outcomes of different experiments. On the other side the Statistical Data Mining VRE adds advantages in using a distributed e-Infrastructure endowed with many data sources. Some of these are:

efficiency and time saving in computations; availability of a set of data sources containing environmental or species features; reliability of the quality of the features; certification of compliancy between e-Infrastructure data sources and algorithms inputs\outputs; import of users' own files; sharing of results and users' files.