Downloads

Arboretum

The GBDI Arboretum is a portable C++ library which implements various metric access methods (MAM). By using this library, any application will be able to perform similarity queries (queries by contents) with the minimum efforts possible. Furthermore, it will provide a robust and uniform platform for MAM developers which allows fair comparisons among all methods implemented by this library. Click here to visit the Arboretum web site.

Siren

SIREN - (SI)milarity (R)etrieval (EN)gine - is a command language interpreter that adds similarity query capabilities in SQL. Web-SIREN is a web-based interface aiming exporting SIREN resources for internet access.

FastMapDB

The FastMapDB tool enables the creation of multidimensional mappings from data stored in Relational Database Management Systems (RDBMS), allowing the interactive visualization of the data in 3D representations, trying to preserve the distances among the objects. Some related documents about the tool are also available.

Dicomviewer

The Dicomlib is a C++ library to manipulate DICOM image files.

DBGen - DataBase Generator

The DBGen is a tool which enables synthesize of multi-dimensional data in spaces of any dimensionality, following geometric, statistical or fractal distribution, aiming to provide a rich palette of options for the analyst to drill their algorithms and analysis techniques.

iDFQL

The iDFQL (Interactive Data Flow Query Language) is a query-based tool to support the teaching/learning of relational algebra by using the flow diagram approach. Please check the available publications.

MamView

MamView is a visual tool for exploring and understanding metric access methods. A description of the tool is also available.

IDEA

To understand the IDEA method, please see the paper. You can download the Prototype IDEA working with a sample database (the ROI dataset described in the paper):

The buttons that are working are: “Load Medical Image”, “Suggest a Diagnosis” and the Option "Load Diagnosis from the File", that allows you to compare the diagnosis suggested by the method with the report given by the specialists.

MetricSPlat project

The MetricSPlat project is a project that combines visualization techniques and content-based data retrieval methodologies. Its goal is to raise a framework where the constituents that define the concept of metric space can be instantiated and tested. A metric space, in the context of content-based data retrieval, is understood as the integration of the following components:

  • extracted features
  • a metric access structure
  • a distance function

In MetricSPlat these constituents can be integrated with minimal effort allowing quick development and testing of techniques for content-based data retrieval. Download

The Method Halite for Correlation Clustering

The algorithm Halite is a fast and scalable density-based clustering algorithm for moderate-to-high-dimensionality data able to analyze large collections of complex data elements. It creates a multi-dimensional grid all over the data space and counts the number of points lying at each hyper-cubic cell provided by the grid. A hyper-quad-tree-like structure, called the Counting-tree, is used to store the counts. The tree is thereafter submitted to a filtering process able to identify regions that are, in a statistical sense, denser than its neighboring regions regarding at least one dimension, which leads to the final clustering result. The algorithm is fast and it has linear or quasi-linear time and space complexity regarding both the data size and the dimensionality.

Remark: A first implementation of Halite was initially named as the method MrCC (after Multi-resolution Correlation Clustering) in an earlier Conference Publication of this work. Latter, it was renamed to Halite for clarity, since several improvements on the basic implementation were included into a Journal Publication.

The Method BoW for Clustering Terabyte-scale Datasets

The method BoW focuses on the problem of finding clusters in Terabytes of moderate-to-high dimensionality data, such as features extracted from billions of complex data elements. In these cases, a serial processing strategy is usually impractical. Just to read a single Terabyte of data (at 5GB/min on a single modern eSATA disk) one takes more than 3 hours. BoW explores parallelism and can treat as plug-in almost any of the serial clustering methods. The major research challenges addressed are (a) how to minimize the I/O cost, taking care of the already existing data partition (e.g., on disks), and (b) how to minimize the network cost among processing nodes. Either of them may become the bottleneck. Our method automatically spots the bottleneck and chooses a good strategy, one of them uses a novel sampling-and-ignore idea to reduce the network traffic. Specifically, BoW combines (a) potentially any serial algorithm used as a plug-in and (b) makes the plug-in run efficiently in parallel, by adaptively balancing the cost for disk accesses and network accesses, which allows BoW to achieve a very good tradeoff between these two possible bottlenecks.

The Method QMAS for Labeling and Summarization

The algorithm QMAS focus on two distinct data mining tasks – the tasks of labeling and summarizing large sets of moderate-to-high dimensionality data, such as features extracted from Gigabytes of complex data elements. Specifically, QMAS is a fast and scalable solution to two problems (a) low-labor labeling – given a large collection of data objects, very few of which are labeled with keywords, find the most suitable labels for the remaining ones, and (b) mining and attention routing – in the same setting, find clusters, the top-NO outlier objects, and the top-NR representative objects. The algorithm is fast and it scales linearly with the data size, besides working even with tiny initial label sets.