In the era of Big Data, decision making processes are becoming increasingly data-driven and data-intensive. The Data Lake approach refers to assembling large amounts of diverse data from a multitude of data sources, retaining their original model and format, and allowing users to query and analyze them in situ. Thus, it promises to enable ad hoc, self-service analytics and to reduce the required time from data to insights.

SmartDataLake aims at designing, developing and evaluating novel approaches, techniques and tools for extreme-scale analytics over Big Data Lakes. It tackles the challenges of reducing costs and extracting value from Big Data Lakes by providing solutions for virtualized and adaptive data access; automated and adaptive data storage tiering; smart data discovery, exploration and mining; monitoring and assessing the impact of changes; and empowering the data scientist in the loop through scalable and interactive data visualizations.

The results of the project will be evaluated in real-world use cases from the Business Intelligence domain, including scenarios for portfolio recommendation, production planning and pricing, and investment decision making.

In a nutshell

Adaptive Data Virtualization and Storage Tiering

SmartDataLake designs and develops an adaptive, scalable and elastic Data Lake management system that offers:

data virtualization for abstracting and optimizing data access and queries over heterogeneous data
data synopses for approximate query answering and analytics to enable interactive response times
automated placement of data in different storage tiers based on data characteristics and access patterns to reduce costs.

Heterogeneous Information Network Mining

SmartDataLake models the Data Lake contents as a Heterogeneous Information Network and provides:

similarity search and exploration for discovering relevant information
entity resolution and ranking for identifying and selecting entities across sources
link prediction and clustering for unveiling hidden associations and patterns among entities
change detection and incremental update of results to enable faster analysis of new data.

Visual Analytics over Spatial, Temporal and Graph Data

SmartDataLake offers interactive and scalable visual analytics to include and empower the data scientist in the knowledge extraction loop, including:

a visual analytics model that guides and orchestrates the interaction with the data scientist
visual exploration and tuning of the space of features, models and parameters
large-scale visualizations for spatial, temporal and network data.