The optimal implementation of Data Lakes to succeed

Data management The optimal implementation of Data Lakes to succeed

Data Lakes are, among other things, to the resolution of data silos, and for Big Data Analytics or Industrial Analytics in use. What needs to be considered in the implementation, so that the advantages of the approach can also effectively?


For a Data Lake is a single storage medium is used for all of the structured and especially unstructured data for the purpose of analysis, or reporting.
For a Data Lake is a single storage medium is used for all of the structured and especially unstructured data for the purpose of analysis, or reporting.

Due to increasing computing power, Cloud storage capacity and usage as well as the network Connectivity may be building up in the company from a flood of data quickly is a little more manageable Tsunami. With such a tidal wave in all formats and from a variety of sources, such as, for example, Internet-of-Things devices, Social Media Sites, sales systems, and internal network systems.

Big Data Analytics push

The agile approach to Data Lakes can help companies not only in the long term, a more efficient data management, but also Big Data Analytics much faster. Because the term Data Lake is the use of Data Analysis and the associated innovations.

A Data Lake is a huge “collection” (the Repository), which serves as a data storage for structured and unstructured data from various sources. In addition, this unstructured data are considered to be the fastest-growing Form of data, which are expected to account for around 90 percent of all data.

Important: The data structure as well as their requirements will be determined only when they are supplied to the respective application. This has the advantage that due to the decoupling of the storage of the calculation and analysis, an independent scaling of both areas is made possible. At the same time, Data Lakes to solve the often inefficient data silos, which are mostly operated departmental or partially isolated from each other.

Pitfalls to timely detect

In view of the also from the increasing number of technologies and Tools to help you Collect, Maintain, and Assess business-critical information, many companies are still not sure how to proceed with these data in a targeted deal. Not infrequently more or less big “data is no longer arise cemeteries”, ultimately are useful to evaluate.

According to experts, companies are only qualified for IT Teams and service providers to be able to develop applications on the Basis of Data Lakes and the full flexibility to truly achieve. That is, a Data Lake includes all the data, what is “tipped in”, where a relevance, completeness and integrity of the data may not per se be given.

Another Problem with the privacy policy represent. Regulations prevent all kinds of data may be stored in a Data Lake. In the process, the context and semantics of the data may be lost. For the analysis, it is often not negligible, what is the date of a certain data rate. What, in turn, requires a Minimum of structuring. The example shows that, from a “Data Lake,” a “Data Swamp” and data swamp, which, is not to evaluate useful.

On the other hand can also be drawn from highly structured data, or the metadata of the encrypted data as part of Data Lakes valuable conclusions. This does not mean that a Data Lake, a good replacement for the existing IT structures, or a classic Data Warehouse. He can, however, be a match for businesses, in particular, large volumes need to process unstructured data in order to develop new business models.

Implementation of Data Lakes

The IT team should always follow an agile approach to the Design and implementation of Data Lakes. This means, in advance of different technologies and management approaches to test and gradually Refine it, in a further step, to the appropriate processes for data storage.

It will go through stages, Mainly in the development of Data Lakes, the following four development: supply of the raw data, experimental Phase, detoxification of the Data Warehouses, and replacement of other data storage.

Supply of the raw data

In the first Phase, the Data Lake is separated from the core IT systems created and serves as a cost-effective, scalable, and “Pure Capture”environment.

The Data Lake serves as a data management layer within the technology stack of the company, are stored in the raw data unlimited before, you will be prepared for use in Computer environments. If companies want to avoid a “data swamp”, is to be carried out in this early Phase, a strict Governance, as well as the identification and classification of data.

Experimental Phase

Now, companies can begin to take advantage of the Data Lake is active as a platform for experiments. Data Scientists use a simple and fast access to all the data and be able to draw your more of a focus on experiments and analysis of the data.

In such an isolated environment, also Sandbox called, can be worked with the same data to create prototypes for analysis programs. In addition, it is recommended that a set of Open-Source and commercial Tools provide.

Detoxification of Data Warehouses

In order to benefit from the lower cost of storage of the Data Lakes, can be converted into the third Phase, more rarely used or inactive data, increasingly, from the Data Warehouse in the Data Lakes.

In the meantime, the IT Team can start to extract relational data from the Data Warehouse for the processing of this data, with a high level of intensity. Accordingly, extraction and transformation tasks in the Data Lake to migrate.

Replacement of other data storage

In this Phase, it is now possible that the majority of the information flowing through the company, is transferred via the Data Lake. The Data Lake is now an essential part of the data infrastructure, replace the existing Data Marts, or other internal data memory and enables the provision of data Services.

Companies can now take advantage of the Data Lake technology, as well as their ability to handle compute-intensive tasks, as they are for the implementation of innovative analyses or the provision of programs for machine Learning (ML) are required in full.

Some IT Teams may decide to create data-intensive applications and a Dashboard for the power management in the Data Lake. Other implement APIs, to connect information from the Data-Lake-resources seamlessly with findings from other applications.