Arjun Kumar 5 years agoJanuary 22, 2022

What is Azure Data Lake?

Microsoft Azure Data Lake is a highly scalable public cloud service that allows developers, scientists, business professionals and other Microsoft customers to gain insight from large, complex data sets. As with most data lake offerings, the service is composed of two parts: data storage and data analytics.

According to Microsoft, customers can provision Azure Data Lakes to store an unlimited amount of structured, semi-structured or unstructured data from a variety of sources. The service does not impose limits on account sizes, file sizes, or the amount of data that can be stored in a data lake.

On the analytics side, Azure Data Lake customers can write their own code to perform specific operational or transactional data transformation and analysis tasks. They can also use existing tools, such as Microsoft’s Analytics Platform System or Azure Data Lake Analytics, to query data sets.

Azure Data Lake includes the following feature:

True HDFS Compatibility

Azure Data Lake is a true Hadoop file system compatible with the major Hadoop distributions like Hortonworks Data Platform and Cloudera Enterprise Data Hub and projects.

Unlimited Data Size

There is no fixed size limit on files or accounts in Azure Data Lake. You can store single or multiple files at terabyte and petabyte scale and process them with Hadoop. This is an complete situation because Hadoop is at its best when working with very large files.

Fault-Tolerant and Available

In a typical Hadoop implementation, data is replicated three times in the cluster. Azure Data Lake adopts this practice in order to ensure the data is available in the event of hardware failure or interruption.

Designed for Parallel Processing

One of the key benefits of the Hadoop file system is it keeps the data close to the compute. In an on premises implementation, this is done by having dedicated disk in each physical node and moving data between nodes as little as possible. Azure Data Lake is designed to support throughput for parallel processing scenarios in order to provide performance similar to that of an on premises Hadoop implementation.

Optimized for High-Speed Throughput

The types of data consists in data streaming off devices such as phones, sensors and equipment is of very small transactions at very high volume. Ingestion of this data into a data management system has traditionally been very difficult both from the application development and hardware and software support standpoints. Azure Data Lake supports extremely high speed data ingestion at huge scale.

globalarjun