Data Lake

What is a data lake?

As they say on aws.amazon.com

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

In other words, you can throw in almost anything. With that said you do not want a complete mess. Data Lakes has have built in systems for structuring, cataloguing and indexing the content. To help these systems to do a good job, you should add as much information as possible to them. This way searches gets more accurate and faster in the lake.

This can be done using the file attributes built in with the files as well as the choice of file format. The information in file attributes follows the file wherever you put it, so you are not depending on file path hierarchies.

Another thing to think about is to not destroy good structure while ingesting the data by making the wrong choices of file format.

An example of this is to read from a database end store it as a delimited file like CSV. That way you are missing datatypes, lengths and a possibly to validate. Instead you should use a semi structured format like XML or Parquet.

Yes, you are going to get some overhead this way, but I promise you it going to be faster in the end.

Another thing to consider is, do you need compression and even to split the data into many files. Some file formats are better handling splits than others. They can store information inside that can make it possible to read many files. When choosing a file format with high compression you can also save a lot of money in the transformation process.

There are many different types of good file formats that I would like to divide in to two types serial row based and parallel column based files.

Examples of serial files

XML
JSON
AVRO

Examples of parallel

Parquet
CSV
ORC

XML

My experience with XML is a good choice if you want to validate data and work with continues loads. XML also got a header where you can store encryption keys etc. XML is as serial format and store the information about the data on the row.

Disadvantage is large overhead and no compression.

All elements are surrounded with tags.

XML can be created using a schema that validates that each element fills its requirements. In this example it should be a string element.

<xs:element name=" FirstName" type="xs:string"/>

The schema is powerful to keep a good standard and can also ease the job writing programs consuming XML data.

JSON

JSON (Java Script Object Notation) is has many similarities with XML. It has a little less overhead but does not have a header where you could put extra information like encryption keys etc. It is mostly used in continues integrations like XML because its serial (has meta data on the row).

JSON and XML could be read using a standard text browser but then it has no internal compression. My experience is if using JSON or XML most text editors are terribly slow if it a bit more data in file. So, my recommendation is to use these formats with relive small files.

{
"Customer": [{
"CustomerKey": 11000,
"GeographyKey": 26,
"CustomerAlternateKey": "AW00011000",
"FirstName": "Jon",
"MiddleName": "V",
"LastName": "Yang",
"NameStyle": false,
"BirthDate": "1971-10-06"
}
]
}

AVRO

JSON (Java Script Object Notation) is has many similarities with XML. It has a little less overhead but does not have a header where you could put extra information like encryption keys etc. It is mostly used in continues integrations like XML because its serial (has meta data on the row).

JSON and XML could be read using a standard text browser but then it has no internal compression. My experience is if using JSON or XML most text editors are terribly slow if it a bit more data in file. So, my recommendation is to use these formats with relive small files.

{
"Customer": [{
"CustomerKey": 11000,
"GeographyKey": 26,
"CustomerAlternateKey": "AW00011000",
"FirstName": "Jon",
"MiddleName": "V",
"LastName": "Yang",
"NameStyle": false,
"BirthDate": "1971-10-06"
}
]
}