Delta Lake Formatted data.
In a previous assignment I realized that the knowledge about Delta Lake formatted data was low and decided to do a blogpost about this.
A Delta Lake is based om the parquet format with a log structure on top. The similarity with parquet ends there and a Delta Lake Is more of a database format where you could insert, update, delete and historize data. Also, some loading management is included so I promise you its worth studying this deeper. A good place to find more knowledge is “https://delta.io/”.
Create a Delta Lake function.
I usually prefer to have code I reuse, in a function to make it easier to maintain. This function takes to variables “Destination” where the delta is going to be stored and a “DataFrame” that is the Data Frame with the data I want to save.
-
[.format] is set to “Delta” this is here its becomes a Delta Lake format
-
[.partitionBy] is how you want the partition to look like. In this case the data is stores the data in a path structure Filename, Year, Month and day.
-
[.mode] Is set to overwrite is this example bat could be set to append. The most important difference is when you have done some deletes in the delta lake they might be lost if you do a rollback because the deletes are physically removed from the parquet structure.
To read the data simply replace write to read as use in the function