Delta Lake Historization

Work with the log file containing history of loads etc.. This is very handy creating maintainable loadings. Like the possibility to regret what you have done using Roll Back.

Understand history log and some ideas how to work with it.

Read history

To begin with read log history using

from delta.tables import *

We also need the

from pyspark.sql.functions import *

to do some PySpark SQL

We need a Delta Lake from where we can read the history. (Later on we will create some doing deletes and updates). Set up your connection to your delta lake. Usually the top path where the history is stored the and you see the “_delta_log” folder.

Log values you get is:

version
timestamp
userId
userName
operation
operationParameters
job
notebook
clusterId
readVersion
isolationLevel
isBlindAppend
operationMetrics
userMetadata
engineInfo

Delete row in Delta Lake

To show how it works and a nice benefit of using Delta Lakes we delete a post in the lake

I use pyspark sql to do the deletion.

DeleteDelta

RollBack

Rollback

If you like we did in the Delete example decided to delete all users with the name _erikberglund and you regret your delete there is a cool command restoreToVersion in deltaTable if you now what version you want to roll back from.

Another deltaTable command

UpdateDL

Update row in Delta Lake

Updates are very much like how you do it in SQL. Add a condition like a where statement in SQL, and a set is what you want it to be.