Delta Lake Historization
Work with the log file containing history of loads etc.. This is very handy creating maintainable loadings. Like the possibility to regret what you have done using Roll Back.
Understand history log and some ideas how to work with it.
Read history
To begin with read log history using
from delta.tables import *
We also need the
from pyspark.sql.functions import *
to do some PySpark SQL
We need a Delta Lake from where we can read the history. (Later on we will create some doing deletes and updates). Set up your connection to your delta lake. Usually the top path where the history is stored the and you see the “_delta_log” folder.
Log values you get is:
-
version
-
timestamp
-
userId
-
userName
-
operation
-
operationParameters
-
job
-
notebook
-
clusterId
-
readVersion
-
isolationLevel
-
isBlindAppend
-
operationMetrics
-
userMetadata
-
engineInfo
Delete row in Delta Lake
To show how it works and a nice benefit of using Delta Lakes we delete a post in the lake
I use pyspark sql to do the deletion.
Rollback
If you like we did in the Delete example decided to delete all users with the name _erikberglund and you regret your delete there is a cool command restoreToVersion in deltaTable if you now what version you want to roll back from.
Another deltaTable command
Update row in Delta Lake
Updates are very much like how you do it in SQL. Add a condition like a where statement in SQL, and a set is what you want it to be.