Hive Metastore
As I have said on other pages, I prefer to use PySpark SQL in my transactions ETL etc. The reason is that its easier to create fast and maintainable code using Pyspark.
What is Hive metastore
(Answer from from ChatGPT)
Hive Metastore is a central repository that stores metadata information about the data stored in the Hive data warehouse. It is a database that contains metadata information such as table and column names, their data types, location of data, partition information, and other relevant information required to access and query the data stored in Hive.
What does this give us.
You can create a delta lake file format, or just use whatever data you could add to a data frame into to a Hive Table. The data in itself remains i the lake, but is now accessible using SQL(HiveQL). Cool isn’t it.
In the code below a Hive table is created named twitter in the TwitterLabb database or to be more precise the metadata for that table is created. And the data remains in the Delta Lake I created in another session.
* Of course you need to create that database before running that code
And now you could access this table using this cod:
This is a nice way opportunity for business users to consume the data using reporting tools to consume the data.
But
If you use a CSV as your source to your Hive table its going to be slow. So it does not give us any short cuts building our Lakehouse strategy. Its just a SQL layer on top of the data making it more accessible for business users. Another disadvantage could be that the datatypes is created by the source data.
Create Hive Database
If you don’t create your hive database Databricks could add a table to its Default database. I prefer to create databases that clearly declare its purpose so let’s create a Hive Metastore Database.
Drop stuff in Hive.
One cool thing with working with meta store databases is that if you drop a table the data will remain. You only drop the meta structure. The commands looks as you would guess the same as in SQL.
But if you try to drop a database that contains a lot of objects it doesn’t work either you have to empty your database and, then drop it or you could use a the command Cascade to drop all meta objects. Still, it does not delete the data.
Update rows using Hive.
When running an update using Hive the Delta table is updated. And all rollback is possible to use. This could be done because Delta Tables supports A.C.I.D (Atomicity, Consistency, Isolation, Durability). Most databases and a lot of other formats support ACID. But formats like CSV, Parquet does not support ACID and can therefore not do updates etc.
*ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability, which are the four key properties of a transaction in a database management system.
-
Atomicity: Atomicity refers to the property of a transaction that ensures that it is treated as a single, indivisible unit of work. A transaction is either completed in its entirety or not at all. If any part of a transaction fails, the entire transaction is rolled back, and the database is returned to its previous state. This ensures that the database remains consistent even in the event of system failures or errors.
-
Consistency: Consistency refers to the property of a transaction that ensures that it brings the database from one consistent state to another. This means that a transaction must obey all the integrity constraints and business rules defined for the database. If a transaction violates any of these rules, it is rolled back, and the database remains in its previous consistent state.
-
Isolation: Isolation refers to the property of a transaction that ensures that it is executed in isolation from other transactions. This means that a transaction should not be affected by the concurrent execution of other transactions. The isolation level determines the degree to which a transaction is isolated from other transactions.
-
Durability: Durability refers to the property of a transaction that ensures that once it is committed, its effects are permanent and will survive any subsequent system failures, such as power outages, crashes, or other catastrophic events. This is usually achieved by writing the transaction's changes to stable storage, such as a hard disk or non-volatile memory.
Together, these properties ensure that database transactions are reliable, consistent, and durable, even in the face of failures or errors. ACID compliance is essential in any application that requires robust and reliable data storage and retrieval, such as financial applications, e-commerce systems, and online booking systems.