Lake Stream Load | DWconsiderations

Stream Load data

Loading Raw data using Databricks Autoloader

If you have ingestion data in the format:

CSV (Comma Separated Values)
JSON (JavaScript Object Notation)
AVRO (Apache Avro)
Parquet
ORC (Optimized Row Columnar)
Delta Lake

You can use Databricks Autoloader to ingest data to you Bronze (raw data) layer. This is an easy way to create a stream load or near real-time load of your files. Because of the delta structure Autoloader could detect schema of the data being ingested and adjust accordingly, it supports several options for handling errors and schema evolution, making it a powerful tool for handling diverse and constantly changing data sources.

In the example I have loaded Wine Reviews from Kagle, and in this case created a Schema this is not necessary if you want a creeping schema loading to raw.

The result of using Databricks Autoloader is a Delta Live Table (DLT) as you see if you run the code. The difference with a Delta Lake table is that we added a checkpoint folder that contains the stuff that controls the loading process.

What happens if you have a glitch in the middle of a loading process, no worries the files in the checkpoint folder keeps track of your files and continue from where it left off even if it’s in a middle of a file. An Autoloader could sit and listen to a folder and load new data as it comes in. It also could be triggered; wish does not cost so much compute (that could be expensive).

An idea could be to use a Kafka trigger starting trigger loads this way it does not listen until something happens and the hole thing will be streamed without wasting compute. Or just trigger it from Data Factory