SQL to Parquet using Python 3.7.6
First read SQL server using Python3 and PyODBC.
Start out downloading PyODBC via pip from pypi.org. “pip install pyodbc”. I created a file called getsql.py. I used AdventureWorksDW2017 downloaded from Microsoft in this example. This function will return a list of customers that will be used in creating anParquet file.
import pandas.io.sql
import pyodbc
import pandas as pd
conn = pyodbc.connect('Driver={SQL Server};'
'Server=name of the sql server;'
'Database=AdventureWorksDW2017;'
'Trusted_Connection=yes;')
query= """SELECT
GeographyKey,
case when Gender = 'F' then 0 else 1 end Gender,
cast((DATEDIFF(DAY, BirthDate, GetDate()) / 365) as int) Age,
cast(YearlyIncome as int) YearlyIncome
FROM DimCustomer
"""
To create the Parquet file I used Pandas.
Run pip install pandas. Then its easy to just read the query into to the file compressed to gzip (small and fast).
df= pandas.io.sql.read_sql(query, conn)
df.to_parquet('TrainingData.gzip', compression='gzip')