Tweets From Raw to Cleansed

I usually use Apache Parquet Viewer to test the output from Parquet files .

In the script below as you probably understand you must set a Infile and outfile. This is probably the only thing you have to change using a cloud-based environment. I decided to use a schema to get control over data types. Using Pandas to read the CSV and Pyarrow to write the parquet file. The out file will be compressed with GZIP.

from encodings.utf_8 import encode

from msilib import schema

import pandas as pd

import pyarrow as pa

Infile = 'C:\\VsCodeRepo\\tweets\\tweets.csv'

Outfile = 'C:\\VsCodeRepo\\tweets\\tweets.parquet'

tweet_schema = pa.schema([

('Datetime', pa.string()),

('Tweet Id', pa.int64()),

('Text', pa.string()),

('Username', pa.string()),

('Permalink', pa.string()),

('User', pa.string()),

('Outlinks', pa.string()),

('CountLinks', pa.string()),

('ReplyCount', pa.int32()),

('RetweetCount', pa.int32()),

('LikeCount', pa.int32()),

('QuoteCount', pa.int32()),

('ConversationId', pa.int64()),

('Language', pa.string()),

('Source', pa.string()),

('Media', pa.string()),

('QuotedTweet', pa.string()),

('MentionedUsers', pa.string()),

('hashtag', pa.string()),

('hastag_counts', pa.int32())

])

def write_parquet_file():

df = pd.read_csv (Infile)

dftweet = df.head(98303)

# print(dftweet)

dftweet.to_parquet(Outfile, schema = tweet_schema,use_dictionary=False, compression='gzip')

write_parquet_file()

Back to main "Create a data lake"