Tweets From Raw to Cleansed
Back to main "Create a data lake"
Download twitter csv from Kaggle
I usually use Apache Parquet Viewer to test the output from Parquet files .
In the script below as you probably understand you must set a Infile and outfile. This is probably the only thing you have to change using a cloud-based environment. I decided to use a schema to get control over data types. Using Pandas to read the CSV and Pyarrow to write the parquet file. The out file will be compressed with GZIP.
from encodings.utf_8 import encode
from msilib import schema
import pandas as pd
import pyarrow as pa
Infile = 'C:\\VsCodeRepo\\tweets\\tweets.csv'
Outfile = 'C:\\VsCodeRepo\\tweets\\tweets.parquet'
tweet_schema = pa.schema([
('Datetime', pa.string()),
('Tweet Id', pa.int64()),
('Text', pa.string()),
('Username', pa.string()),
('Permalink', pa.string()),
('User', pa.string()),
('Outlinks', pa.string()),
('CountLinks', pa.string()),
('ReplyCount', pa.int32()),
('RetweetCount', pa.int32()),
('LikeCount', pa.int32()),
('QuoteCount', pa.int32()),
('ConversationId', pa.int64()),
('Language', pa.string()),
('Source', pa.string()),
('Media', pa.string()),
('QuotedTweet', pa.string()),
('MentionedUsers', pa.string()),
('hashtag', pa.string()),
('hastag_counts', pa.int32())
])
def write_parquet_file():
df = pd.read_csv (Infile)
dftweet = df.head(98303)
# print(dftweet)
dftweet.to_parquet(Outfile, schema = tweet_schema,use_dictionary=False, compression='gzip')
write_parquet_file()