top of page

Tweets From Raw to Cleansed

Back to main "Create a data lake"

 

Download twitter csv from Kaggle

I usually use Apache Parquet Viewer to test the output from Parquet files .

In the script below as you probably understand you must set a Infile and outfile. This is probably the only thing you have to change using a cloud-based environment. I decided to use a schema to get control over data types. Using Pandas to read the CSV and Pyarrow to write the parquet file. The out file will be compressed with GZIP. 

 

from encodings.utf_8 import encode

from msilib import schema

import pandas as pd

import pyarrow as pa

 

Infile = 'C:\\VsCodeRepo\\tweets\\tweets.csv'

Outfile = 'C:\\VsCodeRepo\\tweets\\tweets.parquet'

 

tweet_schema = pa.schema([

    ('Datetime', pa.string()),

    ('Tweet Id', pa.int64()),

    ('Text', pa.string()),

    ('Username', pa.string()),

    ('Permalink', pa.string()),

    ('User', pa.string()),

    ('Outlinks', pa.string()),

    ('CountLinks', pa.string()),

    ('ReplyCount', pa.int32()),

    ('RetweetCount', pa.int32()),

    ('LikeCount', pa.int32()),

    ('QuoteCount', pa.int32()),

    ('ConversationId', pa.int64()),

    ('Language', pa.string()),

    ('Source', pa.string()),

    ('Media', pa.string()),

    ('QuotedTweet', pa.string()),

    ('MentionedUsers', pa.string()),

    ('hashtag', pa.string()),

    ('hastag_counts', pa.int32())

])


 

def write_parquet_file():

    df = pd.read_csv (Infile)

    dftweet = df.head(98303)

    # print(dftweet)

 

    dftweet.to_parquet(Outfile, schema = tweet_schema,use_dictionary=False, compression='gzip')

 

write_parquet_file()

Back to main "Create a data lake"

FullCode
bottom of page