Transformation
When I do my transformations, I’ll usually try to it when I have the data in memory or while reading the data.
This is done by using Case commands and regular expressions both of them are really fast but could look a bit complex. Don’t worry it’s not that complex and really powerful.
In PySpark regular expressions use Java wildcards lets start with a list of those:
By combining this expression, you can find and clean and manipulate data rapidly in all statements in PySpark SQL and other places as well.
Example, test if date follows the pattern yyyy-mm-dd:
Explained ^[1|2][0|9][0-9][0-9]-[0|1][0-9]-[0-3][0-9]'
^ tells us that there the letter after is the first letter.
[1|2] the first position should be a 1 or 2
[0|9] the second position is 0 or 9
[0-9] the third position is between 0 and 9