Pandas, CuDF, Modin, Arrow, Spark and a Billion Taxi Rides
We are continuing our saga of CPU vs. GPU articles comparing the most common data-processing toolkits, and this time it will be about tabular data. Specifically, we will compare frameworks with Pandas-like Python interfaces on a dataset often used to compare SQL Databases.
TLDR: Use Arrow to parse large datasets and split it in batches to process via CuDF 15x faster!
If you are looking for a broader overview of the general topic - my recent PyData talk on “Accelerated Data-Science Libraries” was just published on YouTube.