DataFrames (a software engineer's perspective)

What is a DataFrame?

A DataFrame is a special data structure used primarily by data scientists and machine learning algorithms. It contains row and column data in tabular fashion by storing metadata about each column and row. For instance, one type of metadata would be the column name.

In comparison, a traditional 2-dimensional array does not contain column or row specific metadata.

DataFrames are quite helpful in this regard, but the additional metadata also means more memory overhead. DataFrames for very large datasets (i.e., 100GB+) can be very slow to index and search through.

Libraries

There are several different implementations of DataFrames.

  • Pandas DataFrames
    • The most popular, widely used DataFrame.
    • Designed to be only usable on a single process, and therefore only uses one CPU core.
  • Modin DataFrames
    • An extension of Pandas DataFrames that allows DataFrames to be partitioned across multiple CPU cores.
    • In performing the above task, it uses a distributed compute engine (Ray or Dask) which needs to be specified at initialization.
    • Modin is still pretty young and early in development.
  • Spark DataFrames
    • Spark is primarily Java / Scala based, which might be difficult to work with when passing datasets over from Python.
    • DataFrames data are distributed across a cluster of machines, not just cores .
  • Dask DataFrames
    • Scales Pandas DataFrames across a cluster of machines, similar to Spark.
    • However, unlike Spark DataFrames, it does not use Java.
    • Materialization is lazily done.