NumPy
NumPy is a Python library for numerical computing that offers multi-dimensional arrays and indices as data structures and additional high-level math utilities.
ndarray
The unique offering of NumPy is the ndarray
data structure, which stands for n-dimensional array.
The key difference between a traditional Python list
and ndarray
is Python lists can allow any type of object inside the list. On the other hand, ndarray
is homogenously typed, meaning that each value inside the array must be of the same type. For that reason a single-dimension ndarray
is similar to a C array, where the type and the array sizes are explicitly defined beforehand, i.e. int[10]
, char[10]
.
>>> import numpy as np
>>> x = np.array([1, 2, 3])
>>> x
array([1, 2, 3])
>>> y = np.arange(10) # like Python's list(range(10)), but returns an array
>>> y
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Pandas
Pandas is a Python library for data manipulation and analysis, and also offers data structures and utilities. Most notably, Pandas and the Pandas DataFrame data structure is used frequently in the ML space.
DataFrame
A DataFrame is a special data structure used primarily by data scientists and machine learning algorithms. It contains row and column data in tabular fashion by storing metadata about each column and row. For instance, one type of metadata would be the column name.
In comparison, a traditional 2-dimensional array does not contain column or row specific metadata.
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
DataFrames in one sense can be viewed as an annotated table data structure. The annotations are quite helpful in analysis, but the additional metadata also means more memory overhead. DataFrames for very large datasets (i.e., 100GB+) can be very slow to index and search through.
For more details, see here.
NumPy and Pandas
For experienced data scientists, you may realize that these two libraries are often used in conjunction, rather than in isolation. One major reason for this is because Pandas uses NumPy under the hood.
Pandas is built on top of NumPy
Pandas has a dependency on NumPy, and uses NumPy to create it's own tabular data structures such as DataFrames. As a result, many Pandas APIs have support for NumPy data types.
Computations and Tables
Working with NumPy gives you tools for many mathematical formulas that are used in Linear Algebra. The n-dimensional data structures are built for fast and optimal data access and storage. The downside however, is that NumPy code can be often hard to read, debug, and re-use. Since NumPy data structures are built for computational performance, it usually involves programming with a statically typed mindset - for example, knowing your data types beforehand, or the size of your fixed array. For a single dimensional array, it is trivial, but for multi-dimensional matrices it can get quite convoluted fast.
Pandas DataFrames on the other hand, provides additional metadata for table-like datasets, which can be thought of as annotated two-dimensional matrices. While this isn't an apples-to-apples comparison (NumPy easily allows for higher dimension collections), DataFrames are purposely and intentionally easier to use for programmers and data scientists alike.
Dask
Dask is a Python library for running distributed Python applications. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud.
Dask comes with high-level and low-level data collections. Among the notable few:
- Dask DataFrames
- Extension of Pandas DataFrames, but for parallel computing
- Dask Arrays
- Extension of numpy arrays, but for parallel computing
Dask also comes with utilities to schedule tasks and autoscale clusters using cloud providers such as AWS, GCP, and Azure.
Ray
Ray is a framework for running distributed Python applications, similar to Dask. Ray also comes with autoscaler utilities to spin up clusters on the cloud using cloud providers such as AWS.
One notable difference with Ray and Dask is that Ray was initially designed to be low level enough for scaling any type of Python applications, where as Dask offered custom extensions of Pandas DataFrames and NumPy arrays out-of-the-box, built for parallel computing. Today, Ray offers Ray Datasets, which allows developers to read/write data from various content types (i.e. CSV, Parquet) and Spark/Pandas DataFrames or NumPy arrays.
Modin
Modin is a library that extends Pandas so that Pandas DataFrames can utilize multiple CPU cores. To accomplish this, Modin utilizes a distributed compute engine, like Dask or Ray, to execute Python code in parallel and bypass the GIL limitation for a single Python process. Modin focuses on improving the performance of reading/writing Pandas DataFrames and can be used on a single computer with significants performance benefits.