Big Data Encodings
These encodings are often used with HDFS or some other distributed file system. Since the data can be as large as terabytes or petabytes, it is crucial to encode files in a space optimal way and also allow themselves to be read or written in an optimal way. For this reason, these encodings are generally not human readable (see the next section for more details on these).
File Encoding | Storage Format | Compression | Intended for | Schema Evolution | Other Notes |
---|---|---|---|---|---|
Avro | Row | Weak | Writes | Great | Works well with Kafka. Similar to Google Protobuf and Apache Thrift; allows for versioning and provides backwards/forward compatibility with schemas. |
Parquet | Columnar | Good | Reads | Limited | Works well with Spark. Can store highly nested data. |
ORC | Columnar | Great | Reads | Good | Works well with Hive. Supports ACID transactions. |
Human Readable Encodings
These encodings are often used to structure data in a more convenient way for developers and end-users, with less importance on the data size (in memory or disk). For this reason, these formats are typically human readable.
For example, CSV and TSV are very popular output formats for data analysts who may use programs like Microsoft Excel. JSON formats are also very popular for passing data around with REST APIs, and it is convenient for web browsers since it works natively with Javascript.
File Encoding | Characteristics | Caveats | Other Notes |
---|---|---|---|
CSV, TSV, PSV | Comma Separated Values Tab Separated Values Pipe Separated Values |
Cannot distinguish between string types and number types Ambiguity with parsing delimited values and newlines across implementations Ambiguity with column headers that might or might not be present Some formats are unescaped, causing additional complications when rows have embedded values such as double-quotes. |
Lightweight, simple, widely used |
JSON | Can represent deeply nested data Key-value dictionary is easy to read Works well for representing data on the front-end (Javascript) |
Has issues representing numbers with precision (i.e. larger than 2^52) Can be quite bloaty in size Cannot insert raw binary strings; a hacky workaround is to encode them with Base64 |
Simple, widely used with REST/GraphQL |
Binary JSON (e.g. BSON) | Represents JSON as a binary sequence to save space | End result is not human-readable. Space savings compared to original JSON are limited. |
|
XML | Schema-friendly Verbose Can represent nested data |
Similar to JSON, has issues representing numbers with precision (i.e. larger than 2^52) and inserting raw binary strings. Similar to CSV, it cannot distinguish between string types and number types Size is bloaty, and XML template can be too verbose and hard to read (as a human) |
Used widely for configuration files and SOAP architectures. |