Draft:Compact Numerical Encoding Systems
| This is a draft article. It is a work in progress open to editing by anyone. Please ensure core content policies are met before publishing it as a live Wikipedia article. Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL Last edited by Fermiboson (talk | contribs) 2 months ago. (Update)
Finished drafting? |
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
Compact Numerical Encoding refers to a class of data serialization techniques designed to represent numerical values using the minimum amount of storage space or transmission bandwidth possible. Unlike fixed-width encodings (such as standard 32-bit or 64-bit integers), compact encodings adapt the storage size based on the magnitude or the statistical distribution of the data.
These systems are fundamental to modern computing, particularly in distributed systems, low-latency networking, database compression, and embedded systems where resource efficiency is a priority.
Classification
Compact encoding systems are generally classified based on how they handle the bit-stream and whether they are optimized for specific data patterns.
Variable-length quantity (VLQ)
VLQ systems utilize a variable number of bytes to represent an integer. The most common implementation uses the Most Significant Bit (MSB) of each byte as a "continuation bit" to indicate whether the following byte is part of the same number.
- LEB128 (Little-Endian Base 128): Widely used in the DWARF debug format and WebAssembly.
- Varints: A cornerstone of the Protocol Buffers (Protobuf) format developed by Google.
Bit-level universal codes
Universal codes do not align with byte boundaries; instead, they represent integers as bit-sequences where the prefix allows the decoder to determine the length.
- Elias coding: Includes Gamma, Delta, and Omega coding. Gamma coding is particularly effective for small integers.
- Golomb coding: Optimized for data following a geometric distribution, often used in lossless image and audio compression (e.g., FLAC).
Packing and Frame-of-Reference (FOR)
These systems are used to encode blocks of numbers simultaneously, frequently seen in columnar databases and time series databases (TSDB).
- Simple8b / Simple16: These algorithms pack multiple integers into a single 64-bit word based on the maximum value in the set.
- PFOR (Patched Frame-of-Reference): This technique stores a "base" value for a block and encodes only the offsets. Outliers (exceptions) are stored separately to maintain high compression ratios.
Handling signed integers
Representing negative numbers in compact formats presents a challenge, as traditional two's complement results in a large number of leading ones. To solve this, ZigZag encoding is employed. It maps signed integers to unsigned integers by alternating between positive and negative numbers:
| Signed Integer | Unsigned Mapping |
|---|---|
| 0 | 0 |
| -1 | 1 |
| 1 | 2 |
| -2 | 3 |
| 2 | 4 |
This ensures that values with small absolute magnitudes result in small encoded varints.
Decimal and financial encodings
For applications requiring high precision without binary rounding errors, specialized decimal encodings are used.
- Binary-coded decimal (BCD): Represents each decimal digit with four bits.
- Densely packed decimal (DPD): A method used in the IEEE 754-2008 standard, which compresses three decimal digits into 10 bits.
Applications
- Serialization Frameworks: Apache Thrift, Apache Avro, and Protocol Buffers use compact encodings to reduce payload size in microservices.
- Blockchain: Bitcoin and Ethereum utilize various forms of Varints to minimize the size of transactions stored on the ledger.
- Search Engines: Inverted indexes use Delta-encoding to compress document ID lists.
See also
References
Further reading
- Lemire, D. and Boytsov, L. (2015). "Decoding Billions of Integers per Second through Vectorization." Software: Practice and Experience.
- Scholer, F., Williams, H.E., Yiannis, J. and Zobel, J. (2002). "Compression of Inverted Indexes."
Content Disclaimer
Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.
- The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
- There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
- It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
- Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
- Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.