GGUF
| GGUF | |
|---|---|
| Filename extension | .gguf |
| Magic number | 0x47 0x47 0x55 0x46 |
| Developed by | Georgi Gerganov and community |
| Initial release | August 22, 2023[1] |
| Latest release | v3[2] |
| Type of format | Machine-learning tensors |
The GGUF (GGML Universal File)[3] file format is a binary file format that stores both tensors and metadata in a single file, and is designed for fast saving and loading of model data.[4] It was introduced in August 2023 by the llama.cpp project to better maintain backwards compatibility as support was added for other model architectures.[5][6] It superseded previous formats used by the project such as GGML, and is typically produced by converting models developed with a different machine learning library such as PyTorch.[4]
GGUF has become the standard format for distributing quantized large language models for local inference, and is natively supported by tools including llama.cpp, Ollama, LM Studio, GPT4All, Jan, and koboldcpp.[4] As of 2026, tens of thousands of GGUF checkpoints are hosted on Hugging Face, which provides first-class integration including a metadata viewer, an inference endpoint service, and a JavaScript parser library.[4]
History
The model file formats used by the llama.cpp project evolved through four named stages: GGML, GGMF, GGJT, and GGUF.[7] The original GGML format was a thin tensor container that hard-coded model hyperparameters and tokenizer information inside the loader; adding support for a new model architecture or quantization scheme typically required code changes that broke compatibility with existing files.[6]
As llama.cpp grew during 2023 to support additional architectures beyond Llama — including Mistral, Falcon, and others — these limitations became increasingly difficult to manage. GGUF was introduced on 21 August 2023 as a backwards-incompatible successor, with the specification formalized through pull request #302 in the ggerganov/ggml repository.[1] The format inherits GGJT's overall layout but replaces its flat hyperparameter list with a structured key–value metadata system, allowing new fields — architecture details, tokenizer vocabularies, training parameters — to be added without modifying the loader or breaking older models.[7]
The format itself has gone through three internal versions. Version 1 established the basic structure; version 2 added explicit alignment padding to support memory mapping; and version 3, the current version, added optional big-endian support.[2]
Design
GGUF focuses on quantization, the act of reducing precision in the model weights. This can lead to reduced memory usage and increased speed, albeit at the cost of reduced model accuracy.[8][6]
The format is designed to be:
- Self-contained — a single file holds the tensors, the tokenizer and all metadata needed to load and run the model, eliminating the need for accompanying configuration files.[4]
- Memory-mappable — tensor data is aligned (by default to a 32-byte boundary) so that weights can be accessed directly through pointers without loading the entire file into RAM, allowing models larger than available memory to be served through operating-system paging.[2]
- Extensible — the key–value metadata block lets new fields be added without breaking compatibility with older readers.[7]
GGUF supports 2-bit to 8-bit quantized integer types,[9] common floating-point data formats such as float32, float16 and bfloat16, and 1.58 bit quantization.[10] Several "K-quant" variants (such as Q4_K, Q5_K and Q6_K) use a block-based scheme with separate scale and minimum values per super-block, generally giving better quality at a given bit-width than the simpler legacy quantizations such as Q4_0 and Q8_0.[2]
GGUF contains the information necessary for running a GPT-like language model such as the tokenizer vocabulary, context length, tensor info and other attributes.[11]
File structure
A GGUF file consists of four sequential sections: a fixed-size header, a key–value metadata block, a tensor information block, and the tensor data itself.[2]
Byte-level structure (little-endian)
| Bytes | Description[7] |
|---|---|
| 4 | GGUF magic number, currently set to 0x47 0x47 0x55 0x46
|
| 4 | GGUF version, currently set to 3
|
| 8 | UINT64 tensor_count: number of tensors
|
| 8 | UINT64 metadata_kv_count: number of metadata key-value pairs
|
| Variable | Metadata block, containing metadata_kv_count key-value pairs |
| Variable | Tensors info block, containing tensor_count values |
| Variable | uint8_t tensor_data[], weight bits block
|
Prior to version 3, files were implicitly little-endian; version 3 permits big-endian storage but does not include a flag indicating endianness, so it must be inferred from context.[2]
Metadata block
The metadata block is a sequence of typed key–value pairs. Keys are namespaced strings (for example general.*, tokenizer.*, or an architecture-specific prefix such as llama.*), and values may be scalars, strings, or arrays — including multi-dimensional arrays.[2]
// example metadata
general.architecture: 'llama',
general.name: 'LLaMA v2',
llama.context_length: 4096,
... ,
general.file_type: 10, // (typically indicates quantization level, here "MOSTLY_Q2_K")
tokenizer.ggml.model: 'llama',
tokenizer.ggml.tokens: [
'<unk>', '<s>', '</s>', '<0x00>', '<0x01>', '<0x02>',
'<0x03>', '<0x04>', '<0x05>', '<0x06>', '<0x07>', '<0x08>',
...
],
...
Tensors info block
For each tensor, the info block stores its name, number of dimensions, shape, data type and byte offset within the subsequent tensor_data[] region. The naming scheme is standardized across architectures (for example, blk.0.ffn_gate.weight), so that loaders can locate weights regardless of the source framework.[7]
// n-th tensor
name: GGUF string, // ex: "blk.0.ffn_gate.weight"
n_dimensions: UINT32, // ex: 2
dimensions: UINT64[], // ex: [ 4096, 32000 ]
type: UINT32, // ex: 10 (typically indicates quantization level, here "GGML_TYPE_Q2_K")
offset: UINT64 // starting position within the tensor_data block, relative to the start of the block
// (n+1)-th tensor
...
Tensor data
Tensor data follows the info block and begins at the next alignment boundary. The alignment value is itself stored in the metadata under the key general.alignment; if absent, it defaults to 32 bytes.[2] Aligning the data this way is what allows the file to be memory-mapped directly: tensor weights can be read through pointers without an additional copy, which is important for SIMD operations, GPU DMA transfers and CPU cache efficiency.[7]
Tooling
Models stored in other frameworks are typically converted to GGUF using the convert_hf_to_gguf.py script bundled with llama.cpp, which reads Hugging Face checkpoints (commonly in safetensors form) and emits a GGUF file in a chosen base precision such as f16 or bf16.[12] The resulting file can then be re-quantized to one of the GGUF integer formats with the llama-quantize utility, and very large models can be sharded across multiple files with llama-gguf-split.[12]
The llama.cpp project also exposes a C/C++ API (declared in ggml/include/gguf.h) and a Python package, gguf-py, for reading and writing GGUF files programmatically.[13]
Adoption
GGUF is the native model format of llama.cpp and of Ollama, which uses llama.cpp as its inference backend; Ollama models pulled from its registry are GGUF files internally, and arbitrary GGUF files from Hugging Face can be loaded by referencing them as hf.co/{user}/{repo}.[4] Other inference applications that consume GGUF directly include LM Studio, GPT4All, Jan and koboldcpp.[4]
Hugging Face supports the format as a first-class citizen on its model hub, providing a GGUF metadata viewer, filtering by the gguf tag, an inference endpoints integration, and a large collection of community-uploaded GGUF checkpoints.[4]
See also
References
- ^ a b "gguf : universal format for ggml models (#302)". GitHub. 21 August 2023. Retrieved 18 May 2026.
- ^ a b c d e f g h "GGUF". GitHub – ggml-org/ggml. Retrieved 18 May 2026.
- ^ "gguf · PyPI". Retrieved 18 May 2026.
- ^ a b c d e f g h "GGUF". Hugging Face Documentation. Retrieved 18 May 2026.
- ^ Rajput, Shubham. "An introduction to GGUF format". Medium. Retrieved 18 May 2026.
- ^ a b c Mucci, Tim. "GGUF versus GGML". IBM. Retrieved 18 May 2026.
- ^ a b c d e f "GGUF specification". GitHub – ggml-org/ggml. Retrieved 18 May 2026.
- ^ "Quantize Llama models with GGUF and llama.cpp". Towards Data Science. Retrieved 18 May 2026.
- ^ Cabezas, Darío (2024). "Quantization of Large Language Models with an Overdetermined Basis". arXiv:2404.09737 [cs.LG].
- ^ "Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it". The Register. 14 July 2024. Retrieved 18 May 2026.
- ^ "Accelerating GGUF Models with Transformers". Hugging Face Blog. Retrieved 18 May 2026.
- ^ a b "llama.cpp". GitHub – ggml-org/llama.cpp. Retrieved 18 May 2026.
- ^ "gguf.h". GitHub – ggml-org/llama.cpp. Retrieved 18 May 2026.
External links
- Official GGUF specification on GitHub
- GGUF documentation on Hugging Face
Content Disclaimer
Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.
- The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
- There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
- It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
- Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
- Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.