Welcome to auto_adpq’s documentation!¶
Package overview¶
Auto-AdpQ.
- class auto_adpq.Auto_AdpQ(group_size: int = 128, alpha: float = 0.06, n_iters: int = 100, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, config: AutoAdpQConfig | None = None)[source]¶
Bases:
objectAdaptive Post-Training Quantization driver.
This class implements the end-to-end AdpQ flow: outlier detection, separate quantization of non-outlier and outlier values, and packaging of the quantized representation into
AdpQQuantizedWeights.- linear_target_layers = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj')¶
- quantize(sub_vector: list[float] | ndarray | Tensor) tuple[ndarray, float, float][source]¶
Quantize a 1-D sub-vector.
The function supports symmetric and asymmetric quantization. For symmetric quantization, zeropoint is not used and will be set to
np.nanbefore conversion tonp.float16.- Parameters:
sub_vector (Union[list[float], np.ndarray, torch.Tensor]) – 1-D numeric array-like containing values to quantize.
- Returns:
(quantized, scale, zeropoint)where
quantizedis annp.int8array andscale/zeropointare returned asnp.float16values.
- Return type:
Tuple[np.ndarray, float, float]
- Raises:
ValueError – If input vector leads to invalid arithmetic (e.g. division by zero for a zero vector in symmetric mode).
- pack_bits(quantized_weights: ndarray) ndarray[source]¶
Pack quantized weights vector.
- Parameters:
quantized_weights (np.ndarray) – the quantized weights must be of size (M,N) typical matrix size.
- Returns:
the bit-packed quantized weights.
- Return type:
np.ndarray
- unpack_bits(packed_weights: ndarray) ndarray[source]¶
Unpack bit-packed quantized weights.
- Parameters:
packed_weights (np.ndarray) – the bit-packed quantized weights.
- Returns:
the unpacked quantized weights.
- Return type:
np.ndarray
- reconstruct_weights(adpq_quantized_weights: AdpQQuantizedWeights) ndarray[source]¶
Reconstruct the full matrix from an AdpQQuantizedWeights object.
- Parameters:
adpq_quantized_weights (AdpQQuantizedWeights) – Container produced by
AdpQ_quantize()that includes scales, zeropoints, quantized vectors and outlier indices.- Returns:
- Reconstructed matrix with dtype
np.float16and shape matching
original_shapefrom the provided object.
- Reconstructed matrix with dtype
- Return type:
np.ndarray
- save_weights(adpq_quantized_weights: AdpQQuantizedWeights, filepath: str, weight_name: str = 'weights')[source]¶
Save the AdpQQuantizedWeights to a file.
- Parameters:
adpq_quantized_weights (AdpQQuantizedWeights) – The quantized weights to save.
weight_name (str) – The name of the weight matrix.
filepath (str) – The path to the file where the weights will be saved.
TODO: Data packing fails at the moment!
- load_weights(filepath: str) AdpQQuantizedWeights[source]¶
Load the AdpQQuantizedWeights from a file.
- Parameters:
weight_name (str) – The name of the weight matrix.
filepath (str) – The path to the file where the weights are saved.
- Returns:
The loaded quantized weights.
- Return type:
- lasso_outlier_detection(matrix: list[float] | ndarray | Tensor) tuple[ndarray, float][source]¶
Detect outliers using an adaptive LASSO-inspired method.
The method searches for a regularization parameter that produces a target fraction of outliers (
alpha) using a Brent-like root finding procedure. The selection criterion follows:hat_w_i = sign(w_i) * ReLU(|w_i| - lambda' / |w_i|)
- Parameters:
matrix (Union[list, np.ndarray, torch.Tensor]) – 2-D array shaped (num_groups, group_size) containing values to analyze.
- Returns:
(outlier_indices, outlier_ratio)where
outlier_indicesis an integer array listing per-group outlier positions andoutlier_ratiois the fraction of entries detected as outliers.
- Return type:
Tuple[np.ndarray, float]
- AdpQ_quantize(matrix: list[float] | ndarray | Tensor) AdpQQuantizedWeights[source]¶
Quantize a matrix using the AdpQ (LASSO-based) flow.
- Parameters:
matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to
(-1, group_size)and processes each group independently.- Returns:
- Container with quantized values, scales,
optional zeropoints and outlier indices.
- Return type:
- quantize_reconstruct(matrix: list[float] | ndarray | Tensor) ndarray[source]¶
Quantize and reconstruct a matrix using AdpQ.
- Parameters:
matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to
(-1, group_size)and processes each group independently.- Returns:
Reconstructed matrix after quantization.
- Return type:
np.ndarray
- quantize_model_multithreaded(model: Module, max_workers: int = 4)[source]¶
Quantize valid linear layers using a thread pool.
- Parameters:
model – The PyTorch model.
max_workers – Limit threads to avoid OOM (Out of Memory). Set to 4-8 for desktop, higher for servers.
- save_pretrained(save_directory: str)[source]¶
Save the quantized model in Hugging Face format.
- Parameters:
save_directory (str) – The directory where the model will be saved.
- fuse_model_from_pretrained(model: Module, load_directory: str)[source]¶
Load the quantized model from Hugging Face format.
- Parameters:
model (torch.nn.Module) – The PyTorch model to load the weights into.
load_directory (str) – The directory where the model in ADPQ format is saved.
- classmethod apply_quantization(model: Module, config: AutoAdpQConfig, multi_threaded: int = 1)[source]¶
Apply quantization to a model given a configuration.
- Parameters:
model (torch.nn.Module) – The model to be quantized.
config (AutoAdpQConfig) – Configuration for quantization.
multi_threaded (int) – Whether to use multi-threaded quantization. Default to 1, single-threaded is used. else, specify the number of threads.
- class auto_adpq.AutoAdpQConfig(*, group_size: int = 128, n_iters: int = 100, alpha: float = 0.08, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, target_layers: Tuple[str, ...] | None = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj'))[source]¶
Bases:
BaseModelConfiguration for Auto_AdpQ.
- group_size¶
Number of elements in a group for group-wise quantization. Must be between 1 and 65535 (inclusive).
- Type:
int
- n_iters¶
Maximum number of iterations for outlier detection.
- Type:
int
- alpha¶
Target fraction (0..1) of entries considered outliers.
- Type:
float
- device¶
Device string (e.g. “cpu” or “cuda”). Informational.
- Type:
str
- q_bit¶
Quantization bitwidth (e.g. 4 for 4-bit quantization).
- Type:
int
- data_packing¶
If True, multiple quantized values are packed into 32-bit integers; otherwise plain int8 arrays are used.
- Type:
bool
- symmetrical_quantization¶
If True, use symmetric quantization (no zeropoint). If False, use asymmetric quantization with zeropoints.
- Type:
bool
- target_layers¶
Tuple of layer names to quantize. If None, all linear layers are quantized.
- Type:
Optional[Tuple[str, …]]
- Raises:
ValueError – If group_size or n_iters are out of valid ranges.
- group_size: int¶
- n_iters: int¶
- alpha: float¶
- device: str¶
- q_bit: int¶
- data_packing: bool¶
- symmetrical_quantization: bool¶
- target_layers: Tuple[str, ...] | None¶
- model_config = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class auto_adpq.AdpQQuantizedWeights(group_num: int, scale: list[float] | ndarray, quantized_vector: list[list[int]] | ndarray, outlier_indices: list[list[int]] | ndarray, original_shape: Tuple[int, ...] | None = None, zeropoint: list[float] | ndarray | None = None)[source]¶
Bases:
objectContainer for AdpQ quantization outputs.
- original_shape¶
Original shape of the matrix passed to AdpQ_quantize. Used to reshape reconstructed output back to original shape.
- Type:
Optional[tuple[int, …]]
- group_num¶
Number of groups after reshaping to (-1, group_size).
- Type:
int
- scale¶
Per-group scale values. In practice an array of shape (group_num, 2) where second column is for outliers.
- Type:
Union[list[float], np.ndarray]
- zeropoint¶
Per-group zeropoints (None when symmetric quantization is used).
- Type:
Optional[Union[list[float], np.ndarray]]
- quantized_vector¶
Quantized integer vectors for each group (group_num x group_size).
- Type:
Union[list[list[int]], np.ndarray]
- outlier_indices¶
Per-group list of outlier indices or sentinel values.
- Type:
Union[list[list[int]], np.ndarray]
- Raises:
ValueError – If lengths of lists do not match group_num.
TODO: Currently, there is a major overhead when creating a new object to validate the field. Since it is used internally only, we could ditch the Pydantic module but would need to ensure proper dump and load function.
- group_num: int¶
- scale: list[float] | ndarray¶
- quantized_vector: list[list[int]] | ndarray¶
- outlier_indices: list[list[int]] | ndarray¶
- original_shape: Tuple[int, ...] | None = None¶
- zeropoint: list[float] | ndarray | None = None¶
Module reference¶
Auto ADPQ module.
- class auto_adpq.module.Auto_AdpQ(group_size: int = 128, alpha: float = 0.06, n_iters: int = 100, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, config: AutoAdpQConfig | None = None)[source]¶
Bases:
objectAdaptive Post-Training Quantization driver.
This class implements the end-to-end AdpQ flow: outlier detection, separate quantization of non-outlier and outlier values, and packaging of the quantized representation into
AdpQQuantizedWeights.- linear_target_layers = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj')¶
- quantize(sub_vector: list[float] | ndarray | Tensor) tuple[ndarray, float, float][source]¶
Quantize a 1-D sub-vector.
The function supports symmetric and asymmetric quantization. For symmetric quantization, zeropoint is not used and will be set to
np.nanbefore conversion tonp.float16.- Parameters:
sub_vector (Union[list[float], np.ndarray, torch.Tensor]) – 1-D numeric array-like containing values to quantize.
- Returns:
(quantized, scale, zeropoint)where
quantizedis annp.int8array andscale/zeropointare returned asnp.float16values.
- Return type:
Tuple[np.ndarray, float, float]
- Raises:
ValueError – If input vector leads to invalid arithmetic (e.g. division by zero for a zero vector in symmetric mode).
- pack_bits(quantized_weights: ndarray) ndarray[source]¶
Pack quantized weights vector.
- Parameters:
quantized_weights (np.ndarray) – the quantized weights must be of size (M,N) typical matrix size.
- Returns:
the bit-packed quantized weights.
- Return type:
np.ndarray
- unpack_bits(packed_weights: ndarray) ndarray[source]¶
Unpack bit-packed quantized weights.
- Parameters:
packed_weights (np.ndarray) – the bit-packed quantized weights.
- Returns:
the unpacked quantized weights.
- Return type:
np.ndarray
- reconstruct_weights(adpq_quantized_weights: AdpQQuantizedWeights) ndarray[source]¶
Reconstruct the full matrix from an AdpQQuantizedWeights object.
- Parameters:
adpq_quantized_weights (AdpQQuantizedWeights) – Container produced by
AdpQ_quantize()that includes scales, zeropoints, quantized vectors and outlier indices.- Returns:
- Reconstructed matrix with dtype
np.float16and shape matching
original_shapefrom the provided object.
- Reconstructed matrix with dtype
- Return type:
np.ndarray
- save_weights(adpq_quantized_weights: AdpQQuantizedWeights, filepath: str, weight_name: str = 'weights')[source]¶
Save the AdpQQuantizedWeights to a file.
- Parameters:
adpq_quantized_weights (AdpQQuantizedWeights) – The quantized weights to save.
weight_name (str) – The name of the weight matrix.
filepath (str) – The path to the file where the weights will be saved.
TODO: Data packing fails at the moment!
- load_weights(filepath: str) AdpQQuantizedWeights[source]¶
Load the AdpQQuantizedWeights from a file.
- Parameters:
weight_name (str) – The name of the weight matrix.
filepath (str) – The path to the file where the weights are saved.
- Returns:
The loaded quantized weights.
- Return type:
- lasso_outlier_detection(matrix: list[float] | ndarray | Tensor) tuple[ndarray, float][source]¶
Detect outliers using an adaptive LASSO-inspired method.
The method searches for a regularization parameter that produces a target fraction of outliers (
alpha) using a Brent-like root finding procedure. The selection criterion follows:hat_w_i = sign(w_i) * ReLU(|w_i| - lambda' / |w_i|)
- Parameters:
matrix (Union[list, np.ndarray, torch.Tensor]) – 2-D array shaped (num_groups, group_size) containing values to analyze.
- Returns:
(outlier_indices, outlier_ratio)where
outlier_indicesis an integer array listing per-group outlier positions andoutlier_ratiois the fraction of entries detected as outliers.
- Return type:
Tuple[np.ndarray, float]
- AdpQ_quantize(matrix: list[float] | ndarray | Tensor) AdpQQuantizedWeights[source]¶
Quantize a matrix using the AdpQ (LASSO-based) flow.
- Parameters:
matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to
(-1, group_size)and processes each group independently.- Returns:
- Container with quantized values, scales,
optional zeropoints and outlier indices.
- Return type:
- quantize_reconstruct(matrix: list[float] | ndarray | Tensor) ndarray[source]¶
Quantize and reconstruct a matrix using AdpQ.
- Parameters:
matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to
(-1, group_size)and processes each group independently.- Returns:
Reconstructed matrix after quantization.
- Return type:
np.ndarray
- quantize_model_multithreaded(model: Module, max_workers: int = 4)[source]¶
Quantize valid linear layers using a thread pool.
- Parameters:
model – The PyTorch model.
max_workers – Limit threads to avoid OOM (Out of Memory). Set to 4-8 for desktop, higher for servers.
- save_pretrained(save_directory: str)[source]¶
Save the quantized model in Hugging Face format.
- Parameters:
save_directory (str) – The directory where the model will be saved.
- fuse_model_from_pretrained(model: Module, load_directory: str)[source]¶
Load the quantized model from Hugging Face format.
- Parameters:
model (torch.nn.Module) – The PyTorch model to load the weights into.
load_directory (str) – The directory where the model in ADPQ format is saved.
- classmethod apply_quantization(model: Module, config: AutoAdpQConfig, multi_threaded: int = 1)[source]¶
Apply quantization to a model given a configuration.
- Parameters:
model (torch.nn.Module) – The model to be quantized.
config (AutoAdpQConfig) – Configuration for quantization.
multi_threaded (int) – Whether to use multi-threaded quantization. Default to 1, single-threaded is used. else, specify the number of threads.