Welcome to auto_adpq’s documentation!¶

Package overview¶

Auto-AdpQ.

class auto_adpq.Auto_AdpQ(group_size: int = 128, alpha: float = 0.06, n_iters: int = 100, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, config: AutoAdpQConfig | None = None)[source]¶

Bases: object

Adaptive Post-Training Quantization driver.

This class implements the end-to-end AdpQ flow: outlier detection, separate quantization of non-outlier and outlier values, and packaging of the quantized representation into AdpQQuantizedWeights.

linear_target_layers = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj')¶

quantize(sub_vector: list[float] | ndarray | Tensor) → tuple[ndarray, float, float][source]¶

Quantize a 1-D sub-vector.

The function supports symmetric and asymmetric quantization. For symmetric quantization, zeropoint is not used and will be set to np.nan before conversion to np.float16.

Parameters:

sub_vector (Union[list[float], np.ndarray, torch.Tensor]) – 1-D numeric array-like containing values to quantize.

Returns:

(quantized, scale, zeropoint): where quantized is an np.int8 array and scale/ zeropoint are returned as np.float16 values.

Return type:

Tuple[np.ndarray, float, float]

Raises:

ValueError – If input vector leads to invalid arithmetic (e.g. division by zero for a zero vector in symmetric mode).

pack_bits(quantized_weights: ndarray) → ndarray[source]¶

Pack quantized weights vector.

Parameters:: quantized_weights (np.ndarray) – the quantized weights must be of size (M,N) typical matrix size.
Returns:: the bit-packed quantized weights.
Return type:: np.ndarray

unpack_bits(packed_weights: ndarray) → ndarray[source]¶

Unpack bit-packed quantized weights.

Parameters:: packed_weights (np.ndarray) – the bit-packed quantized weights.
Returns:: the unpacked quantized weights.
Return type:: np.ndarray

reconstruct_weights(adpq_quantized_weights: AdpQQuantizedWeights) → ndarray[source]¶

Reconstruct the full matrix from an AdpQQuantizedWeights object.

Parameters:

adpq_quantized_weights (AdpQQuantizedWeights) – Container produced by AdpQ_quantize() that includes scales, zeropoints, quantized vectors and outlier indices.

Returns:

Reconstructed matrix with dtype np.float16 and: shape matching original_shape from the provided object.

Return type:

np.ndarray

save_weights(adpq_quantized_weights: AdpQQuantizedWeights, filepath: str, weight_name: str = 'weights')[source]¶

Save the AdpQQuantizedWeights to a file.

Parameters:

adpq_quantized_weights (AdpQQuantizedWeights) – The quantized weights to save.
weight_name (str) – The name of the weight matrix.
filepath (str) – The path to the file where the weights will be saved.

TODO: Data packing fails at the moment!

load_weights(filepath: str) → AdpQQuantizedWeights[source]¶

Load the AdpQQuantizedWeights from a file.

Parameters:

weight_name (str) – The name of the weight matrix.
filepath (str) – The path to the file where the weights are saved.

Returns:

The loaded quantized weights.

Return type:

AdpQQuantizedWeights

lasso_outlier_detection(matrix: list[float] | ndarray | Tensor) → tuple[ndarray, float][source]¶

Detect outliers using an adaptive LASSO-inspired method.

The method searches for a regularization parameter that produces a target fraction of outliers (alpha) using a Brent-like root finding procedure. The selection criterion follows:

hat_w_i = sign(w_i) * ReLU(|w_i| - lambda' / |w_i|)

Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – 2-D array shaped (num_groups, group_size) containing values to analyze.

Returns:

(outlier_indices, outlier_ratio): where outlier_indices is an integer array listing per-group outlier positions and outlier_ratio is the fraction of entries detected as outliers.

Return type:

Tuple[np.ndarray, float]

AdpQ_quantize(matrix: list[float] | ndarray | Tensor) → AdpQQuantizedWeights[source]¶

Quantize a matrix using the AdpQ (LASSO-based) flow.

Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to (-1, group_size) and processes each group independently.

Returns:

Container with quantized values, scales,: optional zeropoints and outlier indices.

Return type:

AdpQQuantizedWeights

quantize_reconstruct(matrix: list[float] | ndarray | Tensor) → ndarray[source]¶

Quantize and reconstruct a matrix using AdpQ.

Parameters:: matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to (-1, group_size) and processes each group independently.
Returns:: Reconstructed matrix after quantization.
Return type:: np.ndarray

quantize_model_multithreaded(model: Module, max_workers: int = 4)[source]¶

Quantize valid linear layers using a thread pool.

Parameters:

model – The PyTorch model.
max_workers – Limit threads to avoid OOM (Out of Memory). Set to 4-8 for desktop, higher for servers.

save_pretrained(save_directory: str)[source]¶

Save the quantized model in Hugging Face format.

Parameters:: save_directory (str) – The directory where the model will be saved.

fuse_model_from_pretrained(model: Module, load_directory: str)[source]¶

Load the quantized model from Hugging Face format.

Parameters:

model (torch.nn.Module) – The PyTorch model to load the weights into.
load_directory (str) – The directory where the model in ADPQ format is saved.

classmethod apply_quantization(model: Module, config: AutoAdpQConfig, multi_threaded: int = 1)[source]¶

Apply quantization to a model given a configuration.

Parameters:

model (torch.nn.Module) – The model to be quantized.
config (AutoAdpQConfig) – Configuration for quantization.
multi_threaded (int) – Whether to use multi-threaded quantization. Default to 1, single-threaded is used. else, specify the number of threads.

class auto_adpq.AutoAdpQConfig(*, group_size: int = 128, n_iters: int = 100, alpha: float = 0.08, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, target_layers: Tuple[str, ...] | None = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj'))[source]¶

Bases: BaseModel

Configuration for Auto_AdpQ.

group_size¶

Number of elements in a group for group-wise quantization. Must be between 1 and 65535 (inclusive).

Type:: int

n_iters¶

Maximum number of iterations for outlier detection.

Type:: int

alpha¶

Target fraction (0..1) of entries considered outliers.

Type:: float

device¶

Device string (e.g. “cpu” or “cuda”). Informational.

Type:: str

q_bit¶

Quantization bitwidth (e.g. 4 for 4-bit quantization).

Type:: int

data_packing¶

If True, multiple quantized values are packed into 32-bit integers; otherwise plain int8 arrays are used.

Type:: bool

symmetrical_quantization¶

If True, use symmetric quantization (no zeropoint). If False, use asymmetric quantization with zeropoints.

Type:: bool

target_layers¶

Tuple of layer names to quantize. If None, all linear layers are quantized.

Type:: Optional[Tuple[str, …]]

Raises:: ValueError – If group_size or n_iters are out of valid ranges.

group_size: int¶

n_iters: int¶

alpha: float¶

device: str¶

q_bit: int¶

data_packing: bool¶

symmetrical_quantization: bool¶

target_layers: Tuple[str, ...] | None¶

model_config = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Bases: object

Container for AdpQ quantization outputs.

original_shape¶

Original shape of the matrix passed to AdpQ_quantize. Used to reshape reconstructed output back to original shape.

Type:: Optional[tuple[int, …]]

group_num¶

Number of groups after reshaping to (-1, group_size).

Type:: int

scale¶

Per-group scale values. In practice an array of shape (group_num, 2) where second column is for outliers.

Type:: Union[list[float], np.ndarray]

zeropoint¶

Per-group zeropoints (None when symmetric quantization is used).

Type:: Optional[Union[list[float], np.ndarray]]

quantized_vector¶

Quantized integer vectors for each group (group_num x group_size).

Type:: Union[list[list[int]], np.ndarray]

outlier_indices¶

Per-group list of outlier indices or sentinel values.

Type:: Union[list[list[int]], np.ndarray]

Raises:: ValueError – If lengths of lists do not match group_num.

TODO: Currently, there is a major overhead when creating a new object to validate the field. Since it is used internally only, we could ditch the Pydantic module but would need to ensure proper dump and load function.

group_num: int¶

scale: list[float] | ndarray¶

quantized_vector: list[list[int]] | ndarray¶

outlier_indices: list[list[int]] | ndarray¶

original_shape: Tuple[int, ...] | None = None¶

zeropoint: list[float] | ndarray | None = None¶

Module reference¶

Auto ADPQ module.

class auto_adpq.module.Auto_AdpQ(group_size: int = 128, alpha: float = 0.06, n_iters: int = 100, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, config: AutoAdpQConfig | None = None)[source]¶

Bases: object

Adaptive Post-Training Quantization driver.

This class implements the end-to-end AdpQ flow: outlier detection, separate quantization of non-outlier and outlier values, and packaging of the quantized representation into AdpQQuantizedWeights.

linear_target_layers = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj')¶

quantize(sub_vector: list[float] | ndarray | Tensor) → tuple[ndarray, float, float][source]¶

Quantize a 1-D sub-vector.

The function supports symmetric and asymmetric quantization. For symmetric quantization, zeropoint is not used and will be set to np.nan before conversion to np.float16.

Parameters:

sub_vector (Union[list[float], np.ndarray, torch.Tensor]) – 1-D numeric array-like containing values to quantize.

Returns:

(quantized, scale, zeropoint): where quantized is an np.int8 array and scale/ zeropoint are returned as np.float16 values.

Return type:

Tuple[np.ndarray, float, float]

Raises:

ValueError – If input vector leads to invalid arithmetic (e.g. division by zero for a zero vector in symmetric mode).

pack_bits(quantized_weights: ndarray) → ndarray[source]¶

Pack quantized weights vector.

Parameters:: quantized_weights (np.ndarray) – the quantized weights must be of size (M,N) typical matrix size.
Returns:: the bit-packed quantized weights.
Return type:: np.ndarray

unpack_bits(packed_weights: ndarray) → ndarray[source]¶

Unpack bit-packed quantized weights.

Parameters:: packed_weights (np.ndarray) – the bit-packed quantized weights.
Returns:: the unpacked quantized weights.
Return type:: np.ndarray

reconstruct_weights(adpq_quantized_weights: AdpQQuantizedWeights) → ndarray[source]¶

Reconstruct the full matrix from an AdpQQuantizedWeights object.

Parameters:

adpq_quantized_weights (AdpQQuantizedWeights) – Container produced by AdpQ_quantize() that includes scales, zeropoints, quantized vectors and outlier indices.

Returns:

Reconstructed matrix with dtype np.float16 and: shape matching original_shape from the provided object.

Return type:

np.ndarray

save_weights(adpq_quantized_weights: AdpQQuantizedWeights, filepath: str, weight_name: str = 'weights')[source]¶

Save the AdpQQuantizedWeights to a file.

Parameters:

adpq_quantized_weights (AdpQQuantizedWeights) – The quantized weights to save.
weight_name (str) – The name of the weight matrix.
filepath (str) – The path to the file where the weights will be saved.

TODO: Data packing fails at the moment!

load_weights(filepath: str) → AdpQQuantizedWeights[source]¶

Load the AdpQQuantizedWeights from a file.

Parameters:

weight_name (str) – The name of the weight matrix.
filepath (str) – The path to the file where the weights are saved.

Returns:

The loaded quantized weights.

Return type:

AdpQQuantizedWeights

lasso_outlier_detection(matrix: list[float] | ndarray | Tensor) → tuple[ndarray, float][source]¶

Detect outliers using an adaptive LASSO-inspired method.

The method searches for a regularization parameter that produces a target fraction of outliers (alpha) using a Brent-like root finding procedure. The selection criterion follows:

hat_w_i = sign(w_i) * ReLU(|w_i| - lambda' / |w_i|)

Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – 2-D array shaped (num_groups, group_size) containing values to analyze.

Returns:

(outlier_indices, outlier_ratio): where outlier_indices is an integer array listing per-group outlier positions and outlier_ratio is the fraction of entries detected as outliers.

Return type:

Tuple[np.ndarray, float]

AdpQ_quantize(matrix: list[float] | ndarray | Tensor) → AdpQQuantizedWeights[source]¶

Quantize a matrix using the AdpQ (LASSO-based) flow.

Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to (-1, group_size) and processes each group independently.

Returns:

Container with quantized values, scales,: optional zeropoints and outlier indices.

Return type:

AdpQQuantizedWeights

quantize_reconstruct(matrix: list[float] | ndarray | Tensor) → ndarray[source]¶

Quantize and reconstruct a matrix using AdpQ.

Parameters:: matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to (-1, group_size) and processes each group independently.
Returns:: Reconstructed matrix after quantization.
Return type:: np.ndarray

quantize_model_multithreaded(model: Module, max_workers: int = 4)[source]¶

Quantize valid linear layers using a thread pool.

Parameters:

model – The PyTorch model.
max_workers – Limit threads to avoid OOM (Out of Memory). Set to 4-8 for desktop, higher for servers.

save_pretrained(save_directory: str)[source]¶

Save the quantized model in Hugging Face format.

Parameters:: save_directory (str) – The directory where the model will be saved.

fuse_model_from_pretrained(model: Module, load_directory: str)[source]¶

Load the quantized model from Hugging Face format.

Parameters:

model (torch.nn.Module) – The PyTorch model to load the weights into.
load_directory (str) – The directory where the model in ADPQ format is saved.

classmethod apply_quantization(model: Module, config: AutoAdpQConfig, multi_threaded: int = 1)[source]¶

Apply quantization to a model given a configuration.

Parameters:

model (torch.nn.Module) – The model to be quantized.
config (AutoAdpQConfig) – Configuration for quantization.
multi_threaded (int) – Whether to use multi-threaded quantization. Default to 1, single-threaded is used. else, specify the number of threads.