Welcome to auto_adpq’s documentation!

Package overview

Auto-AdpQ.

class auto_adpq.Auto_AdpQ(group_size: int = 128, alpha: float = 0.06, n_iters: int = 100, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, config: AutoAdpQConfig | None = None)[source]

Bases: object

Adaptive Post-Training Quantization driver.

This class implements the end-to-end AdpQ flow: outlier detection, separate quantization of non-outlier and outlier values, and packaging of the quantized representation into AdpQQuantizedWeights.

linear_target_layers = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj')
quantize(sub_vector: list[float] | ndarray | Tensor) tuple[ndarray, float, float][source]

Quantize a 1-D sub-vector.

The function supports symmetric and asymmetric quantization. For symmetric quantization, zeropoint is not used and will be set to np.nan before conversion to np.float16.

Parameters:

sub_vector (Union[list[float], np.ndarray, torch.Tensor]) – 1-D numeric array-like containing values to quantize.

Returns:

(quantized, scale, zeropoint)

where quantized is an np.int8 array and scale/ zeropoint are returned as np.float16 values.

Return type:

Tuple[np.ndarray, float, float]

Raises:

ValueError – If input vector leads to invalid arithmetic (e.g. division by zero for a zero vector in symmetric mode).

pack_bits(quantized_weights: ndarray) ndarray[source]

Pack quantized weights vector.

Parameters:

quantized_weights (np.ndarray) – the quantized weights must be of size (M,N) typical matrix size.

Returns:

the bit-packed quantized weights.

Return type:

np.ndarray

unpack_bits(packed_weights: ndarray) ndarray[source]

Unpack bit-packed quantized weights.

Parameters:

packed_weights (np.ndarray) – the bit-packed quantized weights.

Returns:

the unpacked quantized weights.

Return type:

np.ndarray

reconstruct_weights(adpq_quantized_weights: AdpQQuantizedWeights) ndarray[source]

Reconstruct the full matrix from an AdpQQuantizedWeights object.

Parameters:

adpq_quantized_weights (AdpQQuantizedWeights) – Container produced by AdpQ_quantize() that includes scales, zeropoints, quantized vectors and outlier indices.

Returns:

Reconstructed matrix with dtype np.float16 and

shape matching original_shape from the provided object.

Return type:

np.ndarray

save_weights(adpq_quantized_weights: AdpQQuantizedWeights, filepath: str, weight_name: str = 'weights')[source]

Save the AdpQQuantizedWeights to a file.

Parameters:
  • adpq_quantized_weights (AdpQQuantizedWeights) – The quantized weights to save.

  • weight_name (str) – The name of the weight matrix.

  • filepath (str) – The path to the file where the weights will be saved.

TODO: Data packing fails at the moment!

load_weights(filepath: str) AdpQQuantizedWeights[source]

Load the AdpQQuantizedWeights from a file.

Parameters:
  • weight_name (str) – The name of the weight matrix.

  • filepath (str) – The path to the file where the weights are saved.

Returns:

The loaded quantized weights.

Return type:

AdpQQuantizedWeights

lasso_outlier_detection(matrix: list[float] | ndarray | Tensor) tuple[ndarray, float][source]

Detect outliers using an adaptive LASSO-inspired method.

The method searches for a regularization parameter that produces a target fraction of outliers (alpha) using a Brent-like root finding procedure. The selection criterion follows:

hat_w_i = sign(w_i) * ReLU(|w_i| - lambda' / |w_i|)
Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – 2-D array shaped (num_groups, group_size) containing values to analyze.

Returns:

(outlier_indices, outlier_ratio)

where outlier_indices is an integer array listing per-group outlier positions and outlier_ratio is the fraction of entries detected as outliers.

Return type:

Tuple[np.ndarray, float]

AdpQ_quantize(matrix: list[float] | ndarray | Tensor) AdpQQuantizedWeights[source]

Quantize a matrix using the AdpQ (LASSO-based) flow.

Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to (-1, group_size) and processes each group independently.

Returns:

Container with quantized values, scales,

optional zeropoints and outlier indices.

Return type:

AdpQQuantizedWeights

quantize_reconstruct(matrix: list[float] | ndarray | Tensor) ndarray[source]

Quantize and reconstruct a matrix using AdpQ.

Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to (-1, group_size) and processes each group independently.

Returns:

Reconstructed matrix after quantization.

Return type:

np.ndarray

quantize_model_multithreaded(model: Module, max_workers: int = 4)[source]

Quantize valid linear layers using a thread pool.

Parameters:
  • model – The PyTorch model.

  • max_workers – Limit threads to avoid OOM (Out of Memory). Set to 4-8 for desktop, higher for servers.

save_pretrained(save_directory: str)[source]

Save the quantized model in Hugging Face format.

Parameters:

save_directory (str) – The directory where the model will be saved.

fuse_model_from_pretrained(model: Module, load_directory: str)[source]

Load the quantized model from Hugging Face format.

Parameters:
  • model (torch.nn.Module) – The PyTorch model to load the weights into.

  • load_directory (str) – The directory where the model in ADPQ format is saved.

classmethod apply_quantization(model: Module, config: AutoAdpQConfig, multi_threaded: int = 1)[source]

Apply quantization to a model given a configuration.

Parameters:
  • model (torch.nn.Module) – The model to be quantized.

  • config (AutoAdpQConfig) – Configuration for quantization.

  • multi_threaded (int) – Whether to use multi-threaded quantization. Default to 1, single-threaded is used. else, specify the number of threads.

class auto_adpq.AutoAdpQConfig(*, group_size: int = 128, n_iters: int = 100, alpha: float = 0.08, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, target_layers: Tuple[str, ...] | None = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj'))[source]

Bases: BaseModel

Configuration for Auto_AdpQ.

group_size

Number of elements in a group for group-wise quantization. Must be between 1 and 65535 (inclusive).

Type:

int

n_iters

Maximum number of iterations for outlier detection.

Type:

int

alpha

Target fraction (0..1) of entries considered outliers.

Type:

float

device

Device string (e.g. “cpu” or “cuda”). Informational.

Type:

str

q_bit

Quantization bitwidth (e.g. 4 for 4-bit quantization).

Type:

int

data_packing

If True, multiple quantized values are packed into 32-bit integers; otherwise plain int8 arrays are used.

Type:

bool

symmetrical_quantization

If True, use symmetric quantization (no zeropoint). If False, use asymmetric quantization with zeropoints.

Type:

bool

target_layers

Tuple of layer names to quantize. If None, all linear layers are quantized.

Type:

Optional[Tuple[str, …]]

Raises:

ValueError – If group_size or n_iters are out of valid ranges.

group_size: int
n_iters: int
alpha: float
device: str
q_bit: int
data_packing: bool
symmetrical_quantization: bool
target_layers: Tuple[str, ...] | None
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class auto_adpq.AdpQQuantizedWeights(group_num: int, scale: list[float] | ndarray, quantized_vector: list[list[int]] | ndarray, outlier_indices: list[list[int]] | ndarray, original_shape: Tuple[int, ...] | None = None, zeropoint: list[float] | ndarray | None = None)[source]

Bases: object

Container for AdpQ quantization outputs.

original_shape

Original shape of the matrix passed to AdpQ_quantize. Used to reshape reconstructed output back to original shape.

Type:

Optional[tuple[int, …]]

group_num

Number of groups after reshaping to (-1, group_size).

Type:

int

scale

Per-group scale values. In practice an array of shape (group_num, 2) where second column is for outliers.

Type:

Union[list[float], np.ndarray]

zeropoint

Per-group zeropoints (None when symmetric quantization is used).

Type:

Optional[Union[list[float], np.ndarray]]

quantized_vector

Quantized integer vectors for each group (group_num x group_size).

Type:

Union[list[list[int]], np.ndarray]

outlier_indices

Per-group list of outlier indices or sentinel values.

Type:

Union[list[list[int]], np.ndarray]

Raises:

ValueError – If lengths of lists do not match group_num.

TODO: Currently, there is a major overhead when creating a new object to validate the field. Since it is used internally only, we could ditch the Pydantic module but would need to ensure proper dump and load function.

group_num: int
scale: list[float] | ndarray
quantized_vector: list[list[int]] | ndarray
outlier_indices: list[list[int]] | ndarray
original_shape: Tuple[int, ...] | None = None
zeropoint: list[float] | ndarray | None = None

Module reference

Auto ADPQ module.

class auto_adpq.module.Auto_AdpQ(group_size: int = 128, alpha: float = 0.06, n_iters: int = 100, device: str = 'cpu', q_bit: int = 4, data_packing: bool = True, symmetrical_quantization: bool = True, config: AutoAdpQConfig | None = None)[source]

Bases: object

Adaptive Post-Training Quantization driver.

This class implements the end-to-end AdpQ flow: outlier detection, separate quantization of non-outlier and outlier values, and packaging of the quantized representation into AdpQQuantizedWeights.

linear_target_layers = ('q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj')
quantize(sub_vector: list[float] | ndarray | Tensor) tuple[ndarray, float, float][source]

Quantize a 1-D sub-vector.

The function supports symmetric and asymmetric quantization. For symmetric quantization, zeropoint is not used and will be set to np.nan before conversion to np.float16.

Parameters:

sub_vector (Union[list[float], np.ndarray, torch.Tensor]) – 1-D numeric array-like containing values to quantize.

Returns:

(quantized, scale, zeropoint)

where quantized is an np.int8 array and scale/ zeropoint are returned as np.float16 values.

Return type:

Tuple[np.ndarray, float, float]

Raises:

ValueError – If input vector leads to invalid arithmetic (e.g. division by zero for a zero vector in symmetric mode).

pack_bits(quantized_weights: ndarray) ndarray[source]

Pack quantized weights vector.

Parameters:

quantized_weights (np.ndarray) – the quantized weights must be of size (M,N) typical matrix size.

Returns:

the bit-packed quantized weights.

Return type:

np.ndarray

unpack_bits(packed_weights: ndarray) ndarray[source]

Unpack bit-packed quantized weights.

Parameters:

packed_weights (np.ndarray) – the bit-packed quantized weights.

Returns:

the unpacked quantized weights.

Return type:

np.ndarray

reconstruct_weights(adpq_quantized_weights: AdpQQuantizedWeights) ndarray[source]

Reconstruct the full matrix from an AdpQQuantizedWeights object.

Parameters:

adpq_quantized_weights (AdpQQuantizedWeights) – Container produced by AdpQ_quantize() that includes scales, zeropoints, quantized vectors and outlier indices.

Returns:

Reconstructed matrix with dtype np.float16 and

shape matching original_shape from the provided object.

Return type:

np.ndarray

save_weights(adpq_quantized_weights: AdpQQuantizedWeights, filepath: str, weight_name: str = 'weights')[source]

Save the AdpQQuantizedWeights to a file.

Parameters:
  • adpq_quantized_weights (AdpQQuantizedWeights) – The quantized weights to save.

  • weight_name (str) – The name of the weight matrix.

  • filepath (str) – The path to the file where the weights will be saved.

TODO: Data packing fails at the moment!

load_weights(filepath: str) AdpQQuantizedWeights[source]

Load the AdpQQuantizedWeights from a file.

Parameters:
  • weight_name (str) – The name of the weight matrix.

  • filepath (str) – The path to the file where the weights are saved.

Returns:

The loaded quantized weights.

Return type:

AdpQQuantizedWeights

lasso_outlier_detection(matrix: list[float] | ndarray | Tensor) tuple[ndarray, float][source]

Detect outliers using an adaptive LASSO-inspired method.

The method searches for a regularization parameter that produces a target fraction of outliers (alpha) using a Brent-like root finding procedure. The selection criterion follows:

hat_w_i = sign(w_i) * ReLU(|w_i| - lambda' / |w_i|)
Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – 2-D array shaped (num_groups, group_size) containing values to analyze.

Returns:

(outlier_indices, outlier_ratio)

where outlier_indices is an integer array listing per-group outlier positions and outlier_ratio is the fraction of entries detected as outliers.

Return type:

Tuple[np.ndarray, float]

AdpQ_quantize(matrix: list[float] | ndarray | Tensor) AdpQQuantizedWeights[source]

Quantize a matrix using the AdpQ (LASSO-based) flow.

Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to (-1, group_size) and processes each group independently.

Returns:

Container with quantized values, scales,

optional zeropoints and outlier indices.

Return type:

AdpQQuantizedWeights

quantize_reconstruct(matrix: list[float] | ndarray | Tensor) ndarray[source]

Quantize and reconstruct a matrix using AdpQ.

Parameters:

matrix (Union[list, np.ndarray, torch.Tensor]) – Input weight matrix. The method reshapes the input to (-1, group_size) and processes each group independently.

Returns:

Reconstructed matrix after quantization.

Return type:

np.ndarray

quantize_model_multithreaded(model: Module, max_workers: int = 4)[source]

Quantize valid linear layers using a thread pool.

Parameters:
  • model – The PyTorch model.

  • max_workers – Limit threads to avoid OOM (Out of Memory). Set to 4-8 for desktop, higher for servers.

save_pretrained(save_directory: str)[source]

Save the quantized model in Hugging Face format.

Parameters:

save_directory (str) – The directory where the model will be saved.

fuse_model_from_pretrained(model: Module, load_directory: str)[source]

Load the quantized model from Hugging Face format.

Parameters:
  • model (torch.nn.Module) – The PyTorch model to load the weights into.

  • load_directory (str) – The directory where the model in ADPQ format is saved.

classmethod apply_quantization(model: Module, config: AutoAdpQConfig, multi_threaded: int = 1)[source]

Apply quantization to a model given a configuration.

Parameters:
  • model (torch.nn.Module) – The model to be quantized.

  • config (AutoAdpQConfig) – Configuration for quantization.

  • multi_threaded (int) – Whether to use multi-threaded quantization. Default to 1, single-threaded is used. else, specify the number of threads.