EfficientLLM:
Efficiency in Large Language Models

Our Framework

"The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency."

— Bill Gates

研究图表


EfficientLLM establishes a comprehensive benchmark to evaluate and compare efficiency techniques across the lifecycle of large language models—from architecture pretraining to fine-tuning and inference—providing actionable insights into trade-offs between performance and consumption.

Ranking

Category Method Performance Utilization Latency Throughput Energy Compression
AMU PCU AL TT ST IT AEC MCR
Architecture Pretraining Efficiency Attention MechanismMQA4111223
GQA3322224
MLA1433222
NSA2244221
Efficient Positional EncodingRoPE1222222
Absolute4444224
Learnable Absolute2333223
Relate3111221
MoE MechanismDense Model 1.5B4223222
Dense Model 3B3114221
MoE Model 1.5Bx81441224
MoE Model 0.5Bx82332223
Attention-free MechanismTransformer1441224
Mamba2114221
Pythia4333223
RWKV3222222
Training & Tuning Efficiency 1B-3BLoRA722323
LoRA-Plus456432
RSLoRA545545
DoRA673766
PiSSA364654
Freeze137171
Full*211217
7B-8BLoRA215543
LoRA-Plus527652
RSLoRA346335
DoRA463776
PiSSA634454
Freeze172111
Full*751227
14B-24BLoRA316236
LoRA-Plus437671
RSLoRA125445
DoRA673724
PiSSA244343
Freeze562112
Full*751567
Bit-Width Quantization 1.5B-3.8Bbfloat161222223
float161321332
int43123111
7B-8Bbfloat161222222
float162321333
int43123111
14B-34Bbfloat161222222
float162321313
int43123131

Legend: (Color mapping for scores across categories)
High Tier - (e.g., Arch: 1; Training: 1-2; Quant: 1)   Medium Tier - (e.g., Arch: 2-3; Training: 3-5; Quant: 2)   Low Tier - (e.g., Arch: 4; Training: 6-7; Quant: 3)

Metrics:
AMU - Accelerator Memory Utilization | PCU - Processor Compute Utilization | AL - Average Latency | TT - Training Throughput | ST - Serving Throughput | IT - Inference Throughput | AEC - Average Energy Consumption | MCR - Model Compression Ratio

Architecture Pretraining Efficiency

All values presented in our figures are min-max normalized within each metric across all models. For consistency, all metrics (e.g., PPL, FID, and etc.) are transformed such that higher values indicate better performance or efficiency.

研究图表

Training and Tuning Efficiency

研究图表

Bit-Width Quantization Efficiency

研究图表