SIGMA features an innovative architecture that includes the Differential Query-Key-Value (DiffQKV) attention mechanism and benefits from extensive pre-training on system-specific data. DiffQKV optimizes inference efficiency by adopting tailored strategies for the Query (Q), Key (K), and Value (V) components of the attention mechanism. Unlike traditional approaches, which compress these components uniformly, DiffQKV applies selective compression. This involves aggressive compression of Key components while sparing Value components to maintain performance. The model also employs augmented Q dimensions, enhancing its representational capacity without significantly impacting inference speed.
SIGMA’s pre-training incorporates 6 trillion tokens, including 19.5 billion tokens from system-domain-specific sources and 1 trillion synthesized and rewritten tokens. This focused training ensures that SIGMA performs on par with state-of-the-art models in general domains while excelling in system-specific tasks. To evaluate its capabilities, Microsoft introduced AIMICIUS, a benchmark specifically designed for system-related tasks. SIGMA’s performance on AIMICIUS demonstrates substantial improvements, outperforming GPT-4 with an absolute improvement of up to 52.5%......