YPOLOGIST Optimized

Computational Architecture

Integrating MAP, CONTROLLER, DISTRIBUTE, and SCAN/REDUCE

Heterogeneous Computing System

Complex computation runs on HOST: a mono- or multi-core computation structure (ARM, RISC V, ...)
Intense computation runs on the ACCELERATOR: a many-core computation structure
The ACCELERATOR is seen by the HOST as a hardware library of functions (called: parallel RISC system: pRISC or accelerator as General-Purpose Processing Unit: aXPU)

Heterogeneous System Diagram

GENERAL PURPOSE ACCELERATOR (pRISC/aXPU)

MAP: linear array of p execution cells with big register files
CONTROLLER: custom micro-computer used to issue commands to the map section, one per clock cycle
DISTRIBUTE: pipelined log-depth distribution network
SCAN/REDUCE: pipelined log-depth circuit performing reduce functions (add, min, max, …) and scan functions (add prefixes, permute, …)

General Purpose Accelerator Diagram

SOFTWARE ARCHITECTURE COMPONENT OVERVIEW

Software Architecture Component Overview

SOFTWARE ARCHITECTURE FLOW VIEW

Software Architecture Flow View

Architectural supralinear acceleration for matrix multiplication: 6.28 x p

Test configuration for NxN matrix multiplication: HOST: ARM mono-core ACCELERATOR: our MapScanReduce accelerator with p = N cells
Architectural acceleration (A): acceleration with HOST and ACCELERATOR running at the same frequency with a x86 mono-core engine. Validated by measurement on GPA simulator's clock counter, measured running on x86 mono-core
A => 6.28 x p

Estimation for 1024x1024 Matrix Multiplication in ML:11x less energy & 3x less area compared with Nvidia

On GPU Nvidia's GA100: execution time 0.4ms, on 846 mm2, with 6912 cells, in 7nm, 1.275 GHz, 400W, Memory Bus: 5120 bits
On our GPA: execution time 2.9ms, on 40 mm2, with 1024 cells, in 7nm, 1.275 GHz, 5.12W, Memory Bus: 128 bits
#cells(GPU)/#cells(GPA) = 6.75 ÷ 7 = time(GPA)/time(GPU) = 7.25
Power(GPU)/Power(GPA) = 78 → ~11x computation for the same energy
Area(GPU)/Area(GPA) = 21 → ~3x computation for the same area

THE PROJECT

Stage 0: ACCELERATOR in FPGA and Assembler language
Stage 1: GPA SDK, the frame for API integration, partially Kernels up to the level at which system performance can be demonstrated (ONNX)
Stage 2: fully developed Kernels

Project Diagram

CURRENT STAGE

Three silicon versions of the acceleration produced in Silicon Valley of a previous/primitive version of this technology Read more...
Working prototype, on PYNQ-Z2 development board, for p = 128
The accelerator is programmed in assembly
Performance was investigated for large number of application domains (dense & sparce linear algebra, FFT, molecular dynamic, automotive, ...)

Current Stage Accelerator Image