About - YPOLOGIST

YPOLOGIST Optimized Computational Architecture

Integrating MAP, CONTROLLER, DISTRIBUTE, and SCAN/REDUCE

Heterogeneous Computing System

Complex computation runs on HOST: a mono- or multi-core computation structure (ARM, RISC V, …)
Intense computation runs on the ACCELERATOR: a many-core computation structure
The ACCELERATOR is seen by the HOST as a hardware library of functions (called: parallel RISC system: pRISC or accelerator as General-Purpose Processing Unit: aXPU)

GENERAL PURPOSE ACCELERATOR (pRISC/aXPU)

MAP: linear array of p execution cells with big register files

CONTROLLER: custom micro-computer used to issue commands to the map section, one per clock cycle

DISTRIBUTE: pipelined log-depth distribution network

SCAN/REDUCE: pipelined log-depth circuit performing reduce functions (add, min, max, …) and scan functions (add prefixes, permute, …)

SOFTWARE ARCHITECTURE COMPONENT OVERVIEW

SOFTWARE ARCHITECTURE FLOW VIEW

Architectural supralinear acceleration for matrix multiplication: 6.28 x p

Test configuration for NxN matrix multiplication: HOST: ARM mono-core ACCELERATOR: our MapScanReduce accelerator with p = N cells

Architectural acceleration (A): acceleration with HOST and ACCELERATOR running at the same frequency with a x86 mono-core engine. Validated by measurement on GPA simulator's clock counter, measured running on x86 mono-core
A => 6.28 x p

Estimation for 1024x1024 Matrix Multiplication in ML:11x less energy & 3x less area compared with Nvidia

On GPU Nvidia's GA100: execution time 0.4ms, on 846 mm2, with 6912 cells, in 7nm, 1.275 GHz, 400W, Memory Bus: 5120 bits

On our GPA: execution time 2.9ms, on 40 mm2 , with 1024 cells, in 7nm, 1.275 GHz, 5.12W, Memory Bus: 128 bits

#cells(GPU)/#cells(GPA) = 6.75 ≈ 7 ≈ time(GPA)/time(GPU) = 7.25

Power(GPU)/Power(GPA) = 78 → ~11x computation for the same energy

Area(GPU)/Area(GPA) = 21 → ~3x computation for the same area

THE PROJECT

Stage 0: ACCELERATOR in FPGA and Assembler language

Stage 1: GPA SDK, the frame for API integration, partially Kernels up to the level at which system performance can be demonstrated (ONNX)

Stage 2: fully developed Kernels

CURRENT STAGE

Three silicon versions of the acceleration produced in Silicon Valley of a previous/primitive version of this technology Read more...

Working prototype, on PYNQ-Z2 development board, for p = 128

The accelerator is programmed in assembly

Performance was investigated for large number of application domains (dense & sparce linear algebra, FFT, molecular dynamic, automotive, …)