YPOLOGIST Optimized

Computational Architecture

Integrating MAP, CONTROLLER, DISTRIBUTE, and SCAN/REDUCE

Heterogeneous Computing System

  • Complex computation runs on HOST: a mono- or multi-core computation structure (ARM, RISC V, ...)
  • Intense computation runs on the ACCELERATOR: a many-core computation structure
  • The ACCELERATOR is seen by the HOST as a hardware library of functions (called: parallel RISC system: pRISC or accelerator as General-Purpose Processing Unit: aXPU)
Heterogeneous System Diagram

GENERAL PURPOSE ACCELERATOR (pRISC/aXPU)

  • MAP: linear array of p execution cells with big register files
  • CONTROLLER: custom micro-computer used to issue commands to the map section, one per clock cycle
  • DISTRIBUTE: pipelined log-depth distribution network
  • SCAN/REDUCE: pipelined log-depth circuit performing reduce functions (add, min, max, …) and scan functions (add prefixes, permute, …)
General Purpose Accelerator Diagram

SOFTWARE ARCHITECTURE COMPONENT OVERVIEW

Software Architecture Component Overview

SOFTWARE ARCHITECTURE FLOW VIEW

Software Architecture Flow View

Architectural supralinear acceleration for matrix multiplication: 6.28 x p

  • Test configuration for NxN matrix multiplication: HOST: ARM mono-core ACCELERATOR: our MapScanReduce accelerator with p = N cells
  • Architectural acceleration (A): acceleration with HOST and ACCELERATOR running at the same frequency with a x86 mono-core engine. Validated by measurement on GPA simulator's clock counter, measured running on x86 mono-core
  • A => 6.28 x p

Estimation for 1024x1024 Matrix Multiplication in ML:11x less energy & 3x less area compared with Nvidia

  • On GPU Nvidia's GA100: execution time 0.4ms, on 846 mm2, with 6912 cells, in 7nm, 1.275 GHz, 400W, Memory Bus: 5120 bits
  • On our GPA: execution time 2.9ms, on 40 mm2, with 1024 cells, in 7nm, 1.275 GHz, 5.12W, Memory Bus: 128 bits
  • #cells(GPU)/#cells(GPA) = 6.75 ÷ 7 = time(GPA)/time(GPU) = 7.25
  • Power(GPU)/Power(GPA) = 78 → ~11x computation for the same energy
  • Area(GPU)/Area(GPA) = 21 → ~3x computation for the same area

THE PROJECT

  • Stage 0: ACCELERATOR in FPGA and Assembler language
  • Stage 1: GPA SDK, the frame for API integration, partially Kernels up to the level at which system performance can be demonstrated (ONNX)
  • Stage 2: fully developed Kernels
Project Diagram

CURRENT STAGE

  • Three silicon versions of the acceleration produced in Silicon Valley of a previous/primitive version of this technology Read more...
  • Working prototype, on PYNQ-Z2 development board, for p = 128
  • The accelerator is programmed in assembly
  • Performance was investigated for large number of application domains (dense & sparce linear algebra, FFT, molecular dynamic, automotive, ...)
Current Stage Accelerator Image