Logo

YPOLOGIST Optimized Computational Architecture

Integrating MAP, CONTROLLER, DISTRIBUTE, and SCAN/REDUCE

Heterogeneous Computing System

  • Complex computation runs on HOST: a mono- or multi-core computation structure (ARM, RISC V, …)
  • Intense computation runs on the ACCELERATOR: a many-core computation structure
  • The ACCELERATOR is seen by the HOST as a hardware library of functions (called: parallel RISC system: pRISC or accelerator as General-Purpose Processing Unit: aXPU)
App screen 02

GENERAL PURPOSE ACCELERATOR (pRISC/aXPU)

  • MAP: linear array of p execution cells with big register files
  • CONTROLLER: custom micro-computer used to issue commands to the map section, one per clock cycle
  • DISTRIBUTE: pipelined log-depth distribution network
  • SCAN/REDUCE: pipelined log-depth circuit performing reduce functions (add, min, max, …) and scan functions (add prefixes, permute, …)
  • App screen 02

    SOFTWARE ARCHITECTURE COMPONENT OVERVIEW

    App screen 02

    SOFTWARE ARCHITECTURE FLOW VIEW

    App screen 02

    Architectural supralinear acceleration for matrix multiplication: 6.28 x p

  • Test configuration for NxN matrix multiplication: HOST: ARM mono-core ACCELERATOR: our MapScanReduce accelerator with p = N cells
  • Architectural acceleration (A): acceleration with HOST and ACCELERATOR running at the same frequency with a x86 mono-core engine. Validated by measurement on GPA simulator's clock counter, measured running on x86 mono-core
    A => 6.28 x p
  • Estimation for 1024x1024 Matrix Multiplication in ML:11x less energy & 3x less area compared with Nvidia

  • On GPU Nvidia's GA100: execution time 0.4ms, on 846 mm2, with 6912 cells, in 7nm, 1.275 GHz, 400W, Memory Bus: 5120 bits
  • On our GPA: execution time 2.9ms, on 40 mm2 , with 1024 cells, in 7nm, 1.275 GHz, 5.12W, Memory Bus: 128 bits
  • #cells(GPU)/#cells(GPA) = 6.75 ≈ 7 ≈ time(GPA)/time(GPU) = 7.25
  • Power(GPU)/Power(GPA) = 78 → ~11x computation for the same energy
  • Area(GPU)/Area(GPA) = 21 → ~3x computation for the same area
  • THE PROJECT

  • Stage 0: ACCELERATOR in FPGA and Assembler language
  • Stage 1: GPA SDK, the frame for API integration, partially Kernels up to the level at which system performance can be demonstrated (ONNX)
  • Stage 2: fully developed Kernels
  • App screen 02

    CURRENT STAGE

  • Three silicon versions of the acceleration produced in Silicon Valley of a previous/primitive version of this technology Read more...
  • Working prototype, on PYNQ-Z2 development board, for p = 128
  • The accelerator is programmed in assembly
  • Performance was investigated for large number of application domains (dense & sparce linear algebra, FFT, molecular dynamic, automotive, …)
  • App screen 02