Blackfin Overview

Srinivas K
A. Patil
A. Awasthy
Shailendra Miglani
Agenda

◆ Day 1
  ● Introduction
  ● VisualDSP++ features
  ● Coding guidelines for achieving Optimal C Performance on Blackfin
  ● Architecture and Pipeline
  ● Memory
  ● Assembly level optimization

◆ Day 2
  ● Introduction to LDF
  ● DMA
  ● VDK and uClinux
  ● Q & A session
Blackfin DSP Technology

A Signal Processing Architecture for the Internet Era

—Analog Devices Confidential Information—
Blackfin : Introduction

- Blackfin DSP is the architectural base for a whole new family of DSPs from ADI. It is built upon the *Micro Signal Architecture (MSA)* core developed through the Joint Development with Intel Corporation.
- Blackfin DSPs incorporate the industry’s highest performance 16-bit DSP architecture. It has Dynamic Power Management capabilities which delivers the lowest power consumption.
- Blackfin DSPs are optimized for processing data, communications and video streams for penetration into new market spaces.
Blackfin: Features and Benefits

- High Performance for real-time video signal processing
- Easily programmed to support complex, new standards.
- Handles the DSP and Control code with equal efficiency.
- Maximizes work and minimizes energy per cycle

- High Performance
  - Blackfin offers 600M MACs today with a roadmap for 2G MACs
- Low Power Consumption
  - Blackfin DSP enables significant power savings by dynamically varying both voltage and frequency.
- Ease to use
  - Blackfin DSP combines attributes of both high performance DSP and microcontrollers into a single RISC ISA.
BLACKfin Processors Embed MCU Features

- Arbitrary bit and bit-field manipulation, insertion and extraction
- Integer operations on 8/16/32-bit data-types
- Memory protection and separate user and supervisor stack pointers
- Scratch SRAM for context switching
- Population and leading digit counting
- Byte addressing DAGs
- Compact Code Density
Integrated Blackfin Features Typically Found in a Microcontroller

**A RISC Instruction Set**

**Data Movement**
- LD, ST, 8, 16, 32 bits
- Unsigned, Sign-extend
- Register moves, P-D-DAG, Push, Pop, Push/Pop mult CC2 dreg, etc.

**Addressing Modes**
- Auto incr, Auto decr,
- Pre-decr store on SP,
- Indirect
- Indexed w/immed offset
- Post-incr w/ nonunity stride
- Byte addressable

**Program Control**
- BRCC, UJ UMP,
- Call, Rets, Loop Setup

** Arithmetic**
- +, -, *, /, >>>, Negate
- 2 and 3 operand instructs

** Logical**
- AND, OR, XOR, NOT
- BIT tst, set, tgl, clr, CC ops
- <<, >>

** Video**
- SAA, Byteops: Residual calc,
- Spatial Interpolation, Spatial Filter

**Cache Control**
- Prefetch, Flush

**Memory management**

**Event control**

**Supervisor/user modes**

**Wide range of peripherals**

---

There is not a separate Micro-Controller mode!

---

—Analog Devices Confidential Information—
Traditional MCU Compiler generates
• Dense control code, BUT
• Much larger and slower DSP code

Traditional DSP Compiler generates
• Good DSP algorithm code, BUT
• Much larger control code

Architecture and Compiler Work Together to Deliver Dense Control Code and Fast DSP Code

--Analog Devices Confidential Information--
Enhanced Dynamic Power Management Increases Battery Life

![Graph showing power consumption at different frequencies and voltages for video and audio processing.]

- **Variable Frequency**
  - Programmable PLL (1x to 63x) combined with CCLK and SCLK dividers enable low latency changes in system performance and power consumption profile

- **Variable Voltage**
  - On-Chip Voltage Regulator generates core voltage from an externally supplied 2.25 – 3.6V input
  - Core voltage programmable from 0.7V to 1.2V (50 mV increments)

- **System Cost Reduction**
Blackfin: Target Applications

- PDA
  - Internet audio
- Digital Still Camera
- Video camera
  - Video conferencing
  - MPEG2
  - DVD
- Digital Printing
- Audio
  - MP3 Audio
  - Digital Car Radios

- Modems
  - ADSL
  - VoIP Phone Solutions
  - Cable Modems
  - RAS Modems
  - Wireless modems
- Mobile Phones
  - GSM Mobile phones
  - 3G data terminals
- Internet Appliances
ADI Blackfin: Performance Leadership

Benchmark: BDTImark2000™ / BDTIsimMark2000™

—Analog Devices Confidential Information—
Blackfin Competitive Performance Advantage

Blackfin has Higher Clock Rate .... And > 2x Signal Processing Performance

The BDTImark2000/BDTIsimMark2000 provide a summary measure of DSP speed. For more info and scores see www.BDTI.com. Scores © 2002/2003 BDTI.

*BDTImark2000 **BDTIsimMark2000 (simulated only, not verified on hardware)
Price/Performance Comparison

Price ($/10 kU)

Signal Processing Performance

- ADSP-BF533
- ADSP-BF532
- ADSP-BF531
- '5501
- '5502
- '5510
- SH3-DSP
- PXA250
- '6416
- '6411
- '5509
- '5404
- '6401

ADI BLACKfin
TI C55xx
TI C64xx
TI C54xx
Intel XSCALE
Hitachi SH3-DSP

—Analog Devices Confidential Information—
Blackfin Products at a Glance
ADSP-BF535 Blackfin DSP – Available Now

System Control Blocks
- Emulator & Test Control
- Event Controllers
- Watchdog Timers
- Memory DMA
- Real Time Clock
- PLL

High Speed I/O
- 32-bit External Bus Interface
- SDRAM Ctrl

Peripheral Blocks
- USB v 1.1
- GPIO
  - SPORT0
  - SPORT1
  - SPI0
  - SPI1
  - UART0
  - IrDA
  - UART1

System Interface Unit
- Blackfin Core
  - To 350 MHz
- 16KB Inst. SRAM / Cache
- 32KB Data SRAM / Cache
- 256 KB SRAM

L1

L2

--- Analog Devices Confidential Information ---
Blackfin : ADSP-BF533 – Available Now

- Processor Core To 750MHz
- Emulator & Test Control
- Voltage Regulation
- Event Controller
- Clock (PLL)
- Memory DMA
- Watchdog Timer
- Real Time Clock

- System Interface Unit
  - 80KB Instruction SRAM/Cache
  - 32KB Instruction ROM
  - 64KB Data SRAM/Cache
  - 4KB Scratchpad RAM

- Peripheral Blocks
  - UART
  - SPI
  - SPORT0
  - SPORT1
  - Timers 0/1/2

- High Speed I/O
- Parallel Peripheral Interface/GPIO

- External Memory Interface SDRAM Ctrl

---Analog Devices Confidential Information---
Blackfin : ADSP-BF532 – Available Now

Processor Core

To 400MHz

System Control Blocks

Emulator & Test Control
Voltage Regulation
Event Controller
Clock (PLL)
Memory DMA
Watchdog Timer
Real Time Clock

System Interface Unit

48KB Instruction SRAM/Cache
32KB Instruction ROM
32KB Data SRAM/Cache
4KB Scratchpad RAM

Peripheral Blocks

UART
SPI
SPORT0
SPORT1
Timers 0/1/2

Parallel Peripheral Interface/GPIO

External Memory Interface SDRAM Ctrl

High Speed I/O

—Analog Devices Confidential Information—
Blackfin: ADSP-BF531 – Available Now
ADSP-BF561 Dual-Core Blackfin – Available Now

System Control Blocks
- Emulator & Test Control
- Voltage Regulator
- Event Controllers
- Watchdog Timers
- Memory DMA
- PLL

Blackfin Core
- Up to 750 MHz

L1
- 32KB Inst.
- 64KB Data
- SRAM / Cache

L2
- 128 KB SRAM

System Interface Unit

High Speed I/O
- 32-bit External Bus Interface
- SDRAM Ctrl
- Emulator & Test Control
- Voltageregulator
- Event Controllers
- Watchdog Timers
- Memory DMA
- PLL

Peripheral Blocks
- SPORT0
- SPORT1
- GPIO
- TIMERS (12)
- SPI0
- UART
- IrDA
- PPI0 / GPIO
- PPI1 / GPIO
Blackfin – ADSP-BF534 – Available Now

System Control Blocks
- Test Control
- Emulation Control
- Event Controller
- Watchdog Timer
- Memory DMA
- RTC
- PLL

Processor Core
- To 500MHz

System Interface Unit
- Up to 64KB Inst.
  - SRAM 32KB
  - SRAM/Cache 32KB
- Up to 64KB Data
  - SRAM 32KB
  - SRAM/Cache 32KB
- Scratch Pad 4KB

Peripheral Blocks
- SPORT1, UART0-1, SPI0, Timer0-7, PPI*
  - 32 GPIO
- SPORT0 / I2C / CAN*

--- Analog Devices Confidential Information ---
Blackfin – ADSP-BF536 – Available Now

System Control Blocks
- Test Control
- Emulation Control
- Event Controller
- Watchdog Timer
- Memory DMA
- RTC
- PLL

Processor Core
- Processor Core
- To 400MHz
- System Interface Unit
- Up to 64KB Inst.
  - SRAM 32KB
  - SRAM/Cache 32KB
- Up to 64KB Data
  - SRAM 16KB
  - SRAM/Cache 16KB
- Scratch Pad 4KB
- System Interface Unit
- L1

Peripheral Blocks
- SPORT1, UART0-1, SPI0, Timer0-7, PPI*
- SPORT0 / I²C / CAN*
- 16-bit External Memory
- 10/100 Ethernet MAC / 16 GPIO

---Analog Devices Confidential Information---
Blackfin – ADSP-BF537 – Available Now

- Processor Core: To 500MHz
- System Control Blocks:
  - Test Control
  - Emulation Control
  - Event Controller
  - Watchdog Timer
  - Memory DMA
  - RTC
  - PLL
- System Interface Unit:
  - Scratch Pad 4KB
- Peripheral Blocks:
  - 16-bit External Memory
  - 10/100 Ethernet MAC / 16 GPIO
  - SPORT1, UART0-1, SPI0, Timer0-7, PPI*
  - SPORT0 / I²C / CAN*
- Memory:
  - Up to 64KB Inst.
    - SRAM 32KB
    - SRAM/Cache 32KB
  - Up to 64KB Data
    - SRAM 32KB
    - SRAM/Cache 32KB
  - 16-bit External Memory
  - 10/100 Ethernet MAC / 16 GPIO
- System Interface Unit
  - L1
- Analog Devices Confidential Information
Blackfin Operating System Support

- Basic Needs
- Limited Budget
- FREE with VisualDSP++™
- Media / Web centered
- Embedded XML
- #1 TCP/IP Stack in World
- OSEK Compliant
- Safety Critical
- Performance Driven
- Minimal Code Size
- De facto Std in Academic World
- Broad User Community
- Free Connotation
- Comprehensive Product Portfolio beyond Kernel
- Comprehensive CPU coverage for easy switch
- Broad Coverage and Highly Integrated
- Consumer Media
- Audio/Video
- Network Connected
- Automotive
- Telematics
- Consumer
- Media / STB
- PC & Peripheral
- Traditional MCU
- From Desktop to Embedded Devices
- Consumer
- Telecomm
- Industrial
- Networking
Operating Systems

Real Time Operating Systems
- VDK from ADI
- Unicoi Fusion RTOS
- Nucleus PLUS
- ThreadX
- CMX
- Live Devices
- uTRON (API)

Operating Systems
- Embedded Linux (BF535) BF531/2/3 – in development

Networking Stacks
- Kadak Kwik-Net
- Unicoi Fusion Net
- Net-X

—Analog Devices Confidential Information—
Section 2

Introduction to VisualDSP++
VisualDSP++ 4.0

- VisualDSP++ is an integrated development environment that enables efficient management of projects.

- **Key Features Include:**
  - Editing
  - Building
    - Compiler, assembler, linker
  - Debugging
    - Simulation, Emulation, EZ-KIT
    - Run, Step, Halt
    - Breakpoints, Watchpoints
    - Advanced plotting and profiling capabilities
    - Pipeline and cache viewers
VisualDSP++

- **What comes with VisualDSP++?**
  - Integrated Development and Debugger Environment (IDDE), C/C++ Compiler, Assembler, Linker, VDK, Emulation and Simulation Support, On-line help and documentation
  - Part #: VDSP-BLKFN-FULL
  - Floating License Part #: VDSP-BLKFN-PCFLOAT

- **VisualDSP++ is a common development environment for all ADI processor families**
  - **Blackfin**
    - ADSP-BF5xx
  - **TigerSharc**
    - ADSP-TSxxx
  - **Sharc**
    - ADSP-21xxx

- Each processor family requires a separate license
Features of VisualDSP++ 4.0

- Integrated Development and Debugger Environment (IDDE)
  - Multiple workspaces, projects, project groups
- Project Wizard
  - Create/configure a DSP project
- High level language support including C and C++
- Expert Linker
  - Graphical support for managing linker description files
  - Code profiling support
- Easy to use Online Help
- BTC (Background Telemetry Channel) Support
  - Data Streaming and Logging
- Easy to test and verify applications with scripts (TCL, VB, Java)
- VisualDSP++ RTOS/Kernel/Scheduler (VDK)
- Integrated Source Code Control
- Device Drivers and System Services
Software Development Flow

Code Generation

- Generate Assembly Source (.ASM)
- Generate C/C++ Source (.C/CPP)

and/or

- Assembler .DOJ
- C/C++ Compiler .S

Linker .DXE

Software Verification

- VisualDSP++ Simulator

System Verification

- Working Code?
  - NO
  - YES

- Hardware Evaluation EZ-Kit Lite
- Target Verification ICE
- ROM Production LOADER .LDR
- PROM Burner
Invoking the Software Tools

- **Software tools may be configured and called by the IDDE**
  - Software tools are configured via property pages
  - The IDDE calls the software tools it needs to complete the build
    - GUI front end to a command line ‘make’ utility
- **Software tools can be invoked from a Command line**
  - C Compiler: `ccblkfn sourcefile -switch [-switch...]
  - Assembler: `easmbblkfn sourcefile -switch [-switch...]
  - Linker: `linker object [object...] -switch [-switch...]
  - Loader: `elfloader executable -switch [-switches...]
- **For the complete list of switches see the appropriate tools manual**
Integrated Development and Debugger Environment (IDDE) Features

- IDDE allows one to manage the project build
- The user configures the project and the development tools via property pages
- **Project Property pages** configure the project
  - Project Property Page
  - General Property Page
  - Pre Build Property Page
  - Post Build Property Page
- **Development Tools Property Pages** are used to configure the development tools
  - Assembler Property Page
  - Compiler Property Page
  - Linker Property Page
  - Loader Property Page
Project Development

• Create a project
  – All development in VisualDSP++ occurs within a project.
  – The project file (.DPJ) stores your program’s build information: source files list and development tools option settings
  – A project group file (.DPG) contains a list of projects that make up an application (eg ADSP-BF561 dual core application)
Configure project options

- Define the target processor and set up your project options (or accept default settings) before adding files to the project.
- The Project Options dialog box provides access to project options, which enable the corresponding build tools to process the project’s files correctly.

Enable building for a specific revision of silicon
- No need to specify ‘-si-revision’ switch
- Automatic will attempt to determine revision of the attached target
- or specify a specific rev level (eg 0.3)
Property Pages

C/C++ Compiler Property Page

Assembler Property Page

---Analog Devices Confidential Information---
Property Pages

Linker Property Page

Loader Property Page
Selecting VisualDSP++ Sessions

- **Sessions define Debug Environments**
- **Select Sessions pull down menu**
  - Choose Sessions List
  - Select Session to activate
- **Define New Session from Session List**
  - Select New Session
  - Configure session as required e.g.
    - Debug target: ADSP-BF53x Family Simulator
    - Platform: ADSP-BF53x Single Processor Simulator
    - Session name: ADSP-BF533 ADSP-BF53x Single Processor Simulator

- **Click OK**
  - Session name will appear in Session List
- **Click Activate**
  - IDDE session will open

---

Analog Devices Confidential Information

---

Analog Devices Confidential Information

---
Debug Features

- Single Step
- Run
- Halt
- Set Breakpoints
- Register Viewing
- Memory
  - Viewing
  - Plotting
  - Dump/Fill
- Code Optimization Utilities
  - Profiling
  - Pipeline Viewer
  - Cache Viewer
- Compiled Simulation
- High Level Language debug support
  - Mixed mode
Online Help

- Fully searchable and indexed online help
- Includes quick overviews on using VisualDSP++ and all of its features.
- Excellent supplement to the manual for things that are better represented visually such as what various plot windows should look like.
- Customizable by using the “Favorites” window
On Line Help Example

VDK State History Window Operations

Status Bar
The status bar (bottom of plot) of the State History page of the VDK State History window shows the event's details and thread status when the data cursor is enabled. Event details include the event type, the tick when the event occurred, and an event value.

The value for a thread-switched event indicates the thread being switched in or out.
The status bar indicates thread status for the active location.

Data Cursor
What is VDK?

- VDK is a kernel not an operating system
- VDK comprises:
  - VDK libraries
  - VDK specific ldff files
  - Include files
  - Template files
- Overheads
  - Memory overhead
  - Minimum memory requirement is platform dependent
  - Footprint is one of the most important metrics for a RT kernel
  - MIPS overhead
Coding Guidelines for Achieving Optimal C Performance on Blackfin
Strategic Objective: Make C as fast as assembler!

Advantages:
- C is much cheaper to develop.
- C is much cheaper to maintain.
- C is comparatively portable.

Disadvantages:
- ANSI C is not designed for DSP.
- DSP processor designs usually expect assembly in key areas.
- DSP applications continue to evolve.
Pillars of Effective Programming

- Understand Underlying Hardware Capabilities
- Discover What Compiler Can Provide
- Design Program Effectively
  - general choice of algorithm
  - choice of data representation
  - finer low-level programming decisions

- Usually the process of performance tuning is a specialisation of the program for particular hardware. It may grow larger or more complex and is less portable.
Analog C Compiler (VDSP++ 4.0)

- **State-of-the-art optimizer.**
  - Provides flexibility
  - Ease of adding architecture-specific optimizations

- **Exploitation of explicit parallelism in the architecture**
  - Vectorization – exploiting wide load capabilities
  - Recognizing SIMD opportunities
  - Software pipelining

- **Whole Program Analysis**
  - A wider view enables the optimizer to be more aggressive.
Optimizer improvements in VDSP++ 4.0

- **Intelligent Vectorization**
  - More flexible, heuristic based vectorization.

- **Unroll and Jam**
  - Unroll outer loop and combine resulting copies of inner loop.

- **Minimising Call Overhead**
  - Can supply list of registers altered by a function.
Other new features with VDSP 3.5

- long long support - 64-bit integer support
- Enhanced GNU compatibility features.
- compiler built-ins added for Blackfin video operations.
- ADSP-BF561 support
- multiple-heap support
- improved cache support
- C++ Exception Handling
- Profile-Guided Optimization
Understanding Underlying Hardware

Isn’t C supposed to be portable & machine independent?
- yes, but at a price!
- Uniform computational model, BUT….  
  - missing operations provided by software emulation (slow)
  - for example: C provides floating point arithmetic everywhere
- C is more machine-dependent than you might think
  - for example: is a “short” 16 or 32 bits? (more later)

Machine’s Characteristics will determine your success.

C programs can be ported with little difficulty.

But if you want high efficiency, you can’t ignore the underlying hardware
Evaluate Algorithm against Hardware.

- What’s the native arithmetic support?
  - Can we use floating point hardware?
  - How wide is the integer arithmetic?
    - Doing 64-bit arithmetic on a 32-bit unit is slow
    - Doing 16-bit arithmetic on a 32-bit part is awkward
  - Can we use packed data operations?
    - 2x16 arithmetic might be ideal for your application
      (more computation per cycle, less memory usage)
    - Implications for data types, memory layout, algorithms

- What is the computational bandwidth and throughput?
  - What are the key operations required by your algorithm?
  - (Macs?, loads?, stores?....)
  - How fast can the computer perform them?
DSP’s Present Some Unique Problems

◆ Special Aspects of Digital Signal Processors:
  ● Reduced memory
  ● Extended precision accumulators
  ● Specialized architectural features
    If not well modeled by C: lose portability and efficiency
    ✿ Example: Zero overhead loop – good
    ✿ Fractional arithmetic - problem.
  ● Mathematical focus (historically not C’s orientation)

◆ Features which compiler must exploit
  ● Efficient Load / Store Operations in Parallel
  ● Utilize multiple Data-paths; SISD, SIMD, MIMD operations
  ● Minimize memory utilization
C and the Compiler

- C provides common computational model
  - portability
  - higher level
- Compiler’s job: map this to a particular machine
  - tries for optimal use of instructions
  - supplement by instruction sequences or library calls
- Optimizer improves performance
  - do things less often, more cheaply
  - try to utilize resources fully
- Optimizing Compiler has Limited Scope
  - will not make global changes
  - will not substitute a different algorithm
  - will not significantly rearrange data or use different types
  - correctness as defined in the language is the priority
Overview of Compilation

Compiler:
(1) makes a straightforward translation
   • fully sequential
   • each individual step as written
(2) then improves it (optimization)
   • transforms it into an equivalent one
     • hopefully faster and smaller
     • must get same “answers”
   • Simple Guiding Principle:
     • Avoid Work
     • Reduce Generality
     • Do things in parallel

This form provides clearest debugging
Summary: How to go about increasing performance.

1. Work at high level first
   most effective -- maintains portability
   - improve algorithm
   - make sure it’s suited to hardware architecture
   - check on generality and aliasing problems

2. Look at machine capabilities
   - may have specialized instructions (library/portable)
   - check handling of DSP-specific demands

3. Non-portable changes last
   - in C?
   - in assembly language?
   - always make sure simple C models exist for verification.
   - Compiler will improve with each release
Choose!
Optimized C or Out of the Box C?

- OTB or “out of the box” C is portable code.
  But most platforms allow some “elaboration” of the source.

- #pragmas. - (Compiler specific assertions.)
- __builtin functions.
- Memory qualifiers – const, restrict, volatile, bank.

- These can specify alignment, cycle iteration count, SIMD, memory type. Or access specific machine instructions one to one.
- Optimized C can go very much faster than “out of the box C”.

—Analog Devices Confidential Information—
OTB C compilers are improving rapidly. EDN: Improvement in the last 2 years.
Use the Optimizer!

- There is a massive effect from optimization on a DSP platform. (Much more than on RISC chips)
- Non-optimised code is up to 20 times slower.
- Sliding scale from control code to DSP inner loop.
- Non-optimized code is only for debugging the algorithm.
  - (You can also perform limited debugging optimized, with –O –g, which gives access to global variables, function names and line numbers.)
Un-Optimized Code for Blackfin

Unoptimized assembly:

```assembly
[FP+ -8] = R7; ._P1L1:
R3=[FP+ -8];
R2 = 150 (X);
CC = R3 < R2;
IF !CC JUMP ._P1L3 ;
R3 <<< 1;
P2 = R3 ;
P0=[FP+ 8];
P0 = P0 + P2;
R1=W[P0+ 0] (X);
R0=[FP+ -8];
R0 <<< 1;
P1 = R0 ;
P2=[FP+ 12];
P2 = P2 + P1;
R7=W[P2+ 0] (X);
R7 = R1;
R1=[FP+ -4];
R0 = R1 + R7;
[FP+ -4] = R0;
R3=[FP+ -8];
R3 <<< 1;
P0 = R3 ;
P1=[FP+ 12];
P1 = P1 + P0;
R1=W[P1+ 0] (X);
R7=[FP+ -8];
R7 <<< 1;
P2 = R7 ;
P1=[FP+ 12];
P1 = P1 + P2;
R3=W[P1+ 0] (X);
R3 = R1 ;
R1=[FP+ 16];
R7 = R1 + R3;
[FP+ 16] = R7;
R3=[FP+ -8];
R3 += 1;
[FP+ -8] = R3;
JUMP ._P1L1;
```

Loop control
increment, test & exit

Load A[I]

Load B[I]

Sum += A[I]* B[I]

Load B[I]

Load B[I]

B[I] * B[I]

Increment I

Repeat Loop

---Analog Devices Confidential Information---

The source code:

```plaintext
for (i = 0; i < 150; i++) {
    dotp += b[i] * a[i];
    sqr  += b[i] * b[i];
}
```

The Optimised assembly

- easier to understand!

```assembly
LSETUP (.P1L2 , .P1L3-8) LC0=P1;

.P1L2:
    A1+= R0.H*R0.H, A0+= R0.L*R0.H (IS)
    || R0.L = W[I1++]
    || R0.H = W[I0++];

.P1L3:
```

---Analog Devices Confidential Information---
General Principles of Optimizer

The Optimizer Looks at Each Operation:

- **Try not to do it at all**
  - perhaps not actually needed
  - calculate at compile-time
  - re-use previous calculation

- **Do it more cheaply**
  - avoiding storing in memory

- **Do it more efficiently**
  - use special resources
  - do more than one thing at a time

- **Loops get special attention**
  - *Biggest Savings of All*

---

The compiler is your partner
You can count on certain optimizations being done

—Analog Devices Confidential Information—
## Compiler command line options

<table>
<thead>
<tr>
<th>Option</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>-O</td>
<td>Optimize</td>
</tr>
<tr>
<td>-Oa</td>
<td>Optimize with auto-inlining</td>
</tr>
<tr>
<td>-Os</td>
<td>Optimize space sensitively</td>
</tr>
<tr>
<td>-Ov</td>
<td>Optimize with user control of balance between size and speed</td>
</tr>
<tr>
<td>-ipa</td>
<td>Whole program analysis</td>
</tr>
<tr>
<td>-save-temps</td>
<td>Preserves compiler output (.s)</td>
</tr>
</tbody>
</table>
Leave the low level concerns to the compiler.
Leave basic operations to the compiler.

(1) \[ a = b \times c; \] Value of ‘a’ can be used directly from register; eliminate load from memory

(2) \[ d = a + f; \] New value assigned to ‘a’, so value stored at (1) is not used; eliminate the store to memory

(3) \[ a = b - g; \]

Straightforward code

\[
\begin{align*}
R2 &= [b]; \\
R3 &= [c]; \\
R1 &= R2 \times R3; \\
[a] &= R1; \\
R1 &= [a]; \\
R6 &= [f]; \\
R4 &= R1 + R6; \\
[d] &= R4; \\
R2 &= [b]; \\
R7 &= [g]; \\
R1 &= R2 - R7; \\
[a] &= R1;
\end{align*}
\]

12 cycles

Optimized code

\[
\begin{align*}
R2 &= [b]; \\
R3 &= [c]; \\
R1 &= R2 \times R3; \\
R6 &= [f]; \\
R4 &= R1 + R6; \\
[d] &= R4; \\
R7 &= [g]; \\
R1 &= R2 - R7; \\
[a] &= R1;
\end{align*}
\]

9 cycles

*—Analog Devices Confidential Information—*
Leave scheduling to the compiler.

(1) \( a = b \times c; \)
(2) \( d = a + f; \)
(3) \( a = b - g; \)

Take advantage of hardware parallelism: consider dispatching multiple instructions in one cycle

Optimized code

\[
\begin{align*}
R2 &= [b]; \\
R3 &= [c]; \\
R1 &= R2 \times R3; \\
R6 &= [f]; \\
R4 &= R1 + R6; \\
[d] &= R4; \\
R7 &= [g]; \\
R1 &= R2 - R7; \\
[a] &= R1;
\end{align*}
\]

Scheduled code

\[
\begin{align*}
R2 &= [b]; \\
R3 &= [c]; \\
R1 &= R2 \times R3, R6 = [f]; \\
R4 &= R1 + R6, R7 = [g]; \\
R1 &= R2 - R7, [d] = R4; \\
[a] &= R1;
\end{align*}
\]
Compilers understand Loops

Simple counted loop:
Use zero-overhead loop mechanism

```
for (j=0; j<N; j++) {
}
```

C and D don’t change during loop:
Load them into registers outside

Combine reference with incrementing pointer
(Use post-modify addressing)

COMPILER DOES THE LOW-LEVEL WORK

—Analog Devices Confidential Information—
Addressing Operations are Fully Efficient

```c
pA = &A[0];
pB = &B[0];
pP = &P[N-1];
pQ = &Q[N-1];

for (j=0; j<N; j++) {
    *pP++
    *pQ++
    *pA++
    *pB--
}
```

> zero-overhead loop
> C, D loop invariant, loaded once outside loop

You Can Count on the Optimizer to Do This Transformation

—Analog Devices Confidential Information—
How can we improve on the compilers effort?
Getting Started  80:20

Find out where program spends its time.

- **80 – 20 rule**
- **Measure:** Intuition is notoriously bad here: instrument, use profiler and cycle accurate simulator.
- **Loops:** Are always a good place to look. Even a trivial operation can have a significant cost, if it is done often enough.
Use the Statistical Profiler

- Statistical profiling samples the program counter of the running application and builds up a picture of where it spends its time.
- Completely non-intrusive – no tracing code is added.
- Completely accurate – shows all effects, including stalls.
- Don’t assume you know where an application spends its time – profile it.
VDSP Statistical Profiler

- The profiler is very useful in C/C++ mode because it makes it easy to benchmark a system module-by-module (i.e. C/C++ function).
- Assembly or optimised code appears as individual instructions.

Linear Profiler is also available for the simulator.
Look closely at cycles in critical areas.

- **Cycle Accurate Simulator.**
  - Step through the code identified by the Statistical profiler. Watch the Cycle counter.

- **Pipeline Viewer.**
  - Close in on causes of stalls with the pipeline viewer.
**VDSP Pipeline Viewer**

- Accessed through View->Debug Windows->Pipeline Viewer in a simulator session (not available in emulator)

### Pipeline Viewer

<table>
<thead>
<tr>
<th>Cycle</th>
<th>IF1</th>
<th>IF2</th>
<th>DECODE</th>
<th>ADDRESS</th>
<th>COMMIT</th>
</tr>
</thead>
<tbody>
<tr>
<td>23</td>
<td>R1.L...</td>
<td>R0.L...</td>
<td>LSET...</td>
<td>P0 =...</td>
<td></td>
</tr>
<tr>
<td>24</td>
<td>P0 =...</td>
<td>R1.L...</td>
<td>R0.L...</td>
<td>LSET...</td>
<td></td>
</tr>
<tr>
<td>25</td>
<td>P1 =...</td>
<td>P0 =...</td>
<td>R1.L...</td>
<td>R0.L...</td>
<td></td>
</tr>
<tr>
<td>26</td>
<td>R2.L...</td>
<td>P1 =...</td>
<td>P0 =...</td>
<td>R1.L...</td>
<td></td>
</tr>
<tr>
<td>27</td>
<td>R3 =...</td>
<td>R2.L...</td>
<td>P1 =...</td>
<td>P0 =...</td>
<td></td>
</tr>
<tr>
<td>28</td>
<td>R0.L...</td>
<td>R3 =...</td>
<td>R2.L...</td>
<td>P1 =...</td>
<td></td>
</tr>
<tr>
<td>29</td>
<td>R0.L...</td>
<td>R3 =...</td>
<td>R2.L...</td>
<td>P1 =...</td>
<td></td>
</tr>
<tr>
<td>30</td>
<td>R0.L...</td>
<td>R3 =...</td>
<td>R2.L...</td>
<td>P1 =...</td>
<td></td>
</tr>
<tr>
<td>31</td>
<td>R0.L...</td>
<td>R3 =...</td>
<td>R2.L...</td>
<td>P1 =...</td>
<td></td>
</tr>
<tr>
<td>32</td>
<td>R1.L...</td>
<td>R0.L...</td>
<td>R3 =...</td>
<td>R2.L...</td>
<td>P1 =...</td>
</tr>
<tr>
<td>33</td>
<td>P0 =...</td>
<td>R1.L...</td>
<td>R0.L...</td>
<td>R3 =...</td>
<td>P1 =...</td>
</tr>
<tr>
<td>34</td>
<td>P1 =...</td>
<td>P0 =...</td>
<td>R1.L...</td>
<td>R0.L...</td>
<td>R3 =...</td>
</tr>
</tbody>
</table>

**Details for stage EX1 (cycle 31)**
- Address: Invalid
- Instruction: Invalid
- Event 0:
  - Type: Stall
  - Cause: Dagreg RAW hazard
  - Details: Stall in stage AC due to stage EX3

---

—Analog Devices Confidential Information—
How about the “pipeline”?

- **Deep pipeline processors:**
  - pipelines do badly on conditionally branching code also on table lookup
  - sometimes branches can be avoided by using other techniques

- **Is there a latency associated with computations?**
  (results not ready on next cycle)
  - latency can be hidden within a loop
  - hiding latencies involves loop setup overhead -- a problem if iteration counts are low

- **C Compiler will do its best, but inherent hardware limitations will always influence the outcome**

- **Pipeline is FULLY interlocked and interruptable!**
Blackfin Pipeline Latencies

1. Multiply/Video Operation Latencies (One stall)

\[
R0 = R4; \\
\text{STALL} \\
R2.H = R1.L \times R0.H;
\]

2. Load to DAG Latencies (Three stalls)

\[
P3 = [\text{SP}++] ; \\
\text{STALL} \\
\text{STALL} \\
\text{STALL} \\
\text{STALL} \\
R0 = P3;
\]

3. Sub-bank access collision (One stall)

\[
\text{STALL} \\
R1 = R4.L \times R5.H (IS) \text{ || } R3 = [I0++] \text{ || } R4 = [I1++] ;
\]
Blackfin Pipeline Latencies (2)

- 4. Instruction flow dependencies
  Correctly predicted branch (4 stalls)
  Incorrectly predicted branch (8 stalls)

- 5. Store buffer load collision
  
  \[
  W[P0] = R0; \\
  \text{STALL} \\
  R1 = W[P0];
  \]

- 6. Hardware loop latencies
  
  (example is instructions between lsetup and loop top)

  
  \[
  \text{LSETUP(top, bottom) LC0 = P0;} \\
  (3 \text{ STALLS}) \\
  \text{P0 = R0;} \\
  \text{top:}
  \]
Latency -> affects programming style

- Take care with structure depth.
  - p->q->z is inefficient to access.
  - (And hard on pointer analysis. What data does this reference?)

- Take care with Table Lookup.
Data types
## Native C Data Types on Blackfin

<table>
<thead>
<tr>
<th>Data Type</th>
<th>Width</th>
</tr>
</thead>
<tbody>
<tr>
<td>char</td>
<td>8-bit signed</td>
</tr>
<tr>
<td>unsigned char</td>
<td>8-bit unsigned</td>
</tr>
<tr>
<td>short</td>
<td>16-bit signed integer</td>
</tr>
<tr>
<td>unsigned short</td>
<td>16-bit unsigned integer</td>
</tr>
<tr>
<td>int</td>
<td>32-bit signed integer</td>
</tr>
<tr>
<td>unsigned int</td>
<td>32-bit unsigned integer</td>
</tr>
<tr>
<td>long</td>
<td>32-bit signed integer</td>
</tr>
<tr>
<td>unsigned long long</td>
<td>32-bit unsigned integer</td>
</tr>
</tbody>
</table>

- float (32-bit), double (32-bit), long long (64-bit) and unsigned long long (64-bit) are not supported by the hardware.
An efficient floating Point Emulation.

<table>
<thead>
<tr>
<th>Measurement in cycles</th>
<th>TI 55xx</th>
<th>BF532</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multiply</td>
<td>330</td>
<td>95</td>
</tr>
<tr>
<td>Add</td>
<td>163</td>
<td>108</td>
</tr>
<tr>
<td>Subtract</td>
<td>195</td>
<td>145</td>
</tr>
<tr>
<td>Divide</td>
<td>655</td>
<td>246</td>
</tr>
<tr>
<td>Sine</td>
<td>5341</td>
<td>2164</td>
</tr>
<tr>
<td>Cos</td>
<td>5942</td>
<td>2029</td>
</tr>
<tr>
<td>Square Root</td>
<td>5836</td>
<td>316</td>
</tr>
</tbody>
</table>

And then add in MHZ advantage.

Note: Our Square root uses a better algorithm!

Smaller is better!
Wide support for Fractional processing.

- The Blackfin instruction set includes a number of operations which support fractional (or fract) data. The instructions include:
  - saturating MAC/ALU/SHIFT instructions
  - MAC shift correction for fractional inputs

- The compiler and libraries provide support for fractional types:
  - fractional builtins
  - fract types fract16 and fract32
  - ETSI
  - C++ fract class

- Fractional arithmetic is a hundred times faster than floating!
ETSI Builtins – fully optimised Fractional arithmetic to a standard specification.

- European Telecommunications Standards Institute's fract functions carefully mapped onto the compiler built-ins.

<table>
<thead>
<tr>
<th>Function</th>
<th>Function</th>
<th>Function</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>add()</td>
<td>sub()</td>
<td>abs_s()</td>
<td>shl()</td>
</tr>
<tr>
<td>shr()</td>
<td>mult()</td>
<td>mult_r()</td>
<td>negate()</td>
</tr>
<tr>
<td>round()</td>
<td>L_add()</td>
<td>L_sub()</td>
<td>L_abs()</td>
</tr>
<tr>
<td>L_negate()</td>
<td>L_shl()</td>
<td>L_shr()</td>
<td>L_mult()</td>
</tr>
<tr>
<td>L_mac()</td>
<td>L_msu()</td>
<td>saturate()</td>
<td>extract_h()</td>
</tr>
<tr>
<td>extract_l()</td>
<td>L_deposit_l()</td>
<td>L_deposit_h()</td>
<td>div_s()</td>
</tr>
<tr>
<td>norm_s()</td>
<td>norm_l()</td>
<td>L_Extract()</td>
<td>L_Comp()</td>
</tr>
<tr>
<td>Mpy_32()</td>
<td>Mpy_32_16()</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- Immediate optimisation of ETSI standard codecs.
- Highly recommended!
Pointers or Arrays?

- Arrays are easier to analyse.
  ```c
  void va_ind(int a[], int b[], int out[], int n) {
    int i;
    for (i = 0; i < n; ++i)
      out[i] = a[i] + b[i];
  }
  ```
- Pointers are closer to the hardware.
  ```c
  void va_ptr(int a[], int b[], int out[], int n) {
    int i,
    for (i = 0; i < n; ++i)
      *out++ = *a++ + *b++
  }
  ```
- Which produces the fastest code?
Pointers or Arrays? 2

- Often no difference.
- Sometimes one version may do better for an algorithm.
- Not always the same style that wins.

- Start using array notation as easier to understand.
- Array format can be better for alias analysis in helping to ensure no overlap.
- If performance is unsatisfactory try using pointers.
- Outside critical loops stay with array notation.
Tricks
( useful transformations )
Avoid Division.

- There are no divide instructions – just supporting instructions.
- Floating or integer division very costly
- Remember Modulus( % ) also implies division.

Get Division out of loops wherever possible.
Exception – Division by powers of 2.

- Division by power of 2 rendered as right shift – very efficient.
- Unsigned Divisor – one cycle. (Division call costs 35 cycles)
- Signed Divisor – more expensive. (Could cast to unsigned?)
  \[ x / 2^n = ((x<0) ? (x+2^n-1) : x) >> n \]  // Consider \(-1/4 = 0\!\)

- Example: signed int / 16
  
  R3 = [I1];  // load divisor
  CC = R3 < 0;  // check if negative
  R1 = 15;  // add 2^n-1 to divisor
  R2 = R3 + R1;
  IF CC R3 = R2 ;  // if divisor negative use addition result
  R3 >>>= 4;  // to the divide as a shift

- Ensure compiler has visibility. Divisor must be unambiguous.
Beware Hidden Division

- Division can be created by For loops.
- Sometimes the compiler will calculate number of iterations.

```c
for ( l = start; l < finish; l += step )
```

compiler plants code to calculate:

```c
iterations = (finish-start) / step
```
Division Trick 1 – Multiply by Reciprocal.

```c
float recip_NUM_SAMPS = 1.0/NUM_SAMPS;
for (i=0; i<NC; i++) {
    for (j=0; j<NC; j++) {
        float sum = 0.0;
        for (k=0; k<NUM_SAMPS; k++)
            sum += Input[i*NC + k] * Input[j*NC + k];
        Cover[i*NC + j] = sum / NUM_SAMPS;
        // = sum * recip_NUM_SAMPS ;
    }
}
```

- **Replace Division by Multiplication by Reciprocal**
  - helps when divisor is locally constant
  - answer may be slightly different - is this OK?
Use the laws of Algebra

Original customer benchmark compares ratios coded as:

\[
\text{if} \quad \left( \frac{X}{Y} > \frac{A}{B} \right)
\]

Recode as:

\[
\text{if} \quad \left( X \times B > A \times Y \right)
\]

Another way to lose divisions!

Problem: possible overflow in fixed point.

The compiler does not know anything about the real data precision. The programmer must decide. For instance two 12 bit precision inputs are quite safe. (24 bits max on multiplication.)
**Replace Conditionals with Min, Max, Abs.**

**Simple bounded decrement**

\[
\text{k} = \text{k-1;}
\text{if (k < -1)}
\text{k = -1;}
\]

**Programming “trick”**

\[
\text{k} = \text{max (k-1, -1);} \\
\text{R0 += -1;}
\text{R1 = -1;}
\text{R0 = MAX (R1, R0);} \\
\]

Avoid jump instruction latencies and simplifying control flow helps optimisation.

The compiler will often do this automatically for you, but not always in 16 bit cases.

**BF ISA Note:** Min and Max are for signed values only.

---

—Analog Devices Confidential Information—
Removing Conditionals 2

◆ Pipelined Architecture Problem:

```c
sum = 0;
for (I=0; I<NN; I++) {
    if ( KeyArray[val1][10-k+I] == '1' )
        sum = sum + buffer[I+10]*64;
    else
        sum = sum - buffer[I+10]*64;
}
```

◆ Better Solution removes conditional branch. Multiplication is fast: let KeyArray hold +64 or -64

```c
sum = 0;
for (I=0; I<NN; I++)
    sum += buffer[I+10] * KeyArray[val1][10-k+I];
```

◆ Compiler is not able to make this kind of global change
Removing conditionals 3

- Duplicate small loops rather than have a conditional in a small loop.

- Example

```c
for {
    if { ..... } else {.....}
}

=> if {
    for {.....}
} else {
    for {.....}
}
```
Removing Conditionals 4
Predicated Instruction Support

- The blackfin predicated instruction support takes the form of:
  - IF (CC) reg = reg.

- Much faster than a conditional branch. (1 cycle) but limited.
- Help the compiler to see the opportunity.

- For instance – consider speculative execution.
  - if (A) X = EXPR1 else X = EXPR2;
  - X = EXPR1; IF (!A) X = EXPR2;
  - Or X=EXPR1; Y=EXPR2; if (!A) X=Y;
Loops
The inner loop

- The optimizer focuses on the inner loop because this is where most programs spend most of their time.
- Considered a good trade off to slow down loop prologue and epilogue to speed up loop.
- Make sure your program spends most of its time in the inner loop.
Allow the optimizer to unroll loops

- The optimizer “works by unrolling loops”.
  - Vectorization
  - Software pipelining

- Do not unroll loops yourself.
- Avoid loop carried dependencies.
- Avoid aliases.
- Do not rotate loops yourself.
Software Pipelining

What is software pipelining?
- Technique used to schedule loops and functional units efficiently.
- Reorganizing the loop in such a way that each iteration of software-pipelined code is made from instructions of different iterations of the original loop.

Simple Dot Product:
load, multiply, accumulate

<table>
<thead>
<tr>
<th>CYCLE</th>
<th>1</th>
<th>2</th>
<th>3 F1</th>
<th>4 M1</th>
<th>5 A1</th>
<th>6 ..... 100</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>F2 M2</td>
<td>M3</td>
<td>A2 M3</td>
<td>A3 M4 A4</td>
</tr>
</tbody>
</table>

- The pipeline gives more instructions to be done per cycle.
Effects of Vectorization and Software Pipelining on Blackfin

- **Simple code generation:** 1 iteration in 4 instructions
  
  ```
  LSETUP ...;
  R0.L = W[I1++]
  R1.L = W[I0++];
  A1+= R0.L*R1.L;
  ```

- **Vectorized and unrolled once:** 2 iterations in 2 instructions
  
  ```
  R0 = [I1++]
  R1 = [I0++]
  A1+= R0.H*R1.H, A0+= R0.L*R1.L (IS)
  ```

- **Software pipeline:** 2 iterations in 1 instruction
  
  ```
  R0.L = W[I1++] || R0.H= W[I0++];
  LSETUP (_P1L2 , _P1L3-8) LC0=P1;
  .align 8;
  _P1L2:
  A1+= R0.H*R0.H, A0+= R0.L*R0.H (IS) || R0.L = W[I1++] || R0.H= W[I0++];
  _P1L3:
  A1+= R0.H*R0.H, A0+= R0.L*R0.H (IS);
  ```
Do not unroll inner loops yourself

- **Good - compiler unrolls to use both compute blocks.**
  
  ```
  for (i = 0; i < n; ++i)
      c[i] = b[i] + a[i];
  ```

- **Bad - compiler leaves on a single compute block.**
  
  ```
  for (i = 0; i < n; i+=2) {
      xb = b[i]; yb = b[i+1];
      xa = a[i]; ya = a[i+1];
      xc = xa + xb; yc = ya + yb;
      c[i] = xc; c[i+1] = yc;
  }
  ```

- **OK to unroll outer loops.**
Avoid loop carried dependencies

- **Bad: Scalar dependency.**
  ```c
  for (i = 0; i < n; ++i)
      x = a[i] - x;
  
  Value used from previous iteration. So iterations cannot be overlapped.
  ```

- **Bad: Array dependency.**
  ```c
  for (i = 0; i < n; ++i)
      a[i] = b[i] * a[c[i]];
  
  Value may be from previous iteration. So iterations cannot be overlapped.
  ```
Resolvable dependencies

- **Good: A Reduction.**
  
  ```c
  for (i = 0; i < n; ++i)
  x = x + a[i];
  ```
  
  Operation is associative. Iterations can be reordered to calculate the same result.

- **Good: Induction variables.**
  
  ```c
  for (i = 0; i < n; ++i)
  a[i+4] = b[i] * a[i];
  ```
  
  Addresses vary by a fixed amount on each iteration. Compiler can see there is no data dependence.
Avoid aliases

- Is there a loop carried dependence in this loop?

```c
void fn(int a[], int b[], int n) {
    for (i = 0; i < n; ++i)
        a[i] = b[i];
}
```

Yes, if `a` and `b` point at the same array.

- Write your code so they do not point at the same array.
- `-ipa` switch may help compiler find out this is so.
Do not rotate loops yourself

- A common DSP idiom. To rotate loops so loads can be executed at same time as computation.
- Introduces loop carried dependencies.
- Makes code less easy to read.
- The compiler can do it for itself.
- Just don’t do it.
The original loop (good)

<table>
<thead>
<tr>
<th>float ss(float *a, float *b, int n) {</th>
</tr>
</thead>
<tbody>
<tr>
<td>float sum = 0.0f;</td>
</tr>
<tr>
<td>int i;</td>
</tr>
<tr>
<td>for (i = 0; i &lt; n; i++) {</td>
</tr>
<tr>
<td>sum += a[i] + b[i]; }</td>
</tr>
<tr>
<td>return sum;</td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>

A rotated loop (bad)

<table>
<thead>
<tr>
<th>float ss(float *a, float *b, int n) {</th>
</tr>
</thead>
<tbody>
<tr>
<td>float ta, tb, sum = 0.0f;</td>
</tr>
<tr>
<td>int i = 0;</td>
</tr>
<tr>
<td>ta = a[i]; tb = b[i];</td>
</tr>
<tr>
<td>for (i = 1; i &lt; n; i++) {</td>
</tr>
<tr>
<td>sum += ta + tb;</td>
</tr>
<tr>
<td>ta = a[i]; tb = b[i];</td>
</tr>
<tr>
<td>}</td>
</tr>
<tr>
<td>sum += ta + tb;</td>
</tr>
<tr>
<td>return sum;</td>
</tr>
</tbody>
</table>
Experiment with Loop structure

- **Unify inner and outer Loops.**
  - May make loop too complex, but optimiser is better focused.

- **Loop Inversion.** - reverse nested loop order.

- **Unify sequential loops** –
  - reduce memory accesses – can be crucial when dealing with external memory.
Section 6

Blackfin ADSP-BF533 Memory
Blackfin Internal SRAM

ADSP-BF531
(84KB Total)

- 32KB Instruction ROM
- 16KB Instruction SRAM
- 16KB Instr SRAM/Cache
- 16KB Data SRAM/Cache
- 4KB Scratchpad

ADSP-BF532
(116KB Total)

- 32KB Instruction ROM
- 32KB Instruction SRAM
- 16KB Instr SRAM/Cache
- 16KB Data SRAM/Cache
- 4KB Scratchpad

ADSP-BF533
(148KB Total)

- 32KB Instruction SRAM
- 32KB Instruction SRAM
- 16KB Instr SRAM/Cache
- 16KB Data SRAM/Cache
- 4KB Scratchpad

—Analog Devices Confidential Information—
ADSP-BF533 Memory Map

<table>
<thead>
<tr>
<th>Memory Address</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xFFE0 0000</td>
<td>Core MMR</td>
</tr>
<tr>
<td>0xFFC0 0000</td>
<td>System MMR</td>
</tr>
<tr>
<td>0xFFB0 1000</td>
<td>Reserved</td>
</tr>
<tr>
<td>0xFFB0 0000</td>
<td>Scratchpad SRAM</td>
</tr>
<tr>
<td>0xFFA1 4000</td>
<td>Instruction SRAM/Cache</td>
</tr>
<tr>
<td>0xFFA1 0000</td>
<td>Instruction SRAM</td>
</tr>
<tr>
<td>0xFFA0 C000</td>
<td>Instruction SRAM</td>
</tr>
<tr>
<td>0xFFA0 8000</td>
<td>Instruction SRAM</td>
</tr>
<tr>
<td>0xFFA0 0000</td>
<td>Instruction SRAM</td>
</tr>
<tr>
<td>0xFF90 8000</td>
<td>Data Bank B SRAM/Cache</td>
</tr>
<tr>
<td>0xFF90 6000</td>
<td>Data Bank B SRAM/Cache</td>
</tr>
<tr>
<td>0xFF90 4000</td>
<td>Data Bank B SRAM</td>
</tr>
<tr>
<td>0xFF90 0000</td>
<td>Data Bank B SRAM</td>
</tr>
<tr>
<td>0xFF80 8000</td>
<td>Data Bank A SRAM/Cache</td>
</tr>
<tr>
<td>0xFF80 6000</td>
<td>Data Bank A SRAM/Cache</td>
</tr>
<tr>
<td>0xFF80 4000</td>
<td>Data Bank A SRAM</td>
</tr>
<tr>
<td>0xFF80 0000</td>
<td>Data Bank A SRAM</td>
</tr>
<tr>
<td>0xEF00 0000</td>
<td>Reserved</td>
</tr>
<tr>
<td>0x2040 0000</td>
<td>Reserved</td>
</tr>
<tr>
<td>0x2030 0000</td>
<td>Async Bank 3</td>
</tr>
<tr>
<td>0x2020 0000</td>
<td>Async Bank 2</td>
</tr>
<tr>
<td>0x2010 0000</td>
<td>Async Bank 1</td>
</tr>
<tr>
<td>0x2000 0000</td>
<td>Async Bank 0</td>
</tr>
<tr>
<td>0x0800 0000</td>
<td>Reserved</td>
</tr>
<tr>
<td>0x0000 0000</td>
<td>SDRAM</td>
</tr>
</tbody>
</table>
ADSP-BF532 Memory Map

Internal Memory:
- 0xFFE0 0000: Core MMR
- 0xFFC0 0000: System MMR
- 0xFFF0 1000: Reserved
- 0xFFF0 0000: Scratchpad SRAM
- 0xFFF0 4000: Instruction SRAM/Cache
- 0xFFF0 C000: Instruction SRAM
- 0xFFF0 8000: Instruction SRAM
- 0xFFF0 0000: Instruction ROM
- 0xFFF0 8000: Data Bank B SRAM/Cache
- 0xFFF0 6000: Data Bank B SRAM/Cache
- 0xFFF0 4000: Reserved
- 0xFFF0 0000: Data Bank A SRAM/Cache
- 0xFFF0 6000: Data Bank A SRAM/Cache
- 0xFFF0 4000: Reserved
- 0xFFF0 0000: Reserved
- 0xEF00 0000: Reserved
- 0x2040 0000: Async Bank 3
- 0x2030 0000: Async Bank 2
- 0x2020 0000: Async Bank 1
- 0x2010 0000: Async Bank 0
- 0x2000 0000: Reserved
- 0x0800 0000: SDRAM
- 0x0000 0000: Reserved

External Memory:
ADSP-BF531 Memory Map

- 0xFFE0 0000: Core MMR
- 0xFFC0 0000: System MMR
- 0xFFFF 1000: Reserved
- 0xFFFF 0000: Scratchpad SRAM
- 0xFFFF 4000: Instruction SRAM/Cache
- 0xFFFF 0000: Reserved
- 0xFFFF 8000: Instruction SRAM
- 0xFFFF 0000: Instruction ROM
- 0xFFFF 90 8000: Reserved
- 0xFFFF 90 6000: Reserved
- 0xFFFF 90 4000: Reserved
- 0xFFFF 90 0000: Reserved
- 0xFFFF 80 8000: Data Bank A SRAM/Cache
- 0xFFFF 80 6000: Data Bank A SRAM/Cache
- 0xFFFF 80 4000: Reserved
- 0xFFFF 80 0000: Reserved
- 0xEFF0 0000: Reserved
- 0x2040 0000: Async Bank 3
- 0x2030 0000: Async Bank 2
- 0x2020 0000: Async Bank 1
- 0x2010 0000: Async Bank 0
- 0x2000 0000: Reserved
- 0x0800 0000: SDRAM
- 0x0000 0000: Reserved

--- Analog Devices Confidential Information ---
Memory Hierarchy on the BF533

- As processor speeds increase (300Mhz – 1 GHz), it becomes increasingly difficult to have large memories running at full speed.
- The BF53x uses a *memory hierarchy* with a primary goal of achieving memory performance similar to that of the fastest memory (i.e. L1) with an overall cost close to that of the least expensive memory (i.e. L2)

```
+-------------------+   +-------------------+
| CORE (Registers) |   | L1 Memory          |
|                  |   | Internal           |
|                  |   | Smallest capacity  |
|                  |   | Single cycle access|
|                  | +-------------------+   +-------------------+
|                  |   | L2 Memory          |
|                  |   | External           |
|                  |   | Largest capacity   |
|                  |   | Highest latency    |
```
Internal Bus Structure of the ADSP-BF533
Configurable Memory

- The best system performance can be achieved when executing code or fetching data out of L1 memory.
- Two methods can be used to fill the L1 memory – Caching and Dynamic Downloading – Blackfin® Processor Supports Both.
  - Micro-controllers have typically used the caching method, as they have large programs often residing in external memory and determinism is not as important.
  - DSPs have typically used Dynamic Downloading as they need direct control over which code runs in the fastest memory.
- Blackfin® Processor allows the programmer to choose one or both methods to optimize system performance.
Why Do Blackfin® Processors Have Cache?

- To allow users to take advantage of single cycle memory without having to specifically move instructions and or data “manually”
  - L2 memory can be used to hold large programs and data sets
  - The paths to and from L1 memory are optimized to perform with cache enabled
- Automatically optimizes code that reuses recently used or nearby data

<table>
<thead>
<tr>
<th>Internal L1 Memory:</th>
<th>External L2 Memory:</th>
</tr>
</thead>
<tbody>
<tr>
<td>Smallest capacity</td>
<td>Largest capacity</td>
</tr>
<tr>
<td>Single cycle access</td>
<td>Highest latency</td>
</tr>
</tbody>
</table>
## Configurable L1 Memory Selections

<table>
<thead>
<tr>
<th>L1 Instruction</th>
<th>L1 Data A</th>
<th>L1 Data B (BF533 and BF532 only)</th>
<th>L1 Data Scratchpad</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cache</td>
<td>Cache</td>
<td>Cache</td>
<td>SRAM</td>
</tr>
<tr>
<td>Cache</td>
<td>Cache</td>
<td>SRAM</td>
<td>SRAM</td>
</tr>
<tr>
<td>Cache</td>
<td>SRAM</td>
<td>SRAM</td>
<td>SRAM</td>
</tr>
<tr>
<td>SRAM</td>
<td>Cache</td>
<td>Cache</td>
<td>SRAM</td>
</tr>
<tr>
<td>SRAM</td>
<td>Cache</td>
<td>SRAM</td>
<td>SRAM</td>
</tr>
<tr>
<td>SRAM</td>
<td>SRAM</td>
<td>SRAM</td>
<td>SRAM</td>
</tr>
</tbody>
</table>

Using instruction cache will improve performance for most applications.

Data Cache may or may not improve performance.

Max bandwidth into L1 memory is available with cache enabled.

Trade-offs must be made on code control and peak short-term performance.

---

Analog Devices Confidential Information

---
Core MMR L1 Memory Registers

- **General Control**
  - IMEM_CONTROL (Instruction Memory)
  - DMEM_CONTROL (Data Memory)

- **Cache and Protection Properties (n=0 to 15)**
  - ICPLB_DATA\textsubscript{n}, ICPLB_ADDR\textsubscript{n}
  - DCPLB_DATA\textsubscript{n}, ICPLB_ADDR\textsubscript{n}

- **Test Functionality**
  - IT\textsubscript{TEST} COMMAND, IT\textsubscript{TEST} DATA
  - DT\textsubscript{TEST} COMMAND, DT\textsubscript{TEST} DATA
BF533 L1 Instruction Memory

Instruction Bank C
BF531, BF532, BF533: 16KB SRAM/CACHE

Instruction Bank B
BF531: 16KB SRAM
BF532: 32KB SRAM
BF533: 32KB SRAM

Instruction Bank A
BF531: 32KB ROM
BF532: 32KB ROM
BF533: 32KB SRAM
L1 Instruction Memory 16KB Configurable Bank

16 KB SRAM
- Four 4KB single-ported sub-banks
- Allows simultaneous core and DMA accesses to different banks

16 KB cache
- 4-way set associative with arbitrary locking of ways and lines
- LRU replacement
- No DMA access

---Analog Devices Confidential Information---
Features of L1 Instruction Memory Unit

- Instruction Alignment Unit: handles alignment of 16-, 32-, and 64-bit instructions that are to be sent to the execution unit.
- Cacheability and Protection Look-aside Buffer (CPLB): Provides cacheability control and protection during instruction memory accesses.
- 256-bit cache Line Fill Buffer: uses four 64-bit word burst transfers to copy cache lines from external memory.
- Memory test interface: Provides software with indirect access to tag and data memory arrays.
IMEM_CONTROL

Reset = 0x0000 0001

- ENICPLB (Instruction CPLB Enable)
  - 0 - CPLBs disabled, minimal address checking only
  - 1 - CPLBs Enabled

- IMC (L1 Instruction Memory Configuration)
  - 0 - Upper 16 KB of L1 instruction memory configured as SRAM
  - 1 - Upper 16 KB of L1 instruction memory configured as cache

- LRUPRIORST (LRU Priority Reset)
  - 0 - LRU priority functionality is enabled
  - 1 - All cached LRU priority bits (LRUPRIO) are cleared

- ILOC[3:0] (Cache Way Lock)
  - 0000 - All Ways not locked
  - 0001 - Way 0 locked, Way 1, Way 2, and Way 3 not locked
  - ... All Ways locked
BF533 L1 Data Memory

Victim Buffers:
Victimized Write-Back
Cached Data to external memory

Write Buffer:
Write-Through and
Non-cached Data to external memory

—Analogue Devices Confidential Information—
L1 Data Memory 16KB Configurable Bank

Block is Multi-ported when:
- Accessing different sub-bank
- OR
- Accessing one odd and one even access (Addr bit 2 different) within the same sub-bank.

- When Used as SRAM
  - Allows simultaneous dual DAG and DMA access

- When Used as Cache
  - Each bank is 2-way set-associative
  - No DMA access
  - Allows simultaneous dual DAG access

—Analog Devices Confidential Information—
# BF533 L1 Data Memory

<table>
<thead>
<tr>
<th>Sub-Bank</th>
<th>Data Bank A</th>
<th>Data Bank B</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0xFF80</td>
<td>0xFF90 0000</td>
</tr>
<tr>
<td>2</td>
<td>0xFF80</td>
<td>0xFF90 1000</td>
</tr>
<tr>
<td>3</td>
<td>0xFF80</td>
<td>0xFF90 2000</td>
</tr>
<tr>
<td>4</td>
<td>0xFF80</td>
<td>0xFF90 3000</td>
</tr>
<tr>
<td>5</td>
<td>0xFF80</td>
<td>0xFF90 4000</td>
</tr>
<tr>
<td>6</td>
<td>0xFF80</td>
<td>0xFF90 5000</td>
</tr>
<tr>
<td>7</td>
<td>0xFF80</td>
<td>0xFF90 6000</td>
</tr>
<tr>
<td>8</td>
<td>0xFF80</td>
<td>0xFF90 7000</td>
</tr>
</tbody>
</table>

L1 configurable data memory can be:

- Both banks A & B as SRAM
- Bank A as cache, bank B as SRAM
- Both banks as cache

---

Analog Devices Confidential Information—
BF532 L1 Data Memory

<table>
<thead>
<tr>
<th>Sub-Bank</th>
<th>Data Bank A</th>
<th>Data Bank B</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0xFF80</td>
<td>0xFF90 0000</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>0xFF80</td>
<td>0xFF90 1000</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>0xFF80</td>
<td>0xFF90 2000</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>0xFF80</td>
<td>0xFF90 3000</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>0xFF80</td>
<td>0xFF90 4000</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>0xFF80</td>
<td>0xFF90 5000</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>0xFF80</td>
<td>0xFF90 6000</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>0xFF80</td>
<td>0xFF90 7000</td>
<td></td>
</tr>
</tbody>
</table>

**SRAM**

**CONFIGURABLE**

L1 configurable data memory can be:

- Both banks A & B as SRAM
- Bank A as cache, bank B as SRAM
- Both banks as cache
### BF531 L1 Data Memory

<table>
<thead>
<tr>
<th>Sub-Bank</th>
<th>Data Bank A</th>
<th>Data Bank B</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0xFF80</td>
<td>0xFF90 0000</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>0xFF80</td>
<td>0xFF90 1000</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>0xFF80</td>
<td>0xFF90 2000</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>0xFF80</td>
<td>0xFF90 3000</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>0xFF80</td>
<td>0xFF90 4000</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>0xFF80</td>
<td>0xFF90 5000</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>0xFF80</td>
<td>0xFF90 6000</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>0xFF80</td>
<td>0xFF90 7000</td>
<td></td>
</tr>
</tbody>
</table>

L1 configurable data memory can be:
- Bank A as SRAM
- Bank A as Cache

---

*Analog Devices Confidential Information*
L1 Data Memory SRAM Addressing

- Both DAG units can access Data Banks A & B
- If an address conflict is detected Data Bank priority is as follows:
  1. System DMA (highest priority)
  2. DAG Unit 0
  3. DAG Unit 1 (lowest priority)
- Parallel DAG accesses can occur to the same Data Bank as long as the references are to different sub-banks OR they access 2 words of different 32-bit address polarity (Address bit 2 is different).
## Dual Access to Same Sub-Bank

A dual access to an odd and even (quad address) location can be performed in a single cycle.

A dual access to two odd or two even locations will result in an extra cycle (1 stall) of delay.

<table>
<thead>
<tr>
<th>A2 = 1 (odd)</th>
<th>A2 = 0 (even)</th>
</tr>
</thead>
<tbody>
<tr>
<td>07 06 05 04</td>
<td>03 02 01 00</td>
</tr>
<tr>
<td>0F 0E 0D 0C</td>
<td>0B 0A 09 08</td>
</tr>
<tr>
<td>17 16 15 14</td>
<td>13 12 11 10</td>
</tr>
<tr>
<td>1F 1E 1D 1C</td>
<td>1B 1A 19 18</td>
</tr>
<tr>
<td>27 26 25 24</td>
<td>23 22 21 20</td>
</tr>
<tr>
<td>2F 2E 2D 2C</td>
<td>2B 2A 29 28</td>
</tr>
</tbody>
</table>
L1 Scratchpad Memory

- Dedicated 4KB Block of Data SRAM
- Operates at CCLK rate
- Can not be configured as Cache
- Can not be accessed by DMA
- Typical Use is for User and Supervisor stacks to do fast context switching during interrupt handling.
L1 Data Memory Control Register

DMEM_CONTROL

PORT_PREF1 (DAG1 Port Preference)
- 0 - DAG1 non-cacheable fetches use port A
- 1 - DAG1 non-cacheable fetches use port B

PORT_PREF0 (DAG0 Port Preference)
- 0 - DAG0 non-cacheable fetches use port A
- 1 - DAG0 non-cacheable fetches use port B

DCBS (L1 Data Cache Bank Select)
Valid only when DMC[1:0] = 11, for ADSP-BF532 and ADSP-BF533. Determines whether Address bit A[14] or A[23] is used to select the L1 data cache bank.

ENDCPLB (Data Cacheability Protection Lookaside Buffer Enable)
- 0 - CPLBs disabled. Minimal address checking only
- 1 - CPLBs enabled

DMC[1:0] (L1 Data Memory Configure)
For ADSP-BF533:
- 00 - Both data banks are SRAM
- 01 - Reserved
- 10 - Data Bank A is SRAM, Data Bank B is lower 16 KB SRAM, upper 16 KB cache
- 11 - Both data banks are lower 16 KB SRAM, upper 16 KB cache

Reset = 0x0000 0001
Cache Mode
What is Cache?

- In a hierarchical memory system, cache is the first level of memory reached once the address leaves the core (i.e. L1)
  - If the instruction/data word (8, 16, 32, or 64 bits) that corresponds to the address is in the cache, there is a cache hit and the word is forwarded to the core from the cache.
  - If the word that corresponds to the address is not in the cache, there is a cache miss. This causes a fetch of a fixed size block (which contains the requested word) from the main memory.
- The Blackfin allows the user to specify which regions (i.e. pages) of main memory are cacheable and which are not through the use of CPLBs (more on this later).
  - If a page is cacheable, the block (i.e. cache line containing 32 bytes) is stored in the cache after the requested word is forwarded to the core
  - If a page is non-cacheable, the requested word is simply forwarded to the core

—Analog Devices Confidential Information—
ADSP-BF533 Instruction Cache

- **Cache Line:**
  - A 32 byte contiguous block of memory

- **Set:**
  - A group of cache lines in the cache
    - Selected by Line Address Index

- **Way:**
  - One of several places in a set that a cache line can be stored
    - 1 of 4 for Instructions
    - 1 of 2 for Data

- **Cache Tag:**
  - Upper address bits stored with cache line. Used to ID the specific address in main memory that the cached line represents
Instruction Cache Placement Based On Address

- Four 4KB sub-banks (16KB total)
- Each sub-bank has 4-ways (1KB for each way)
- Each way has 32 lines
- Each line is 32 bytes
Cache Hits and Misses

- A cache hit occurs when the address for an instruction fetch request from the core matches a valid entry in the cache.
- A cache hit is determined by comparing the upper 18 bits, and bits 11 and 10 of the instruction fetch address to the address tags of valid lines currently stored in a cache set.
- Only valid cache lines (i.e. cache lines with their valid bits set) are included in the address tag compare operation.
- When a cache hit occurs, the target 64-bit instruction word is sent to the instruction alignment unit where it is stored in one of two 64-bit instruction buffers.
- When a cache miss occurs, the instruction memory unit generates a cache line-fill access to retrieve the missing cache line from external memory to the core.
Instruction Fill from L2 Memory

• Cache Off
  – 64 bits

  

• Cache On
  – Burst Cache Line fill (32-bytes)

| 64 bits | 64 bits | 64 bits | 64 bits |
Cache Line Fills

- A cache line fill consists of fetching 32 bytes of data from memory external to the core (i.e. L2 memory).
- A line read data transfer consists of a four 64-bit word read burst.
- The instruction memory unit requests the target instruction word first; once it has returned the target word the IMU requests the next three words in sequential address order and wrap around if necessary.

<table>
<thead>
<tr>
<th>Target Word</th>
<th>Fetching Order for Next Three Words</th>
</tr>
</thead>
<tbody>
<tr>
<td>WD0</td>
<td>WD0, WD1, WD2, WD3</td>
</tr>
<tr>
<td>WD1</td>
<td>WD1, WD2, WD3, WD0</td>
</tr>
<tr>
<td>WD2</td>
<td>WD2, WD3, WD0, WD1</td>
</tr>
<tr>
<td>WD3</td>
<td>WD3, WD0, WD1, WD2</td>
</tr>
</tbody>
</table>
Cache Line-Fill Buffer

- The cache line-fill buffer allows the core to access the data from the new cache line as the line is being retrieved from external memory, rather than having to wait until the line has been completely written to the 4KB memory block.

- The line-fill buffer organization is shown below:

![Diagram of cache line-fill buffer]

- The line-fill buffer is also used to support non-cacheable accesses*. A non-cacheable access consists of a single 64-bit transfer on the instruction memory unit’s external read port.

* A non-cacheable access includes: external accesses when instruction memory is configured as SRAM, or accesses to non-cacheable pages.
Cache Line Replacement

- The cache line replacement unit first checks for invalid entries.
- If only a single invalid entry is found then that entry is selected for the new cache line. If multiple invalid entries are found the replacement entry for the new cache line is selected based on the following priority:
  - way 0 first
  - way 1 next
  - way 2 next
  - way 3 last

- When no invalid entries are found, the cache replacement logic uses a 6-bit LRU algorithm to select the entry for the new cache line.
- For instruction cache the LRUPRIO bit is also considered.
Instruction Cache “Locking By Line” (LRUPRIO)

- Prevents the Cached Line from being replaced
- CPLB_LRUPRIO bits in the ICPLB_DATAx register define the priority for that page.
- The Cache line importance level (LRUPRIO) is saved in the TAG and used by the replacement policy logic.
- Cache Line Replacement policy with LRUPRIO
  - No invalid entries:
    - A high priority will replace a low priority or a high priority if all 4-ways contain high priority lines.
    - LRU (least recently used) policy is used to determine which one of the lines that have the same priority will be replaced.
- Setting the IMEM_CONTROL: LRUPRIORST bit clears all LRUPRIO bits in the TAGs.
Instruction Cache Locking By Way

- Each 4KB way of the instruction cache can be locked individually to ensure placement of performance-critical code.
- Controlled by the ILOC<3:0> bits in the IMEM_CONTROL register.
Data Cache Mode
Data Cache Placement Based On Address

- Four 4KB sub-banks (16KB total)
- Each sub-bank has 2-ways (2KB for each way)
- Each way has 64 lines
- Each line is 32 bytes

If Both Data Bank A and B are set for Cache, bit 14 or 23 is used to determine which Data Bank.

19 Bit Tag

Sub-bank Select  Line Select  Byte Select
Data Cache Definitions

- **Write Through:**
  - A cache write policy where write data is written to the cache line and to the source memory.

- **Write Back:**
  - A cache write policy where write data is written only to the cache line. The modified cache line is written to source memory only when it is replaced.

- **Dirty/Clean (Applies to Write Back Mode only):**
  - State of cache line indicating whether the data in the cache has changed since it was copied from source memory

- Performance trade-off required between write through and write back to determine the best policy to use for an application.
Data Cache Victim Buffer

- The victim buffer is used to read a dirty cache line either being flushed or replaced by a cache line fill and then to initiate a burst write operation on the bus to perform the line copyback to the system.
- The processor can continue running without having to wait for the data to be written back to L2 memory.
- The victim buffer is comprised of a 4-deep FIFO each 64-bits in width (similar to the fill-buffer.)
- There is no data forwarding support from the victim buffer.
Cacheability Protection
Lookaside Buffers (CPLBS)
Memory Protection and Cache Properties

- **Memory Management Unit**
  - Cacheability and Protection Look-Aside Buffers (CPLBs)
  - Cache/protection properties determined on a per memory page basis (1K, 4K, 1M, 4M byte sizes)
  - 32 CPLBs total: 16 CPLBs for instruction memory, 16 CPLBs for data memory

- **User/Supervisor Access Protection**
- **Read/Write Access Protection**
- **Cacheable or Non-Cacheable**
Using CPLBs

◆ Cache enabled:
  • CPLB must be used to define cacheability properties

◆ Cache disabled:
  • CPLBs can be used to protect pages of memory

• When CPLBS are enabled, a valid CPLB must exist before an access to a specific memory location is attempted. Otherwise, an exception will be generated.

• User and Supervisor mode protection is available without using CPLBs.
Cacheability Protection Lookaside Buffers (CPLBs)

- Divide the entire Blackfin memory map into regions (i.e. pages) that have cacheability and protection properties.
- 16 Pages in Instruction Memory plus 16 Pages in Data memory
  - Page sizes: 1KB, 4KB, 1MB, 4MB
- Each CPLB has 2 associated registers:
  - 32bit Start Address: ICPLB_ADDRn, DCPLB_ADDRn
  - Cache/Protection Properties: ICPLB_DATAn, DCPLB_DATAn

Note: “n” equals 15:0
**ICPLB_DATA{n} Register**

```
<table>
<thead>
<tr>
<th>31</th>
<th>30</th>
<th>29</th>
<th>28</th>
<th>27</th>
<th>26</th>
<th>25</th>
<th>24</th>
<th>23</th>
<th>22</th>
<th>21</th>
<th>20</th>
<th>19</th>
<th>18</th>
<th>17</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
```

Reset = 0x0000 0000

**PAGE_SIZE[1:0]**
- 00 - 1 KB page size
- 01 - 4 KB page size
- 10 - 1 MB page size
- 11 - 4 MB page size

**CPLB_L1_CHBL**
- Clear this bit whenever L1 memory is configured as SRAM
- 0 - Non-cacheable in L1
- 1 - Cacheable in L1

**CPLB_LRUPRIO**
- 0 - Low importance
- 1 - High importance

**CPLB_VALID**
- 0 - Invalid (disabled) CPLB entry
- 1 - Valid (enabled) CPLB entry

**CPLB_LOCK**
- Can be used by software in CPLB replacement algorithms
- 0 - Unlocked, CPLB entry can be replaced
- 1 - Locked, CPLB entry should not be replaced

**CPLB_USER_RD**
- 0 - User mode read access generates protection violation exception
- 1 - User mode read access permitted

---

Note: “n” equals 15:0

---

— Analog Devices Confidential Information —
DCPLB_Datan Register

<table>
<thead>
<tr>
<th>Bit</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>15</td>
<td>DCPLB_L1_AOW: Valid only if write through cacheable (CPLB_VALID = 1, CPLB_WT = 1)</td>
</tr>
<tr>
<td>14</td>
<td>0 - Allocate cache lines on reads only</td>
</tr>
<tr>
<td>13</td>
<td>1 - Allocate cache lines on reads and writes</td>
</tr>
<tr>
<td>12</td>
<td>CPLB_WT: Operates only in cache mode</td>
</tr>
<tr>
<td>11</td>
<td>0 - Write back</td>
</tr>
<tr>
<td>10</td>
<td>1 - Write through</td>
</tr>
<tr>
<td>9</td>
<td>CPLB_L1_CHBL: Clear this bit when L1 memory is configured as SRAM</td>
</tr>
<tr>
<td>8</td>
<td>0 - Non-cacheable in L1</td>
</tr>
<tr>
<td>7</td>
<td>1 - Cacheable in L1</td>
</tr>
<tr>
<td>6</td>
<td>CPLB_DIRTY: Valid only if write back cacheable (CPLB_VALID = 1, CPLB_WT = 0, and CPLB_L1_CHBL = 1)</td>
</tr>
<tr>
<td>5</td>
<td>0 - Clean</td>
</tr>
<tr>
<td>4</td>
<td>1 - Dirty</td>
</tr>
<tr>
<td>3</td>
<td>A protection violation exception is generated on store accesses to this page when this bit is 0. The state of this bit is modified only by writes to this register. The exception service routine must set this bit.</td>
</tr>
<tr>
<td>2</td>
<td>CPLB_VALID: 0 - Invalid (disabled) CPLB entry</td>
</tr>
<tr>
<td>1</td>
<td>1 - Valid (enabled) CPLB entry</td>
</tr>
<tr>
<td>0</td>
<td>CPLB_LOCK: Can be used by software in CPLB replacement algorithms</td>
</tr>
<tr>
<td></td>
<td>0 - Unlocked, CPLB entry can be replaced</td>
</tr>
<tr>
<td></td>
<td>1 - Locked, CPLB entry should not be replaced</td>
</tr>
<tr>
<td></td>
<td>CPLB_USER_RD: 0 - User mode read access generates protection violation exception</td>
</tr>
<tr>
<td></td>
<td>1 - User mode read access permitted</td>
</tr>
<tr>
<td></td>
<td>CPLB_USER_WR: 0 - User mode write access generates protection violation exception</td>
</tr>
<tr>
<td></td>
<td>1 - User mode write access permitted</td>
</tr>
<tr>
<td></td>
<td>CPLB_SUPV_WR: 0 - Supervisor mode write access generates protection violation exception</td>
</tr>
<tr>
<td></td>
<td>1 - Supervisor mode write access access permitted</td>
</tr>
</tbody>
</table>

*Bits 17:16 Page Size[1:0] same as ICPLB Register

Note: "n" equals 15:0

---Analog Devices Confidential Information---
Example Protection Operation

- Set up CPLBs to define regions and properties:
  - Default hardware CPLBs are present for MMRs and scratchpad memory.
  - CPLBs must be configured for L1 Data and L1 Instruction Memory as Non-Cacheable.
  - Disable all memory other than the desired memory space.
  - Execute Code.
- If code tries to access memory that has been ‘disabled’ or protected, then a ‘memory protection violation’ occurs as an exception.
Example CPLB Setup

Instruction CPLB setup

- L1 Instruction: Non-cacheable 1MB page
- Async: Non-cacheable One 4MB page
- Async: Cacheable Two 4MB pages

Data CPLB setup

- L1 Data: Non-cacheable One 4MB page
- Async: Non-cacheable One 4MB page
- Async: Cacheable One 4MB page

Memory management handles exceptions and redefines external memory pages as required for external memory. Examples will be provided to customers.
Accessing the Cache Directly

- Once L1 memory is configured as cache, it can’t be accessed via DMA or from a core read.
- ITEST_COMMAND and ITEST_DATA memory mapped registers do allow direct access to Instruction Memory tags and lines.
- Analogous registers exist for Data Cache.
- Can be useful for invalidating cache lines directly.
Data Cache Control Instructions

- **Prefetch**: Causes data cache to prefetch line associated with address in P-register
  - Causes line to be fetched if it is not currently in the cache and the location is cacheable
  - Otherwise it behaves like a nop
    - `Prefetch [p2];`
    - `Prefetch [p2 ++]; // post increment by cache-line size`

- **FLUSH**: Causes data cache to synchronize specified cache line with higher levels of memory
  - If the line is dirty, it is written out and marked clean
    - `flush [p2];`
    - `flush [p2 ++]; // post increment by cache-line size`

- **FLUSHINV**: Causes data cache to invalidate a specific line in cache.
  - If the line is dirty, it is written out:
    - `flushinv [p2];`
    - `flushinv [p2 ++]; // post increment by cache-line size`
Instruction Cache Control Instructions

- **IFLUSH**: Causes instruction cache to invalidate a specific line in cache.
  - `iflush [p2];`
  - `iflush [p2 ++]; // post increment by cache-line size`
Coherency Considerations

- Care must be taken when memory that is defined as “cacheable” is modified by outside source
  - DMA controller (data or descriptors)
- Cache is not aware of these changes so some mechanism must be setup
  - Simple memory polling will not work
  - Must Invalidate the cache before accessing the changed L2 memory.
Reference Material

Memory
Data Byte-Ordering

- The ADSP-BF533 architecture supports little-endian byte-ordering.
- For example, if the hex value 0x76543210 resides in register r0 and the pointer register p0 contains address 0x00ff0000, then the instruction “[p0] = r0;” would cause the data to be written to memory as shown below:

<table>
<thead>
<tr>
<th>Byte Address</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x00ff0000</td>
<td>0x10</td>
</tr>
<tr>
<td>0x00ff0001</td>
<td>0x32</td>
</tr>
<tr>
<td>0x00ff0002</td>
<td>0x54</td>
</tr>
<tr>
<td>0x00ff0003</td>
<td>0x76</td>
</tr>
</tbody>
</table>

- When loading a byte, half-word, or word from memory to a register, the LSB (bit 0) of the data word is always loaded into the LSB of the destination register.

—Analog Devices Confidential Information—
Instruction Packing

- Instruction set tuned for compact code:
  - Multi-length instructions
    - 16, 32, 64-bit opcodes
    - Limited multi-issue instructions
  - No memory alignment restrictions for code:
    - Transparent alignment H/W.

---

Instruction Formats

No Memory Alignment Restrictions:
Maximum Code Density and Minimum System Memory Cost
Instruction Fetching

- 64-bit instruction line can fetch between 1 and 4 instructions

<table>
<thead>
<tr>
<th>One 64-bit instruction</th>
<th>One 32-bit instruction</th>
<th>One 32-bit instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>One 16-bit instruction</td>
<td>One 16-bit instruction</td>
<td>One 16-bit instruction</td>
</tr>
<tr>
<td>One 16-bit instruction</td>
<td>One 16-bit instruction</td>
<td>One 16-bit instruction</td>
</tr>
<tr>
<td>One 32-bit instruction</td>
<td>One 16-bit instruction</td>
<td>One 16-bit instruction</td>
</tr>
</tbody>
</table>

---

—Analog Devices Confidential Information—
Linker Description File
Software Development Flow
Step 1- Compiling & Assembling

Source Files (.C and .ASM)

Compiler & Assembler

Object Files (.DOJ)

Linker

Linker Description File (.LDF)

Executable (.DXE)

Loader / Splitter

Boot Image (.LDR)

Debugger
(In-Circuit Emulator, Simulator, or EZKIT)
Software Development Flow
Step 2 - Linking

Source Files (.C and .ASM) → Compiler & Assembler → Object Files (.DOJ) → Linker → Executable (.DXE) → Debugger (In-Circuit Emulator, Simulator, or EZKIT) → Boot Image (.LDR)

- Linker Description File (.LDF)
- Loader / Splitter
- Boot Code (.DXE)
Linker Description File
Step 2 - Linking

Object Files (.DOJ)

Executable (.DXE)

LINKER

LDF
Linker

- Generates a Complete Executable DSP Program (.dxe)
- Resolves All External References
- Assigns Addresses to re-locatable Code and Data Spaces
- Generates Optional Memory Map
- Output in ELF format
  - Used by downstream tools such as Loader, Simulator, and Emulator
- Controlled by linker commands contained in a linker description file (LDF)
  - An LDF is required for each project
  - Typically modify a default one to suit target application
Linker

Object File .DOJ

Library Files .DLB

Linker Description Files .LDF

Memory Image File .DXE (binary)

Memory Map File .MAP (.xml)
If chosen, a .map file will be created.

All symbol names will be removed, if chosen.
The Linker Description File (LDF)

- The link process is controlled by a linker command language.
- The LDF provides a complete specification of mapping between the linker's input files and its output.
- It controls:
  - input files
  - output file
  - target memory configuration
- Preprocessor Support
LDF consists of three primary parts

• **Global Commands**
  - Defines architecture or processor
  - Directory search paths
  - Libraries and object files to include

• **Memory Description**
  - Defines memory segments

• **Link Project Commands**
  - Mapping of *input sections* to memory *segments*
  - Output file name
  - Link against object file list
ARCHITECTURE (ADSP-BF533)
SEARCH_DIR ($ADI_DSP\Blackfin\lib)
$OBJECTS = $COMMAND_LINE_OBJECTS;

MEMORY
{
    seg_data_a { TYPE(RAM) START(0xFF800000) END(0xFF803FFF) WIDTH(8) }
    seg_data_b { TYPE(RAM) START(0xFF900000) END(0xFF903FFF) WIDTH(8) }
    seg_data_scr { TYPE(RAM) START(0xFFFF0000) END(0xFFFF0FFF) WIDTH(8) }
    seg_prog { TYPE(RAM) START(0xFFA00000) END(0xFFA03FFF) WIDTH(8) }
}
PROCESSOR p0
{
    OUTPUT( $COMMAND_LINE_OUTPUT_FILE )
    SECTIONS
    {
        sec_data_a
        { INPUT_SECTIONS( $OBJECTS(data_a) ) } > seg_data_a
        sec_data_b
        { INPUT_SECTIONS( $OBJECTS(data_b) ) } > seg_data_b
        sec_data_scr
        { INPUT_SECTIONS( $OBJECTS(data_scr) ) } > seg_data_scr
        sec_prog
        { INPUT_SECTIONS( $OBJECTS(prog) ) } > seg_prog
    }
}

Example LDF (con’t)
Link Commands
Linker Description File for C/C++ Programming

- **Memory Description**
  - Define Memory Segments
  - Map Input Sections (Names Produced by Compiler) to Memory Segments

- **Run Time Stack Supported**
  - Stack Used for Branching, Local Variables, Arguments
  - LDF Defines Stack Size and Location

- **Run Time Heap Supported**
  - Used For Memory Management Protocols (malloc, free, etc)
  - LDF Defines Heap Size, Location, and Name (For Multiple Heap Support)
Compiler-Generated Memory Section Names

- Compiler uses default section names that are mapped appropriately by the linker (through the LDF)

- **program** - contains all program instructions
- **data1** - contains all global and “static” data
- **constdata “const”** - contains all data declared as “const”
- **ctor** - C++ constructor initializations
- **cplb_code** – code CPLB config tables
- **cplb_data** – data CPLB config tables
Memory Descriptions

- Define Memory Segments In LDF For:
  - Code, Data, Stack*, Heap(s)
- Map Input Sections to Memory Segments
  (BF533 Default LDF Segment Names Used)

<table>
<thead>
<tr>
<th>Segment Name</th>
<th>Use</th>
</tr>
</thead>
<tbody>
<tr>
<td>MEM_L1_CODE</td>
<td>code storage</td>
</tr>
<tr>
<td>MEM_L1_CODE_CACHE</td>
<td>code storage, if not cache</td>
</tr>
<tr>
<td>MEM_L1_DATA_A</td>
<td>used for default compiler data sections</td>
</tr>
<tr>
<td>MEM_L1_DATA_A_CACHE</td>
<td>If not used as cache, it becomes heap space</td>
</tr>
<tr>
<td>MEM_L1_DATA_B</td>
<td>used for default compiler data sections</td>
</tr>
<tr>
<td>MEM_L1_DATA_B_CACHE</td>
<td>If not used as cache, it is used for data</td>
</tr>
<tr>
<td>MEM_L1_DATA_B_STACK</td>
<td>dedicated stack space</td>
</tr>
<tr>
<td>MEM_L1_SCRATCH</td>
<td>Dedicated 4 Kbyte Data Scratchpad</td>
</tr>
<tr>
<td>MEM_SDRAM0_HEAP</td>
<td>If L1 Data A used as cache, heap is external</td>
</tr>
<tr>
<td>MEM_SDRAM0</td>
<td>external SDRAM bank</td>
</tr>
<tr>
<td>MEM_ASYNCx (x=0,1,2,3)</td>
<td>(x=0,1,2,3) 1MB Async Banks</td>
</tr>
</tbody>
</table>
LDF and the Stack

- C/C++ Runtime Environment Depends Upon the Initialization of FP and SP

- Variables Initialized by Constants Defined in the LDF
  - ldf_stack_space
  - ldf_stack_end

- Variables Used to Initialize FP and SP are Declared and Initialized in the Assembly File basiccrt.s
LDF Stack Setup
(C/C++ Compiler Only)

- Linker Calculates LDF Stack-Initializing Constants from the Stack Memory Segment Description

```c
stack
{
    ldf_stack_space = .;
    ldf_stack_end = ldf_stack_space +
    MEMORY_SIZEOF(MEM_L1_DATA_B_STACK);
} >MEM_L1_DATA_B_STACK
```
LDF and the Heap

- Four Library Functions Can Be Used to Allocate or Free Memory to/from the Heap
  - malloc, calloc, realloc, free

- Other C Library Functions Implicitly Use these Four Functions and ALSO Require the Heap
  - memmove, memcpy, etc.

- Initialized by Constants Defined in the LDF
  - ldf_heap_space
  - ldf_heap_length
  - ldf_heap_end

- Multiple Heaps are Possible
  - Can be defined at Link Time or at Run Time (see compiler manual)
LDF Heap Setup
(C Compiler Only)

- Output Section ‘heap’ Calculates LDF Heap Initializers from Heap Memory Segment Description

```c
#ifdef USE_CACHE /* { */
    heap
    {
        // Allocate a heap for the application
        ldf_heap_space = .;
        ldf_heap_end = ldf_heap_space + MEMORY_SIZEOF(MEM_SDRAM0_HEAP) - 1;
        ldf_heap_length = ldf_heap_end - ldf_heap_space;
    } >MEM_SDRAM0_HEAP
#else
    heap
    {
        // Allocate a heap for the application
        ldf_heap_space = .;
        ldf_heap_end = ldf_heap_space + MEMORY_SIZEOF(MEM_L1_DATA_A_CACHE) - 1;
        ldf_heap_length = ldf_heap_end - ldf_heap_space;
    } >MEM_L1_DATA_A_CACHE
#endif /* USE_CACHE } */
```

—Analog Devices Confidential Information—
Expert Linker

Using the LDF Wizard
Expert Linker is a Graphical tools that can:

- Use wizards to create LDF files
- Define a DSP’s target memory map
- Drag and Drop object sections into the memory map
- Graphically highlights code elimination of unused objects
- Profile object sections in memory
Create LDF Wizard

Welcome to the Create LDF Wizard

This wizard will guide you through the creation of a new LDF file.

To continue, click Next.
This is a memory map view of the generated .ldf file. In this mode, each section’s start and end address are shown in a list format.
This is a graphical view of the memory map. Double click on the section to zoom in.
Unmapped sections can be ‘mapped’ simply by dragging to an appropriate memory segment.
How to create Library Functions
Section 11

Direct Memory Access (DMA)
BF533 DMA Overview

- The ADSP-BF533 DMA controller allows data transfer operations without processor intervention
  - Core sets up registers or descriptors
  - Core responds to interrupts when data is available
- Types of data transfers
  - Internal or External Memory <-> Internal or External Memory
  - Internal or External Memory <-> Serial Peripheral Interface (SPI)
  - Internal or External Memory <-> Serial Port
  - Internal or External Memory <-> UART Port
  - Internal or External Memory <-> Parallel Port Interface
The ADSP-BF533 system includes 6 DMA-capable peripherals, including the Memory DMA controller (MemDMA) with 12 DMA channels and bus masters that support these devices:

- SPORT0 RCV DMA Channel
- SPORT1 RCV DMA Channel
- SPORT0 XMT DMA Channel
- SPORT1 XMT DMA Channels
- SPI DMA Channel Streams
- UART RCV Channel
- UART XMT Channel
- PPI DMA Channel
- 4 Memory DMA
- Equates to 2 DMA
BF533 DMA Buses

- The DMA Access Bus (DAB) provides a means for DMA channels to be accessed by the peripherals.

- The DMA External Bus (DEB) provides a means for DMA channels to gain access to off-chip memory.
  - The core processor has priority over the DEB on the External Port Bus (EPB) for off-chip memory.

- The DMA Core Bus (DCB) provides a means for DMA channels to gain access to on-chip memory.
  - The DCB has priority over the core processor on arbitration into L1 memory configured as SRAM.
BF533 DMA Priority

The ADSP-BF533 processor uses the following priority arbitration policy on the DAB.

<table>
<thead>
<tr>
<th>DMA Channel</th>
<th>Default Peripheral Mapping</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 – highest</td>
<td>PPI</td>
<td>Re-assignable</td>
</tr>
<tr>
<td>1</td>
<td>SPORT0 RX</td>
<td>Re-assignable</td>
</tr>
<tr>
<td>2</td>
<td>SPORT0 TX</td>
<td>Re-assignable</td>
</tr>
<tr>
<td>3</td>
<td>SPORT1 RX</td>
<td>Re-assignable</td>
</tr>
<tr>
<td>4</td>
<td>SPORT1 TX</td>
<td>Re-assignable</td>
</tr>
<tr>
<td>5</td>
<td>SPI</td>
<td>Re-assignable</td>
</tr>
<tr>
<td>6</td>
<td>UART RX</td>
<td>Re-assignable</td>
</tr>
<tr>
<td>7</td>
<td>UART TX</td>
<td>Re-assignable</td>
</tr>
<tr>
<td>8</td>
<td>Memory DMA Stream 0 TX</td>
<td>Fixed</td>
</tr>
<tr>
<td></td>
<td>(destination)</td>
<td></td>
</tr>
<tr>
<td>9</td>
<td>Memory DMA Stream 0 RX</td>
<td>Fixed</td>
</tr>
<tr>
<td></td>
<td>(source)</td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>Memory DMA Stream 1 TX</td>
<td>Fixed</td>
</tr>
<tr>
<td></td>
<td>(destination)</td>
<td></td>
</tr>
<tr>
<td>11 - lowest</td>
<td>Memory DMA Stream 1 RX</td>
<td>Fixed</td>
</tr>
<tr>
<td></td>
<td>(source)</td>
<td></td>
</tr>
</tbody>
</table>
Peripheral Map Register

The Peripheral Map Register allows the user to map a peripheral to a specific channel thus programming the priority of each peripheral.

The Peripheral Map Register allows the user to map a peripheral to a specific channel thus programming the priority of each peripheral.
DMA Initialization

To initiate a DMA transfer, certain parameters need to be defined before the DMA engine can start a DMA sequence. These parameters are:

- **Configuration**
  - describes certain characteristics of the DMA transfer such as data size, transfer direction, etc..

- **Start Address**
  - Specifies the address where the DMA transfer will start from.

- **Count**
  - Specifies the number of elements the DMA Engine will transfer.

- **Modify**
  - Specifies the address increment after every element transfer.
DMA Schemes

Two Types of DMA transfers available on the ADSP-BF533/BF561

- **Descriptor-based DMA transfers**
  - Requires a set of parameters stored within memory to initiate a DMA sequence. These parameters are transferred to DMA control registers upon a start of a DMA transfer.
  - Supports chaining of multiple DMA transfers.

- **Register-based DMA transfers**
  - Allows the user to program the DMA control registers directly to define and initiate a DMA sequence.
  - Upon DMA completion, depending on certain bits with the Configuration Register:
    - Control registers are automatically updated with their original setup values (Autobuffer Mode).
    - Or the DMA Channel gracefully shuts off (Stop Mode).
Descriptor Blocks

Descriptor Array Mode

- **0x0**
  - Start_Addr[15:0]
- **0x2**
  - Start_Addr[31:16]
- **0x4**
  - DMA_Config
- **0x6**
  - X_Count
- **0x8**
  - X_Modify
- **0xA**
  - Y_Count
- **0xC**
  - Y_Modify
- **0xE**
  - Start_Addr[15:0]
- **0x10**
  - Start_Addr[31:16]
- **0x12**
  - DMA_Config
- **0x14**
  - X_Count
- **0x16**
  - X_Modify
- **0x18**
  - Y_Count
- **0x1A**
  - Y_Modify
- **0x1C**
  - Start_Addr[15:0]
- **0x1E**
  - Start_Addr[31:16]
- **0x20**
  - DMA_Config

Descriptor Block 1

Descriptor Block 2

Descriptor Block 3

Descriptor List (Small Model) Mode

- **Next_Desc_Ptr[15:0]**
- **Start_Addr[15:0]**
- **Start_Addr[31:16]**
- **DMA_Config**
- **X_Count**
- **X_Modify**
- **Y_Count**
- **Y_Modify**

Descriptor List (Large Model) Mode

- **Next_Desc_Ptr[15:0]**
- **Next_Desc_Ptr[31:16]**
- **Start_Addr[15:0]**
- **Start_Addr[31:16]**
- **DMA_Config**
- **X_Count**
- **X_Modify**
- **Y_Count**
- **Y_Modify**

---Analog Devices Confidential Information---
Transfer Modes

The Transfer Mode is controlled by 3 bits called the FLOW[2:0] bits within the DMA Configuration Register.

- **Stop Mode (FLOW = 0x0).**
  - When the current DMA transfer completes, the DMA channel stops automatically, after signaling an interrupt if enabled.

- **Autobuffer Mode (FLOW = 0x1).**
  - DMA is performed in a continuous circular-buffer fashion based on user-programmed DMAx MMR settings. On completion of the DMA transfer, the Parameter registers are reloaded into the Current registers, and DMA resumes immediately with zero overhead. Autobuffer mode is stopped by a user write of 0 to the DMA enable bit in the DMAx_DMA_Config Register.

- **Descriptor Array Mode (FLOW = 0x4).**
  - In this mode, the Descriptor Block does not include the NEXT_DESC_PTR parameter. Descriptor Blocks are placed one after the other within memory like an array.

- **Descriptor List (Small Model) Mode (FLOW = 0x6).**
  - In this mode, the Descriptor Block does not include the upper 16 bits of the NEXT_DESC_PTR parameter. The upper 16 bits are taken from the upper 16 bits of the NEXT_DESC_PTR register, thus confining all descriptors to a specific 64K page in memory.

- **Descriptor List (Large Model) Mode (FLOW = 0x7).**
  - In this mode, Descriptor Block includes all 32 bits of the NEXT_DESC_PTR parameter, thus allowing maximum flexibility in locating descriptors in memory.
Descriptor Block Structures

Depending on the Descriptor Mode used, the following lists the order of the Descriptor Block Parameters stored within memory:

<table>
<thead>
<tr>
<th>Descriptor Offset</th>
<th>Descriptor Array Mode (FLOW = 0x4)</th>
<th>Small Descriptor List Mode (FLOW = 0x6)</th>
<th>Large Descriptor List Mode (FLOW = 0x7)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x0</td>
<td>START_ADDR[15:0]</td>
<td>NEXT_DESC_PTR[15:0]</td>
<td>NEXT_DESC_PTR[15:0]</td>
</tr>
<tr>
<td>0x2</td>
<td>START_ADDR[31:16]</td>
<td>START_ADDR[15:0]</td>
<td>START_ADDR[15:0]</td>
</tr>
<tr>
<td>0x4</td>
<td>DMA_CONFIG</td>
<td>START_ADDR[31:16]</td>
<td>START_ADDR[31:16]</td>
</tr>
<tr>
<td>0x6</td>
<td>X_COUNT</td>
<td>DMA_CONFIG</td>
<td></td>
</tr>
<tr>
<td>0x8</td>
<td>X_MODIFY</td>
<td>X_COUNT</td>
<td>DMA_CONFIG</td>
</tr>
<tr>
<td>0xA</td>
<td>Y_COUNT</td>
<td>X_MODIFY</td>
<td>X_COUNT</td>
</tr>
<tr>
<td>0xC</td>
<td>Y_MODIFY</td>
<td>Y_COUNT</td>
<td>X_MODIFY</td>
</tr>
<tr>
<td>0xE</td>
<td>Y_MODIFY</td>
<td></td>
<td>Y_COUNT</td>
</tr>
<tr>
<td>0x10</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

NOTE: Not all of the Parameters need to be initialized within the Descriptor Block depending on the NDSIZE value within the DMA Configuration Register. The NDSIZE value is the number of Parameters that the DMA engine will fetch for the next Descriptor Block.
DMA Register Setup

To start DMA operation, some or all of the DMA Parameter Registers must first be initialized depending on the ‘Next Descriptor Size’(NDSIZE) and ‘FLOW’ bits in the DMA Configuration Register. After Initialization, DMA operation begins by writing a 1 to the DMA Enable bit in the DMA Configuration Register.

1) **FLOW = 0x0 (Stop Mode), NDSIZE = 0x0:**
   - Initialize all of the following:
     - START_ADDR
     - X_COUNT
     - X_MODIFY
     - Y_COUNT (if 2D DMA)
     - Y_MODIFY (if 2D DMA)
     - DMA_CONFIG

2) **FLOW = 0x1 (Autobuffer Mode), NDSIZE = 0x0:**
   - Initialize all of the following:
     - START_ADDR
     - X_COUNT
     - X_MODIFY
     - Y_COUNT (if 2D DMA)
     - Y_MODIFY (if 2D DMA)
     - DMA_CONFIG

3) **FLOW = 0x4 (Descriptor Array Mode), NDSIZE = 0x0 – 0x7:**
   - Initialize at least:
     - CURR_DESC_PTR[31:16]
     - CURR_DESC_PTR[15:0]

4) **FLOW = 0x6 (Small Descriptor List Mode), NDSIZE = 0x0 – 0x8:**
   - Initialize at least:
     - NEXT_DESC_PTR[31:16]
     - NEXT_DESC_PTR[15:0]

5) **FLOW = 0x7 (Large Descriptor List Mode), NDSIZE = 0x0 – 0x9:**
   - Initialize at least:
     - NEXT_DESC_PTR[31:16]
     - NEXT_DESC_PTR[15:0]
How to Stop DMA Transfers

» FLOW = 0x0 (Stop Mode):
  • DMA stops automatically after the DMA transfer is complete.

» FLOW = 0x1 (Autobuffer Mode):
  • Write a 0 to the DMA Enable bit in the DMA Configuration Register. A write of 0x0 to the entire register will always terminate DMA gracefully (without DMA Abort).

» FLOW = 0x4, 0x6, 0x7 (Array / List Mode):
  • Set the final DMA_CONFIG Register with FLOW = 0x0 setting to gracefully stop the DMA channel. If the DMA_CONFIG Parameter is not included within the Descriptor Block, use the FLOW = 0x1 method above to end the DMA.
Memory DMA (MemDMA)

- Allows memory-to-memory DMA transfers between the various ADSP-BF533 memory spaces
- A single MemDMA transfer requires a pair of DMA channels:
  - One to specify the Source block of memory
  - One to specify the Destination block of memory
- ADSP-BF533 consists of four MemDMA channels which allows setup for 2 memory-to-memory DMA transfers at the same time
  - Two Source DMA Channel – used to read from memory
  - Two Destination DMA Channel – used to write to memory
- Both the Source and Destination DMA Channels share a 8-entry, 16-bit FIFO (32-bit FIFOs on the BF561)
  - Source DMA Channel fills the FIFO
  - Destination DMA Channel empties the FIFO
Memory DMA (MemDMA)

- Each DMA transfer sequence requires two sets of Descriptor Blocks within memory
  - One for the Source DMA Channel
  - One for the Destination DMA Channel
  - Both sets of Descriptor Blocks must be configured to have the same transfer count and data size but they can have different modify values.
  - The DMA Configuration Register for the source channel must be written before the DMA Configuration Register for the destination channel. When the destination DMA Configuration Register is written, MemDMA operations starts after a latency of 3 SCLK cycles
- It is preferable to activate interrupts on only one channel
  - Eliminates ambiguity when trying to identify the channel (either source or destination) that requested the interrupt
Prioritization and Traffic Control

- Traffic can be independently controlled for each of the three buses (DAB, DCB, and DEB) with simple counters
  - alternation of transfers between MDMA streams can also be controlled
- Using the traffic control features, the DMA system preferentially grants data transfers on the DAB or memory buses (DCB and DEB), which are going in the same read/write direction as the previous transfer, until either the traffic control counter times out, or until traffic stops or changes direction on its own.
- When a count field in TC_CNT expires, it is automatically reloaded with the appropriate value programmed in TC_PER (i.e., period value).
- When a DAB, DEB, or DCB counter decrements from 1 to 0, the opposite-direction DAB, DCB, or DEB access is preferred,
  - This may result in a direction change.
- When the MDMA counter decrements from 1 to 0, the next available MDMA stream is selected.
  - If the MDMA period is set to 0, then MDMA is scheduled by fixed priority.
  - If the MDMA period is set between $1 \leq p \leq 31$, the two MDMA streams are granted bus access in alternate bursts of up to $p$ data transfers.
Traffic Control (cont’d)

Important Register: Allows the definition of transfer sizes in a given direction on DMA busses

Max values usually yield best performance but it is application dependent

2 Reads and 2 writes are more efficient with traffic control

Arrows represent transfers in and out of SDRAM
Two-Dimensional DMA (2D DMA)

Supports arbitrary row and column sizes up to 64K x 64K elements. X_Count = row size and Y_Count = column size.

X_COUNT

must be 2 or greater

Y_COUNT

X_MODIFY

Y_MODIFY

—Analog Devices Confidential Information—
Two-Dimensional DMA (2D DMA)

- **X_Modify**
  - is the byte-address increment applied after each transfer that decrements Curr_X_Count.
  - is not applied when the inner loop (row) count is ended by decrementing Curr_X_Count from 1 to 0.

- **Y_Modify**
  - is the byte-address increment applied after each decrement of Curr_Y_Count.
  - is not applied to the last element in the array on which the outer loop (column) count, Curr_Y_Count, also expires by decrementing from 1 to 0.

- **After the last transfer completes,**
  - Curr_Y_Count = 1
  - Curr_X_Count = 0
  - Curr_Addr is equal to the last item’s address plus X_Modify.

- In Autobuffer Mode, these registers are reloaded from X_Count, Y_Count, and Start_Addr upon the first data transfer.
### BF533 MMRs for Peripheral DMA

<table>
<thead>
<tr>
<th>MMR Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>DMAx_NEXT_DESC_P</td>
<td>Link pointer to next descriptor</td>
</tr>
<tr>
<td>DMAx_START_ADDR</td>
<td>Start address of DMA buffer</td>
</tr>
<tr>
<td>DMAx_DMA_CONFIG</td>
<td>DMA configuration register</td>
</tr>
<tr>
<td>DMAx_X_COUNT</td>
<td>Inner loop count</td>
</tr>
<tr>
<td>DMAx_X_MODIFY</td>
<td>Inner loop address increment, in bytes</td>
</tr>
<tr>
<td>DMAx_Y_COUNT</td>
<td>Outer loop count (2D DMA only)</td>
</tr>
<tr>
<td>DMAx_Y_MODIFY</td>
<td>Outer loop address increment, in bytes</td>
</tr>
<tr>
<td>DMAx_CURR_DESC_P</td>
<td>Current Descriptor Pointer</td>
</tr>
<tr>
<td>DMAx_CURR_ADD</td>
<td>Current DMA Address</td>
</tr>
<tr>
<td>DMAx_IRQ_STATUS</td>
<td>Interrupt Status Register contains completion and error interrupt status information</td>
</tr>
<tr>
<td>DMAx_PERIPHERAL_MAP</td>
<td>Priority mapping register</td>
</tr>
<tr>
<td>DMAx_CURR_X_COUNT</td>
<td>Current count (1D) or intra-row X count (2D)</td>
</tr>
<tr>
<td>DMAx_CURR_Y_COUNT</td>
<td>Current row count (2D DMA only)</td>
</tr>
</tbody>
</table>
BF533 MMRs for Memory DMA

<table>
<thead>
<tr>
<th>MMR Name (yy = S0, S1, D0, D1)</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>MDMA_yy_NEXT_DESC_PTR</td>
<td>Link pointer to next descriptor</td>
</tr>
<tr>
<td>MDMA_yy_START_ADDR</td>
<td>Start address of DMA buffer</td>
</tr>
<tr>
<td>MDMA_yy_DMA_CONFIG</td>
<td>DMA configuration register</td>
</tr>
<tr>
<td>MDMA_yy_X_COUNT</td>
<td>Inner loop count</td>
</tr>
<tr>
<td>MDMA_yy_X_MODIFY</td>
<td>Inner loop address increment, in bytes</td>
</tr>
<tr>
<td>MDMA_yy_Y_COUNT</td>
<td>Outer loop count (2D DMA only)</td>
</tr>
<tr>
<td>MDMA_yy_Y_MODIFY</td>
<td>Outer loop address increment, in bytes</td>
</tr>
<tr>
<td>MDMA_yy_CURR_DESC_PTR</td>
<td>Current Descriptor Pointer</td>
</tr>
<tr>
<td>MDMA_yy_CURR_ADD</td>
<td>Current DMA Address</td>
</tr>
<tr>
<td>MDMA_yy_IRQ_STATUS</td>
<td>Interrupt Status Register contains completion and error interrupt status</td>
</tr>
<tr>
<td>MDMA_yy_PERIPHERAL_MA</td>
<td>Priority mapping register (read only)</td>
</tr>
<tr>
<td>MDMA_yy_CURR_X_COUNT</td>
<td>Current count (1D) or intra-row X count (2D)</td>
</tr>
<tr>
<td>MDMA_yy_CURR_Y_COUNT</td>
<td>Current row count (2D DMA only)</td>
</tr>
</tbody>
</table>
Next Descriptor Pointer Register

- Specifies the location of the Next Descriptor Block when the current DMA transfer finishes. Used only in Small and Large Descriptor List Modes. Contents of this register are copied into the Curr_Desc_Ptr Register at the start of a descriptor block fetch. Disregarded in Stop, Autobuffer, and Descriptor Array Mode.

Reset = 0x0000 0000
**DMA Configuration Register**

**DMAx_CONFIG / MDMA_yy_CONFIG**

<table>
<thead>
<tr>
<th>Bit</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>15</td>
<td>DMA Enable</td>
</tr>
<tr>
<td>14</td>
<td>Transfer Direction</td>
</tr>
<tr>
<td>13</td>
<td>Transfer Word Size</td>
</tr>
<tr>
<td>12</td>
<td>DMA Buffer Clear</td>
</tr>
<tr>
<td>11</td>
<td>DMA Mode</td>
</tr>
<tr>
<td>10</td>
<td>Retain DMA FIFO data between DMA transfers</td>
</tr>
<tr>
<td>9</td>
<td>0 = Memory Read</td>
</tr>
<tr>
<td>8</td>
<td>1 = Memory Write</td>
</tr>
<tr>
<td>0</td>
<td>(Bit 1 cannot be modified for some peripherals and MemDMA)</td>
</tr>
</tbody>
</table>

- **DMA Enable**
  - 0 = Disabled
  - 1 = Enabled

- **Transfer Direction**
  - 0 = Memory Read
  - 1 = Memory Write

- **Transfer Word Size**
  - 00 = 8-bit transfers
  - 01 = 16-bit transfers
  - 10 = 32-bit transfers
  - 11 = reserved

- **DMA Buffer Clear**
  - 0 = Retain DMA FIFO data between DMA transfers
  - 1 = Discard DMA FIFO before beginning DMA transfer

- **DMA Mode**
  - 0 = Linear
  - 1 = 2D DMA

---

---Analog Devices Confidential Information---
DMA Configuration Register (cont.)

DMAx_CONFIG / MDMA_yy_CONFIG

<table>
<thead>
<tr>
<th>15</th>
<th>14</th>
<th>13</th>
<th>12</th>
<th>11</th>
<th>10</th>
<th>9</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>5</th>
<th>4</th>
<th>3</th>
<th>2</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
</table>

**FLOW (Next Operation)**
- 0x0 = Stop
- 0x1 = Autobuffer Mode
- 0x4 = Descriptor Array
- 0x6 = Descriptor List (small model)
- 0x7 = Descriptor List (large model)

**NDSIZE (Next Descriptor Size)**
- 0000 = Required if Stop or Autobuffer Mode
- 0001 – 1001 = Descriptor Size
- 1010 – 1111 = Reserved

**Interrupt Timing Select**
- 0 = Interrupt after completing whole buffer
- 1 = Interrupt after completing each row (inner loop), 2D only

**Interrupt Enable**
- 0 = Do not allow completion of DMA transfer to generate an interrupt
- 1 = Allow completion of DMA transfer to generate an interrupt
Start Address Register

◆ Specifies the address of the data buffer currently targeted for DMA. Contents of the Start_Addr_Ptr Register is copied into the Curr_Start_Addr_Ptr Register at the start of a DMA transfer.
X Count Register

For 2D DMA, the X_Count Register contains the inner loop count. For 1D DMA, it specifies the number of elements (8-, 16-, or 32-bit) to read in. A value of 0x0 in X_Count corresponds to 65,536 elements.
X Address Increment Register

This register contains a signed, 2’s compliment byte-address increment. In 1D DMA, this increment is the stride that is applied after transferring each element.

In 2D DMA, this increment is applied after transferring each element in the inner loop, up to but not including the last element in each inner loop. After the last element in each inner loop, Y_Modify is applied instead.
Outer Loop Count Register

For 2D DMA, the Y_Count Register contains the outer loop count. This register contains the number of rows in the outer loop of a 2D DMA sequence.

It is not used in 1D DMA.
Outer Loop Address Increment Register

This register contains a 2’s compliment byte-address increment. In 2D DMA, this increment is applied after each decrement of Curr_Y_Count except for the last item in the 2D array on which the Curr_Y_Count also expires.

The value is the offset between the last word of one “row” and the first word of the next “row”
Current Descriptor Pointer Register

![Diagram of Current Descriptor Pointer Register]

- Contains the memory address of the next descriptor element to be loaded. Curr_Desc_Ptr Register increments as each descriptor element is read in. For Descriptor Array Mode, the Curr_Desc_Ptr Register must be programmed, not the Next_Desc_Ptr Register, to initiate a DMA transfer.

Reset = 0x0000 0000

---

Analog Devices Confidential Information
Current Start Register

Contains the current DMA transfer address. At the start of a DMA transfer, the Curr_Addr Register is loaded from the Start_Addr Register and it is incremented as each transfer occurs.
Current X Count Register

- This register is loaded by X_Count at the beginning of each DMA transfer.
- It is decremented each time an element is transferred.
- For 2D DMA, Curr_X_Count is reloaded after the end of DMA for each row.
- Expiration of the count in this register signifies that DMA is complete. In 2D DMA, this register is 0 only when the entire transfer is complete.
Current Outer Loop Count Register

DMAx_CURR_Y_COUNT / MDMA_yy_CURR_Y_COUNT

This register is loaded by Y_Count at the beginning of each 2D DMA transfer.

Not used for 1D DMA.

This register is decremented each time that the Curr_X_Count Register expires during 2D DMA (1 to X_Count or 1 to 0 transition), signifying completion of an entire row transfer.

After 2D DMA is complete, Curr_Y_Count = 1 and Curr_X_Count = 0

Reset = 0x0000
## Interrupt Status Register

**DMAx_IRQ_STATUS / MDMA_yy_IRQ_STATUS**

<table>
<thead>
<tr>
<th>Bit</th>
<th>Description</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>15</td>
<td><strong>DMA_ERR (DMA Error Interrupt Status) – W1C</strong></td>
<td>0</td>
</tr>
<tr>
<td>14</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>13</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>12</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>11</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>10</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>9</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>8</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>7</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>6</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>5</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>4</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>3</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td></td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>Reset = 0x0000</td>
<td>0</td>
</tr>
</tbody>
</table>

**DMA_RUN (DMA Channel Running) – RO**
This bit is set to 1 automatically when the DMA_CONFIG register is written
0 – This DMA channel is disabled, or it is enabled but paused
1 – This DMA channel is enabled and operating, either transferring data or fetching a DMA descriptor

**DFETCH (DMA Descriptor Fetch) – RO**
This bit is set to 1 automatically when the DMA_CONFIG register is written with FLOW = 0x4 – 0x7
0 – This DMA channel is disabled, or it is enabled but stopped
1 – This DMA channel is enabled and presently fetching a DMA descriptor

**DMA_DONE (DMA Completion Interrupt Status) – W1C**
0 – No interrupt is being asserted for this channel
1 – DMA transfer has completed, and this DMA channel's interrupt is being asserted

**DMA_ERR (DMA Error Interrupt Status) – W1C**
0 – No DMA error has occurred
1 – A DMA error has occurred, and the global DMA error interrupt is being asserted.
DMA Traffic Control Counter Period Register

TC_PER

- **MDMA ROUND ROBIN PERIOD[4:0]**
  Max. length of MDMA round-robin bursts. If not zero, any MDMA stream which receives a grant is allowed up to that number of DMA transfers, to the exclusion of the other MDMA streams.

- **DCB TRAFFIC PERIOD[3:0]**
  000 = No DCB bus transfer grouping performed
  Other = Preferred length of unidirectional bursts on the DCB bus between the DMA and internal L1 memory

- **DAB TRAFFIC PERIOD[2:0]**
  000 = No DAB bus transfer grouping performed
  Other = Preferred length of unidirectional bursts on the DAB bus between the DMA and the peripherals.

- **DEB TRAFFIC PERIOD[3:0]**
  000 = No DEB bus transfer grouping performed
  Other = Preferred length of unidirectional bursts on the DEB bus between the DMA and external memory.
DMA Traffic Control Counter Register

**TC_CNT - RO**

- **MDMA_ROUND_ROBIN_COUNT[4:0]**
  Current cycle count remaining in the MDMA round robin period

- **DCB_TRAFFIC_COUNT[3:0]**
  Current cycle count remaining in the DCB traffic period

- **DEB_TRAFFIC_COUNT[3:0]**
  Current cycle count remaining in the DEB traffic period

---

—Analog Devices Confidential Information—