Compare RP2040 Executable Memory Regions and Load Program From Flash into SRAM

Compare RP2040 executable memory regions (flash XIP space, striped SRAM and banked SRAM). Discuss memory strategies, data hazard and structural hazard. Load program from flash into SRAM then execute from SRAM.

Paspberry Pi Pico, RP2040, ARM, Cortex M0+, Assembly, Baremetal, Memory, Memory structural hazard, Linker script, Virtual memory address, Load memory address, MCU

--by Captdam @ Mar 12, 2026 ~~Mar 8, 2026~~

Index

This article is intended for developers who are familiar with 8-bit MCUs and use Assembly and C language to develop bare metal applications, but new to 32-bit RP2040 and ARM Cortex-M0+.

Since we are creating bare metal applications, we will be directly writing to and reading from the MCU control registers. No library is used.

We will rely on the documents heavily. It includes all information we need about the MCU control registers.

Because of the RP2040 document updates, and for some reason they decided to redirect my links to the old document to the new document, I decided to create a copy of the current version (2025-02-20) on my server. You may obtain this document from the official link here (as 2026-02-10).

This article is based on my previous article: Switch RP2040 Clock Source in Baremetal: ROSC, XOSC and PLL, you may want to check it first.

Difference Between Memory Regions

The ARM Cortex M0+ CPU (used in RP2040) is a Von Neumann architecture machine, meaning it has one universal memory space for program, data and device (control registers).

This table shows that different regions in the memory space are for different purposes. Among them, address 0x00000000 to 0x3FFFFFFF and address 0x60000000 to 0x9FFFFFFF can be used for the program.

RP2040 divides these executable memory regions into smaller segments:

0x00000000 - 0x00FFFFFF - Boot ROM. 16kiB size. Cannot be used to store user programs. But allows execution from it at boot-up and for some util functions.
0x10000000 - 0x1FFFFFFF - (Mapped) external flash via XIP. 16MiB max (24-bit address). Different cache strategies are allowed using different base addresses.
0x20000000 - 0x2FFFFFFF - SRAM. 256kiB + 8kiB in size. Different access patterns are allowed using different base addresses.

Boot ROM

The boot ROM is 16kiB in size and it is at the beginning of the RP2040 memory space. Boot ROM is executed at power-up, serving as the first stage bootloader.

Boot ROM is factory-burned and cannot be modified. Hence, we are not allowed to write the boot ROM for our program (user program). Writing to this region will have no effect. However, the Boot ROM provides some util function that can be used in user programs.

(Mapped) Flash Memory

The external flash memory is not directly connected to the RP2040 memory space. However, RP2040 provides XIP in SSI to cache the external flash memory into the XIP memory region. This makes the external flash memory logically a part of the RP2040 internal memory.

The maximum flash size is 16MiB, or 0xFFFFFF bytes, which translates to a 24-bit address. The flash is mirrored four times with different cache strategies:

0x10000000 - 0x10FFFFFF - cacheable, allocating. Reading will check cache first. If miss, cache will be updated.
0x11000000 - 0x11FFFFFF - cacheable, non-allocating. Reading will check cache first. If miss, read from flash but don't update cahce.
0x12000000 - 0x12FFFFFF - non-cacheable, allocating. Reading will not check cache. Always read from flash and update cache.
0x13000000 - 0x13FFFFFF - non-cacheable, non-allocating. Bypass cache. Always read from flash, don't update cache.

The XIP cache is 2-way set-associative and its size is 16 kiB. Making it favored in a dual-core sequential execution or near-location read scenario. Hit is 1 cycle.

If the cache misses, the XIP must read the required data from flash. This causes stall.

The Pico on-board flash is 2MiB in size.

SRAM

RP2040 provides 264 kiB of SRAM memory in 6 banks. SRAM banks 0 to 3 are 64 kiB each, banks 4 and 5 are 64 kiB. Reading or writing are 1 cycle.

The 4 64k memory banks can be grouped into a 256 kiB striped block. The Striped memory is mapped at address 0x20000000 - 0x2003FFFF. Among them, 0x20000000 is mapped to word 0 of SRAM bank 0, 0x20000004 is mapped to word 0 of bank 1, 0x20000010 is mapped to word 1 of bank 0, and so on.

Striped SRAM makes the highest random access performance. When a core is accessing a specific SRAM bank, there is only a 1/4 chance that the second core is accessing the same SRAM bank. Hence, 25% chance of structural hazard.

The 2 4k SRAM banks can not be striped. SRAM bank 4 is mapped at address 0x20040000 - 0x20040FFF, SRAM bank 5 is mapped at address 0x20041000 - 0x20041FFF.

We can treat the striped four 64k banks and two 4k banks to a universal 264 kiB memory chunk.

If we do not want to use striped SRAM banks 0 to 3, we can access them through address 0x21000000 - 0x2103FFFF. Among them, bank 0 on 0x21000000 - 0x2100FFFF, bank 1 on 0x21010000 - 0x2101FFFF, and so on.

Unstriped SRAM is best when we dedicate SRAM banks to a core. If we can make sure that a bank dedicated to one core will never be accessed by another core (or by DMA), there will be no structural hazard.

We can not treat the two 4k banks and the unstriped four 64k to a universal memory chunk, because they are not continued, address 0x20042000 - 0x20FFFFF is void. However, we can treat the unstriped four 64k to a universal 256 kiB memory chunk.

Other Memories

RP2040 comes with 2 buffers that can be used for data and program if the buffer is not used for its intended purpose:

XIP cache - 16kiB at 0x15000000
USB DPRAM - 4kiB at 0x50100000 (Yes, it is executable)

Stall

Cache Miss

RP2040 XIP — XIP caches the flash to memory space allowing direct CPU access ©RPI2040

The CPU is running faster than external devices. To prevent the external devices from slowing down the CPU, data from external devices must be cached into a small but fast (same speed as CPU or just little slower than the CPU) memory. When the CPU reads data from the external device, it does not read and wait for the external devices; instead, it reads the cache. If the required data is in the cache, the CPU reads it with zero delay (or a few cycles of delay). If the data is not cached, then the CPU stalls, until that data is loaded into the cache. Cache miss is harmless except for delay.

This terminology is used in modern PC CPUs and memory. The CPU (running at some GHz) is tens or hundreds of times faster than the memory (running at some hundreds MHz, note the data transfer speed is different from the memory clock speed). Hence, the CPU comes with an internal data cache and an internal instruction cache. In some cases, there will be multiple stages of CPU caches (L1, L2, L3 caches).

The CPU used in RP2040 does not have cache. SRAM is clocked at the same speed as the CPU, accessing is 1 cycle. However, the external flash device which stores the program is cached (XIP cache). When we execute a program in the flash memory space, the CPU reads the instruction (or program data) at a specific address in the XIP memory space. The XIP hardware will first check if the required instruction (or data) is cached. If hit, the instruction (or data) is returned immediately. If not, the CPU stalls, until the XIP hardware reads the required instruction (or data) from the external flash device into the XIP cache.

The XIP cache is 2-way and 16 kiB in size.

Structural Hazard

RP2040 AHB-Lite crossbar has 4 upstream and 10 downstream ©RP2040

RP2040 SIO connection — The SIO is directly connected to both cores ©RP2040

Structural hazard occurs when a resource is accessed by multiple devices, but the resource can only respond to one device at a time. Hence, some of the devices must temporarily stall (waiting). Structural hazard is harmless except for delay.

The CPUs used in RP2040 have no cache. When executing an instruction at a specific address, the instruction must be read from that address at the time of execution. Similarly, when reading or writing data at a specific location (either a general data or a control register value), the data must be read or written to or from that address at the time of execution.

As the figure shows, the 4:10 AHB-Lite crossbar is used to connect the CPUs and DMA to the memory spaces. At the same time, each upstream (for example, CPU 0 - AHB-Lite crossbar) and downstream (for example, SRAM 0 - AHB-Lite crossbar) can have at maximum one read or write operation.

Therefore, memory read / write instructions like ldr and str require 2 cycles to complete. The first cycle is used to read the instruction; the second cycle is used to read or write the data.

Furthermore, if two CPUs (or one CPU and the DMA) are trying to access the same downstream memory or device, one must stall until the other one finishes the access.

The access priority can be configured for reach of the 4 upstreams (CPU 0, CPU 1, DMA write and DMA read). If two upstreams with the same priority access the same downstream memory or device at the same time, priority is given in a round-robin fashion.

On the other hand, reading from or writing to SIO only consumes 1 CPU cycle, because the SIO is directly connected to the CPU without routing it to the AHB-Lite crossbar. That means, reading the instruction from memory via AHB-Lite crossbar and accessing the SIO can occur in the same cycle.

If two CPUs are written to the same SIO (GPIO output only) at the same time, the write from CPU 0 applied first, then immediately be overwritten by CPU 1. In other words, only the output from CPU 1 takes effect.

Memory Strategy

In this example, assume we are creating a timing-critical dual-core application that we would like to avoid any chance of cache miss and structural hazard. That means:

To avoid cache miss: We will place the program in SRAM only.
To avoid structural hazard: We will make each SRAM banks dedicated for one core access.

Following pattern shows what I will use in this example:

SRAM bank 0 and 1: Dedicated data memory for CPU 0. (128kiB)
SRAM bank 2 and 3: Dedicated data memory for CPU 1. (128kiB)
SRAM bank 4: Vector, read-only data (constants), stack and program instruction for CPU 0. (4kiB)
SRAM bank 5: Vector, read-only data (constants), stack and program instruction for CPU 1. (4kiB)
Flash: Data or program not immediately in use. Can be loaded into SRAM later when about to required.

Since reading from or writing to SRAM always takes 2 cycles, it is OK to place data of one core in the same SRAM bank as its program.

In most cases, the program is small but the data is large. Therefore, I decided to use the small SRAM banks for programs and large SRAM banks for data. Furthermore, I do not like recursion; hence, the stack will not be very large. It is absolutely fine to use the larger SRAM banks for the program or place the program and stack in different banks.

I will not use the striped memory. It is useful only when the programmer does not use a dedicated memory bank for each upstream. It provides no guarantee but unclear timing.

Dual-core Application Start From USB

Project file for USB boot can be found here.

3rd stage Bootloader

First, we will create an assembly code program called boot3 (bootloader 3). This program will launch core 1 with core 1's program and then start core 0 with core 0's program. Save this file as boot3.s:

RP2040 SIO_FIFO_* — Inter-processor Mailbox FIFO Status, Write and Read @ 0xD0000050 - 0xD000005B ©RPI2040


.cpu cortex-m0plus
.thumb
.align 2
.thumb_func

.section .boot, "ax"

.global boot3
boot3:
	@ Start core 1 main program
	ldr	r7, =c1_vector
	ldr	r6, =0xd0000050		@ SIO_FIFO_ST
	mov	r0, #1
	str	r0, [r6, #0x04]		@ SIO_FIFO_WR = 1
	str	r7, [r6, #0x04]		@ SIO_FIFO_WR = c1_vector
	ldr	r0, [r7, #0x00]
	str	r0, [r6, #0x04]		@ SIO_FIFO_WR = c1_vector[0] = SP
	ldr	r0, [r7, #0x04]
	str	r0, [r6, #0x04]		@ SIO_FIFO_WR = c1_vector[1] = c0_reset
	sev

	@ Enter core 0 main program
	ldr	r7, =c0_vector
	ldr	r0, [r7, #0x00]		@ c1_vector[0] = SP
	mov	sp, r0
	ldr	r0, [r7, #0x04]		@ c1_vector[1] = c1_reset
	bx	r0
	
.global boot3_clearInterprocessorMailboxRx
boot3_clearInterprocessorMailboxRx:
	push	{r0, r1}
	ldr	r1, =0xd0000050		@ SIO_FIFO_ST
1:	ldr	r0, [r1, #0x00]		@ SIO_FIFO_ST
	lsr	r0, #1			@ VLD
	bcc	2f
	ldr	r0, [r1, #0x08]		@ SIO_FIFO_RD
	b	1b
2:	mov	r0, #0b1100
	str	r0, [r1, #0x00]		@ SIO_FIFO_ST
	pop	{r0, r1}
	bx	lr

There is nothing new in the boot3 routine. We had discussed this in previous article that launches core 1.

We do create a subroutine that helps empty the inter-processor mailbox, the boot3_clearInterprocessorMailboxRx. This subroutine reads the mailbox FIFO until the FIFO is no longer valid (which means FIFO empty). This subroutine should be executed at the beginning of core 0 and core 1's most beginning.

Once both cores are running and their mailboxes are cleared, the bootloader is no longer required. Programs can use SRAM banks 0 to 3 and overwrite this bootloader.

Name this section .boot and make it allocatable and executable so the linker can allocate space for them. Also make the two routines global so we can use them in other files.

Program on Core 0

Next, we will create an assembly code program that running on core 0. Save this file as main.s:

Data Dedicated to Core 0


.section .c0_data, "aw"

c0_static0:	.space	3
c0_static1:	.byte	0x15

Create a new section .c0_data for core 0's data:

c0_static0 - Reserve 3 bytes for this data. The initial value is undefined (, assembler will fill this with 0).
c0_static1 - 1-byte long. Initial value is 0x15.

I am not gonna use these variables. I just want to show that data dedicated to core 0 is stored in the same SRAM bank as the program for core 0.

Vector Table for Core 0


.section .c0_vector, "aw"
.global c0_vector
c0_vector:
	.word	0x20041000
	.word	c0_reset + 1

Create a new section .c0_vector for core 0's vector table:

The first vector is the initial SP. Set it to the top of SRAM bank 4.
The second vector is core 0's entry point. Set it to c0_reset. Furthermore, Thumb requires the LSB of instruction address to be set; hence, add 1.

Although the vector table should contain 48 32-bit vectors, we are not using any interrupt in this example, we can omit them.

Program for Core 0


.section .c0_text, "awx"

.global c0_reset
c0_reset:
	bl	boot3_clearInterprocessorMailboxRx

	ldr	r7, =0x4000f000		@ RESETS_RESET  + 0x3000
	mov	r0, #(1<<5)		@ IO_BANK0
	str	r0, [r7, #0x00]
	ldr	r7, =0x400140cc		@ IO_BANK0_GPIO25_CTRL
	mov	r0, #5			@ SIO
	str	r0, [r7, #0x00]
	ldr	r7, =0xd0000020		@ SIO_GPIO_OE
	ldr	r0, =(1<<25)		@ GPIO25
	str	r0, [r7, #0x00]

	ldr	r0, =c1_static0
	mov	r8, r0
1:
	mov	r0, r8
@	ldm	r0, {r0, r1, r2, r3, r4, r5, r6, r7}	@ Structural hazard test 
	b	1b

Create a new section .c0_text for core 0's program:

There is only one routine called c0_reset. We will:

Call boot3_clearInterprocessorMailboxRx defined in boot3 to empty the mailbox.
Configure GPIO to allow output.
Structural hazard test. This optional section is used to cause structural hazard in core 1. We will comment this section out for now.

Program on Core 1

Then, create a C language program that running on core 1. Save this file as main.c:

Program for Core 1


#include <stdint.h>

extern void boot3_clearInterprocessorMailboxRx();

void c1_reset() __attribute__((section (".c1_text"))) __attribute__((naked));
void c1_reset() {
	boot3_clearInterprocessorMailboxRx();
	for (;;) {
		*(uint32_t volatile * const)(0xd000001c) = (1<<25); // GPIO XOR
	}
}

Declear external function boot3_clearInterprocessorMailboxRx. This tells the compiler that the function boot3_clearInterprocessorMailboxRx is external. Therefore, when compile, the compiler won't panic when it cannot find this function. (Wait until linking)

Define main function for core 1 c1_reset, using attributes:

section (".c1_text") - Place this function in section .c1_text instead of the default program (instruction text) section .text.
naked - This is the main function, it will never return; hence, there is no need to preserve its calling stack. By default, the link register lr and other general purpose registers used in this function should be pushed into stack to preserve its value, because the caller expecting no change in these registers. Making the main function naked can save some stack space.

In this function, we will:

Call boot3_clearInterprocessorMailboxRx defined in boot3 to empty the mailbox.
In a dead loop, flip the GPIO output.

Vector Table for Core 1


uint32_t c1_vector[48] __attribute__ ((section(".c1_vector"))) = {
	0x20042000,
	(uint32_t)c1_reset
};

Define the vector table, its section name is .c1_vector. In this vector table:

The first vector is the initial SP. Set it to the top of SRAM bank 5.
The second vector is core 1's entry point. Set it to c1_reset. In C language, the compiler automatically set the address LSB.

We are not using any interrupt in this example, leave these vectors undefined.

Data Dedicated to Core 1


__attribute__((section (".c1_data"))) volatile uint32_t c1_static0 = 5;

Create a new section .c1_data for core 1's data:

Define a variable c1_static0, that:

This variable is dedicated to core 1. So, let's put it in section .c1_data.
Its initial value is 5.
Make it volatile, this prevents the compiler from optimizing it into a register-save variable.

I am not gonna use these variables in core 1. I just want to show that data dedicated to core 1 is stored in the same SRAM bank as the program for core 1.

Link and Compile

Now, create a linker script main.ld:


MEMORY {
	SRAM(rwx) : ORIGIN = 0x20000000, LENGTH = 256k
	SRAM_4(rwx) : ORIGIN = 0x20040000, LENGTH = 4k
	SRAM_5(rwx) : ORIGIN = 0x20041000, LENGTH = 4k
	SRAM_0(rwx) : ORIGIN = 0x21000000, LENGTH = 64k
	SRAM_1(rwx) : ORIGIN = 0x21010000, LENGTH = 64k
	SRAM_2(rwx) : ORIGIN = 0x21020000, LENGTH = 64k
	SRAM_3(rwx) : ORIGIN = 0x21030000, LENGTH = 64k
}

ENTRY(boot3)

SECTIONS {
	.text : {
		*(boot3)
		*(boot3_clearInterprocessorMailboxRx)
	} > SRAM

	.core0 : {
		. = ALIGN (256);
		*(.c0_vector)
		*(.c0_data)
		*(.c0_text)
	} > SRAM_4

	.core1 : {
		. = ALIGN (256);
		*(.c1_vector)
		*(.c1_data)
		*(.c1_text)
	} > SRAM_5
}

First, provide the address of each memory banks.

Place our 3rd stage bootloader boot3 and boot3_clearInterprocessorMailboxRx in SRAM (striped memory banks). Set the entry point to boot3.

For USB boot, the entry point must be the beginning of striped memory at 0x20000000 (Thumb 0x20000001). Furturemore, the section name must be .text.

Place core 0's vector table .c0_vector, data .c0_data, program .c0_text in SRAM bank 4. Vector table first.

Place core 1's vector table .c1_vector, data .c1_data, program .c1_text in SRAM bank 5.

It is redundant to place . = ALIGN (256) before vector tables, because the beginning of each bank is 4k aligned.

Next, compile this project:


arm-none-eabi-as --warn --fatal-warnings -g boot3.s -o boot3.o
arm-none-eabi-objdump --disassembler-options=force-thumb -Dxs boot3.o > boot3.s.list

arm-none-eabi-as --warn --fatal-warnings -g main.s -o main.s.o
arm-none-eabi-objdump --disassembler-options=force-thumb -Dxs main.s.o > main.s.list

arm-none-eabi-gcc -mcpu=cortex-m0plus -c -O3 main.c -o main.c.o
arm-none-eabi-objdump --disassembler-options=force-thumb -Dxs main.c.o > main.c.list

arm-none-eabi-ld -nostdlib -nostartfiles -T main.ld boot3.o main.s.o main.c.o -o main.elf
arm-none-eabi-objdump --disassembler-options=force-thumb -dxs main.elf > main.list
pico-elf2uf2 main.elf main.uf2

Verify Output

Now, we can verify the out using the disassembled code main.list:

`boot`


Disassembly of section .boot:

20000000 <boot3>:
20000000:	4f0c      	ldr	r7, [pc, #48]	; (20000034 <boot3_clearInterprocessorMailboxRx+0x16>)
20000002:	4e0d      	ldr	r6, [pc, #52]	; (20000038 <boot3_clearInterprocessorMailboxRx+0x1a>)
20000004:	2001      	movs	r0, #1
20000006:	6070      	str	r0, [r6, #4]
20000008:	6077      	str	r7, [r6, #4]
2000000a:	6838      	ldr	r0, [r7, #0]
2000000c:	6070      	str	r0, [r6, #4]
2000000e:	6878      	ldr	r0, [r7, #4]
20000010:	6070      	str	r0, [r6, #4]
20000012:	bf40      	sev
20000014:	4f09      	ldr	r7, [pc, #36]	; (2000003c <boot3_clearInterprocessorMailboxRx+0x1e>)
20000016:	6838      	ldr	r0, [r7, #0]
20000018:	4685      	mov	sp, r0
2000001a:	6878      	ldr	r0, [r7, #4]
2000001c:	4700      	bx	r0

2000001e <boot3_clearInterprocessorMailboxRx>:
2000001e:	b403      	push	{r0, r1}
20000020:	4905      	ldr	r1, [pc, #20]	; (20000038 <boot3_clearInterprocessorMailboxRx+0x1a>)
20000022:	6808      	ldr	r0, [r1, #0]
20000024:	0840      	lsrs	r0, r0, #1
20000026:	d301      	bcc.n	2000002c <boot3_clearInterprocessorMailboxRx+0xe>
20000028:	6888      	ldr	r0, [r1, #8]
2000002a:	e7fa      	b.n	20000022 <boot3_clearInterprocessorMailboxRx+0x4>
2000002c:	200c      	movs	r0, #12
2000002e:	6008      	str	r0, [r1, #0]
20000030:	bc03      	pop	{r0, r1}
20000032:	4770      	bx	lr

20000034:	20041000 	.word	0x20041000
20000038:	d0000050 	.word	0xd0000050
2000003c:	20040000 	.word	0x20040000

Routine boot3 is placed at the beginning of the striped SRAM at address 0x20000000. This routine will be executed at boot.

Routine boot3_clearInterprocessorMailboxRx is placed immediately after boot3, at address 0x2000001e, in striped SRAM.

Followed by constants used in the assembly code, including:

.word 0x20041000 @ 20000034 - Vector table address for core 1.
.word 0x20040000 @ 2000003C - Vector table address for core 0.

`core0`


Disassembly of section .core0:

20040000 <c0_vector>:
20040000:	20041000 	.word	0x20041000
20040004:	2004000d 	.word	0x2004000d

20040008 <c0_static0>:
20040008:	0000      	.short	0x0000
	...

2004000b <c0_static1>:
2004000b:	15          	.byte	0x15

2004000c <c0_reset>:
2004000c:	f7c0 f807 	bl	2000001e <boot3_clearInterprocessorMailboxRx>
20040010:	4f06      	ldr	r7, [pc, #24]	; (2004002c <c0_reset+0x20>)
20040012:	2020      	movs	r0, #32
20040014:	6038      	str	r0, [r7, #0]
20040016:	4f06      	ldr	r7, [pc, #24]	; (20040030 <c0_reset+0x24>)
20040018:	2005      	movs	r0, #5
2004001a:	6038      	str	r0, [r7, #0]
2004001c:	4f05      	ldr	r7, [pc, #20]	; (20040034 <c0_reset+0x28>)
2004001e:	4806      	ldr	r0, [pc, #24]	; (20040038 <c0_reset+0x2c>)
20040020:	6038      	str	r0, [r7, #0]
20040022:	4806      	ldr	r0, [pc, #24]	; (2004003c <c0_reset+0x30>)
20040024:	4680      	mov	r8, r0
20040026:	4640      	mov	r0, r8
20040028:	e7fd      	b.n	20040026 <c0_reset+0x1a>
2004002a:	0000      	.short	0x0000

2004002c:	4000f000 	.word	0x4000f000
20040030:	400140cc 	.word	0x400140cc
20040034:	d0000020 	.word	0xd0000020
20040038:	02000000 	.word	0x02000000
2004003c:	200410c0 	.word	0x200410c0

Vector table for core 0 c0_vector is placed at the beginning of the SRAM bank 4 at address 0x20040000. It is 256-byte aligned.

Data dedicated to core 0 c0_static0 and c0_static1 is placed immediately after c0_vector, at address 0x20040008, in SRAM bank 4.

c0_static0 and c0_static1 are byte data. No align required.

Program for core 0 c0_reset is placed immediately after c0_static1, at address 0x2004000c, in SRAM bank 4.

Followed by constants used in the assembly code.

`core1`


Disassembly of section .core1:

20041000 <c1_vector>:
20041000:	20042000 200410c5 00000000 00000000     . . ... ........
	...

200410c0 <c1_static0>:
200410c0:	00000005                                ....

200410c4 <c1_reset>:
200410c4:	f7be ffab 	bl	2000001e <boot3_clearInterprocessorMailboxRx>
200410c8:	2380      	movs	r3, #128	; 0x80
200410ca:	4a02      	ldr	r2, [pc, #8]	; (200410d4 <c1_reset+0x10>)
200410cc:	049b      	lsls	r3, r3, #18
200410ce:	6013      	str	r3, [r2, #0]
200410d0:	6013      	str	r3, [r2, #0]
200410d2:	e7fc      	b.n	200410ce <c1_reset+0xa>
200410d4:	d000001c 	.word	0xd000001c

Vector table for core 1 c1_vector is placed at the beginning of the SRAM bank 5 at address 0x20041000. It is 256-byte aligned.

Data dedicated to core 1 c1_static0 is placed immediately after c1_vector, at address 0x200410c0, in SRAM bank 5.

Program for core 1 c1_reset is placed immediately after c1_static0, at address 0x200410c4, in SRAM bank 5.

For some reasons, the compiler decided to flip the output twice in the loop. We can see 2 lines of str r3, [r2, #0].

Followed by constants used in the assembly code.

Measure the Output

Download the uf2 file. Then, use an oscilloscope to measure the output waveform.

Output waveform using ROSC with no hazard

By default, ROSC is used as the system clock source. Which is running at 5.8MHz on my chip.

Writing SIO consumes 1 cycles, branch consumes 2 cycles. Therefore, we are seeing the output waveform at 5.8MHz / 4 = 1.45MHz with 1/4 duty.

Structural Hazard in Effect

To verify that accessing the same memory bank from two cores at the same time could stall the process, we will add a dead loop in core 0's program. In this dead loop, we will try to read from the SRAM bank storing core 1's instruction. In this case, the SRAM bank has to serve both instruction read from core 1 and data read from core 0 at the same time.


	ldr	r0, =c1_static0
	mov	r8, r0
1:
	mov	r0, r8
	ldm	r0, {r0, r1, r2, r3, r4, r5, r6, r7}	@ Structural hazard test 
	b	1b

We will uncomment this line in c0_reset: ldm r0, {r0, r1, r2, r3, r4, r5, r6, r7}. In this loop, we will:

Load the address of c1_static0, which is saved in the same SRAM bank as core 1's program instructions.
Load r0 with that address. (1 cycle)
Read 8 words using r0 as a pointer, causing structural hazard on the SRAM bank used by core 1. (9 cycles, 8 reads)
Go to step 2 and repeat. (2 cycles)

That is, in every 12 cycles in core 0, stall core 1 for 8 cycles.

Compile, link, and download the uf2 file. Then, use an oscilloscope to measure the output waveform.

Output waveform using ROSC with structural hazard

As the scope shows, the frequency lowered, meaning core 1 stalls.

Since the AHB-Lite crossbar access is given in a round-robin fashion, sometimes core 1 will stall core 0 (shows same waveform length as the last figure), sometimes core 0 will stall core 1 (shows longer waveform length).

Dual-core Application Start from Flash

In the previous example, we successfully downloaded the dual-core program into SRAM banks 4 and 5, each bank is dedicated for one core, preventing any stall from structural hazard.

However, once power was lost, the program disappeared. Is there a way that we can store the program into flash, then load the program from flash into SRAM banks at boot time?

On a PC, when we compile a program, the program is not only a bunch of CPU instructions, but also includes information about how to set up the run-time environment and prepare the data, making the program executable. When we execute a program:

The operating system (OS) started.
The OS loads the program file from disk into memory.
The OS reads the program file's "information" section, sets up the run-time environment, places its contents (program and data segments) into the desired address.
Load PC with the entry point and starting execution.

We will do the similar tasks on RP2040:

The on-chip bootloader starts the 2nd stage bootloader in striped SRAM. Then, the 2nd stage bootloader configures XIP in SSI. This is similar to starting the OS.
The 2nd stage bootloader starts our 3rd stage bootloader. The 3rd stage bootloader is saved and executed in flash (XIP) memory space.
Our 3rd stage bootloader copies the vector tables, data, and program instructions into SRAM banks. Furthermore, the 3rd stage bootloader performs other setups, such as clock source, memory protection, interrupt vector setup, and ect..
Our 3rd stage bootloader (running on core 0) launches core 1 with core 1's main program, and branches to core 0's main program.

Project file for flash boot can be found here.

3rd stage Bootloader

First, we will create an assembly code program called boot3 (bootloader 3). Similar to the previous example, this program will launch core 1 with core 1's program and then start core 0 with core 0's program. Furturemore, this program will set the system clock to 133MHz PLL and copy the program from the flash memory space into the SRAM banks. Save this file as boot3.s:

`Boot section`


.cpu cortex-m0plus
.thumb
.align 2
.thumb_func

.section .boot3, "awx"

Place the entire file in .boot3 section and make it allocatable and executable so the linker can allocate space for them.

Vector Table


	.word	0x20041000
	.word	boot3 + 1

As required by the 2nd stage bootloader we used (the SDK 2nd stage bootloader), we have to place the vector table at the beginning. The vector table includes:

Initial SP used by this program. We will set it to the top of SRAM bank 4.
Entry point. We will set it to the routine boot3. Because it is a Thumb instruction routine, set the LSB of its address.

`boot3_copyText`


boot3_copyText:
	ldrh	r3, [r1, #0]		@ r1 = src, use halfword because insturction is 16-bit wide, code block may not be 4-byte aligned
	strh	r3, [r0, #0]		@ r0 = dest
	add	r1, #2
	add	r0, #2
	cmp	r1, r2			@ r2 = src_top
	bls	boot3_copyText		@ Continue if src less than src_top
	bx	lr

Create a subroutine boot3_copyText. This subroutine is used to copy the program text from flash into SRAM.

This subroutine accepts 3 parameters:

r0 - Beginning address of the destination.
r1 - Beginning address of the source.
r2 - End (not included) address of the source.
r3 - Will be clobbed.

This operation can be illustrated by:


for (
	uint16_t* src = program_in_flash_begin, * dest = program_in_sram_begin;
	src < program_in_flash_end;
	src++
) {
	*(dest++) = *(src++);
}

Because the Thumb instruction is 16-bit wide, the length of the content (programs, vector tables and data) is 2-byte aligned. Furthermore, the assembler and compiler will insert padding if the content is not 2-byte aligned. Therefore, we will use half-word read / write instructions ldrh / strh.

This subroutine is used in the bootloader only. Hence, no need to make it global.

There is no harm to place subroutines or data before the main program of our 3rd stage bootloader. The 2nd stage bootloader reads the entry point from the vector table, the entry point can be anywhere.

`boot3_clearInterprocessorMailboxRx`


.global boot3_clearInterprocessorMailboxRx
boot3_clearInterprocessorMailboxRx:
	push	{r0, r1}
	ldr	r1, =0xd0000050		@ SIO_FIFO_ST
1:	ldr	r0, [r1, #0x00]		@ SIO_FIFO_ST
	lsr	r0, #1			@ VLD
	bcc	2f
	ldr	r0, [r1, #0x08]		@ SIO_FIFO_RD
	b	1b
2:	mov	r0, #0b1100
	str	r0, [r1, #0x00]		@ SIO_FIFO_ST
	pop	{r0, r1}
	bx	lr

Create a subroutine that helps empty the inter-processor mailbox, the boot3_clearInterprocessorMailboxRx. Discussed in previous example.

This subroutine will be used in the main programs (in other files). Hence, make it global for linking.

`boot3`


.global boot3
boot3:
	@ Start XOSC
	ldr	r7, =0x40024000			@ XOSC_BASE
	ldr	r0, =0x00FABAA0			@ Enable 1-15MHz
	str	r0, [r7, #0x00]			@ XOSC_CTRL
1:	ldr	r0, [r7, #0x04]			@ XOSC_STATUS
	lsr	r0, #32				@ Stable
	bcc	1b

	@ Start PLL
	ldr	r7, =0x4000f000			@ RESETS_RESET  + 0x3000
	ldr	r0, =(1<<12)			@ PLL_SYS
	str	r0, [r7, #0x00]

	ldr	r7, =0x40028000			@ PLL_SYS_BASE
	ldr	r6, =0x40028000 + 0x3000	@ PLL_SYS_CLEAR
	mov	r0, #(1<<0)			@ RefDiv = 1
	str	r0, [r7, #0x00]			@ PLL_SYS_CS
	mov	r0, #63				@ 12MHz * 63 = 756MHz
	str	r0, [r7, #0x08]			@ PLL_SYS_FBDIV_INT
	mov	r0, #((1<<5) | (1<<0))		@ VO and main powerdown
	str	r0, [r6, #0x04]			@ PLL_SYS_CLEAR_PWR
1:	ldr	r0, [r7, #0x00]			@ PLL_SYS_CS
	lsr	r0, r0, #32			@ Lock
	bcc	1b
	ldr	r0, =((6 << 16) | (1 << 12))	@ 756MHz / (6*1) = 126MHz
	str	r0, [r7, #0x0C]			@ PLL_SYS_PRIM
	mov	r0, #(1<<3)			@ PostDiv powerdown
	str	r0, [r6, #0x04]			@ PLL_SYS_CLEAR_PWR

	@ Switch to PLL
	ldr	r7, =0x40008000			@ CLOCKS_BASE
	mov	r0, #((0<<5) | (0<<0))		@ Set Aux src to CLKSRC_PLL_SYS (need wait) and switch Src to CLK_REF (on-the-fly), for safe
	str	r0, [r7, #0x3C]			@ CLOCKS_CLK_SYS_CTRL
	nop
	nop
	mov	r0, #((0<<5) | (1<<0))		@ Keep Aux src but switch Src to CLKSRC_CLK_SYS_AUX (on-the-fly)
	str	r0, [r7, #0x3C]

	@ Copy Code for core 0
	ldr	r0, =_core0_dest
	ldr	r1, =_core0_start
	ldr	r2, =_core0_end
	bl	boot3_copyText

	@ Copy code for core 1
	ldr	r0, =_core1_dest
	ldr	r1, =_core1_start
	ldr	r2, =_core1_end
	bl	boot3_copyText

	@ Start core 1 main program
	ldr	r7, =c1_vector
	ldr	r6, =0xd0000050		@ SIO_FIFO_ST
	mov	r0, #1
	str	r0, [r6, #0x04]		@ SIO_FIFO_WR = 1
	str	r7, [r6, #0x04]		@ SIO_FIFO_WR = c1_vector
	ldr	r0, [r7, #0x00]
	str	r0, [r6, #0x04]		@ SIO_FIFO_WR = c1_vector[0] = SP
	ldr	r0, [r7, #0x04]
	str	r0, [r6, #0x04]		@ SIO_FIFO_WR = c1_vector[1] = c0_reset
	@ Delay core 1 start until core 0 ready

	@ Enter core 0 main program
	ldr	r7, =c0_vector
	ldr	r0, [r7, #0x00]		@ c1_vector[0] = SP
	mov	sp, r0
	ldr	r0, [r7, #0x04]		@ c1_vector[1] = c1_reset
	sev				@ Core 1 start
	bx	r0

Here is waht we will do in our 3rd stage bootloader. We had discussed the detail in previous article, so, I'm not gonna spend too much time explain this. In summary, this performs:

Start XOSC.
Start PLL, which requires XOSC as reference clock.
Switch the system clock source to PLL, allowing the system to run at maximum frequency (133MHz).
Copy the programs (including instructions, data and vector tables) from flash to SRAM. The address will be resolved at linking time.
Launch core 1 and start core 0.

Program on Core 0

Next, we will create an assembly code program that running on core 0. In this example, we will let core 0 to flip the GPIO output. Save this file as main.s:

Data Dedicated to Core 0


.section .c0_data, "aw"

c0_static0:	.space	3
c0_static1:	.byte	0x15

Create a new section .c0_data for core 0's data:

c0_static0 - Reserve 3 bytes for this data. The initial value is undefined (, assembler will fill this with 0).
c0_static1 - 1-byte long. Initial value is 0x15.

I am not gonna use these variables. I just want to show that data dedicated to core 0 is stored in the same SRAM bank as the program for core 0.

Vector Table for Core 0


.section .c0_vector, "aw"
.global c0_vector
c0_vector:
	.word	0x20041000
	.word	c0_reset + 1

Create a new section .c0_vector for core 0's vector table:

The first vector is the initial SP. Set it to the top of SRAM bank 4.
The second vector is core 0's entry point. Set it to c0_reset. Furthermore, Thumb requires the LSB of instruction address to be set; hence, add 1.

Although the vector table should contain 48 32-bit vectors, we are not using any interrupt in this example, we can omit them.

Program for Core 0


.section .c0_text, "awx"

.global c0_reset
c0_reset:
	ldr	r0, =(boot3_clearInterprocessorMailboxRx+1)
	blx	r0

	ldr	r1, =0xd000001c		@ GPIO XOR
	ldr	r0, =(1<<25)		@ GPIO25

	ldr	r2, =0x10000000		@ Flash pointer
	ldr	r3, =0x00000004		@ Flash step (must be 4x for word read)
	ldr	r4, =0x100000FF		@ Flash mask

1:	str	r0, [r1, #0]
	ldr	r5, [r2, #0]
	add	r2, r2, r3
	and	r2, r2, r4
	b	1b

Create a new section .c0_text for core 0's program:

There is only one routine called c0_reset. We will:

Call boot3_clearInterprocessorMailboxRx defined in boot3 to empty the mailbox.
Load the registers with addresses and values to write.
In a dead loop, flip the GPIO output and read a dummy data from flash (cache miss test).

Note that in the example, when calling the subroutine boot3_clearInterprocessorMailboxRx, we cannot use the bl boot3_clearInterprocessorMailboxRx instruction because the bl label instruction only supports PC offset of +/- 16MiB. However, the subroutine is saved in flash space at 0x10000000, the program is running in SRAM space at 0x20000000, the offset is 0x10000000, or 256MiB, which is larger than the allowed offset. Therefore, we have to load the subroutine's address in a register (a function pointer), add 1 because of Thumb instruction, then use the blx rx instruction to call that subroutine.

We will discuss the details of the cache miss test in a later chapter in this article.

Program on Core 1

Then, create a C language program that running on core 1. In this example, we will let core 1 to configure the GPIO output. Save this file as main.c:

Program for Core 1


#include <stdint.h>

__attribute__((long_call)) extern void boot3_clearInterprocessorMailboxRx();

void c1_reset() __attribute__((section (".c1_text"))) __attribute__((naked));
void c1_reset() {
	(boot3_clearInterprocessorMailboxRx + 1)();
	
	*(uint32_t volatile * const)(0x4000f000) = (1<<5);	// RESETS_RESET + 0x3000 <-- IO_BANK0
	*(uint32_t volatile * const)(0x400140cc) = (5);		// IO_BANK0_GPIO25_CTRL <-- SIO
	*(uint32_t volatile * const)(0xd0000020) = (1<<25);	// SIO_GPIO_OE <-- GPIO25

	for(;;);
}

Declear external function boot3_clearInterprocessorMailboxRx. Furturemore, add the long_call attribute to it. This allows the compiler to use the blx rx instruction to call a subroutine that is more than 16 MiB away from the current address. However, this attribute doesn't tell the compiler whether this function uses Thumb or ARM. We have to manually add 1 (for Thumb) to this symbol when we call it. (Wait until linking)

Define main function for core 1 c1_reset, using attributes:

section (".c1_text") - Place this function in section .c1_text instead of the default program (instruction text) section .text.
naked - This is the main function, it will never return; hence, there is no need to preserve its calling stack. By default, the link register lr and other general purpose registers used in this function should be pushed into stack to preserve its value, because the caller expecting no change in these registers. Making the main function naked can save some stack space.

In this function, we will:

Call boot3_clearInterprocessorMailboxRx defined in boot3 to empty the mailbox.
Configure the GPIO output.

Vector Table for Core 1


uint32_t c1_vector[48] __attribute__ ((section(".c1_vector"))) = {
	0x20042000,
	(uint32_t)c1_reset
};

Define the vector table, its section name is .c1_vector. In this vector table:

The first vector is the initial SP. Set it to the top of SRAM bank 5.
The second vector is core 1's entry point. Set it to c1_reset. In C language, the compiler automatically set the address LSB.

We are not using any interrupt in this example, leave these vectors undefined.

Data Dedicated to Core 1


__attribute__((section (".c1_data"))) volatile uint32_t c1_static0 = 5;

Create a new section .c1_data for core 1's data:

Define a variable c1_static0, that:

This variable is dedicated to core 1. So, let's put it in section .c0_data.
Its initial value is 5.
Make it volatile, this prevents the compiler from optimizing it into a register-save variable.

I am not gonna use these variables in core 1. I just want to show that data dedicated to core 1 is stored in the same SRAM bank as the program for core 1.

Link and Compile

Now, create a linker script main.ld:


MEMORY {
	FLASH(rwx) : ORIGIN = 0x10000000, LENGTH = 2048k
	SRAM(rwx) : ORIGIN = 0x20000000, LENGTH = 256k
	SRAM_4(rwx) : ORIGIN = 0x20040000, LENGTH = 4k
	SRAM_5(rwx) : ORIGIN = 0x20041000, LENGTH = 4k
	SRAM_0(rwx) : ORIGIN = 0x21000000, LENGTH = 64k
	SRAM_1(rwx) : ORIGIN = 0x21010000, LENGTH = 64k
	SRAM_2(rwx) : ORIGIN = 0x21020000, LENGTH = 64k
	SRAM_3(rwx) : ORIGIN = 0x21030000, LENGTH = 64k
}

ENTRY(_boot_start)

SECTIONS {
	.boot : {
		*(.boot2)
		*(.boot3)
	} > FLASH
	_boot_start = ORIGIN(FLASH);
	_boot_end = _boot_start + SIZEOF(.boot);

	.core0 : {
		. = ALIGN (256)
		*(.c0_vector)
		*(.c0_data)
		*(.c0_text)
	} > SRAM_4 AT > FLASH
	_core0_dest = ORIGIN(SRAM_4);
	_core0_start = _boot_end;
	_core0_end = _core0_start + SIZEOF(.core0);

	.core1 : {
		. = ALIGN (256)
		*(.c1_vector)
		*(.c1_data)
		*(.c1_text)
	} > SRAM_5 AT > FLASH
	_core1_dest = ORIGIN(SRAM_5);
	_core1_start = _core0_end;
	_core1_end = _core1_start + SIZEOF(.core1);

	.unspecified : {
		*(.text)
		*(.data)
		*(.bss)
	}
	ASSERT(!(SIZEOF(.unspecified)), "Unspecified text, data, and/or bss section")
}

The linker script used in this example is a little bit complex.

To execute the program in a memory different than store program, we will provide both virtual memory address (VMA) and load memory address (LMA) for the sections using the following syntax:


.section_name : {
	*(.subsection_name)
	*(.subsection_name)
	*(.subsection_name)
	*(symbol_name)
	*(symbol_name)
} > virtual_memory_addr AT > load_memory_addr

That is:

Use SRAM bank 4 and 5 for VMA. This address is used when resovle a symbol (variables and routines).
Use flash for LMA. This address is used to store the programs and data.

Furthermore, we need to define the following symbol to indicate the address of programs and data. They are used to copy the programs, vector tables and data from flash into SRAM banks in our 3rd stage bootloader:

_boot_start - Start of the flash memory.
_boot_end - End of the .boot section in flash.
_core0_start - Same as _boot_end. Section .core0 (program, vector table and data for core 0) is placed immediately after the .boot section.
_core0_end - End of the .core0 section in flash.
_core1_start - Same as _core0_end. Section .core1 (program, vector table and data for core 1) is placed immediately after the .core0 section.
_core1_end - End of the .core1 section in flash.
_core0_dest - Virtual address for .core0, which is SRAM bank 4 in our example.
_core1_dest - Virtual address for .core1, which is SRAM bank 5 in our example.

In our design, the programs, vector tables and data must be assigned to the .boot, .core0 or .core1 section. We defined a special section named .unspecified to include any data, vector table or program that is not assigned one of the mentioned sections. If there is any content in the .unspecified section, the linking fails. This tells us we forgot something.

First, provide the address of each memory banks.

Place the SDK 2nd stage bootloader .boot2 and our 3rd stage bootloader .boot3 (preserve the order) in the flash.

Entry point is not required for flash boot.

Place core 0's vector table c0_vector, data c0_data, program c0_text in the flash but use SRAM bank 4 for virtual address. Vector table first.

Place core 1's vector table c1_vector, data c1_data, program c1_text in the flash but use SRAM bank 5 for virtual address.

It is redundant to place . = ALIGN (256) before vector tables, because the beginning of each bank is 4k aligned. Furturemore, the alignment only applies to the VMA.

Next, compile this project:


arm-none-eabi-as --warn --fatal-warnings -g boot3.s -o boot3.o
arm-none-eabi-objdump --disassembler-options=force-thumb -Dxs boot3.o > boot3.s.list

arm-none-eabi-as --warn --fatal-warnings -g main.s -o main.s.o
arm-none-eabi-objdump --disassembler-options=force-thumb -Dxs main.s.o > main.s.list

arm-none-eabi-gcc -mcpu=cortex-m0plus -c -O3 main.c -o main.c.o
arm-none-eabi-objdump --disassembler-options=force-thumb -Dxs main.c.o > main.c.list

arm-none-eabi-ld -nostdlib -nostartfiles -T main.ld boot3.o main.s.o main.c.o boot2.o -o main.elf
arm-none-eabi-objdump --disassembler-options=force-thumb -dxs main.elf > main.list
pico-elf2uf2 main.elf main.uf2

Verify Output - Virtual address

Now, we can verify the out using the disassembled code main.list. Note the disassembled code shows virtual address:

`boot`


Disassembly of section .boot:

10000000 <_boot_start>:
10000000:	4b32b500 	.word	0x4b32b500
10000004:	60582021 	.word	0x60582021
10000008:	21026898 	.word	0x21026898
1000000c:	60984388 	.word	0x60984388
10000010:	611860d8 	.word	0x611860d8
10000014:	4b2e6158 	.word	0x4b2e6158
10000018:	60992100 	.word	0x60992100
1000001c:	61592102 	.word	0x61592102
10000020:	22f02101 	.word	0x22f02101
10000024:	492b5099 	.word	0x492b5099
10000028:	21016019 	.word	0x21016019
1000002c:	20356099 	.word	0x20356099
10000030:	f844f000 	.word	0xf844f000
10000034:	42902202 	.word	0x42902202
10000038:	2106d014 	.word	0x2106d014
1000003c:	f0006619 	.word	0xf0006619
10000040:	6e19f834 	.word	0x6e19f834
10000044:	66192101 	.word	0x66192101
10000048:	66182000 	.word	0x66182000
1000004c:	f000661a 	.word	0xf000661a
10000050:	6e19f82c 	.word	0x6e19f82c
10000054:	6e196e19 	.word	0x6e196e19
10000058:	f0002005 	.word	0xf0002005
1000005c:	2101f82f 	.word	0x2101f82f
10000060:	d1f94208 	.word	0xd1f94208
10000064:	60992100 	.word	0x60992100
10000068:	6019491b 	.word	0x6019491b
1000006c:	60592100 	.word	0x60592100
10000070:	481b491a 	.word	0x481b491a
10000074:	21016001 	.word	0x21016001
10000078:	21eb6099 	.word	0x21eb6099
1000007c:	21a06619 	.word	0x21a06619
10000080:	f0006619 	.word	0xf0006619
10000084:	2100f812 	.word	0x2100f812
10000088:	49166099 	.word	0x49166099
1000008c:	60014814 	.word	0x60014814
10000090:	60992101 	.word	0x60992101
10000094:	2800bc01 	.word	0x2800bc01
10000098:	4700d000 	.word	0x4700d000
1000009c:	49134812 	.word	0x49134812
100000a0:	c8036008 	.word	0xc8036008
100000a4:	8808f380 	.word	0x8808f380
100000a8:	b5034708 	.word	0xb5034708
100000ac:	20046a99 	.word	0x20046a99
100000b0:	d0fb4201 	.word	0xd0fb4201
100000b4:	42012001 	.word	0x42012001
100000b8:	bd03d1f8 	.word	0xbd03d1f8
100000bc:	6618b502 	.word	0x6618b502
100000c0:	f7ff6618 	.word	0xf7ff6618
100000c4:	6e18fff2 	.word	0x6e18fff2
100000c8:	bd026e18 	.word	0xbd026e18
100000cc:	40020000 	.word	0x40020000
100000d0:	18000000 	.word	0x18000000
100000d4:	00070000 	.word	0x00070000
100000d8:	005f0300 	.word	0x005f0300
100000dc:	00002221 	.word	0x00002221
100000e0:	180000f4 	.word	0x180000f4
100000e4:	a0002022 	.word	0xa0002022
100000e8:	10000100 	.word	0x10000100
100000ec:	e000ed08 	.word	0xe000ed08
	...
100000fc:	7a4eb274 	.word	0x7a4eb274
10000100:	20041000 	.word	0x20041000
10000104:	1000012d 	.word	0x1000012d

10000108 <boot3_copyText>:
10000108:	880b      	ldrh	r3, [r1, #0]
1000010a:	8003      	strh	r3, [r0, #0]
1000010c:	3102      	adds	r1, #2
1000010e:	3002      	adds	r0, #2
10000110:	4291      	cmp	r1, r2
10000112:	d9f9      	bls.n	10000108 <boot3_copyText>
10000114:	4770      	bx	lr

10000116 <boot3_clearInterprocessorMailboxRx>:
10000116:	b403      	push	{r0, r1}
10000118:	4920      	ldr	r1, [pc, #128]	; (1000019c <boot3+0x70>)
1000011a:	6808      	ldr	r0, [r1, #0]
1000011c:	0840      	lsrs	r0, r0, #1
1000011e:	d301      	bcc.n	10000124 <boot3_clearInterprocessorMailboxRx+0xe>
10000120:	6888      	ldr	r0, [r1, #8]
10000122:	e7fa      	b.n	1000011a <boot3_clearInterprocessorMailboxRx+0x4>
10000124:	200c      	movs	r0, #12
10000126:	6008      	str	r0, [r1, #0]
10000128:	bc03      	pop	{r0, r1}
1000012a:	4770      	bx	lr

1000012c <boot3>:
1000012c:	4f1c      	ldr	r7, [pc, #112]	; (100001a0 <boot3+0x74>)
1000012e:	481d      	ldr	r0, [pc, #116]	; (100001a4 <boot3+0x78>)
10000130:	6038      	str	r0, [r7, #0]
10000132:	6878      	ldr	r0, [r7, #4]
10000134:	0800      	lsrs	r0, r0, #32
10000136:	d3fc      	bcc.n	10000132 <boot3+0x6>
10000138:	4f1b      	ldr	r7, [pc, #108]	; (100001a8 <boot3+0x7c>)
1000013a:	481c      	ldr	r0, [pc, #112]	; (100001ac <boot3+0x80>)
1000013c:	6038      	str	r0, [r7, #0]
1000013e:	4f1c      	ldr	r7, [pc, #112]	; (100001b0 <boot3+0x84>)
10000140:	4e1c      	ldr	r6, [pc, #112]	; (100001b4 <boot3+0x88>)
10000142:	2001      	movs	r0, #1
10000144:	6038      	str	r0, [r7, #0]
10000146:	203f      	movs	r0, #63	; 0x3f
10000148:	60b8      	str	r0, [r7, #8]
1000014a:	2021      	movs	r0, #33	; 0x21
1000014c:	6070      	str	r0, [r6, #4]
1000014e:	6838      	ldr	r0, [r7, #0]
10000150:	0800      	lsrs	r0, r0, #32
10000152:	d3fc      	bcc.n	1000014e <boot3+0x22>
10000154:	4818      	ldr	r0, [pc, #96]	; (100001b8 <boot3+0x8c>)
10000156:	60f8      	str	r0, [r7, #12]
10000158:	2008      	movs	r0, #8
1000015a:	6070      	str	r0, [r6, #4]
1000015c:	4f17      	ldr	r7, [pc, #92]	; (100001bc <boot3+0x90>)
1000015e:	2000      	movs	r0, #0
10000160:	63f8      	str	r0, [r7, #60]	; 0x3c
10000162:	46c0      	nop			; (mov r8, r8)
10000164:	46c0      	nop			; (mov r8, r8)
10000166:	2001      	movs	r0, #1
10000168:	63f8      	str	r0, [r7, #60]	; 0x3c
1000016a:	4815      	ldr	r0, [pc, #84]	; (100001c0 <boot3+0x94>)
1000016c:	4915      	ldr	r1, [pc, #84]	; (100001c4 <boot3+0x98>)
1000016e:	4a16      	ldr	r2, [pc, #88]	; (100001c8 <boot3+0x9c>)
10000170:	f7ff ffca 	bl	10000108 <boot3_copyText>
10000174:	4815      	ldr	r0, [pc, #84]	; (100001cc <boot3+0xa0>)
10000176:	4916      	ldr	r1, [pc, #88]	; (100001d0 <boot3+0xa4>)
10000178:	4a16      	ldr	r2, [pc, #88]	; (100001d4 <boot3+0xa8>)
1000017a:	f7ff ffc5 	bl	10000108 <boot3_copyText>
1000017e:	4f16      	ldr	r7, [pc, #88]	; (100001d8 <boot3+0xac>)
10000180:	4e06      	ldr	r6, [pc, #24]	; (1000019c <boot3+0x70>)
10000182:	2001      	movs	r0, #1
10000184:	6070      	str	r0, [r6, #4]
10000186:	6077      	str	r7, [r6, #4]
10000188:	6838      	ldr	r0, [r7, #0]
1000018a:	6070      	str	r0, [r6, #4]
1000018c:	6878      	ldr	r0, [r7, #4]
1000018e:	6070      	str	r0, [r6, #4]
10000190:	4f12      	ldr	r7, [pc, #72]	; (100001dc <boot3+0xb0>)
10000192:	6838      	ldr	r0, [r7, #0]
10000194:	4685      	mov	sp, r0
10000196:	6878      	ldr	r0, [r7, #4]
10000198:	bf40      	sev
1000019a:	4700      	bx	r0

1000019c:	d0000050 	.word	0xd0000050
100001a0:	40024000 	.word	0x40024000
100001a4:	00fabaa0 	.word	0x00fabaa0
100001a8:	4000f000 	.word	0x4000f000
100001ac:	00001000 	.word	0x00001000
100001b0:	40028000 	.word	0x40028000
100001b4:	4002b000 	.word	0x4002b000
100001b8:	00061000 	.word	0x00061000
100001bc:	40008000 	.word	0x40008000
100001c0:	20040000 	.word	0x20040000
100001c4:	100001e0 	.word	0x100001e0
100001c8:	1000021c 	.word	0x1000021c
100001cc:	20041000 	.word	0x20041000
100001d0:	1000021c 	.word	0x1000021c
100001d4:	1000030c 	.word	0x1000030c
100001d8:	20041000 	.word	0x20041000
100001dc:	20040000 	.word	0x20040000

The SDK 2nd stage bootloader _boot_start is placed at the beginning of the flash at address 0x10000000. This routine will be executed at boot.

Vector table for our 3rd stage is placed immediately after the SDK 2nd stage bootloader, at address 0x10000100, in flash. Followed by boot3_copyText, boot3_clearInterprocessorMailboxRx and boot3, all of them are in flash.

At the end shows the constants used in the assembly code, including:

.word 0x20040000 @ 100001c0 - Destination for copy core 0's content.
.word 0x100001e0 @ 100001c4 - Beginning of source of for core 0's content. Note the last content of boot3 is at address of 0x100001dc. Because the content for core0 is placed immediately after boot3, its address will be 0x100001e0.
.word 0x1000021c @ 100001c8 - Top of source of for core 0's content.
...

`core0`


Disassembly of section .core0:

20040000 <c0_vector>:
20040000:	20041000 	.word	0x20041000
20040004:	2004000d 	.word	0x2004000d

20040008 <c0_static0>:
20040008:	0000      	.short	0x0000
	...

2004000b <c0_static1>:
2004000b:	15          	.byte	0x15

2004000c <c0_reset>:
2004000c:	4805      	ldr	r0, [pc, #20]	; (20040024 <c0_reset+0x18>)
2004000e:	4780      	blx	r0
20040010:	4905      	ldr	r1, [pc, #20]	; (20040028 <c0_reset+0x1c>)
20040012:	4806      	ldr	r0, [pc, #24]	; (2004002c <c0_reset+0x20>)
20040014:	4a06      	ldr	r2, [pc, #24]	; (20040030 <c0_reset+0x24>)
20040016:	4b07      	ldr	r3, [pc, #28]	; (20040034 <c0_reset+0x28>)
20040018:	4c07      	ldr	r4, [pc, #28]	; (20040038 <c0_reset+0x2c>)
2004001a:	6008      	str	r0, [r1, #0]
2004001c:	6815      	ldr	r5, [r2, #0]
2004001e:	18d2      	adds	r2, r2, r3
20040020:	4022      	ands	r2, r4
20040022:	e7fa      	b.n	2004001a <c0_reset+0xe>

20040024:	10000117 	.word	0x10000117
20040028:	d000001c 	.word	0xd000001c
2004002c:	02000000 	.word	0x02000000
20040030:	10000000 	.word	0x10000000
20040034:	00000004 	.word	0x00000004
20040038:	100000ff 	.word	0x100000ff

Vector table for core 0 c0_vector is placed at the beginning of the SRAM bank 4 at address 0x20040000. It is 256-byte aligned.

Data dedicated to core 0 c0_static0 and c0_static1 is placed immediately after c0_vector, at address 0x20040008, in SRAM bank 4.

c0_static0 and c0_static1 are byte data. No align required.

Program for core 0 c0_reset is placed immediately after c0_static1, at address 0x2004000c, in SRAM bank 4.

Followed by constants used in the assembly code.

`core1`


Disassembly of section .core1:

20041000 <c1_vector>:
20041000:	20042000 200410c5 00000000 00000000     . . ... ........
	...

200410c0 <c1_static0>:
200410c0:	00000005                                ....

200410c4 <c1_reset>:
200410c4:	4b06      	ldr	r3, [pc, #24]	; (200410e0 <c1_reset+0x1c>)
200410c6:	4798      	blx	r3
200410c8:	2220      	movs	r2, #32
200410ca:	4b06      	ldr	r3, [pc, #24]	; (200410e4 <c1_reset+0x20>)
200410cc:	601a      	str	r2, [r3, #0]
200410ce:	4b06      	ldr	r3, [pc, #24]	; (200410e8 <c1_reset+0x24>)
200410d0:	3a1b      	subs	r2, #27
200410d2:	601a      	str	r2, [r3, #0]
200410d4:	2280      	movs	r2, #128	; 0x80
200410d6:	4b05      	ldr	r3, [pc, #20]	; (200410ec <c1_reset+0x28>)
200410d8:	0492      	lsls	r2, r2, #18
200410da:	601a      	str	r2, [r3, #0]
200410dc:	e7fe      	b.n	200410dc <c1_reset+0x18>
200410de:	46c0      	nop			; (mov r8, r8)

200410e0:	10000117 	.word	0x10000117
200410e4:	4000f000 	.word	0x4000f000
200410e8:	400140cc 	.word	0x400140cc
200410ec:	d0000020 	.word	0xd0000020

Vector table for core 1 c1_vector is placed at the beginning of the SRAM bank 5 at address 0x20041000. It is 256-byte aligned.

Data dedicated to core 1 c1_static0 is placed immediately after c1_vector, at address 0x200410c0, in SRAM bank 5.

Program for core 1 c1_reset is placed immediately after c1_static0, at address 0x200410c4, in SRAM bank 5.

Followed by constants used in the assembly code.

Verify Output - Load memory address

We can verify the LMA by looking at the header of the disassembled code:


Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .boot         000001e0  10000000  10000000  00000094  2**2
                  CONTENTS, ALLOC, LOAD, CODE
  1 .core0        0000003c  20040000  100001e0  00000274  2**2
                  CONTENTS, ALLOC, LOAD, CODE
  2 .core1        000000f0  20041000  1000021c  000002b0  2**2
                  CONTENTS, ALLOC, LOAD, CODE

where VMA means virtual memory address and LMA means load memory address.

As we can see, the VMAs of .core0 and .core1 are in the SRAM memory space, but the LMAs are in the flash memory space.

We can also verify the LMA by looking at address field of each block in the uf2 file (highlighted in the following screenshot):

Measure the Output

Download the uf2 file. Then, use an oscilloscope to measure the output waveform.

Output waveform using PLL with no cache miss

The scope shows a square wave at 9MHz. All green! Our program is running.

Cache Miss in Effect

To verify that a cache miss will stall the process, we adjust the flash read parameter in core 0's program. In a dead loop, we will try to read a number of address from the flash memory space. In case of read hit, we should see the output waveform at the maximum frequency; otherwise, we will see the output frequency lowered.


	ldr	r2, =0x10000000		@ Flash pointer
	ldr	r3, =0x00000004		@ Flash step (must be 4x for word read)
	ldr	r4, =0x100000FF		@ Flash mask

1:	str	r0, [r1, #0]
	ldr	r5, [r2, #0]
	add	r2, r2, r3
	and	r2, r2, r4
	b	1b

We have the following parameters:

r2 - Which flash chache strategy to use. 0x10000000 for cacheable and allocatable, 0x13000000 for non-cacheable and non-allocatable (always miss).
r3 - Step between each flash read.
r4 - Top of flash read address using mask. Warpback once reached.

All Hit - Fit in cache size

We have the following parameters:

r2 - 0x10000000
r3 - 0x00000004
r4 - 0x100000FF

which means, we read the following addresses (64 locations):

0x10000000
0x10000004
0x10000008
0x1000000C
0x10000010
0x10000014
...
0x100000F8
0x100000FC
Warpback to 0x10000000

We read multiple locations in the flash memory space, and the window is 256 bytes. This is smaller than the XIP cahce size; therefore, all thees locations are cached.

Partial Hit - Frequent cache reload

Output waveform with large read window but small step

We have the following parameters:

r2 - 0x10000000
r3 - 0x00000400
r4 - 0x1000FFFF

which means, we read the following addresses (64 locations):

0x10000000
0x10000400
0x10000800
0x10000C00
0x10001000
0x10001400
...
0x1000F800
0x1000FC00
Warpback to 0x10000000

We read multiple locations in the flash memory space, step size is 1 kiB but the window size is 64 kiB. This step size is smaller than the XIP cahce size but the window size is larger than the XIP cache size; therefore, sometimes hit.

All Miss - Step larger than cache size

Output waveform with large read window and large step

We have the following parameters:

r2 - 0x10000000
r3 - 0x00004000
r4 - 0x100FFFFF

which means, we read the following addresses (64 locations):

0x10000000
0x10004000
0x10008000
0x1000C000
0x10010000
0x10014000
...
0x100F8000
0x100FC000
Warpback to 0x10000000

We read multiple locations in the flash memory space, step size is 16 kiB. This is larger than the XIP cache size; therefore, all missed.

All Hit - 2 way cached

Output waveform with 2-location read window

We have the following parameters:

r2 - 0x10000000
r3 - 0x00080000
r4 - 0x100FFFFF

which means, we read the following addresses (2 locations):

0x10000000
0x10080000
Warpback to 0x10000000

Although step is large, but only 2 locations are read. Since the XIP uses 2-way cache, we will not miss.