At its heart, the Xilinx UltraScale+ SOC has a multi-core Cortex-A53 CPU. This is not the fastest ARM out there, but it’s still plenty capable. One interesting feature is its built-in Snoop Control Unit (SCU). This enables transparent synchronization of L1 caches among the individual cores. There is one pitfall that you might fall into when running bare-metal code: the distinction between secure and non-secure memory access.

If one of your cores operates in a secure exception level (such as EL3) and another runs a non-secure exception level (EL1), by default they see different memory spaces. The AXI bus has a signal called AxPROT that distinguishes secure and non-secure access. Once your data reaches a “stupid” memory such as DDR SDRAM, this distinction vanishes; however, the data cache does takes it into account and effectively treats these two as separate address spaces. Thus, your precious shared memory buffer will not be coherent until flushed on one side and invalidated on the other.

Fortunately, there is a simple solution. The page tables that you set up to configure the MMU (a prerequisite for using the snooper) have a bit called NS (Non-secure). Setting this bit forces all accesses to be treated as Non-secure even when running in a secure EL. The converse (forcing secure access from EL1) is of course not possible, because it would completely break the security model.

NS is bit 5 of the Lower attributes, so if you’re using the Xilinx-provided template code (translation_table.S) and you don’t really care about this type of security, you might want to simply change the line which says

.set Memory,	0x405 | (3 << 8) | (0x0)		/* normal writeback write allocate inner shared read write */

to

.set Memory,	0x425 | (3 << 8) | (0x0)		/* normal writeback write allocate inner shared read write (forced non-secure) */

Alternatively, this can be done at runtime.

Thanks to this thread on the ARM Support forums.