Virtual memory gains importance as more data stays on chip
9 mins read
Ever since the first general-purpose computers appeared, programmers have looked for ways to expand the amount of memory they could address without actually adding more memory.
Originally, programmers took care of the problem, developing methods such as overlaying to achieve the illusion of a larger memory store. But programming overlays is a tedious process, as many 8bit and 16bit microcontroller developers have found. It did not take long for system architects to look for automated methods of expanding the apparent addressing space of a computer.
Although early forms appeared in the University of Manchester Atlas and Burroughs introduced a form of segmentation that would become familiar to x86 programmers decades later, the IBM System/360 Model 67 put virtual memory as we know it into action.
Borrowing concepts developed at the Massachusetts Institute of Technology, the computer used a hardware unit to 'convince' programs that each could access to up to 16Mbyte of memory when the installed capacity was a mere fraction of that. The hardware would translate addresses used by the program into real memory addresses.
By presenting the illusion of a single, virtual memory space to a program, the technique made it possible for independent tasks to run side-by-side without interfering with each other's data. They simply could not see the physical pages used by other tasks, unless the operating system mapped them into the task's virtual address space.
The Model 67's hardware evolved into the memory-management unit (MMU) found in any processor that can run operating systems such as Linux. It took a while to convince programmers that the trick would work better than overlays, but a team led by IBM researcher David Sayre showed, in 1969, that data paged automatically on and off disk was better managed automatically than by skilled programmers, even though the OS would wait until the last minute before halting the processor – telling it there was a memory fault because the data was not in memory. It would then load the requested page of data from disk while it selected a victim page to be spooled back out from memory to disk.
Normally, telling a processor the memory address it requested does not exist will trigger a fatal bus error. In practice, the bus fault still occurs but, instead of halting the system, an exception handler will fetch the missing page from disk, map it into main memory and then restart program execution at the point when everything went wrong. But this needs the processor to support the concept of restarting after a bus fault. For older hands, this was one of the main differences between Motorola's 68000 and 68010 processors – the slightly older and simpler 68000 could not resume execution after a bus fault.
How you manage the mapping between virtual and real addresses presents the OS designer with a challenge. Ideally, you would maintain a simple list of virtual memory page addresses and index into that to find the actual address of the page in memory – assuming it has not been spooled out to disk. Unfortunately, that approach chews up megabytes of memory per process, which is not acceptable in servers, let alone embedded systems. As most systems have a lot less physical memory in use than the total possible virtual allocation, it makes sense to map pages the other way round: use the physical tags in a fixed-size table to find the virtual page that maps to each real memory address. There is only one snag: how do you find the physical tag when all you have to hand is the virtual one?
In its simplest form, you perform a laborious search of the inverted page table to work out which physical page has the right mapping. As this would take thousands of accesses for any reasonably sized physical memory array, this is not practical. In the real world, most implementations use a combination of hashing and linking. A hash table, derived from the virtual address, points to one of the physical-virtual pairs in the table. The page-table walking algorithm visits this and checks the virtual address. If it matches, its job is done. If not, it looks at the link appended to the pair that points to the next candidate. The page-table walker keeps moving through those links until it is successful or works out that the page for which it is looking is not in memory after all, which will trigger a fetch from disk.
The general-purpose approaches developed originally for memory-constrained servers do not work well for real-time memory-constrained systems. Although hashing reduces the uncertainty of when a page will be found, it still introduces a degree of non-determinism.
As embedded systems typically do not need to support large virtual-memory spaces, it is possible to use tables with virtual-address indices without demanding huge chunks of memory. However, the use of software architectures ported from the desktop environment means that, despite their drawbacks, standard memory-management techniques are widely used in embedded systems.
Having to do a table walk on every memory access is clearly inefficient, especially as successive accesses will often be to the same page. If you cache the most recently accessed page translations, you can provide a real address more than 90% of the time without involving the software page handler. Best of all, the translation lookaside buffer (TLB) does not usually need to be very big. However, as it's important for the buffer not to suffer from thrashing because of tag conflicts between pages with similar virtual addresses, the TLB is normally fully associative – and expensive to implement – compared with a conventional cache that might only be one, two or four-way associative. With a TLB, it is possible to provide a real address as quickly as an access to the first-level cache, a factor that has had a strong influence on microprocessor architecture so far.
With MMU and TLB in place, there is another reason for embedded systems to use virtual memory, even if you have no intention of providing more storage than is achievable with the combination of flash and DRAM. Because virtual memory hides blocks of physical memory that are not mapped into a task's address space, it provides an effective way of preventing tasks from overwriting each others' program or data spaces.
If a process attempts to write outside of its allocated address range, because pages will not have been mapped by the MMU, the most it can do is trigger a bus fault. By adding descriptors to the page definitions, it is possible to trigger faults if code attempts to use memory in the wrong way. For example, version 6 of the ARM architecture added bits to prevent the processor from attempting to execute code in a page – in case a mistake or malicious code causes execution to jump to a page that is meant to contain only data. If the processor jumps to 'code' in one of these pages, it triggers a permissions fault.
Other bits control how the caches and bus controllers work with data in those pages. For example, data in a page can be set to be 'strongly ordered' to ensure that reads and writes to that area of memory cannot be performed out of order. This is useful for implementing locks and semaphores that control access to resources that might be used by multiple processors in the system. Other bits control whether the data in a page can be stored in a writeback cache or whether any write by the processor has to pass through the cache to go straight back into main memory in systems that do not have full cache coherency.
There is a cost for memory protection performed using virtual memory. Any time a task wants to share data with another, it has to ask the OS to copy it into the other task's virtual-address space. So-called zero-copy techniques work by mapping the same physical address range into the virtual-address spaces of two or more tasks. This reduces the overhead of copying, but raises the probability of inadvertent corruption of data by an errant task.
Virtual memory support has a direct effect on cache design. In principle, you want to use virtually mapped caches because this avoids the MMU having to pull the next instruction or piece of data. In practice, many systems use physically mapped caches because they offer, on balance, better performance. The main problem with the virtually mapped cache is that different tasks will typically reuse the same blocks of virtual memory – which causes conflict in the cache. Not only that, the system has no easy way of determining whether virtual memory addresses for one task will be valid for another. So, the safest thing to do is flush the cache on each context switch and reload the data. Naturally, this puts a severe dent in the system's responsiveness.
If you have a physically mapped cache, you have to wait for the access to the MMU, and potentially suffer a page-table walk, before you can pull data from the cache. This extra latency can only be hidden by greater pipelining, which degrades the processor's performance in branch-heavy code. Many level-one caches therefore have virtual mapping, but use physical tags to confirm the correct cache line, which allows the MMU access to happen in parallel. The data is fetched, but held until the MMU access confirms the cache has the correct line.
There are ways around context-switch flushing with virtually mapped, virtually tagged caches. You can add more tag bits to encode the process ID. One approach used to save on tag memory in some older ARM-based systems, before the company switched to physically tagged caches, was to ensure processes do not share virtual addresses.
Until version 6 of the architecture, ARM recommended use of the fast context-switch extension, which limited each task to 32Mbyte of memory. The process ID was added to the virtual address, so each of up to 127 tasks received a chunk of virtual memory to which it had sole access in a 4Gbyte range. While far from being an efficient use of the theoretically available address space, it provided a reasonable tradeoff in terms of transistor count versus capability at a time when most OSs expected to run on the processors would use only a fraction of the available virtual-memory space.
The rise of multicore systems has reopened debate on the point at which address translation happens. Most vendors have, for the moment, decided to stick with translating addresses into their physical counterparts as quickly as possible. However, concerns over power consumption could see that change.
One of the biggest contributors to power consumption in future systems will be accesses to main memory because of the capacitive and inductive load of the I/O channels. The focus in system architecture will be to keep as much data as on-chip possible in caches and local memories. But, if data hardly ever moves off-chip into its DRAM 'home', do you really need physical addresses? Why suffer the cost of translation when, most of the time, you are using either a local copy or picking up a copy stored in the cache of a nearby processor, especially if many of those processors are working on the same virtual memory space for a parallelised task?
One theme in distributed multiprocessor research is the idea that most data is sitting in a portion of a distributed cache, rather than a centralised memory – because there can be no centralised memory. In this scenario, data simply flows around the system to the point where it was most recently used. As long as you have a tag for it, it does not really matter whether the address is virtual or physical. This is the concept behind the cache-only memory architecture (COMA), a form of distributed shared memory (DSM) or distributed virtual memory (DVM) architecture. The data is managed almost entirely by cache-coherency protocols.
However, most processor makers are avoiding the wholesale move to virtual memory at the system level. Although its architecture employs distributed caches to avoid congestion when accessing a centralised level-two or three cache, Tilera's many-core TILE-Gx processors have local TLBs so they request data using physical addresses, rather than virtual.
ARM has borrowed the term DVM for the SoC infrastructure it plans to build around multiprocessor Cortex-A15 systems and the AXI4 interconnect. Each block that needs to access memory, including I/O controllers, has its own MMU and the cache-coherency channels pass signals that will update or invalidate entries in the local TLBs. However, the A15 uses a physically tagged cache, so the addresses flying around the SoC will not be virtual from the perspective of an OS, such as Linux or Windows Phone 7.
Instead, ARM envisages these 'intermediate physical addresses' will help with virtualisation. Only the hypervisor running on the system will be able to see and map actual physical addresses – each OS running will only have access to this second, intermediate address space. Instead of a hypervisor trapping MMU accesses by a guest OS and then remapping pages, the technique co-opts virtual memory hardware directly.
In principle, a second level of indirection increases security as it makes it harder for an I/O unit to overwrite data belonging to the wrong virtual machine. It should also simplify the hypervisor, as it does not need to maintain its own shadow memory-mapping tables. However, its effect on performance overall is less clear. Hypervisor interventions in memory mapping are far rarer than actual accesses which, if they go to main memory in the ARM architecture, have to go through two phases of translation. And, if a mapping is not in the TLB of an I/O unit, it will have to interrupt one of the processors to invoke software to perform a table walk and write the data into the local TLB.
Having cross-system virtual memory raises the prospect of large-scale cache flushing when each guest is scheduled to run. However, it is likely that ARM and others who adopt this approach will use VM or process IDs to prevent the entire contents of every cache in the hierarchy being dumped on a context switch. The question is whether hypervisor writers and systems companies will embrace the approach – and with it a new, second level of virtual memory – or the industry sticks with just the one. Either way, virtual memory is here to stay.