The focus this week was on performance.
I finished my skip-list structure and integrated it into the mapping handling. I discovered, during development, that keeping entries sorted by starting address by itself gave the biggest speedup. The old version scanned all entries every time a lookup was made, but with sorted entries you can stop looking when an entry is past the lookup address.
Some extra performance was gained by the skip mechanism but not much in my measurements. But I think this gain would be bigger in a large application with lots of memory blocks, or if several applications are run simultaneously.
I have also added a simple caching mechanism. It uses a global serial number that gets incremented every time a mapping is added or removed. A special cache handle keeps track of the entry last looked up. If a new lookup is made and it is within the same mapping and the serial number is not changed, we have a hit in the cache and we do not have to traverse the skip list at all.
I added an instruction cache handle and a data cache handle into the CPU structure in the processor emulation. Instruction fetches use the instruction handle and data accesses use the data handle. These caches madeĀ another 10% gain in performance.
I used a simple “sieve” program for some measurements.
Here are some numbers:
| Scenario |
sieve 3000 |
sieve 10 000 |
sieve 20 000 000 |
| No optimizations |
36 s |
not measured |
not measured |
| Skip-list |
~4 s |
31 s |
not measured |
| Skip-list + cache |
~4 s |
27 s |
not measured |
| Native AROS sieve |
~0 s |
<1 s |
23 s |
As you can see, my optimizations give quite a performance boost, almost a 10 time speedup! But yeah, I’m still behind the native application by a factor of two thousand. But this is only the beginning!
I will leave the performance work for now and work on the last remaining things needed to run “Clock” fully.