Finding Light in the SHADOW
Every published paper has a story. In research, we often sell our work as a neat, straight line. To readers, it might seem that the problem was obvious, the solution insightful, the implementation straightforward, and the results amazing—leading effortlessly to a great paper. My experience (and perhaps yours, too) couldn’t be farther from this idealized image.
The purpose of this blog post is to give a behind-the-scenes look at what it took to publish our MICRO’25 paper, SHADOW: Simultaneous Multi-Threading Architecture with Asymmetric Threads. The journey to publication was a challenging yet deeply rewarding experience.
Getting an idea
GhOST took three additional paper cycles to publish after completion. And I realized I’d saturated my curiosity about GPU microarchitecture. I had explored GPUs deeply, grasped their microarchitecture, programming model, and memory behavior, and could clearly visualize their execution flow. It was time for a new direction.
There was just one issue, I had no idea what problem to tackle next.
So, I took the only rational step: at ASPLOS’23 and ISCA’23, I cornered every faculty member I could find and picked their brains. They were incredibly generous with their time, patient with my wild ideas, and straightforward in their feedback. They suggested exploring sparsity on GPUs (but I wanted a GPU break), edge computing (no suitable simulator at the time, and I didn’t want to build one from scratch), or building accelerators (which didn’t excite me back then).
CPUs emerged as a natural next step. I’d always found parallelism fascinating. GhOST had clearly shown us there was still untapped performance in existing architectures. So if going deep (with OoO) on a wide machine (many threads) had its benefits on the GPU, why don’t we go wide (more SMT) on deep machines (OoO) on the CPU? Great! But how exactly?
That’s when we hit another wall: scaling CPU threads isn’t limited by usefulness, but by area and power—OoO execution is resource-intensive. What if we combined OoO and InO threads in the same core, creating a truly asymmetric SMT CPU? The idea of mixing thread types, which run simulanesouly, sharing front-end and back-end resources, seemed genuinely exciting (and mildly terrifying). Now the real question emerged: how on earth do I build this?
Implementing the idea
To put it bluntly, implementing SHADOW was painful. Gem5 emerged as my simulator of choice due to its execute-on-execute design, allowing at least partial verification of massive architectural changes. There was one catch: I had never worked on Gem5 before, and the SMT implementation in Gem5 was broken. Now I had to fix SMT first, integrate InO threads alongside OoO, ensure they properly shared the pipeline, correctly disable register renaming, add support for false dependence checking for InO threads and verify synchronization.
Debugging this infrastructure drained every ounce of energy and patience I had, taking a full year. Only after this was complete could my actual research begin. And just when I thought the hardest part was over, another big question hit: Which applications would actually benefit?
Finding the workloads
Having built the infrastructure, the idea needed fleshing out. Should each thread run a single-threaded application? Should we prioritize OoO threads for foreground processes, relegating InO threads to background tasks? How would we ensure QoS? Or maybe all threads should run the same multi-threaded process? After significant deliberation, we decided the last approach was most practical for testing SHADOW. But then, another question appeared: which applications?
I strongly believe processors should benefit general-purpose workloads. CPUs need to perform well broadly, even if they specialize occasionally. Moreover, my system only supported pthreads, and frankly, I was too burned out to implement OpenMP support. I ran several pthread benchmarks, and…we saw zero performance gain. In fact, performance degraded.
I had invested more than a year and a half and had nothing positive to show. It felt awful. Now what?
Study the stalls
Digging deeper into the system I’d barely gotten working, I discovered optimization opportunities hidden in the interactions between OoO and InO threads. Basically, performance bugs. Fixing these bugs finally yielded measurable, significant performance improvements.
Understand the system
After nearly two years of intense building, testing, debugging, and hoping for the best, we could finally start understanding SHADOW’s dynamics. Now we had to answer the fun questions: What truly drives SHADOW’s performance gains? Which applications benefit most, and under what thread configurations? How should threads distribute workloads (static or dynamic)? Is SHADOW configurable per application?—and if so, how? all of which we answered.
SHADOW was a passion project born from genuine curiosity. I sincerely wanted to understand the impact of asymmetric SMT. Ultimately, we uncovered intriguing insights about parallelism, thread behavior, and CPU architecture. It was rejected from ISCA’25, but it is a relief that SHADOW is out in MICRO’25. It will read like everything was obvious, but now you know that was far from the truth.
Final reflections
If your research feels chaotic, uncertain, or exhausting, remember you’re not alone. Behind every polished paper is a messy journey filled with setbacks, frustrations, and doubts. Sharing this story openly, I hope it resonates with you, reminding you that genuine, worthwhile work rarely comes easy. And that’s perfectly okay.