Authored by Timothy Hao Chi Ho
Lodestar is ChainSafe's Eth2 client being built in TypeScript. Lodestar provides highly accessible tooling libraries that benefit the entire Eth2 ecosystem. Our libraries are written in idiomatic TypeScript, making them accessible to a broad swath of developers. Additionally, Lodestar will feature heavily in the Eth2 ecosystem as light-client deployable. Check out our last blog post, this talk and these slides for an introduction.
Out of the ashes of the Scaling Ethereum EthGlobal hackathon, a Lodestar light client prototype was born! Over the past month, the Lodestar team has been focusing almost exclusively on delivering the light client for the hackathon. You will get to learn more soon as we will be releasing an update specifically around our light client work. Stay tuned!
Outside of our light client work, progress has also been made to stabilizing the Lodestar beacon node. Let's dive right in.
Garbage collection day: fixing our OOM issues 🗑
In our last update, we pointed out that our node had some memory leaks triggering out-of-memory (OOM) crashes. This prevented syncing from genesis to the head of the chain unless the client was restarted or you were willing to bear a bit of occasional downtime. Thankfully, we have now resolved this and our node has been stable and running for several weeks on Mainnet, Pyrmont and Prater. The problem came down to a few cases of circular references, which outsmarted the garbage collector - responsible for automatic memory management - and stopped it from collecting.
The secret sauce that fixed the problem is WeakRefs, which was introduced in Node v14.6.0. Lodestar features structurally shared tree backed data structures to efficiently hash data and cache those same hashes. These structures are critical for our Light-client server to efficiently serve proofs, but requires subtrees to reference their parent tree. Now, the subtree references its parent with a WeakRef, so all those big, expensive trees inside full states can be garbage collected even if you want to keep referencing a subtree.
[A quick explainer for your understanding: it might be helpful to think of the top (root) of a tree as being the anchor, which we have to hold onto if we want to keep the tree. As it grows, WeakRefs allow us to create a new anchor lower down in the tree so that the stuff at the top can be discarded (garbage collected). For light clients we mostly care about the recent states, so there is no need to keep everything!]
Validator profitability: Achievement unlocked ✅
With a stable beacon node capable of syncing to the head, the next step is to ensure Lodestar validators are properly incentivized with staking returns. We have refactored our validator client and its interactions with the beacon node to make full use of our caches in return for better efficiency. And as a result, for the first time ever, a Lodestar beacon node and a Lodestar validator client has achieved profitability in a testnet!
With a configuration of 1 validator client with 1 validator key, the Lodestar validator can achieve 50–75% profitability compared to the network average, based on the rewards received per epoch on our Prater nodes and compared to the other four Eth2 clients. However, performance degrades significantly when attaching 16+ validators, so this will be our next area of focus after delivering Altair. [More on this in a bit.]
METRICS 👏 ARE 👏 EVERYTHING
A big lesson we have learned is how critical good metrics are to success, especially when load testing an implementation. Without them, you are blind to the real impact of your projected performance improvements. We've spent a significant amount of time to 4x our total count of available metrics since our last update, and we are just starting to see an insane ROI.
It's one thing to have logs in a terminal tracking node activity, it's another to have a dashboard visualizing the data plotted through time to precisely identify where and when things are failing. The logs are simply no substitute for the holistic information that the dashboard provides. More than that, metrics help to measure and quantify the magnitude at which things are failing when compared to the steady-state progression. We've certainly come to appreciate any tooling used for measuring metrics as a macro indicator/bird's eye assessment of node health and performance. Used in conjunction with other tools (e.g. ChromeDevTools, debugger), metric-based troubleshooting can be a powerful debugging paradigm to operate in. A great analogy by Lodestar developer dapplion: Using metrics is akin to a diagnostic/pulse indicator for a patient, while other tools, like a debugger, are like a scalpel. Only with proper understanding of where underlying problems might be, would you then open up a patient to perform surgery with a scalpel.
Case study 1: State regeneration queue
Take this example, where we noticed an issue with our state regeneration queue, with job wait times spiking to 20 minutes. The state regeneration module (regen for short) rebuilds a past state that's no longer in memory. This is often required to validate attestations and blocks which reference old states many slots back from the current head, and can only be regenerated one state at a time in a queue. To follow the chain properly, you would want to process as many attestations as possible, which requires fast validation, which itself requires fast regen.
The regen queue should always be close to empty with jobs waiting at most a few seconds. However, thanks to the dashboard, we see a big spike between midnight and 2 am on May 10th, 2021. While yes, scrolling through thousands of lines of logs might have helped identify the issue, the visualized dashboard helped to illustrate when things began breaking and how badly they did. In this case, we could guess that something was blocking the queue. So we developed our hypothesis for why the queue had jumped by orders of magnitude, and then went to the scalpel (i.e. ChromeDevTools) to hook into our node and follow the path to see what's actually inside, filling up the queue, and causing problems.
Case study 2: the prepareBeaconCommitteeSubnet process
We now weave our story back to the performance/load issues mentioned earlier when connecting 16+ validators to the beacon node. How did we use our metrics-tooling to diagnose the load problems we were having with validators? With data visualization, we could easily identify the problem areas, where we noticed something was causing sputtering REST API response times and messing up the validator flow.
To be profitable, an Eth2.0 validator has to complete a set of actions within relatively tight time windows to get rewarded by the protocol. Here, our metrics turned out to be supremely helpful in diagnosing the prepareBeaconCommitteeSubnet handler, which had turned out to be extremely unperformant when stress-tested with 16+ validators. Thanks to the metrics, we can see that around May 1st, this particular process was taking ~16 seconds to run - even indicating a regression from the previous version. With that, we were able to identify and diagnose the performance issue, went to our scalpel (i.e. ChromeDevTools), looked at a single run of this process, and realized we had implemented a nonoptimal function. We changed it out for another - more efficient - function, and the entire process stabilized to 30 ms. This is another example of how helpful visualizing the data can be!
Hopefully, with this discussion, you can see how metrics and web development tools can help troubleshoot worrisome node performance issues. They have been critical to the Lodestar team's methodical approach to optimizing the client so far.
Up next 🛣
We are still priming Lodestar to become Altair-ready, and we are about 90% of the way there. As a gentle reminder to the readers, the Altair hardfork actually adds the critical functionality that makes light clients possible in Eth2 - the notion of a "sync committee". Rest assured, all of this will be discussed in detail in our upcoming light client article!
Beyond that, we have plans to make contributions back into the Eth2 community by sharing our experience developing the light client with other spec developers and teams. Part of this initiative will be cross-pollinating with our clients, which will begin with interop testing with Teku by starting a testnet to connect our light clients to.
Finally, we will continue optimizing for node stability. We have a clear roadmap for the near future, with a couple of priorities aimed at improving the experience for end users:
Reduce resource consumption (CPU, memory, disk, bandwidth)
Increase profitability to have 90–99% profitability when compared to the rest of the network
Increase security, such as denial-of-service protection and covering all potential attack vectors - lots of research ahead!
If you are interested in getting involved and contributing to the project, check out our Github. If you would like to get in contact with one of the Lodestar team members, feel free to drop by on Chainsafe's Discord on the #lodestar channel, or email email@example.com. We would love to know more about you, your team and your project!
For more details on Lodestar, please head to our documentation site.