Authored by Ben Adar
Last month, when the Medalla testnet launched, we learned a lot about where Lodestar was lacking, and where we needed focus. At the highest level, our takeaway is we need a stable client before moving on to anything else, so we need to refactor several problem areas in our codebase. See our Medalla update here. In order to have a spec-compliant, stable client we needed to focus on a few key issues:
Our network stack uses an old version of gossipsub and does not include all required validations.
Our chain submodules are overly coupled and have poor error handling, making them inflexible and difficult to safely modify.
Our block production is broken, lodestar-proposed blocks could not drive consensus of Eth1 data forward.
Since we identified these issues, the Lodestar team has made considerable progress towards solving them.
We are working on updating our network stack to js-libp2p 0.29.x, which includes our recently-implemented gossipsub v1.1. Integration will require further typescript integration in a js-libp2p dependency; a PR is in progress. Another thing we are adding to improve sync stability is peer-scoring. Peer-scoring is a way of scoring and ranking peers, based on how they perform to a node's requests. Slow/buggy peers will be penalized and Lodestar nodes will know which peers are reliable and which are not. This will improve sync speed and reliability by preventing bad peers from wasting a Lodestar node's time. We're also doing a close re-reading of the p2p spec. After reading the recent Nimbus update (thanks for the great update!), we realized we had the same bug they had, due to not following the p2p spec closely enough. Further reading suggests there are a few places where we aren't performing all of the necessary validations.
We are refactoring the chain module in several different ways. We've cleaned up many of the submodules to be simpler and have more robust error handling. This includes the ‘clock', state caches, block processor, and the fork choice. We previously identified several significant issues with our fork choice, namely that attestations were being weighted by the incorrect balance, so we took this time to rewrite our fork choice from scratch and pull it into its own package. It is now simpler, more performant, and follows the spec much more closely. We are also fixing the memory issues we faced in the last month, where Lodestar was unable to sync Medalla due to running out of memory.
Before now, we stored ALL unfinalized states in memory. Even though we have data-sharing between states, this still has its limits, and with this naive strategy, we will eventually run out of memory, as we saw in the 1000 epoch unfinality period in Medalla. Ultimately what must be done is pruning some unfinalized states from memory, but doing this introduces its own issues.
A new incoming block may require loading a state that has already been pruned. So, like many other clients, we had to implement some machinery to ‘regenerate' states on command. If the requested state isn't already in memory, find the latest state that is in memory, and rerun blocks to get to the requested state. Of course, there's a cost to rerunning blocks, especially across many epoch boundaries. Lodestar can process blocks fairly fast, but each epoch transition takes roughly 40x the time to reprocess a block. Based on this info, we decided on the following:
Break up block processing by checkpoint/epoch. Even when processing a single block, we stop after every epoch. This will give us a chance to let other tasks (eg: network requests) be processed, and not have block processing hog the cpu, even when processing (or reprocessing) a block after many skip slots.
Cache checkpoint states. We have a second state cache, where we store checkpoint states, eg: states which are in the first slot of an epoch. With this, we can rely on states being stored at regular intervals. Fetching from this cache can also ensure that we only ever process an epoch once (as long as the checkpoint isn't pruned).
Allow our state regeneration to access both state caches. With the regularity of the checkpoint cache, we bound the amount of reprocessing a single state regeneration has to perform. We use both caches to minimize the amount of block/epoch transitions that we have to reprocess.
One area where we could still improve is in our state cache pruning strategy. Right now, we're looking at a very simple strategy:
Keep some max # of states in the state cache, prune by removing the first-inserted states first.
Keep some max # of states in the checkpoint state cache, prune by removing the earliest epoch states first (but always keep the finalized and justified checkpoint states)
Finally, we have been fixing up our block production, mainly the eth1 data processing. Previously, our eth1 data fetching was critically bugged. During block production, Lodestar was mutating eth1 data received from the eth1 chain. Now, we've taken the time to refactor the whole eth1 processing pipeline. This includes correctly storing and fetching deposits and eth1 data, as well as several performance boosts, for example, fetching eth1 data in batches.
What these last few months have taught us is that in order to have a stable client, we needed to do some house cleaning before we move on. So, our goal remains to complete Phase 0 and have a functioning stable client. A significant portion of the infrastructure to build the light client/server is there, but without the solid foundation of a reliable stable beacon node that the light client can request data from, we will be fighting uphill.
How to get involved!
Keep up with our github, discord, twitter, and hackmd to be on top of the latest developments on the project. We're always eager to expand our team and collaborate with other open source projects. Reach out if you're interested in expanding the eth2 typescript ecosystem and bringing eth2 to the web!
Gitcoin Grants is currently hosting another round of CLR fund-matching. Any love you can send would be forever appreciated!