Lodestar Medalla Update

Lodestar Medalla Update

Authored by Colin Schwarz

Lodestar Medalla Update


Co-Authored By Cayman Nava

With the launch of the [[Medalla](https://github.com/goerli/medalla) ](https://github.com/goerli/medalla)testnet last week, the Lodestar team would like to update the community on our experience with Medalla, our current status, and plans for the near future. With over 26,000 active validators, Medalla is the largest eth2 testnet to date, and will likely be the final major testnet launched before Phase 0 goes live later this year. As such, we were eager to connect to the testnet and gain as much information as possible to improve and optimize Lodestar.

Pre-Testnet

Our work in the weeks leading up to the Medalla launch consisted mostly of improving our sync and increasing test coverage, especially end to end tests. Up until recently, we had been focusing our efforts on our ecosystem of libraries, and still had not participated in a testnet with a full client. Much of the work revolved around syncing with Altona and iterating based on our experiences there. In the week leading up to Medalla launch, we felt fairly confident that we would be able to connect to the testnet at genesis provided we could hit several key milestones:

  1. Ensure that we could remain synced to the head of Altona. We determined that if we could stay synced on Altona, we would have no trouble getting and staying synced on Medalla.

  2. Ensure that we could run a stable ephemeral testnet of M validators on N beacon nodes, for small M and N. If we could drive a small chain, we would be exercising our whole stack, not just the state transition, but also committee gossip, validator - beacon node interaction, etc.

  3. Ensure that we could run 256 validators on a single beacon node. In all of our end-to-end tests, we had only run a handful of validators per beacon node. We needed to stress-test what was possible, because it had been agreed that we would run 1024 validators at genesis, split between 4 beacon nodes.

In the week leading up to the launch, we sprinted to tackle the big blockers to our milestones. We also resolved many smaller issues that, while not as urgent, still severely hurt lodestar usability: peer persistence between boots, simplified cli initialization and testnet support, CLI to import launchpad validator keys, and much more. Ultimately, we were able to confirm the first two items, but not the third.

Testnet Launch

"Lodestar seems to be choking on gossip" - Cayman, Medalla block 12

The morning of launch, the team felt accomplished by the work completed and prepared for the imminent genesis event. All signs were green, save for our stress test. We had confirmed that Lodestar had properly computed the ‘genesis state', the same as other clients, and we waited in anticipation. We had confirmed that we were syncing to the Altona testnet and syncing and maintaining small ephemeral testnets. We hadn't confirmed that we could tie 256 validators to a single beacon node, but felt confident that it wouldn't be an issue. Leading up to genesis, we had 4 beacon nodes in the cloud with 256 validators each, and a helpful Grafana dashboard tracking relevant health metrics.

Immediately after genesis, we realized that, in fact, we were not as accomplished and prepared as we had thought. A confluence of issues rendered our nodes unable to perform their tasks. Our dashboard was showing our nodes stuck on slot 0 and memory usage was spiking rapidly. After several minutes, several of the nodes stopped reporting metrics entirely. Reviewing the logs, several of the nodes had exploded with OOM (out of memory) exceptions. The remaining nodes seemed to either not be receiving gossip, or be incorrectly invalidating all block gossip; their peer count was in the low single-digits. Meanwhile, the validators were gleefully unaware of the beacon nodes' syncing issues, and continued requesting duties, eventually DOSing the beacon nodes with requests that triggered empty epoch transitions.

After a few epochs, the decision was reluctantly made to pull the plug on the Lodestar nodes for the time being, and run the validators using a more stable client. The network wasn't finalizing due to low participation, and we accounted for 5% of effectively offline validators.

By all tangible accounts, we had failed to successfully join the Medalla testnet at genesis. But we've learned a heck of a lot in the process.

Lessons from Medalla

Once the dust settled from the Medalla launch, the Lodestar team met to discuss the best path forward. During this planning meeting we had to first take a step back and think about the lessons learned.

Community Engagement Matters

By far the most positive thing to come from the launch is increased community engagement. We are really humbled by and grateful for the big increase in community interest and participation in Lodestar which began leading up to the Medalla launch and continues today! The feedback of users and developers is invaluable for us to continue improving Lodestar. We have received so much interest on our Discord that we just opened a designated channel to field any questions you might have to assist anyone trying to run or otherwise work with Lodestar.

Moving forward, we're committing to increasing the communication we have with the community.

We Weren't Ready

While we've contributed a large number of high quality typescript libraries to the eth2 ecosystem, our full client simply isn't ready for production use. We're very grateful to be a part of this diverse community of eth2 client teams, and relieved that our Medalla performance has barely had an effect on the testnet. Lodestar, the full client, has come a long way, but it still has a lot of rough edges that we need to work through. For now, Lodestar is alpha software, available mainly to those on the frontier who aren't afraid of getting their hands dirty.

Moving forward, we've identified many areas of improvement, and we will continue to work towards stable testnet usage. We will be very hesitant to signal our participation in coming testnets if there are any shreds of doubt over the stability of our code.

Next Steps

Lodestar's future is bright, and we're putting our best foot forward. During our last planning meeting, we identified major areas where Lodestar needs the most work: Networking, Syncing, and Error handling. We also need to consider our grant-defined mandate of exploring the eth2 light client landscape.

Networking

Several of our Early-Medalla woes were networking related. Our low peer count meant that we rarely received incoming blocks and our faulty and poorly-logged gossip validation exacerbated the lack of gossip. We also noticed significant processing required for validation of attestations in some cases.

We will be examining the causes of our low peer count and giving more love to our gossip validation code. These are likely small to medium sized changes that don't require significant refactoring.

Syncing

Syncing has been a consistent source of headache for us leading into Medalla as well as now. We have a naive algorithm for syncing that is brittle and overly complex. It's also easily stalled and derailed by unresponsive or buggy peers. We will be tackling this issue in two phases: A short-term and medium-term solution.

Short-term, we will continue finding ways to incrementally enhance our sync, to make it "good-enough" for practical use.

Long-term, we need to rethink how we're approaching sync entirely. We would like to track attribution of incoming blocks, and use this to track peer behavior. This will allow us to evict poorly performing and buggy peers, favor responsive peers and share requests across peers without active requests. We will write up a design document, learning from the best aspects of our and other implementations, and eventually migrate our sync code when it proves reliable and robust.

Error Handling

Our codebase has had several areas where we lack error handling. A stray uncaught Error can cause chaos inside our client. While in many other languages, an uncaught error may cause a full-on panic, and cause the program to shut down, in javascript, an uncaught error may simply cause one section of the program to silently fail. Fixing this involves adjusting our linter rules and our best-practices around how we throw and catch errors in the application.

Light Client Prototyping

In the medium term, will be in prototyping phase 1 and light client specs. We've been tasked by our grantors (EF, Moloch, Gitcoin) to build out light clients (and eth2 libraries), and we haven't forgotten the call. We have spent considerable effort in building a productionized beacon chain because we truly believe it has helped us and will continue to help us on our goal towards building browser-ready light clients. Much of the infrastructure and recent improvements we've made to Lodestar will be usable in the light client realm, and we want to build off a solid foundation, our phase 0 implementation.

Contributing

We welcome any intrepid Typescript developers or enthusiasts to join us in our Discord. We run our project like the rest of Ethereum, having the vast majority of our communications in public channels and on GitHub. We're always looking for passionate new team members and contributors of all kinds. Come chat with us to see how you can get involved!