74.
this part is even more crazy. they do moe_output = (routed_output + shared_output)/2 ???
this part is even more crazy. they do moe_output = (routed_output + shared_output)/2 ??? wouldn't this be a really bad init for experts? the model would be so incentivized to use shared expert capacity and the routed experts would need to l