Exclusively Alibaba Cloud has revealed the design of an Ethernet-based network created specifically to carry traffic to train large language models – and has been in production for eight months.
Alibaba Infrastructure Service | Predictable data center network
The Chinese cloud also revealed that its choice of Ethernet was based on a desire to avoid vendor lock-in and leverage "the power of the entire Ethernet Alliance for faster development" – a decision that supports arguments from a group of vendors trying to attack Nvidia's networking business.
Alibaba's plans were revealed on the GitHub page by Ennan Zhai – a senior staff engineer and researcher from Alibaba Cloud with a focus on network research. Zhai published a paper [PDF] to be presented at the August SIGCOMM conference – the annual gathering of the Association for Computing Machinery's special interest group on data communications.
Entitled "Alibaba HPN: A Data Center Network for Large Language Model Training", the article opens with the observation that traffic clouds for traffic "… generate millions of small flows (e.g. lower than 10 Gbit/s)", while LLM- training "produces a small number of periodic, bursty flows (eg 400 Gbit/sec) on each host."