Lucknow Stock:The reasoning performance of the reasoning is over H100!The 21 -year -old Chinese brother Harvard dropped out of school to develop the AI ​​accelerated chip "SoHu", and the two companies valued 34 mil

The reasoning performance of the reasoning is over H100!The 21 -year -old Chinese brother Harvard dropped out of school to develop the AI ​​accelerated chip "SoHu", and the two companies valued 34 mil

Original source: Xinzhi Yuan

Image source: generated by unbounded AI

The god -level entrepreneurial story like PIKA will be staged again?

Two young people who have dropped out of school, want to create a AI accelerator chip specially dedicated to large language models, will be delivered in the third quarter of 2024, and the reasoning performance will be 10 times that of H100.

In June of this year, the two founders Gavin UberTi and Chris Zhu founded Etched.ai and received a $ 5.36 million seed round investment including EBAY CEO Devin Wenig.

The company’s valuation is as high as 34 million US dollars!

According to the data released on the company’s official website, this chip will integrate the Transformer structure at the hardware level, which will increase the reasoning speed compared to Nvidia H100 by 8-10 times!

They named the first LLM accelerated chip as "Sohu", claiming that thousands of words can be processed in milliseconds.

The chip also supports better codes through tree search, which can compare hundreds of responses in parallel.

It also supports multiple speculation decoding (Multicast Speculaatic Decoding), which can generate new content in real time.

According to the official details, this chip has only one core, but it is equipped with 144GB of HBM3E video memory:

-The full open source software stack, extended to 100T parameter model

-ProcomLucknow Stock

-Card the various variants of MOE and Transformer

Two Harvard dropout undergraduate challenge the chip industry’s top business

The two originally planned to go to school for one year from Harvard, and found a job responsible for the ApachetVM open source compiler and micro -core at a chip company.

But at work, they found that some of the inefficient designs of ARM’s instructions made their work efficiency poor.

When they think about how to systematically solve this problem, they find that they can use this idea to design an AI acceleration chip for the current explosion.

In the perspective of UberTi, one of the founders, the general design cannot get the performance improvement that the proprietary acceleration chip they are developing: they are developing:

"You must have a strong energy in a single structure to let the chip processing AI tasks. The goal is too large. We must design the chip for more specific tasks … We think Nvidia will eventually do so."

In the eyes of the two of them, this market has a big chance and must not be missed.

"If you look back at GPT-2 four years ago, there are only two differences compared to Meta’s recent LLAMA model-size and activation functions. There are differences in training methods, but this is not important for reasoning."

The basic components of Transformer are fixed. Although there are slight differences, they are not worried that new frameworks will appear instead of Transformer in the short term.Udabur Wealth Management

So they decided to make a special integrated circuit (ASIC) of the Transformer’s frame, which competed in a series of chip giants such as Nvidia in the future in the large model reasoning market.

They believe that the first chip launched by ETCHED.AI will obtain 140 times throughput performance in the unit price compared to H100!

What kind of background is the second background that allows two students who have not graduated by the two undergraduates to dare to challenge the most popular track in the chip industry?

The founder and company CEO Gavin UberTi has been part -time outside the school since entering Harvard in 2020. At the end of 2022, ETCHED.AI was established.

Before entering the university, he participated in the FIRST Tech Challenge, the most famous youth science and technology innovation contest in the United States, and the team won TOP 10 awards.The team developed by the team ranked second in 600 participating teams.

Another founder Chris Zhu was also a crazy internship outside the school while Harvard was studying, and he had become a part -time teacher before he graduated from Harvard.

AMD MI300X decisive battle NVIDIA H100

The Nvidia and AMD side have recently played the hot and the sky, and even the official wrote a blog directly.

Just some time ago, AMD released its strongest AI chip MI300X.

PPT shows that the performance of the server consisting of 8 MI300X in large model reasoning is as high as 1.6 times higher than the same size H100 speed.

For AMD, this direct benchmark is rare.

In response, Nvidia quickly published a blog post, which was not objective to refute AMD’s evaluation.

Nvidia said that if the H100 GPU uses optimized software for the correct benchmark test, its performance will exceed MI300X.

In response, Nvidia showed the comparison of the two GPUs on the LLAMA 2 70B after using TenSorrt-LLM optimization settings.

It can be seen that when the size of the batch is set to 1, the performance of the H100 reaches twice the MI300X.

Even when the same 2.5 second delay as AMD is used, the performance of the H100 can reach 14 times that of MI300X.

Nvidia said that AMD’s alternative software does not support the Transformer Engine of Hopper and ignores key optimization functions in Tensorrt-LLM.And these can be obtained for free on github.

AMD does not show weakness

Seeing this, AMD also posted that since optimization is used, everyone will use it.

And even in this case, MI300X’s performance is still 30%than H100.

1. When using the setting of VLLM FP16, compared to the 1.4 times performance displayed at the press conference, the latest optimization of AMD has expanded this advantage to 2.1 times.

2. Compared with H100 optimized using Tensorrt-LLM, the use of VLLM’s MI300X has achieved 1.3 times delay improvement.

3. Compared to H100 with low-precision FP8 and Tensorrt-LLM, MI300X using VLLM and high-precision FP16 performed better in absolute delay.

AMD pointed out that Nvidia uses its own technology Tensorrt-LLM during the benchmark test on H100, not VLLM, which is more widely used.

In addition, in terms of delay, Nvidia only pays attention to the performance of throughput, but ignores the delay in actual work.

In the end, AMD said that the reason why FP16 was chosen was because it was very popular, and VLLM did not support FP8.

GPU war enters the white fever

In the field of artificial intelligence accelerators, some companies have special architectures for specific workloads.

The special architecture of the data center is mainly concentrated in DLRM (deep learning recommendation model), because the GPU is difficult to accelerate such tasks.

Meta has recently announced that it has built its own DLRM reasoning chip and has been widely deployed.

Regarding the acceleration of the Transformer structure, Nvidia is achieved by the software function of deploying Transformer Engine in H100 GPU.

Transformer Engine allows LLM reasoning without further quantification, which greatly accelerates the effect of GPU reasoning LLM.

What ETCHED.AI has to do is one step closer and complete this design at the hardware level, so that the reasoning speed and energy efficiency of LLM are even higher.

And the reason why investors are willing to invest a large amount of money for the two undergraduates, and more importantly, so far, everyone thinks that the cost of LLM reasoning is too high, and there must be room for innovation.

In addition to such star startups, traditional giants have high expectations for the large model reasoning market.

Su Ma continuously stated on various occasions that the scale of the large model reasoning market in the future will be far greater than the model training market.Therefore, AMD has also emphasized that its products have been fully prepared for this market.Nagpur Investment

Judging from the performance of Nvidia and AMD for the first time, the competition in the GPU field is obviously intensifying.

At present, in addition to AMD’s challenges, Nvidia also needs to consider the rapid progress made by Intel and Cerebras.

Just on December 14, CEO Pat Gelsinger showed Intel’s latest AI chip -GAUDI 3, which has improved the 5nm process, has improved 1.5 times the performance.

Compared with the previous generation of GAUDI 2, the performance of the BFLOAT16 of GAUDI 3 has increased by 4 times, the computing power is doubled, the memory capacity increases by 50%to 144GB, and the HBM3 or HBM3E uses.New Delhi Stock Exchange

Similarly, Nvidia plans to launch GH200 super chips early next year.

In view of the fierce competition, AMD may be regarded as an alternative plan by companies such as Microsoft, META and Oracle that have been announced to data centers.

Gelsinger predicts that by 2027, the GPU market will reach an amazing $ 400 billion, which will undoubtedly provide a broad stage for fierce competition.

CEREBRAS Systems CEO Andrew Feldman is even more ambitious: "We are trying to surpass Nvidia. By next year, we will build an AI computing power up to 36 EXAFLOPS."

Bangalore Wealth Management