The AI videos you can find everywhere are not as simple as you might think.
"Suspected use of AI generation technology, please scrutinize carefully."
Has anyone noticed that this small line is a bit like "The advertisement is for reference only, please refer to the actual product", which has become more and more common in daily life.
Especially the short video platforms nowadays.
I saw a video of a small cat, and its mouth movements were synthesized by AI.
The content is from the Douyin user @墩墩吃不饱.
"Watch Journey to the West, you can see the true identity of the AI fairy."
Even watching an animation, the visuals are always synthesized by AI.
There are more and more AI videos now.
Although Sora, who sparked the AI video trend, is still struggling with difficulties, the era of AI video may have quietly arrived.
A research report from Dongwu Securities predicts that the potential market space for Chinese AI video generation could exceed 580 billion RMB.
But as the market continues to heat up, some issues in the industry are gradually being exposed to the public eye:
Video and AI are indeed the future, but issues such as cost, quality, collaboration, and performance are hovering over all startup companies and big companies alike.
Take the cost issue and quality issue as examples.
Everyone knows that the current generation of large-scale generative models, especially for videos, is a process that requires massive amounts of data for training.
And when it comes to super-large-scale video training data, it will also have tremendous demands on computing and data processing, as well as the data itself. The associated cost increase is astronomical.
GPT-4o, as an older version of AI, had a development cost of "only" 1 billion US dollars, with training costs estimated to be around 78 million US dollars.
The training cost of video models is even higher. Taking the video model Sora launched at the beginning of the year as an example, the computational power required for training and inference reaches 4.5 times that of GPT-4 and nearly 400 times.
In addition to the sky-high training costs, large model training involves sample quality, complex processing chains, multiple stages, and requires collaboration from multiple teams. Whether self-developed or third-party, it involves various heterogeneous computing resources such as GPUs, CPUs, ARM, etc., and necessitates flexible scheduling and deployment.
So, for many companies preparing to embrace AI video, the most urgent task is to find ways to solve these problems in order to evolve more quickly.
When it comes to experts in playing videos, there is a lot to say about Douyin and Huoshan engines.
Byte released a viral PixelDance at the end of last month, and the effect was very impressive.
We also wrote an article specifically for everyone at the time, just to talk about it. Just look at our simple use of the generated video effects, and you will know that there is indeed something special.
At the Volcano Engine Video Cloud Technology Conference which ended on the 15th, a custom digital human, Tan Dai, was used to communicate with everyone at the opening.
The result is so good that many of the participants at the event thought it was a video cutout.
And behind these high-quality AI outputs, there is actually an intelligent framework called "BMF".
The Volcano Engine collaborates internally with its large model team to preprocess massive amounts of video data.
Finally, based on the volcano engine audio and video processing platform and the BMF framework, a sufficient amount of high-quality video materials were produced in a short period of time for training the model, and that's how PixelDance came about.
One of the heroes, BMF, why is he able to achieve these things?
We can use a common example from daily life to explain:
A company developing large-scale models is like your family preparing a big New Year's Eve dinner.
In order to have a hearty Chinese New Year's Eve dinner, your dad is responsible for buying groceries, mom cooks, grandma makes dumplings, and you move tables, chairs, and stools...
Everyone needs to have their own responsibilities, working non-stop from the beginning to the end. Sometimes, they may need to make a phone call or send a WeChat message to help each other out. In the end, when you do the math, oh my goodness, it turns out that having a New Year's Eve dinner costs so much money.
BMF is a "New Year's Eve dinner one-click full-process package" launched by Volcano Engine, which provides a series of tools and services to help you easily and quickly prepare a New Year's Eve dinner.
This package focuses on addressing the 4 pain points we discussed earlier and has made corresponding adjustments.
For example, in order to address the issue of video training data quality, they utilized multiple algorithms to conduct multidimensional analysis and screening of the videos, achieving a thorough and refined filtering process.
In response to performance challenges, they utilized the flexible scheduling of the BMF framework to proactively allocate performance in advance.
This is like preparing for a big New Year's Eve dinner early on - having a strategy in place, entrusting the grocery shopping to a delivery service, hiring a five-star chef to cook, and arranging for dedicated helpers to set up the tables and chairs...
Anyway, by using the BMF framework, it is convenient, worry-free, and cost-effective.
Dealt with the requirements of the company, and didn't forget about the volcano engine as well.
Nowadays, the computing power of everyone's devices is constantly increasing, leading to a strong demand for the enhancement of video quality.
The Volcano Engine utilizes its unique advantages to process massive amounts of videos and images on apps such as Douyin and Xigua, facing hundreds of millions of users every day.
After acquiring such rich experience, Volcano Engine has developed a "BMF lite" that is more suitable for ordinary users based on the BMF, achieving a lighter, more efficient, and more universal evolution.
For example, compared to the cloud, on the user side, power consumption and memory are very sensitive, and the scenario involves multiple platforms such as Android, iOS, and PC.
So BMF-Lite focuses on building a framework design based on cross-platform, resource-reusing algorithm packages.
Simply put, it unifies the interface formats of various platforms, making integration and deployment more convenient.
Reuse the same algorithm instance with the algorithm controller, most of the time only one will take effect in the Douyin playback scene, for both on-demand and live playback, to maximize resource reuse.
Apart from the BMF intelligent framework represented by, the Volcano Engine also proposed intelligent computing power and intelligent encoding and decoding.
The Volcano Engine directly launched a self-developed video transcoding chip last year, incorporating its self-developed video encoding and decoding technology into it.
The advantage is that this transcoding chip has a higher computing density for specific scenarios such as video on demand and live streaming.
In other words, the transcoding capability of a set of video encoding and decoding chip servers is equivalent to that of hundreds of CPU servers.
Today, after the video transcoding chip went online on Douyin, practical data shows that with the same video compression efficiency, costs can be saved by more than 95%.
Regarding the encoding and decoding layer, Volcano Engine introduced the "BVE1.2 Encoder" based on the self-developed video transcoding chip.
This encoder boldly integrates deep learning technology and introduces a revolutionary intelligent hybrid encoding and decoding solution by organically integrating traditional compression techniques with deep learning compression techniques into a whole, greatly improving the efficiency and performance of encoding and decoding.
In the recently concluded 6th Deep Learning Image Compression Challenge (CLIC), the "BVE1.2 encoder" winning two championships is enough to demonstrate its strength.
After this series of combination punches, the interested manufacturers probably have "Where to scan the payment code?" all over their minds.
You don't have to say, the volcano engine has indeed not intended to hide anything.
As mentioned earlier, the BMF framework was open-sourced as early as last year, and the updated version, BMF lite, was also directly open-sourced.
Overall, currently, the evolution of large models is still ongoing, and the competition among video AI companies will only become more intense.
But if everyone just works in isolation, competes by comparing achievements, and lets products speak for themselves, it may appear as though there is too much competition and insufficient cooperation.
And as perhaps the most proficient short video player in all of China, ByteDance's platform, Fire Engine, constantly opens up its internal technologies and frameworks for open source.
The BMF framework specifically brought along with it a complete set of intelligent infrastructure composed of intelligent computing power and intelligent encoding and decoding capabilities. This indeed helps enterprises save a significant amount of time and costs, and better supports the implementation and development of AI projects.
The attitude of competition and cooperation coexisting is exactly what AI in China is most willing to see.
A single tree does not make a forest, but a multitude of flowers in various colors is the true essence of spring.