a whole year,AI The topic of painting is still hot. People are amazed that the large model has become a “painter”, which makes AIGC (AI generated content) a popular capital track. “The next wave will definitely be video, audio, and 3D content,” says Stability AI’s CTO Tom Mason firmly believes that they are currently working on related models for generating video and audio.
Will AI The culmination of painting is Stable Diffusion, an unprecedented open source model that is open to everyone and can generate pictures quickly and well. Its birth path is also different: from the open source community, not a large company. Its main enabler is Stability AI. The company regards itself as one of the contributors to the community, and hopes to explore a path for open source, AI models, and the community to move forward together.
Stability AI Founded in 2019, it has become a unicorn valued at over US$1 billion in less than three years. Stability AI focuses on generative models, which it believes will be an important part of the Internet in the future. If the previous generation of AI algorithms brought about advertisement recommendations, then “what generative AI is doing is to make consumers become creators and give them the ability to create the media content they consume themselves.”
This year’s IF 2023, Geek Park invited Stability AI of CTO（Chief Technology Officer) Tom Mason, who shared the whole process of AIGC from obscurity to sudden outbreak from his own perspective, and gave his own prediction on how AIGC will affect the world in the future.The following is a record of his sharing at the conference, compiled and released by Geek Park.
(At the Geek Park Innovation Conference IF 2023 site, Geek Park Innovation Business Director Ashley interpreted Tom Mason’s interview video.)
Open source is Stability AI
the foundation of
Geek Park: Please briefly introduce, when you come to Stability AI What kind of experience did you have before?
Tom Mason：I spent the previous 15 years running a technology company in London, developing many different platforms for major car companies and providing technical support to many start-ups.in touch AI Before Stability AI, I worked in many fields.
About 2 years ago, I started working with an open source community to develop a Python Notebook called Disco Diffusion to generate animations and images. It’s a great community and I’ve been working with them for months building new tools for non-technical users. Through Python Notebook and a concurrent product that became Dream Studio, I discovered Stability AIwhich is the company I work for now.
Geek Park: Stability AI What kind of company is it?
Tom Mason：Stability AI We are very committed to open source, which is where our roots lie in a way. We help support about 8 open source communities totaling over 100,000 members who specialize in different modalities, from language to graphics, to audio, video, and 3D.
We will provide computing power support and will fund some researchers. We have a very large computing cluster. Now we have 4000 A100 nodes on AWS and 1500 nodes in other places. These computing clusters are growing at a very rapid rate, and we will open resources to researchers so that they can train models, which will eventually be open source. So it can be said that Stability AI It is a basic platform, and its pillars are actually these open source communities.
At the same time, our internal team will also provide them with support, build HPC supercomputing, and manage these computing clusters; the data team will provide data assistance; the cross-functional team (responsible) will assist in cross-community work.
In addition, we have a very large infrastructure department, and their work is mainly to develop API and products. We publish APIs and our own built products to the world through the platform website.
Geek Park: Why is open source important?
Tom Mason：I open source AI Technology started this journey. To me, open source AI is too amazing to be true. From my own home, I can log into the open source community, interact with a model that has all the information in it, and do cool stuff with it, build tools on top of it, and it’s like a leap forward. All of this has advanced the course of human history, and we are living in such an amazing time. These (open source) gifts to mankind allow us to create a better and greater cause. I am really honored to participate in this process, to be a member of this company, and to witness this moment.
For example, the release of Stable Diffusion, I think opening a model of such a large scale and complexity is not so easy to think of and do, (but it really happened) and brought about an explosion of creativity.
Every morning I wake up and I can see 10 different new projects online, there are a lot of people doing amazing things, and every little project has the potential to become a new company, a new open source community.
Geek Park: Stability AIWill it become an institution like OpenAI?
Tom Mason：Open AI Very focused on AGI (Artificial General Intelligence). But this is definitely not our goal, we want to build good generative models. Because generative AI is likely to have a greater impact. There are already many theories about how it will be implemented, especially through language models and video models, and other models with timing information.
AGI is not our focus right now. We only focus on building useful generative models of different modalities, supporting the customization of these models with large datasets, and supporting open source. This is the main difference between us and OpenAI.We are absolutely 100% committed to making our model open source and making this technology public so that people around the world can use it without any restrictions.That’s definitely a really big deal because the technology is so revolutionary.
The latest version of Stable Diffusion is public｜Source: stability AI official website
Consumers become creators
Geek Park: AIGC has received unprecedented attention this year. In your opinion, what are the important moments before the outbreak?
Tom Mason：I think AI One of the important turning points in the field is the 2017 paper on Transformer, “Attention Is All You Need” was published. The paper introduced the concept of the attention mechanism, which made the neural network more popular; then, based on the Transformer network, a lot of research appeared in the field of image generation, among which the diffusion model appeared. What started as Latent Diffusion and now Stable Diffusion was originally developed by the CompVis team.
Geek Park: In this process, how was Stable Diffusion born?
Tom Mason：The next two important turning points are data sets and computing power. One of the projects we support, LAION, focuses on collecting and building massive datasets. They now have a multilingual dataset of 5 billion graphic-text matches, of which 2 billion are pictures marked in English.
On the basis of the 2 billion, we screened out about 1 billion datasets for Stable Diffusion. The work on the dataset started 2 or 3 years ago, and the scale is growing every year. The size of the dataset is very important. Apart from LAION, no other dataset available has this scale. So when the CompVis team and the LAION team started collaborating, this neural network was born.
The third key element is the satisfaction of computing power. Prior to this, academic researchers and open source researchers must apply for computing power resources through the university network or other companies that provide computing power resources. And the current Stability AI It has the tenth or eleventh largest self-use supercomputer in the world. We’re making these resources available to open source researchers in need, so they now have the ability to train the world’s largest models, rivaling any other company. This is very helpful to the community. enable them to have the resources to do research and development,
Hence the awesome models being released now, a trend I believe will only grow. As we come to 2023, more other modalities will be involved, such as video, the model will become larger and larger, and the data set will become larger and larger, so this trend will likely continue.
(Tom Mason’s sharing at the IF 2023 conference.)
Geek Park: This year, generating images through text is very eye-catching. After that, how will the field of content production change?
Tom Mason: The next wave is definitely video, audio and 3D.The explosion and popularity of language models and image models actually stem from the openness of data sets. We were able to extract large amounts of text from the internet and use it to train image models. This is actually an important reason for the explosive development of image and language models in the past few years. Video models have begun to emerge, and they also rely on large-scale, labeled and clean datasets so that the models can be trained efficiently.
This is the area we are focusing on now, and audio is similar. We have a team called Harmonai that is working on text and audio. At this stage, the output of the trained model is already very good, and it can be generated through text input, so this is a very exciting field. My personal passion lies in video and animation. I have been working on Stability before joining Stability do something about it.
There are not enough video datasets and audio datasets on the Internet, which is our top priority. We should complete it through cooperation (data set construction). The explosion and popularity of language models and image models actually stem from the openness of data sets. We were able to extract large amounts of text from the internet and use it to train image models. This is actually an important reason for the explosive development of image and language models in the past few years.
A large amount of video content is copyrighted by large film companies and streaming media companies, so it is very important for us to help those companies use their data sets to develop new video models. This is one of our core strategies. It’s about making data smarter and making better use of large data sets that are often not used properly.
Geek Park: When will the generated model of video content be released?
Tom Mason：No doubt next year. We now have a video model that is being trained, and we have established partnerships with those large dataset owners that I mentioned earlier. I think the model architecture still needs to be optimized, but we have some interesting options. .
I am very much looking forward to the middle of next year, we will be able to make a good video model, of course it is a short video, and then slowly (develop) to a long video, which may require the use of a multi-model combination. At the same time, it is necessary to optimize scene fusion and other related technologies.
One of our tools, Dream Studio, is used to edit and make animations. We are actually studying animation generation API, allowing people to generate animations with only one image, using a 2D-to-3D depth estimation method.It’s a really cool technology, a little bit different from video diffusion, and we’re going to release it early next year, so thatuser experience. Video Diffusion will be released later next year.
I really look forward to the day when we will be able to create tools for animation and video diffusion (models). 3D, too, will be a hot area next year. We’ve seen a lot of pipelines that include NeRF (note: a 2D image to 3D model), allowing us to create 3D models and assets. Through the text pipeline, Vincent diagram, 2D to 3D, or the environment in the photographic work is converted into a 3D model through NeRF. These pipelines are currently very slow, but they are rapidly improving efficiency.
Geek Park: What kind of brand-new experience will video and 3D production models bring to people?
Tom Mason：Users should soon be able to pass through these generative pipelines, in VR Or create 3D assets in the game scene.This is going to be such a big deal that it almost immediately makes you thinkmetaverse. In it, you can create your own environment, players only need to dictate what kind of game assets or environments they want to be immersed in. It’s going to be very exciting.
I think many of us have imagined that.exist VR the entire environment around us is (automatically) generated.Players have full control over the music, 3D assets, and atmosphere of the environment, so you can take full control of your experience.This is different from the generative AI The progress made is very consistent.What generative AI is doing is turning consumers into creators, giving them the ability to create the media content they consume themselves.It’s going to be a very exciting time.
Geek Park: Currently, what are the challenges in generating 3D content?
Tom Mason：As far as the current 3D (content generation) is concerned, I think the challenge is mainly the generation time and resolution. The two are related, the more accurate the NeRF model is, the slower it will run, and if you think about what is the most amazing progress in image models, it is the generation time (reduced).
A year ago, it might have taken minutes, 2, 3, or 4 minutes to generate a high-resolution image.But for example by running in our API Stable Diffusion on the Internet only takes about 2-3 seconds, so there is an order of magnitude improvement in performance. That’s why, this model can be so successful, because it is small enough and can be generated fast enough.So it can run locally GPU and at a faster speed.
So we need to see that similar breakthroughs are needed in the field of 3D content generation. It takes about 10 minutes to generate a decent mesh model from a photo. For ordinary users who want to embed it into the creative experience It’s too slow for people, and people want authoring tools that respond quickly.
So I think we need to focus on solving this problem.
Geek Park: What is the technical maturity of the video generation model?
Tom Mason：I’m confident it’s going to go a lot faster. We have seen some new sampling techniques and model architectures that can greatly reduce inference time. The image model forms the core of the video model. To some extent, the video model adds timing information to the image model, so as long as we make the image model smaller, the video model can also become more efficient, which is a relatively clear direction for research in the video field.
I think we have a high probability of realizing real-time video (generation) by the end of next year.I can see that the image reasoning time of the video will soon reach at least 1 frame per second next year, and then reach the fluency of real-time output by the end of the year, and 3D will be relatively far away, depending on the iteration of the technology . But there is no doubt that we will continue to invest firmly in the field of 3D content generation together with many companies including Nvidia.
Stability AI official website
be part of the community
Geek Park: You mentioned Stability AIAdhere to open source and support 8 open source communities. How is the operation of the open source community?
Tom Mason：The operating mode of our open source community is similar to Linux and other open source projects that everyone is familiar with, and we only use talents. Through Git management (community members) contributions to the code base, community members review each other’s code, and once the review is passed, it can be merged into the trunk.
For the open source communities we support, we fund some researchers who can lead the community, which also enables them to participate full-time on the project. Many of the people who work on these projects do it in their spare time, or while working on a university degree or Ph.D. Even if many of them wanted to work on the project full-time, the reality didn’t allow them to. We have funded some core researchers of the project so that they can fully devote their energy to this project.
Of course, we only do this when we are very sure that this person is vital to the community. These people either played an important role in creating the community, or they were able to bring members together. There are always some people who are indispensable in the organization and play the role of glue. For these people, we will do our best to support them.
Geek Park: What role does StabilityAI play in the community?
Tom Mason：I think the point is that we are no different than other members of the community.As a business, we are just one in a community.We don’t own it, we’re just a contributor.
I think all of us think so. Besides that, we don’t want to play any more roles. As a business, we just want to contribute in a positive and open way, and drive ecological improvement. I think everyone agrees with that. And we also hope that we can make more positive contributions.
Geek Park: You hope that your model can affect 1 billion people, how will this happen?
Tom Mason：An exciting fact is that we are training models in a large number of different languages. There is no large-scale multilingual generative model now, but the emergence of multilingual datasets will be different.
Not much is known about this technology at the moment. We see statistics for model coverage, which is still very small globally.So in the next year or two, we will use different languages to train the model and make Stable Diffusion compatible with more languages.We want to work with global partners, and it is very important for us to work with institutions in different countries. Together we can train these models in different languages.
This doesn’t require the technology to be redeveloped, it’s actually a reapplication of an existing program. Now that we have these architectures, we should quickly roll them out. We hope to share the entire model training process and knowledge so that partners and suppliers in various countries can master it. In this way, in the next 12 months, image generation may once again make waves all over the world, and the same is true in the field of video and audio. One billion may not be enough, but it is our current goal.