Meet Paella, a Novel AI Text-to-Image Model That Uses a Speed-Optimized Architecture Allowing to Sample a Single Image in Less Than 500 ms While Having 573M Parameters - MarkTechPost
Recent advances in the diversity, quality, and variety of created pictures result from research on text-to-image generation. However, these models' remarkable output quality has come at the expense of poor inference speeds that are unsuitable for end-user applications due to their numerous sampling stages. Most recent state-of-the-art works are either transformer-based or rely on diffusion models. Transformers often compress their spatial representation before learning because the self-attention process scales quadratically with latent space dimensions. Additionally, a transformer flattens the encoded picture tokens to regard images as one-dimensional sequences, which is an unnatural projection of images and necessitates a substantially more