5 Advancements that made the current wave of generative AI possible

From an AI rendition of Kanye West flawlessly singing "Just the Two of Us"[1], to universities trying to discern whether student assignments are generated by ChatGPT, to Tokyo researchers reproducing images from brain scans that reflect what people are seeing[2] - generative AI has swept the globe. Some argue we aren't prepared for these shifts, while others suggest the capabilities of AI are being overstated. Influential figures, like Elon Musk and Bjarne Stroustrup (creator of the C++ programming language), have urged a halt on AI system development until we're equipped to manage them.

The past few years could indeed be seen as the era of generative AI. Incremental progress combined with revolutionary advancements have enabled AI to generate human-like creative texts, images, videos, and music. Every week sees a dramatic surge in the release of AI-related software and research papers, making it challenging to stay updated.

What's new? And what foundational elements have allowed generative AI to become as effective and contentious as it is? In this blog, we discuss the five major advancements that have made the current state of generative AI possible.

Attention is all you need

In 2017, a group of researchers mainly from google published a groundbreaking paper called "Attention is all you need"[3], which they themselves didn't know how much impact this paper would have in the subsequent years. Attention mechanism is a part of the transformer model that deals with self-attention or cross attention between layers. Simply speaking, attention mechanism tries to highlight the interdependencies between an input. For instance, it deals with how the words in a sentence depend on one another.

But you might be thinking how this seemingly simple concept would have any impact that is major. Well, you are not wrong to think that. The significance has to do with how sequence to sequence models were previously dealt with. Sequence to sequence models are models that require a sequence as input and generate a sequence as output. Translating "I have a cat" to "Ich habe eine Katze" is an example of a sequence-to-sequence model.

Previously sequence to sequence models were done with recurrent neural networks, which mostly followed a sequential method that required the output to be redirected to input, and that made the possibility of parallelization difficult. Rekurrentes neuronales Netz realisiert, die meist eine sequenzielle Ansatz verfolgten, die eine Rückführung des Outputs zu Input erforderte. Das hat die Möglichkeit zur Parallelisierung stark eingeschränkt.

Before this, AI models had to process inputs in order, which made it hard to deal with large amounts of data at once. The attention mechanism changed this, allowing AI systems to process many inputs at the same time.


Diffusion denoising models

A simple technique contributing to the current wave of image and video generation is the use of diffusion models[4]. Several text-to-image systems such as Stable Diffusion, DALL-E, and Midjourney have popularized generating images from provided text.

The key idea behind diffusion models involves a forward process that destroys a clean image and a backward process that reconstructs an image. Let's break this down:

First, you take a high-quality image - perhaps one a photographer captured or an artist's painting. Then, step by step, you add Gaussian noise to it until all that remains is pure noise, no longer resembling the original image. Meanwhile, you can also allow neural networks to estimate what noise was added. Gaussian noise is easier to model because it requires only two parameters - the mean and the standard deviation, which can be further optimized.

Then, during the backward stage, you start with total noise and text conditioning, and you attempt to generate a clean image by reversing the process step by step, estimating the noise parameters that aid in this reverse process. This approach is especially helpful when trying to generate an image from a piece of text.

Compute power and data

Discussing the current capabilities of generative AI would be incomplete without acknowledging the substantial computational power and large quantities of data they require. Indeed, it's safe to argue that contemporary generative AI products would not exist without the utilization of vast amounts of data, parameter sets, and computational resources.

In 2020, a paper[5] written by AI ethicist Timnit Gebru, among others, sparked controversy, leading to Gebru's contentious departure from Google under pressure from senior management. The paper focused on the environmental and societal impacts of training large language models, underscoring the escalating scale of training datasets and parameters, and the often-overlooked environmental and energy ramifications.

The situation has arguably intensified since then. GPT-3, the large language model upon which the first ChatGPT was based, underwent training with over 570GB of text data and harnessed more than 175 billion parameters. Utilizing Google's current TPUs to train such a system would incur a cost exceeding 1.65 million dollars. To put it differently, an average laptop would take millennia to train such a system.

The data used by GPT-3 was compiled from a variety of sources. Their paper references about 400 billion tokens from Common Crawl, an internet text-crawling source, tens of billions of tokens from books, and 3 billion tokens from Wikipedia.

In the wake of competitive concerns, OpenAI has decided not to disclose the data and parameter size used in GPT-4. However, some sources imply they employed over 1 trillion parameters through Mixture of Experts model.

Regarding the types of computers currently being utilized, Google developed TPUs (Tensor Processing Units) in 2016, serving as specialized accelerators for machine learning. Upon realizing that most neural network systems didn't necessitate double precision, they crafted hardware systems that deploy floating-point operations.


Back propagation

Backpropagation is far from a new concept. In fact, it has existed since before the 1980s and 90s, only gaining popularity in this century. While it's not a new player in the generative AI realm, it does serve as the backbone of neural network systems.

For most of these systems, optimizing the cost function is the ultimate goal. The cost function represents the distance between the expected outcome and the output provided by the system. The ideal objective is to minimize this difference, aiming for it to be as close to zero as possible.

Optimization, in this context, seeking the local minimum, is a matter of calculus. However, neural network systems present complexity due to their structure - layers upon layers of interconnected neurons. The effects observed in the final output are the results of parameters in earlier layers of the neural networks.

The challenge lies in transferring the impact of optimization to the preceding layers and communicating to every neuron its role in the final outcome. This is where backpropagation comes into play – it guides the error back through all the hidden layers of a neural network system.


Reinforcement Learning

Much like backpropagation, reinforcement learning is not a novel concept. It has been a subject of active study for several decades and relies on a reward mechanism to enhance a system's performance.

Reinforcement learning plays a crucial role in ChatGPT and other contemporary AI systems that actively engage with users. As an integral part of the fine-tuning process, reinforcement learning can help refine the system's parameters based on feedback received from active users.


Our Conclusion

Machines that think, create, and understand the subtleties of life in meaningful ways have long been a fascinating concept for scientists. If machines could draw from all recorded music and start predicting musical pieces that engage our senses, does this threaten our own capacity for creativity? Or could we view it as an opportunity to use these tools to create incredibly complex things by delegating the basics?

Even though the most well-known tools, like ChatGPT and Midjourney, are impressive in their own right, they are far from replacing humans in executing complex tasks. It's not uncommon for ChatGPT to begin hallucinating after a few paragraphs of text, or to confidently discuss topics it knows nothing about. It often misses basic math problems that even third graders could handle.

Several other challenges - including context length, transfer learning, and general intelligence - are far from being solved, and would require an extensive blog entry to fully discuss. Despite these hurdles, it's exciting to think about what the next 5-10 years might bring. Indeed, these are simultaneously thrilling and troubling times in which we live. We must tread carefully.


  1. https://www.youtube.com/watch?v=t-tx7A7bE4U&ab_channel=Ryuma
  2. https://www.biorxiv.org/content/10.1101/2022.11.18.517004v2.full.pdf
  3. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: “Attention Is All You Need”, 2017; http://arxiv.org/abs/1706.03762.
  4. Jonathan Ho, Ajay Jain, Pieter Abbeel: “Denoising Diffusion Probabilistic Models”, 2020; http://arxiv.org/abs/2006.11239
  5. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922

With us, you won't end up on hold.

Arrange a free consultation with your personal advisor quickly and easily or use the contact form.

Your Information