Note: this blog post is a final paper for my UCSB WRIT105SW course. As such, it is a slight deviation from my standard writing and may assume a lot less prerequisite math and computer science knowledge than some other posts on this account.Nonetheless, I attest it’s still top notch :)Part 1: The Turing TestIn 1950, mathematician and computer scientist Alan Turing pondered if there was a fundamentally philosophical way to answer the question of “can machines think?” In light of this question, he proposed the philosophical thought experiment of the “Turing Test,” a theoretical construct that could discern a machine from a human based on how it responds to questions. They often were simple questions that prompted complex answers:Describe yourself using only colors and shapes. (A machine would struggle with the abstraction from complex human characteristics to simple shape and colors)Do more people go to Russia than me? (This sentence is syntactically correct but semantically nonsense — a human cannot answer this properly)Describe why time flies like an arrow but fruit flies like a banana? (A machine would struggle to interpret whether the second “flies” is a noun or verb)Back in 1950, machines were simply far too computationally inept to make a dent in any of these questions. After all, it was already mathematically proven in 1936 by Turing himself that no algorithm could prove if a program terminated or looped forever. Computer scientists began to realize that computers couldn’t do everything. Someone could almost believe that it’s as if there was a theoretical boundary of computing here too — some sort of human-computer interaction limit.But they never found it. In fact, it may not exist.I vividly remember having to write a paper on Polka tradition for my ethnomusicology course when I first heard the whispers of a possibly ground-breaking tool called ChatGPT. I couldn’t believe my eyes — the essay was done. It was coherent, well-structured, and it captured every nuance that I gave in the prompt.The onset of transformer models like ChatGPT single-handedly shattered my perception of everything I knew about computers. It surpasses every previous language model by miles. It passes countless Turing Tests. And through it all, it can masquerade as a human and emulate responses that signify an understanding of the meaning of sentences given to it.How is this all possible? Behold, the transformer model.Part 2: What does generative mean?In ChatGPT, the GPT stands for “generative pre-trained transformer.” In the context of machine learning, a generative model is one that outputs new creative content based on a prompt or some input sequence. In the context of transformer models, the prediction is made one word at a time based on the previous string of words that came before it.The model predicts that the word “over” is the most likely to follow in the sequenceTake the above example. The crux of the prediction process is that each word is generated one at a time—this is actually the reason why ChatGPT slowly types out a response word-by-word instead of giving you a block text all at once. For predicting a single word:All of the previous words (or rather, at least enough to establish context) are fed into the model.Then, the model gives back not one word, but instead a probability distribution for every single word in the dictionary. This model is trained on a large sample of example text and predicts based on its observations of which words tend to follow other words.Finally, the highest probability word is outputted. Then, the cycle repeats.Once over is predicted, the cycle repeats and the model generates the word “the”Up to now, this model is alright, but an issue that you immediately run into is that the model is not able to understand context. E.g. if you have the following sentences:“I caught a bass in the lake.”“I connected my electric bass to the speaker.”The model literally cannot discern whether the input word bass refers to the fish or the instrument. Fortunately, the key insight of the transformer model is how it utilizes a tool called attention heads to preserve context of words—this will be explained in a bit.But to understand that, let’s first take a look at how meaning can even be encoded at all.Part 3: EmbeddingsTo understand how machines even make sense of words, we first have to take a look at embeddings, or mathematical representations of word meaning.In a machine, words are really hard to assign meaning to. E.g. you can tell a human that the word serene conveys the meaning of tranquility and peace — a computer has no inherent understanding of what tranquility means. However, the mathematical best-effort approach to approximate meaning is a concept called embeddings.Let’s look at an example of the word embeddings for the words elephant and small.Examples of word embeddings, drawn out in two dimensionsIn this example, our 2-d plane has a dimension representing “living-ness” and one that represents “size