Deepening into How AI Understands Different Contexts

Arda Bora Karahan
Jul 1
4 min read

Written by Kerem Muldur

As humans, we can understand the contextual meanings of concepts through living and connecting. For instance, you can understand what the “president” means because you experience their actions, you see them on television, and so on. However, LLM, Large Language Models, such as ChatGPT and Gemini, can’t experience what is going on in the world. The question you are now gonna be asking is “How do they understand?”, you are at the right place to learn it, because in this this article, we’ll pull back the curtain on how modern AI turns words into numbers, understands their relationships, and processes entire sentences in one go. No programming degree required, just curiosity and a passion for big ideas explained with familiar concepts.

How Does AI Learn Words?

When somebody asks if the ducks and the lake or planes and pilots are connected, we would probably say yes, but how can AI create this connection just like our brains did? Here, our way crosses with the vector space. Imagine a vast space with hundreds of invisible axes—far beyond our familiar X, Y, Z. In that high‑dimensional world, every word lives as a point. At first, those points are scattered randomly, like darts thrown blindfolded. However, during training, the model reads billions of sentences and repositions each word’s point closer to words that appear in similar contexts. So “king” and “queen” are being held together, while “king” and “banana” stay far apart. This continuous adjust‑and‑correct process is how AI “learns” word meanings.

The list of numbers, called a vector, that marks a word’s coordinates in that high-dimensional space is called an embedding. If our AI uses 768 dimensions, then each of those 768 slots captures one feature—maybe “is it fruit‑related?”, “Is it positive?”, or “Is it common in cooking?” By comparing two vectors, the model measures similarity: if they are close together, they are related; if not, they are not related.

Tokenization

Before embedding, we must chop sentences into pieces called tokens. A token might be a whole word (“running”), a subword (“run” + “ning”), or even a single character. This is key because new or rare words (think “chatGPT’ing”) would otherwise be out of reach. One of the popular techniques to achieve tokenization is using Byte‑Pair Encoding (BPE), where you start with every character as its token, then repeatedly merge the most common neighboring pairs, for example “a” + “t” → “at”—until you have a manageable vocabulary of subwords and words. This mix enables the AI to handle both familiar and brand-new terms with ease.

RNN vs. Parallel Attention

Once the words are located in the vector space, AI compares them with other words to train it further. Before modern AI could process whole sentences at once, it used Recurrent Neural Networks (RNNs) that read one word at a time, like a whisper chain. Each word passes a hidden “state” to the next: “The → cat → sat → on…”After some time, that initial message can be lost entirely.

This problem triggered the question: What if every word could see every other word at once? Enter self-attention, which works like a group chat. All tokens broadcast questions and answers simultaneously: “How relevant are you to me right now?” Because it happens in parallel, even a “not” at the start of a long sentence still influences “happy” at the end.

Introducing the Transformer: Putting Self‑Attention to Work

The Transformer takes the randomly embedded token and fixes its vectors; in other words, it is several steps that provide the repositioning of tokens on the huge vector space, bringing related words closer, etc.

Introducing the Transformer: Putting Self‑Attention to Work

The Transformer takes our randomly embedded tokens and, through a series of deliberate steps, relocates related words closer together in that high‑dimensional space while pushing unrelated ones farther apart. It does this with a repeatable module called an encoder layer, repeated several times. Each layer inputs a set of token vectors (think of them as points on the map), then applies self‑attention, normalization, and a small neural network to gradually refine each token’s position to reflect its true contextual meaning.

Projecting Tokens into Q, K, and V

Inside each encoder layer, every token vector X multiplies three learned matrices to produce Query (Q), Key (K), and Value (V) vectors. These matrices are simply tables of numbers that the model has learned during training, optimized so that when you multiply a token vector by one of these matrices, for example, you get the perfect “question” about other tokens. This projection reads each token for similarity comparison.

Computing Similarity and Attention Outputs

Next, we take each token’s Q and compute a similarity score against every other token’s K by doing a dot product. A higher dot product means two tokens point in similar directions, in other words, they share meaning. We then apply the softmax function to turn those raw scores into positive weights that sum to 1. Finally, each token’s new vector is the weighted sum of all V vectors. This attention output reflects exactly how much context each neighboring token should contribute, effectively repositioning each token closer to those it cares about most.

Residuals, Normalization, and Layer Stacking

Once we have the attention output, we add it back to the original input token (a residual connection) so we don’t lose what we started with. We then apply layer normalization to keep the values from exploding or vanishing. That completes one encoder layer. We feed those normalized vectors into the next layer, repeating projections, similarity scoring, weighting, and normalization. Each pass sharpens token positions further, allowing early layers to capture simple word associations and deeper layers to pick up on complex contextual patterns till every token sits in just the right spot.

References:

1-Uma, C. (2024, June 3). What is text embedding for AI? Transforming NLP with AI. DataCamp. https://www.datacamp.com/blog/what-is-text-embedding-ai

2- Alammar, J. (2018, August 28). The illustrated Transformer. Retrieved June 29, 2025, from https://jalammar.github.io/illustrated-transformer

3- Reimers, N., & Gurevych, I. (2020). Contrastive learning of sentence embeddings. arXiv. https://arxiv.org/abs/2002.05642

4- Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. arXiv. https://arxiv.org/abs/2002.12327

5- Villas‑Boas, R. (2018, November 2). Making NLP easy: Simple techniques for word vectorization and cosine similarity. Medium. https://medium.com/@florvela/making-nlp-easy-simple-techniques-for-word-vectorization-and-cosine-similarity-e6ef94586f71

6- Zilliz. (2021, May 10). Vector search in production with HNSW and FAISS. Retrieved June 29, 2025, from https://zilliz.com/blog/vector-search-in-production-with-hnsw-and-faiss

7- Sidorov, G., Gelbukh, A., Gómez‑Adorno, H., & Pinto, D. (2014). Soft cosine measure. In Similarity Search and Applications (pp. 13–24). Springer. https://link.springer.com/chapter/10.1007/978-3-642-24551-9_13