What is an embedding?

8 min read

·
┌──────────────────────────────────────────────────────────┐
│  ═══════════════════════════════════════════════════     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  █████████████████████████████████░░░░░░░░░░░░░░░░░░     │
│  ██████████████████████████████████████░░░░░░░░░░░░░     │
│  ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ███████████████████████████████████████░░░░░░░░░░░░     │
└──────────────────────────────────────────────────────────┘

An embedding is a numerical representation of text, images, or other data as a vector, which is a list of numbers. These vectors capture the semantic meaning of the content, so items with similar meanings end up close together in the vector space. Embeddings are the foundation of AI-powered search, recommendations, and retrieval-augmented generation.

How Embeddings Capture Meaning

────────────────────────────────────────

When you pass a piece of text through an embedding model, it returns a vector of floating-point numbers, typically between 256 and 3072 dimensions. Each dimension captures some aspect of the text's meaning.

The key insight is that similar concepts get similar vectors. The embedding for "dog" will be closer to "puppy" than to "airplane." The embedding for "How do I reset my password?" will be close to "I forgot my login credentials" even though they share almost no words in common.

This works because embedding models are trained on massive amounts of text where they learn that certain words and phrases appear in similar contexts. That contextual similarity gets encoded into the vector representation.

Distance Metrics

────────────────────────────────────────

To determine how similar two embeddings are, you measure the distance between them:

[Cosine similarity] is the most common metric. It measures the angle between two vectors, returning a value between -1 and 1 (where 1 means identical direction). It is popular because it works regardless of vector magnitude, focusing purely on direction.

[Euclidean distance] measures the straight-line distance between two points. Smaller values mean more similar. This is intuitive but can be affected by vector magnitude.

[Dot product] is a fast similarity measure that combines both direction and magnitude. It is often used when vectors are normalized.

In practice, cosine similarity is the default choice for most text embedding applications. Most vector databases support all three metrics.

Use Cases

────────────────────────────────────────

[Semantic search]: Instead of matching keywords, search by meaning. A user searches "affordable places to stay in Paris" and finds results about "budget hotels in Paris" and "cheap Parisian accommodations" even though the exact words differ. This is the most common use of embeddings.

[Retrieval-augmented generation (RAG)]: Embed your documents, store them in a vector database, and when a user asks a question, find the most relevant documents and feed them to a language model as context. This grounds the model's response in your actual data.

[Recommendations]: Embed items and users, then recommend items whose embeddings are close to what a user has liked before. This works for products, articles, music, and more.

[Clustering]: Group similar documents together automatically. Embed all your support tickets and cluster them to discover common themes without manual labeling.

[Anomaly detection]: When most items cluster together, outliers with distant embeddings might be anomalies worth investigating. Useful for fraud detection and quality control.

[Deduplication]: Find near-duplicate content by comparing embeddings. Two articles that say essentially the same thing in different words will have similar vectors.

Popular Embedding Models

────────────────────────────────────────

[OpenAI] offers text-embedding-3-small and text-embedding-3-large. The large model produces 3072-dimensional vectors with excellent quality. The small model uses 1536 dimensions and is cheaper and faster. Both support shortening, letting you truncate vectors to reduce dimensions while retaining quality.

[Cohere] provides embed-v3 models optimized for different use cases: search, classification, and clustering. They support over 100 languages and offer separate models for English and multilingual use.

[Google] offers embedding models through Vertex AI and Gemini, including text-embedding-004 and multimodal embedding models that can embed both text and images into the same vector space.

[Open source models] are excellent for embeddings. [Sentence-transformers] is a widely-used Python library with dozens of models. [E5] (from Microsoft) and [BGE] (from BAAI) are popular open source embedding models that rival commercial offerings. [Nomic Embed] and [GTE] are other strong options. You can run these locally or on your own infrastructure.

[Voyage AI] and [Jina AI] offer specialized embedding models with strong performance on benchmarks, particularly for code and multilingual text.

Vector Databases

────────────────────────────────────────

Once you have embeddings, you need somewhere to store and search them efficiently. Vector databases are purpose-built for this:

[Pinecone] is a managed vector database designed for simplicity. You push vectors in and query by similarity. It handles scaling and indexing automatically.

[Weaviate] is an open source vector database that supports hybrid search (combining vector and keyword search), multiple modalities, and built-in vectorization.

[Chroma] is lightweight and popular for prototyping and smaller applications. It runs in-memory or persisted, with a simple Python API.

[pgvector] is a PostgreSQL extension that adds vector similarity search to your existing Postgres database. Great if you already use Postgres and want to avoid adding another service.

[Qdrant] is an open source vector database written in Rust, known for performance and a rich filtering system.

[Milvus] is built for large-scale vector search and is widely used in production systems.

For many projects, pgvector is the pragmatic choice since it keeps your vectors alongside your relational data. For large-scale applications with billions of vectors, a dedicated vector database is worth the operational overhead.

Dimensionality and Tradeoffs

────────────────────────────────────────

Higher-dimensional embeddings capture more nuance but use more storage and compute. A 3072-dimensional vector takes twice the space of a 1536-dimensional one and similarity searches are slower.

For most applications, 768 to 1536 dimensions hit the sweet spot. Going higher helps for specialized domains with fine-grained distinctions. Going lower (256-512) works well when you need speed and have simpler similarity requirements.

Some models support [Matryoshka representation learning], which means you can truncate vectors to lower dimensions while retaining most of the quality. OpenAI's text-embedding-3 models support this.

A Practical Semantic Search Example

────────────────────────────────────────

Here is how you would build a basic semantic search system:

  1. [Prepare your documents]: Break your content into chunks, typically 200-500 tokens each.
  2. [Generate embeddings]: Pass each chunk through an embedding model to get a vector.
  3. [Store in a vector database]: Index the vectors with metadata like the source document, section, and URL.
  4. [Query at search time]: When a user searches, embed their query with the same model, then find the nearest vectors in your database.
  5. [Return results]: The chunks with the highest similarity scores are your search results.

The beauty of this approach is that it understands meaning, not just keywords. A search for "how to handle errors in Python" will find content about "exception handling" and "try-except blocks" even if those exact phrases were not in the query.

Key Takeaways

────────────────────────────────────────

Embeddings are one of the most practical tools in AI development. They turn unstructured data into something you can compute with. Whether you are building search, recommendations, or RAG systems, understanding embeddings and how to work with them is essential knowledge for building with AI.

Related Articles

Building with AI