Embedding models convert text, images, or other data into numerical vectors that capture semantic meaning. The choice of embedding model directly impacts the quality of semantic search, RAG retrieval, and recommendation systems. With dozens of options available, selecting the right one requires understanding the tradeoffs between quality, speed, cost, and dimensionality.

What Makes a Good Embedding Model

Retrieval quality. The model should place semantically similar content close together in vector space. Measured by benchmarks like MTEB (Massive Text Embedding Benchmark) for text embeddings.

Dimensionality. Higher dimensions capture more nuance but increase storage cost and search latency. Common dimensions: 384, 768, 1024, 1536, 3072.

Speed. How fast the model generates embeddings. Important for real-time applications and large-scale batch processing.

Context window. How much text can the model process at once. Models with larger context windows can embed longer documents without chunking.

Cost. API-based models charge per token. Self-hosted models have compute costs. The cost per million embeddings varies by orders of magnitude.

API-Based Models

OpenAI text-embedding-3-large. 3072 dimensions (configurable). Strong benchmark performance. $0.13 per million tokens. 8191 token context window. Easy to use but creates vendor dependency.

OpenAI text-embedding-3-small. 1536 dimensions. Good quality at lower cost ($0.02 per million tokens). Suitable when budget is constrained or storage is a concern.

Amazon Titan Embeddings V2. 1024 dimensions. Available through Amazon Bedrock. Competitive quality. Good choice for AWS-centric architectures to keep data within AWS.

Cohere embed-v3. 1024 dimensions. Strong multilingual performance. Supports both search and classification use cases. Available through Bedrock and Cohere API.

Voyage AI. Multiple variants optimized for specific domains (code, legal, finance). 1024 dimensions. Competitive with OpenAI on domain-specific tasks.

Open Source / Self-Hosted Models

BGE-large-en-v1.5. 1024 dimensions. Consistently near the top of MTEB benchmarks. Open source (MIT license). Can be self-hosted for zero per-token cost.

E5-large-v2. 1024 dimensions. Strong general-purpose embeddings. From Microsoft Research. Open source.

GTE-large. 1024 dimensions. From Alibaba DAMO Academy. Competitive benchmark performance. Apache 2.0 license.

all-MiniLM-L6-v2. 384 dimensions. Much smaller and faster than the above models. Lower quality but excellent for prototyping and resource-constrained environments.

Nomic-embed-text. 768 dimensions. Open source with strong performance. Supports long context (8192 tokens).

Selection Criteria

By Use Case

RAG retrieval. Prioritize retrieval quality. Use a model that scores well on MTEB retrieval benchmarks. OpenAI text-embedding-3-large or BGE-large are strong choices.

Semantic search. Similar to RAG but may need multilingual support. Cohere embed-v3 for multilingual; OpenAI or BGE for English-focused.

Classification or clustering. Some models are optimized for these tasks separately. Check task-specific benchmarks, not just overall MTEB scores.

Code search. Use models trained on code. Voyage Code or OpenAI with code-specific fine-tuning.

By Constraint

Latency-sensitive. Smaller models (MiniLM, small variants) generate embeddings faster. Consider batch pre-computation.

Cost-sensitive. Self-hosted open source models eliminate per-token costs. The trade-off is infrastructure management. For API models, OpenAI text-embedding-3-small is the best value.

Privacy-sensitive. Self-hosted models keep data on your infrastructure. No data is sent to third-party APIs. BGE, E5, and GTE are excellent self-hosted options.

Multilingual. Cohere embed-v3 and multilingual-e5-large handle multiple languages well. English-only models will produce poor results on non-English text.

Evaluation Methodology

Do not rely solely on public benchmarks. Evaluate on your actual data:

Step 1: Create an Evaluation Dataset

Build a dataset of queries and relevant documents from your domain. 100-200 query-document pairs is sufficient for initial evaluation. Have domain experts label relevance.

Step 2: Generate Embeddings

Embed your evaluation dataset with each candidate model. Record time and cost.

Step 3: Measure Retrieval Quality

For each query, retrieve the top-K documents using vector similarity. Compute:

  • Recall@K: What fraction of relevant documents appear in the top K results?
  • MRR (Mean Reciprocal Rank): How high does the first relevant result appear?
  • nDCG: How well are the relevant results ordered?

Step 4: Compare Tradeoffs

Plot quality vs. cost and quality vs. latency. The best model depends on where your constraints are:

  • If quality is paramount and budget allows: OpenAI text-embedding-3-large or BGE-large self-hosted
  • If cost is the primary constraint: OpenAI text-embedding-3-small or self-hosted MiniLM
  • If latency is critical: smallest model that meets quality requirements

Practical Considerations

Embedding dimension reduction. OpenAI’s text-embedding-3 models support Matryoshka representation learning, allowing you to truncate dimensions without retraining. This lets you trade quality for efficiency.

Batch processing. Pre-compute embeddings for your document corpus and store them. Only compute embeddings at query time for the user’s query. This makes model speed less critical for the document side.

Model updates. When you switch embedding models, you must re-embed your entire corpus. Plan for this migration cost when evaluating models.

Hybrid search. Combining embedding-based search with keyword-based search (BM25) often outperforms either alone. Consider this before investing in a more expensive embedding model.

The embedding model is a foundational choice that affects the entire retrieval pipeline. Invest the time to evaluate properly on your data - public benchmarks predict general performance but not domain-specific results.