Natural Language Processing (NLP) pipelines transform raw text into structured, actionable information. Despite the rise of large language models that can handle many NLP tasks in a single prompt, well-designed pipelines remain essential for production systems that need reliability, efficiency, and maintainability. This guide covers pipeline design for common enterprise NLP tasks.

Pipeline Architecture

An NLP pipeline consists of sequential stages, each transforming the data for the next:

Raw Text -> Preprocessing -> Analysis -> Post-processing -> Output

Preprocessing Stage

Text extraction. Convert source documents into plain text. Handle PDFs (with layout preservation), HTML (strip tags, preserve structure), Word documents, and email formats. Use format-specific extractors (PyMuPDF for PDFs, BeautifulSoup for HTML).

Encoding normalization. Convert all text to UTF-8. Detect and handle encoding issues (mojibake, mixed encodings). This prevents downstream processing errors.

Text cleaning. Remove or normalize boilerplate text, headers, footers, navigation elements, and formatting artifacts. Be conservative - removing too much context can harm downstream tasks.

Language detection. Identify the language of each document. Route non-target-language documents to appropriate processing or flag them for review. Use libraries like langdetect or fastText for reliable detection.

Sentence segmentation. Split text into sentences. This seems trivial but abbreviations (Dr., U.S., etc.), decimal numbers, and domain-specific patterns make it tricky. Use spaCy or NLTK with domain-specific rules.

Tokenization. Split text into tokens (words, subwords, or characters). The tokenization method should match the downstream model’s expectations.

Analysis Stage

The analysis stage applies one or more NLP tasks:

Named Entity Recognition (NER). Extract entities: people, organizations, locations, dates, amounts, product names. Use pre-trained models (spaCy, Hugging Face) for common entities. Fine-tune for domain-specific entities (medical conditions, legal terms, product codes).

Text classification. Categorize documents or passages: sentiment (positive/negative/neutral), topic (billing, technical, sales), intent (complaint, question, request), priority (high/medium/low). Fine-tuned classifiers or prompted LLMs both work, with different cost-accuracy-latency tradeoffs.

Relation extraction. Identify relationships between entities. “Company X acquired Company Y on Date Z.” This is harder than NER and typically requires either fine-tuned models or LLM prompting.

Summarization. Generate concise summaries of longer documents. Extractive summarization selects key sentences from the original text. Abstractive summarization generates new text that captures the main points. LLMs excel at abstractive summarization.

Key phrase extraction. Identify the most important phrases in a document. Useful for tagging, indexing, and quick document overview.

Post-processing Stage

Confidence filtering. Remove or flag results below a confidence threshold. An entity extraction with 40% confidence is more likely to be wrong than useful.

Deduplication. Merge duplicate entities (“IBM” and “International Business Machines” should resolve to the same entity). Use entity linking or simple rule-based matching.

Normalization. Standardize extracted values. Dates to ISO 8601 format, amounts to numeric values, names to canonical forms.

Validation. Check extracted information against business rules and known constraints. A date of “February 30” is clearly wrong regardless of model confidence.

Output formatting. Structure results in the format downstream systems expect (JSON, database records, API responses).

LLM-Based vs Traditional NLP

When to Use LLMs

Complex tasks. Tasks requiring understanding of context, nuance, or world knowledge. Summarization, question answering, and complex classification are well-suited to LLMs.

Low-volume, high-value. When each document is valuable enough to justify higher per-document processing cost. Legal document analysis, medical record review.

Rapid prototyping. LLMs can handle new NLP tasks with prompting alone, without training data or model development.

When to Use Traditional NLP

High-volume processing. When processing millions of documents, traditional models (spaCy, fine-tuned transformers) are orders of magnitude cheaper per document.

Low-latency requirements. Traditional models process in milliseconds; LLM API calls take hundreds of milliseconds to seconds.

Simple, well-defined tasks. Language detection, tokenization, basic NER for common entity types. Using an LLM for these is like using a sledgehammer for a thumbtack.

Hybrid Approach

The most practical approach combines both: traditional NLP for preprocessing and simple tasks, LLMs for complex analysis.

Example pipeline for customer email processing:

  1. Text extraction (traditional) - Extract text from email
  2. Language detection (traditional) - Identify language
  3. Entity extraction (traditional) - Extract names, dates, account numbers
  4. Classification (fine-tuned model or LLM) - Classify intent and priority
  5. Summarization (LLM) - Generate a brief summary for the support agent
  6. Suggested response (LLM) - Draft a response for agent review

Error Handling

NLP pipelines must handle messy real-world text:

Encoding errors. Corrupted characters, mixed encodings, binary content mixed with text. Detect and either fix or skip.

Empty or minimal content. Documents with no meaningful text content. Detect and route to manual review.

Unusual formats. Scanned documents with poor OCR, tables without context, code mixed with text. Have fallback processing for non-standard inputs.

Model errors. NLP models will misclassify text, miss entities, and generate incorrect summaries. Design the pipeline so that individual model errors do not cascade.

Monitoring

Track pipeline health:

  • Processing volume: documents processed per time period
  • Error rate: percentage of documents failing at each stage
  • Latency: processing time per document and per stage
  • Quality metrics: classification accuracy, entity extraction precision/recall (sampled and human-verified)
  • Drift detection: changes in input data distribution (document length, language mix, topic distribution)

Well-designed NLP pipelines are modular, testable, and observable. Each stage has clear inputs and outputs, can be tested independently, and produces metrics that enable monitoring. Build the pipeline in stages, starting with the simplest version that delivers value, and add complexity as needs evolve.