All posts
Announcement10 min read

AI SMS Classification on GPU: Deploy OpenTextShield in 5 Minutes

Run the OpenTextShield BERT SMS classifier on GPU with one Docker command. 99.7% accuracy, <1s latency, 100 MPS on NVIDIA T4. Open source.

Paste into Claude, ChatGPT, or Cursor to have it help you integrate in minutes — full article, code samples, and FAQs included.

TL;DR: OpenTextShield v2.9 ships as a multi-arch Docker image with GPU acceleration and dynamic batching built in. Pull the image, run it with --gpus all, and you have a production-grade AI SMS classification service answering in milliseconds — 99.7% accuracy, <1s latency, ~100 messages/sec sustained on an AWS g4dn.4xlarge (NVIDIA T4). Source is on GitHub. The full 5-step quickstart is on the /opentextshield page.


Every network engineer running an SMS pipeline in 2026 has the same problem: A2P volumes keep climbing, phishing and smishing templates mutate faster than any rule engine can keep up, and the commercial AI APIs that can classify this traffic reliably are either too slow, too expensive, or send your subscribers' message content to a third-party cloud.

This post is a field guide to the alternative: running a BERT-based AI SMS classifier locally on a GPU, deployed in the time it takes to read this paragraph, with cost, latency, and data-residency numbers that actually work for a telecom.

Why AI/ML Classification Beats Regex for SMS

A rule-based SMS firewall is fundamentally a pattern matcher: keyword lists, URL blocklists, sender-ID regexes, maybe a Bloom filter over known bad hashes. It works for about a week after you ship it. Then the spammers adapt — they swap bit.ly for tinyurl, replace Latin letters with Cyrillic lookalikes (аррlе vs apple), rotate keywords (URGENT!ATTENTION!IMMEDIATE ACTION), and you're back to re-authoring rules at 3am.

A transformer model classifies on meaning rather than surface patterns. OpenTextShield uses multilingual BERT (mBERT), fine-tuned on a labelled corpus of ham, spam, and phishing messages across 50+ languages. The model has internalised what a "your package is stuck at customs, click to pay the fee" phishing template looks like in English, Arabic, Portuguese, and Mandarin simultaneously — in all the variations it has seen, and crucially, in variations it hasn't. That generalisation is what keeps the numbers steady as traffic rotates.

In production: 99.7% accuracy across a blended multilingual test set, vs. the 85–92% range a well-tuned regex firewall typically starts at (and the long decay curve afterward). The gap matters twice: once for the ham you don't want to accidentally block, and again for the fraud that slips through a stale rule list.

Why GPU, Why Local

The calculus looks different for a network engineer than for an application developer. Three constraints dominate:

Latency

An inline SMS pipeline doesn't have 2–30 seconds to wait for a cloud AI API. Carriers operate in milliseconds — SS7 signalling, SMSC acknowledgements, SMPP submit_sm_resp timing. Any inline classification stage needs to add tens of milliseconds, not tens of seconds. On a co-located NVIDIA T4 running OpenTextShield with FP16 and dynamic batching, p95 inference is under 50ms end-to-end.

Cost at Volume

Commercial AI APIs price between $0.01 and $0.03 per request. Run that against a tier-1 operator SMS firehose:

  • At 100 messages/sec sustained (8.64M/day), commercial APIs cost $86,400 to $259,200 per day.
  • At 1,000 messages/sec (86.4M/day), multiply by ten.
  • A single AWS g4dn.4xlarge with an NVIDIA T4 is ~$1,200/month on-demand, far less on reserved capacity. It handles the same 100 MPS with headroom.

The break-even is measured in hours, not months. And that's before you factor in the fact that you control the model — no surprise pricing changes, no deprecations, no rate limits.

Data Residency

Subscriber SMS content is regulated under GDPR in Europe, under operator licence conditions almost everywhere else, and under specific A2P/enterprise-customer contracts in every other case. Shipping message bodies to a commercial AI endpoint outside your network is a conversation with your legal team you don't want to have. A locally-deployed classifier keeps the content inside your perimeter. The only thing that leaves the model is a label and a probability.

Architecture at a Glance

OpenTextShield v2.9 is a single container with three moving parts:

  • Inference server — FastAPI on port 8002, exposing /predict/ for classification and /metrics for Prometheus. Runs the mBERT model through PyTorch with CUDA FP16 on GPU.
  • Dynamic batching layer — collects incoming requests in a short window (default OTS_BATCH_WAIT_MS=10) and runs them as a single GPU pass. This is what unlocks the 100 MPS number — per-request inference is fast, but the batched throughput is what actually scales with hardware.
  • Admin UI / health — port 8080, for log inspection and model info. Not required for production traffic; firewall it off if you don't need it.

The image is multi-arch (linux/amd64 + linux/arm64). Docker picks the right variant at pull time, so the same docker run works on x86 GPU hosts, Graviton CPU fallback, and NVIDIA Jetson edge boxes without changing the command.

Deploy in 5 Minutes — The Full Sequence

These are the same five commands from the deployment quickstart, with the operational notes you actually want at 2am.

1. Install the NVIDIA Driver and Container Toolkit

One-time host setup so Docker can pass the GPU into a container. Ubuntu/Debian:

sudo apt update && sudo apt install -y nvidia-driver-535-server
sudo reboot

# After reboot — install NVIDIA Container Toolkit
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Sanity check — should list your GPU
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

If nvidia-smi inside the container doesn't print your GPU details, stop here and fix the driver install — no amount of later tweaking helps if the runtime can't see the card.

2. Pull and Run OpenTextShield

Pin to an exact version in production. Don't use :latest.

docker pull telecomsxchange/opentextshield:v2.9.0

docker run -d \
  --name ots \
  --gpus all \
  --restart unless-stopped \
  -p 8002:8002 \
  -p 8080:8080 \
  telecomsxchange/opentextshield:v2.9.0

--restart unless-stopped is not optional for anything going into production — you want the container back up after a host reboot without manual intervention. Pair it with a systemd service that watches the process if you're paranoid.

3. Verify the GPU Is Actually Being Used

This is the step network engineers skip and regret. Just because the container started doesn't mean the model loaded on the GPU — a missing driver, wrong CUDA version, or permission issue will silently fall back to CPU and your MPS collapses.

# Confirm GPU acceleration
curl -s http://<host>:8002/metrics | grep ots_api_info
# Expected: ots_api_info{version="2.9.0",device="cuda",fp16="true",...} 1

# Watch live GPU utilization while sending traffic
nvidia-smi -l 2

If device="cpu" shows up, the container isn't using the GPU. Check docker inspect ots for "Runtime": "nvidia" and re-run step 1.

4. Test Classification

The API takes a JSON body with text and model, returns a label plus probability.

curl -X POST http://<host>:8002/predict/ \
  -H "Content-Type: application/json" \
  -d '{"text":"Free iPhone! Click http://bit.ly/scam","model":"ots-mbert"}'

# Response: {"label":"spam","probability":0.97,"processing_time":0.04,...}

Four milliseconds of model time on a cold T4. That's the bit you drop into your SMPP proxy or REST pipeline — the rest of the latency budget is yours to spend on network round-trip and business logic.

5. Tune for Your Hardware

Defaults assume a Tesla T4 with 16 GB VRAM. Override via environment variables for other cards.

# Smaller GPUs (under 8 GB VRAM)
docker run -d --gpus all \
  -e OTS_MAX_BATCH_SIZE=32 \
  -e OTS_BATCH_WAIT_MS=50 \
  -p 8002:8002 -p 8080:8080 \
  telecomsxchange/opentextshield:v2.9.0

# Single-request debugging — disable batching
docker run --gpus all \
  -e OTS_BATCHING_ENABLED=false \
  -p 8002:8002 \
  telecomsxchange/opentextshield:v2.9.0

When to tune batch size down: VRAM is tight, or you want lower tail latency at the cost of throughput. When to tune batch wait up: your traffic is bursty and you're not filling a batch before the timeout fires — a longer wait increases per-batch size and throughput.

Validated Performance: The Numbers That Matter

OpenTextShield v2.9 was soak-tested on an AWS g4dn.4xlarge (NVIDIA Tesla T4, 16 GB VRAM, FP16 precision) against a realistic ham/spam/phishing workload:

  • 132,000+ messages processed during the soak.
  • ~100 messages/sec sustained throughput.
  • Zero failures. Zero timeouts.
  • <1s p99 end-to-end latency, dominated by network, not model.
  • 99.7% classification accuracy across the labelled corpus.

Your mileage will vary by GPU class and traffic mix. A T4 is the lower bound of "production-serious" — an A10G or L4 roughly doubles throughput; an H100 is overkill unless you're classifying 1,000+ MPS and bundling multiple models.

SMPP Integration for Carrier Deployments

The pattern above gives you a REST endpoint. For carriers running inline SMS firewalling, the Professional Edition ships with an SMPP-integrated stack: the classifier sits between your SMSC and the next hop, inspects every submit_sm PDU, and forwards / drops / quarantines based on the label before the PDU leaves the box.

For the Community Edition, the standard pattern is to front OpenTextShield with your existing SMPP proxy (Kannel, Jasmin, an in-house ESME) and call /predict/ on each submit. Same model accuracy; you own the SMPP plumbing. Either way, the p95 classifier overhead is <50ms — safely under any realistic SMPP timeout.

Observability: What to Graph Day One

The /metrics endpoint speaks Prometheus natively. Scrape it from the same stack you already run. The dashboard you care about has four panels:

  1. Prediction counts by label (ham / spam / phishing) — so you can see traffic mix drift in real time. A sudden spike in "phishing" is usually a campaign hitting your operator.
  2. Inference latency p50/p95/p99 — set alerts at p99 > 200ms. If you cross this, GPU saturation or model loading issues are your first suspects.
  3. Batch size distribution — a healthy deployment runs near your configured OTS_MAX_BATCH_SIZE at peak. If you're always at batch size 1, your traffic is under-utilising the GPU and you can raise OTS_BATCH_WAIT_MS.
  4. GPU utilisation and VRAM — from the exposed NVML metrics. Sustained >85% is healthy; pinned at 100% means you need a bigger card or a second replica.

For soak testing, run nvidia-smi -l 2 alongside a load generator to correlate GPU utilisation with API throughput — useful for sizing before you commit to production capacity.

Community vs Professional — Which One Do You Want

Both editions use the same model architecture and deliver the same 99.7% accuracy. The split is operational:

  • Community Edition (MIT, free, on GitHub). Best for: engineering teams that want to train their own models on proprietary data, research groups, universities, small operators with in-house ML expertise. You get full source access, custom training pipelines, and zero vendor lock-in. You ship your own SMPP glue.
  • Professional Edition (commercial licence, 24/7 SLA). Best for: carrier-grade production deployments where the calculus is "get this working now, never page someone, call support when it breaks." You get pre-trained and continuously-updated models, the pre-integrated SMPP stack, and a commercial support contract. Zero-configuration install path — deploy and forward traffic.

The decision usually comes down to whether "build-vs-buy" on the SMPP integration and the model retraining cadence is worth it for your team.

Where This Fits in the Wider TCXC Stack

OpenTextShield is one layer of the TelecomsXChange autonomous telecom platform. Other parts of the stack that network engineers often pair it with:

  • MCP Server — expose OpenTextShield predictions through the Model Context Protocol so AI operations agents can query, investigate, and respond to SMS fraud events conversationally.
  • TeleAnalytix AI — the broader ML stack for wholesale voice and messaging analytics. OTS classification labels flow in as features.
  • TCXC SMPP — the wholesale SMS exchange. Pair OTS with TCXC SMPP termination to get classified, firewalled, wholesale routing in one integration.

Getting Started

  1. Clone or pull the image. GitHub repo for the Community Edition, Docker Hub for the image.
  2. Follow the 5-step quickstart above or on the /opentextshield page.
  3. Wire up Prometheus scraping against :8002/metrics. Build the four-panel dashboard the next day.
  4. For SMPP-inline production or a commercial support SLA, contact our team about the Professional Edition.

The model stays current, the container stays small, the GPU stays busy. That's the whole pitch.


Frequently Asked Questions

Back to all posts