Splunk, SignalFx, and the art of not drowning in your own data
A raw, practitioner take on one of the most powerful - and most misused - observability stacks in enterprise. The good, the bad, the ugly, the MLTK deep dive, and the SPL patterns that actually keep platforms healthy.
An SVP friend asked me to review their Splunk + SignalFx + VictorOps setup with MLTK. These were the exact issues we found. This write-up is how we fixed them. Splunk is the observability platform enterprises love to overpay for, under-tune, and blame when things go sideways. SignalFx - now Splunk APM after the 2019 acquisition - brought real-time streaming metrics and distributed tracing into the fold. Together they're formidable. Formidable tools wielded poorly are just expensive chaos generators. Let's talk about what actually works - and what doesn't.
The good, the bad, the ugly
No sugarcoating. This stack has real strengths and real failure modes. The teams that get value from it understand both clearly.
- SPL is genuinely powerful for ad-hoc log forensics
- SignalFx streaming latency - near real-time at scale
- APM service maps auto-built from OTel traces
- MLTK is best-in-class for enterprise ML on operational data
- SIEM use cases still unmatched in the market
- Dashboard Studio finally doesn't make you want to quit
- Licensing model that punishes observability hygiene
- SPL learning curve is real for SQL/PromQL people
- Splunk + SignalFx seams still visible 5 years post-acquisition
- MLTK requires actual data science chops - not plug-and-play
- Log-to-trace correlation requires explicit setup and tagging
- Runaway searches eat search head capacity with no mercy
- Everything into one main index - the original sin
- Static alert thresholds set in 2021, never revisited
- 43 dashboards, none of which tell you if the platform is healthy
- No retention policy - everything on hot storage forever
- Realtime searches left running by someone who left the team
- Teams suppressing logs to stay under license quotas
Splunk core - where it still earns its price tag
Nothing in enterprise beats Splunk for ad-hoc log investigation. SPL is arcane, but once you know it, you can answer almost any question from your log data fast. Full-text indexing across billions of events, subsearches, lookups, transaction for session stitching - the search engine is legitimately excellent.
| index=prod_apps sourcetype=app_logs level=ERROR earliest=-1h
| eval bucket=strftime(_time, "%Y-%m-%d %H")
| stats count as errors by bucket, service, error_code
| appendcols [
search index=prod_apps sourcetype=app_logs earliest=-1h
| stats count as total by bucket, service
]
| eval error_rate=round((errors/total)*100, 2)
| where error_rate > 1.5
| sort -error_rate
| table service, error_code, errors, total, error_rate
The SPL discipline that actually matters: always scope index and sourcetype first - unscopped searches fan out across all data. Use tstats for indexed field aggregations, it hits the tsidx layer and is often 100x+ faster than scanning raw events. Push filters before pipes. And audit your scheduled searches monthly - they accumulate and quietly hammer search heads.
SignalFx APM - the streaming side
SignalFx's streaming architecture delivers metric ingestion in the 2-5 second range - meaningfully better than polling-based systems. The service map auto-generated from distributed traces gives you real-time dependency awareness that's genuinely useful during active incidents. For high-throughput microservice architectures on Kubernetes, this is top tier.
The catch: auto-instrumentation from the Splunk OTel Distribution covers the skeleton. Getting diagnostic fidelity requires custom span attributes - tenant ID, business identifiers, feature flag state. Without those, trace data tells you something broke but not enough to fix it fast.
MLTK: the capability most teams install and promptly ignore
The Machine Learning Toolkit ships with Splunk. Most teams install it, run through the demo, and go back to threshold-based alerting. That is leaving serious value on the table.
MLTK is a full scikit-learn-backed ML framework inside SPL - over 300 algorithms from the Python for Scientific Computing library, accessible directly from search pipelines via three core ML-SPL commands: fit, apply, and score. The Assistants (guided dashboards built on the Experiment Management Framework) walk you through model training, testing, and deployment with automated versioning and lineage.
The important caveat Splunk calls out prominently in their own docs: MLTK is not a default solution. It requires domain knowledge, SPL fluency, and data science experience. If you don't have that combination, you'll train models on the wrong data, fit to stale baselines, and alert on noise. The tool is excellent. The operator discipline is what makes or breaks it.
Fits a probability density function to your historical data (Normal, Exponential, Gaussian KDE, or Beta - auto-selected) and flags points in the boundary regions as outliers. Supports by clause for per-service, per-host models in a single pass. MLTK 5.5 introduced scaled multi-group support - you can train thousands of per-entity models without performance problems. Supports incremental fit for streaming retraining.
Density-based local outlier detection. Better than DensityFunction when you're looking at several correlated metrics together - latency + error rate + saturation simultaneously. The anomaly_score parameter gives you a continuous score, not just a binary flag. Good for correlated anomaly detection across signals.
Learns a tight boundary around "normal" and flags everything outside. Useful when you can't label anomalies ahead of time (which is almost always the case). Good for detecting novel failure modes that didn't exist in training data - the unknown unknowns that static rules will never catch.
-- training search (scheduled weekly) --
| index=prod_apps sourcetype=app_logs earliest=-30d@d
| timechart span=5m avg(response_ms) as avg_lat by service
| fit DensityFunction avg_lat by service
dist=auto threshold=0.01
into latency_baseline_model
-- detection search (scheduled every 5m) --
| index=prod_apps sourcetype=app_logs earliest=-5m
| timechart span=1m avg(response_ms) as avg_lat by service
| apply latency_baseline_model
| eval is_anomaly=if(IsOutlier(avg_lat)==1, "true", "false")
| where is_anomaly="true"
The Smart Forecasting Assistant uses this under the hood. Handles seasonality and trend automatically - no manual p, d, q parameter tuning. Forecasts CPU usage, error rates, queue depths 30-60 minutes out with confidence intervals. When your forecast fires ahead of the user-visible problem, that is the alert that saves the incident.
Available via the Forecast Time Series Assistant. More manual to configure but gives you more control for well-understood periodic signals. Good for capacity forecasting where you have clean weekly seasonality and want to explain the model to stakeholders.
The built-in SPL predict command uses a holdout model for short-term trend extrapolation without MLTK overhead. Much simpler to deploy - useful for quick "where is this metric heading in the next 30 minutes" panels on operational dashboards without the full MLTK model pipeline.
| index=prod_apps sourcetype=app_logs level=ERROR earliest=-7d
| timechart span=1h count as error_count
| fit StateSpaceForecast error_count
holdback=24 forecast_k=12 conf_interval=95
into error_forecast_model
| eval status=if(error_count > upper95, "spike",
if(error_count < lower95, "drop", "normal"))
| table _time, error_count, predicted(error_count),
lower95, upper95, status
Train on labeled historical incidents to classify incoming alerts: "disk issue, network issue, or application bug?" Built-in feature importance scoring tells you which log fields are most predictive - invaluable for mature teams doing postmortem analysis. Best accuracy in the classifier lineup.
Faster to train than RandomForest and fully interpretable. Good for binary classification: "will this deployment succeed based on pre-deploy health signals?" Used in the Predict Categorical Fields Assistant. The 70/30 default train/test split in MLTK 4.4+ is sensible - use it.
You can inspect the actual decision tree and explain exactly why the model flagged something. For compliance-heavy or regulated environments where you need to justify model decisions to an audit or a change management board, DecisionTree wins over RandomForest every time.
Group log messages by similarity during an active incident. Invaluable for finding the needle - the new error pattern that appeared 10 minutes before things went sideways, buried in 50k events per minute. Set k to 8-15 for most production environments and let it surface pattern clusters you'd never find manually.
Finds clusters without you specifying k, and handles noise points natively. Better than KMeans when you don't know how many failure patterns to expect. Handles irregular, non-spherical clusters in high-dimensional log feature space. Good for novel incident pattern discovery.
Native SPL field similarity clustering without MLTK overhead. No model management, no KV store objects. Good for quick log pattern triage during active incidents when you need grouping in seconds - not the full MLTK pipeline. Reach for this one first during an active P1.
| index=prod_apps sourcetype=app_logs level=ERROR earliest="-30m"
| eval msg_len=len(message)
| eval has_timeout=if(match(message,"timeout"),1,0)
| eval has_conn=if(match(message,"connection"),1,0)
| eval has_oom=if(match(message,"OutOfMemory|OOM"),1,0)
| fit KMeans k=8 msg_len has_timeout has_conn has_oom
into incident_cluster_model
| stats count by cluster, service
| sort -count
The fit-apply-score workflow
The core MLTK pipeline is three commands. fit trains a model on historical data and persists it to a KV store lookup. apply runs that saved model against incoming data - this is what goes in your scheduled detection searches running every 1-5 minutes. score validates model quality: accuracy, F1, AUC for classifiers; R2 and MSE for regressors. The Experiment Management Framework tracks model versions and retraining history automatically via the Experiment History tab.
Log reliability, better indexes, better performance
The observability is only as good as the data pipeline feeding it.
Index design - the single most impactful architectural decision
Dumping everything into a single main index is the original Splunk sin. It destroys search performance, makes RBAC impossible to implement cleanly, and means every search fans out across all data. Structure by lifecycle and access pattern.
- Separate by retention tier. prod_apps_hot (7d), prod_apps_warm (30d), prod_apps_cold (90d). Use SmartStore to tier cold data to S3/GCS at a fraction of local disk cost. Match retention to actual SLA requirements.
- Separate security from ops. Auth, audit, and firewall logs belong in an RBAC-controlled index. Mixing them with app logs is a security posture problem and a search noise problem simultaneously.
- Use metrics indexes for time-series data. tsidx-backed metrics indexes make mstats queries 5-10x faster than log-based rex extraction for numeric data. If you're storing metrics as log events, stop.
- Materialize heavy recurring searches into summary indexes. Any scheduled search running more than hourly against large volumes should write to a summary index. SLA compliance reports and executive dashboards should never rebuild from raw data on load.
Structured logging and props.conf
Push structured JSON logging from application teams. INDEXED_EXTRACTIONS = json in props.conf is zero-cost field extraction at index time vs. expensive rex at search time. Make it a platform onboarding standard. And use transforms.conf to drop noise at the Heavy Forwarder layer before it ever hits an index - health checks, debug logs, chatty heartbeats. Every GB you nullqueue is license cost you don't pay.
[prod_app_json]
INDEXED_EXTRACTIONS = json
KV_MODE = none
TIMESTAMP_FIELDS = timestamp
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%6N%z
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)
TRANSFORMS-nullqueue = drop_health_checks, drop_debug_logs
[transforms.conf]
[drop_health_checks]
REGEX = "(GET|HEAD) /(health|ready|live)(z)?( |$)"
DEST_KEY = queue
FORMAT = nullQueue
[drop_debug_logs]
REGEX = "level":\s*"debug"
DEST_KEY = queue
FORMAT = nullQueue
Search performance checklist
Mapping D.U.R.E.S.S. onto the Splunk stack
If you're not covering all six signals, you have blind spots. Here's exactly where each one lives.
Which tool for which question
| Question | Tool | Why |
|---|---|---|
| What errors happened in the last hour? | Splunk logs | Full-text SPL search - nothing beats it for log forensics |
| Which service is slow right now? | SignalFx APM | Streaming latency percentiles with 2-5s refresh on service map |
| Is this error rate unusual for this time of day? | MLTK DensityFunction | Seasonality-aware baseline, not a static threshold |
| Why did this specific request fail? | SignalFx APM | Trace waterfall + related logs via trace ID correlation |
| Where is error volume heading in 2 hours? | MLTK StateSpaceForecast | Time-series forecast with 95% confidence bounds |
| What patterns appeared right before the incident? | MLTK KMeans / cluster | Log clustering surfaces novel patterns in high-volume streams |
| Is our infrastructure healthy right now? | SignalFx Infra | Real-time streaming host/container metrics with auto-discovery |
| 90-day SLA compliance report? | Splunk summary index | Pre-aggregated materialized data - never rebuild from raw |
| Security: who accessed what and when? | Splunk ES / SIEM | Where Splunk started. Still unmatched for security analytics. |
The data pipeline, animated
This is what actually happens to a log event from the moment it leaves your app to when MLTK fires an anomaly alert. Hit run and watch it flow.
SPL query builder
Pick your intent, set your parameters, and get production-ready SPL - scoped correctly from the start.
MLTK algorithm picker
Answer four questions about what you're trying to detect and get the right algorithm - with SPL to go.
Splunk and SignalFx together are a genuinely capable observability stack. Capability without discipline is just expensive noise.
The teams that get real value here invest in index architecture, push structured logging at the source, use MLTK to move beyond static thresholds, instrument with OTel properly, and treat their SPL like production code - reviewed, owned, and retired when it's no longer earning its runtime cost.
The teams that struggle treat Splunk as a log vacuum cleaner, pile up dashboards nobody reads, set alerts in 2021 and never tuned them, and then wonder why their on-call rotation is miserable and their license renewal is a budget crisis.
The platform is excellent. The discipline is the hard part. It always is.