The SQL Trap - Why You Should Never Use Map or Join in Splunk

If you come from a relational database background, your first instinct when using Splunk is often to reach for join or map. It makes sense: you have an IP address in your firewall logs and a username in your identity logs, and you want to link them together. In the SQL world, a JOIN is bread and butter. In Splunk, it’s a recipe for performance degradation and silent data loss. In this post, I’ll explore why these commands are often “traps” and how to use the Stats Pattern to create faster, more reliable queries.

The Technical Debt and Why Map and Join Fail

There are two main challenges we run into while using the map and join commands.

Performance Overhead: map executes a new search for every single result of the original query, putting immense strain on the search head. In fact, the map command sets a default maximum of 10 subsearches for this very reason.
Hidden Limits: Splunk will automatically truncate your subsearch results without warning. join and in fact any subsearch has a default limit (usually 50,000 results), leading to silent data loss.

Case Study: Hunting for Malware Phone Home

In this scenario, I needed to link two datasets: DNS logs containing malicious indicators and identity logs that map IP addresses to specific users.

Normally, on a smaller scale I would manually copy and paste the affected source IPs into a custom Splunk dashboard, set a time range for the incident, and after a minute of waiting I would find the associated user. This is a long and painstaking process for anything more than a few host IPs, and each affected user in this case would have a dozen different leased IPs across my 7 day incident window. I needed a better solution.

My first instinct was to use the map command. I had already gathered the malicious DNS queries and associated source IPs. The next logical step was to take those IPs and use a subsearch to find the user – an automated solution which mimicked my previous Splunk dashboard workflow. Below is my original search which silently limited my results:

index=dns_logs query IN ("evilbadguys.com", "etphonehome.net", "criminals.us")
| dedup src_ip | table _time, src_ip
| eval earliest=_time - 7200 | eval latest=_time + 7200
| map search="search index=identity_logs AND src_ip=$src_ip$ earliest=$earliest$ latest=$latest$"
| table user,_time,src_ip

Since I wasn’t getting the results I’d expected, I tried join. Once again my search results were silently truncated:

index=dns_logs query IN ("evilbadguys.com", "etphonehome.net", "criminals.us")
| dedup src_ip
| rename _time as dns_time
| join src_ip
	[search index=identity_logs | rename _time as auth_time | table src_ip, auth_time, user, query]
| where auth_time >= (dns_time - 7200) AND auth_time <= (dns_time + 7200)
| table dns_time, auth_time, src_ip, query, user

The Optimal Solution Using the Stats Pattern

After researching the Splunk Community forums, I found a far superior technique. By bringing both datasets into the pipeline at once and using stats, you can recreate the effect of a join without the performance penalties:

(index=dns_logs query IN ("evilbadguys.com", "etphonehome.net", "criminals.us")) OR (index=identity_logs)
| eval dns_time = if(index=="dns_logs", _time, null())
| eval auth_time = if(index=="identity_logs", _time, null())
| stats latest(dns_time) as dns_time latest(query) as query values(auth_time) as auth_time values(user) as user by src_ip
| where isnotnull(dns_time) AND isnotnull(user)
| mvexpand auth_time
| where auth_time >= (dns_time - 7200) AND auth_time <= (dns_time + 7200)
| table dns_time, auth_time, src_ip, query, user

Wrapping it Up

Using the stats pattern is almost always faster and more reliable than using map or join. As a rule of thumb, if you think you need a map, join, or a subsearch of any kind, you probably just need an eval and a stats command.