< Back |Home| Next >
The secret behind splunk speed and awesomeness - Mapreduce Algorithm

Splunk is a very powerful tool, we all know that. "what makes it so powerful and fast?". The answer is not always very simple to convey due to the advanced nature of the technology behind it. The simple answer without much explanation is it uses MapReduce technology and it writes data to disk very quickly so it can be searched immediately. Pretty vague and doesn’t really explain anything. I came up with this document to discuss just what makes Splunk so fast and what MapReduce is, but convey it all in very basic terms and examples.
Why is Splunk fast?
The simple answer is parallel processing via MapReduce methodologies. For this section, we are going to focus primarily on the parallel processing aspect, which is the first step to MapReduce. Splunk has the ability to take a search and break it up into smaller parts to get you the answer faster. To understand this better, let's start with how Splunk scales.
The typical components of a Splunk deployment are made up of the following (all of which can exist on a single machine):
In a distributed search setup, the search head knows about all indexers in the environment. When a search is submitted by a user, the search head submits that search to each indexer individually and they run in parallel. Each indexer contains a portion of the entire data set, and so returns results for that portion of the data. The search head then aggregates or puts all the results back together giving the user the final result. By doing it this way, you get your results much faster as the speed of the search increases linearly with each additional indexer that's added. This is the first thing that gives Splunk it's speed.
You may be asking "Wait, didn't you say you can run all this on a single machine? How is it still fast with only one machine?". This brings us to the second part of how Splunk scales. In addition to the distributed processing (official term: spatial MapReduce) method based on the number of physical indexers, Splunk also breaks up the search into smaller parts based on chunks of time (official term: temporal MapReduce). In not so many terms, every search that is submitted to an indexer is broken up into blocks of time and then processed in parallel. This method simulates the same thing that happens when you break up the search by sending it to separate indexers. This is the second thing that gives splunk it's speed.
What's MapReduce?
Official definition:
"MapReduce is a programming model for processing large data sets with a parallel,distributed algorithm on a cluster.” (Wikipedia, http://en.wikipedia.org/wiki/Mapreduce)
So let’s break this down a bit more. We already covered how the parallel processing works from the Splunk perspective, but there is more to it. First let’s understand the two parts individually, “map” and “reduce”.
A “map” function is essentially the operation of gathering your data in parallel execution fashion. In Splunk terms, the map function is the part of your search that’s actually grabbing data. So for instance, let’s take this search as an example:
source src_ip=192.168.1.1 | chart count by host
The map function of this search is the “source src_ip=192.168.1.1” as this is the criteria Splunk is using to gather the data. This search returns all events that match this criteria from the index(ers). After the pipe, you have your “reduce” function, which takes the results provided by the “map” function and performs additional processing. In this case, “chart count by host”, which as we know provides a table showing an event count organized by the host field.
Pretty straight forward, right? The reason the topic of MapReduce becomes so complicated is really because of how flexible it really is. Splunk is unique because it built a framework around MapReduce and setup a very handy search language that easily and directly gets translated into a MapReduce job. The majority of other MapReduce implementations are not so simple because the data is so unstructured. Teams of developers have to code MapReduce jobs manually to get to a very specific answer, which is why overhead costs are generally high. The power of MapReduce is really in it’s implementation and even more importantly, what problem it is being used to solve.
MapReduce Implementations
A quick note on the different MapReduce implementations. You will hear a lot of mis-information out there and it can get very confusing very quick when it comes to MapReduce. You will hear Hadoop, Cassandra, Hive, Avro, Ambari, Chukwa, etc… The list goes on and on. When researching different projects, understand that MapReduce is it’s own thing, it’s a programming methodology. All these other names you see flying around are typically different implementations of parallel processing frameworks and the MapReduce jobs run against that framework. Also keep in mind that all these different “projects” were designed for specific purposes. There is no framework that rules them all when it comes to this new era of distributed processing.
In closing, this is a very fun and exciting new way of thinking about data processing and crushes the old methods of individual servers and localized disk storage. I hope you enjoyed this quick write up on Splunk and MapReduce!
courtesy:- https://defensepointsecurity.com
Why is Splunk fast?
The simple answer is parallel processing via MapReduce methodologies. For this section, we are going to focus primarily on the parallel processing aspect, which is the first step to MapReduce. Splunk has the ability to take a search and break it up into smaller parts to get you the answer faster. To understand this better, let's start with how Splunk scales.
The typical components of a Splunk deployment are made up of the following (all of which can exist on a single machine):
- Search Head: The web service you login to through your browser and submit searches, view dashboards, etc.
- Indexer: Does initial parsing of event data and stores it to disk
- Forwarder: Gathers the event data and delivers it to an Indexer
In a distributed search setup, the search head knows about all indexers in the environment. When a search is submitted by a user, the search head submits that search to each indexer individually and they run in parallel. Each indexer contains a portion of the entire data set, and so returns results for that portion of the data. The search head then aggregates or puts all the results back together giving the user the final result. By doing it this way, you get your results much faster as the speed of the search increases linearly with each additional indexer that's added. This is the first thing that gives Splunk it's speed.
You may be asking "Wait, didn't you say you can run all this on a single machine? How is it still fast with only one machine?". This brings us to the second part of how Splunk scales. In addition to the distributed processing (official term: spatial MapReduce) method based on the number of physical indexers, Splunk also breaks up the search into smaller parts based on chunks of time (official term: temporal MapReduce). In not so many terms, every search that is submitted to an indexer is broken up into blocks of time and then processed in parallel. This method simulates the same thing that happens when you break up the search by sending it to separate indexers. This is the second thing that gives splunk it's speed.
What's MapReduce?
Official definition:
"MapReduce is a programming model for processing large data sets with a parallel,distributed algorithm on a cluster.” (Wikipedia, http://en.wikipedia.org/wiki/Mapreduce)
So let’s break this down a bit more. We already covered how the parallel processing works from the Splunk perspective, but there is more to it. First let’s understand the two parts individually, “map” and “reduce”.
A “map” function is essentially the operation of gathering your data in parallel execution fashion. In Splunk terms, the map function is the part of your search that’s actually grabbing data. So for instance, let’s take this search as an example:
source src_ip=192.168.1.1 | chart count by host
The map function of this search is the “source src_ip=192.168.1.1” as this is the criteria Splunk is using to gather the data. This search returns all events that match this criteria from the index(ers). After the pipe, you have your “reduce” function, which takes the results provided by the “map” function and performs additional processing. In this case, “chart count by host”, which as we know provides a table showing an event count organized by the host field.
Pretty straight forward, right? The reason the topic of MapReduce becomes so complicated is really because of how flexible it really is. Splunk is unique because it built a framework around MapReduce and setup a very handy search language that easily and directly gets translated into a MapReduce job. The majority of other MapReduce implementations are not so simple because the data is so unstructured. Teams of developers have to code MapReduce jobs manually to get to a very specific answer, which is why overhead costs are generally high. The power of MapReduce is really in it’s implementation and even more importantly, what problem it is being used to solve.
MapReduce Implementations
A quick note on the different MapReduce implementations. You will hear a lot of mis-information out there and it can get very confusing very quick when it comes to MapReduce. You will hear Hadoop, Cassandra, Hive, Avro, Ambari, Chukwa, etc… The list goes on and on. When researching different projects, understand that MapReduce is it’s own thing, it’s a programming methodology. All these other names you see flying around are typically different implementations of parallel processing frameworks and the MapReduce jobs run against that framework. Also keep in mind that all these different “projects” were designed for specific purposes. There is no framework that rules them all when it comes to this new era of distributed processing.
In closing, this is a very fun and exciting new way of thinking about data processing and crushes the old methods of individual servers and localized disk storage. I hope you enjoyed this quick write up on Splunk and MapReduce!
courtesy:- https://defensepointsecurity.com
Comment Box is loading comments...