Dealing with hundreds of security alerts daily is a challenge. Especially when many are false positives which waste our time and all take too much of our precious time to sift through. Let me explain how our security team solved this problem, as we built security around JFrog products.
First of all, let me tell you a bit about our team.
Security Engineering at JFrog
JFrog Security – CSO Office is focused on protecting our customer data, ensuring the security of our cloud infrastructure, providing the highest product and application security standards, and responding to emerging security threats. Learn more about security at JFrog>
Performance, scalability, ease of use and flexibility
As part of our efforts to improve our processes and better meet our ever-growing needs, we decided to build our safety monitoring system from the ground up and make it top-notch. After studying the different approaches, we decided to create a solution that some might call unconventional, comprising the following two main parts:
Containerization of our log shipping components
Using a Mail Queue
The purpose of this blog post is to share how we tackle, at a high level, the main issues experienced by every security engineer / architect who has ever worked with traditional security surveillance systems.
So who is this blog post for?
If you are about to decide what should be the architecture of your security monitoring system / solution, this is for you.
If you’re tired of chasing health issues in your log collectors, data spikes that cause indexing issues, or just melting your database with too many requests, this is for you.
This blog post will describe the reasons why we chose these solutions and our final architecture today. We hope that sharing our use case can provide additional guidance for you and your organization. This will be the first part of three blog posts, including: Security monitoring (aka SIEM), Automating (aka SOAR) and Bots (aka chat bots).
Let’s start with our main takeaways and recommendations for the solutions we have chosen.
Note: Our team has chosen to tackle all issues and start from scratch. However, you can choose to only tackle certain parts of your existing security oversight, such as technology, alert channels, or processing.
1. Containerization of log shipping components
In most traditional security surveillance systems, the log sender can be your biggest point of failure. Indeed, getting it to work can be a very difficult task, especially in a complex architecture where you will find dozens or hundreds of different services. Using a containerized version of a log shipper can be of great help.
- Saves tons of compute resources without a lot of instances to run services.
- Much more resistant with the possibility of defining a restart policy in case the “service” dies.
- Hassle-free recovery without needing to rebuild your service if it fails or is corrupted.
- Adds a layer of security because the services are isolated, which helps separate credentials and other secrets.
2. Using a Messaging Queue (MQ)
Next, we chose to use a buffer (aka queue system). In our case, we chose to use Kafka.
- Absorbs peaks and overloads – Every now and then you will receive a large amount of logs which you certainly did not expect or prepare from a sizing point of view. A mail queue can keep you from losing data, corrupting, and causing service issues.
- Backups – Although it is not a backup system, you should back up your data using snapshots, buckets, etc. Using a mail queue can help with momentary changes. Updating a version and need to shut down your SIEM for an hour? No problem! Has a configuration changed in any of your nodes? Bring it on!
- Load balancing – Not all log ingestion solutions know how to balance the traffic load, and MQ can help by allowing consumers to distribute the load evenly.
Example of simple architecture:
(Source: “Just Enough Kafka for the Elastic Stack” elastic.co)
On the left we can see our log senders (eg DB, Filebeat, Syslog, etc.) (Kafka). Then, a consumer retrieves the data from the MQ which sends it to a SIEM (ELK Stack).
Each environment has its own requirements. Here are some guides available that explain in detail how to size the different components:
Final architecture and infrastructure
Here are the main components and patterns of our final architecture:
Log Sources / Senders
Elastic components help avoid compatibility issues. It is easier to maintain a smaller number of component types. For example, upgrading all senders only requires changing the version of the Docker image.
A smaller number of branches (as much as possible) simplifies things and helps us save time on maintaining our infrastructure.
Mail queue (aka Kafka, as shown above)
AWS manages the infrastructure and provides a managed service to us.
We only manage the application side (subjects, replications, retention, monitoring).
Since Logstash cannot really run as an HA / Cluster, it is not exactly considered a cluster. But, making it work as a consumer for Kafka essentially means that working with a queue will balance the load, which will result in more partitions in a Kafka topic. Consuming it with an equal number of threads allows us to read information faster in a balanced way and avoid spikes that are not Logstash or Elasticsearch favorites.
Having Logstash to consolidate logs into a single pipeline allows easier control for: handling logs, tagging, enriching (e.g. Maxmind Geo information) or even matching your firewall logs preferred with OSINT from MISP (resource: https://owentl.medium.com/ elk-enrichment-with-threat-intelligence-4813d3addf78)
Elastic search / Kibana
Some other takeaways:
Outsourcing to a managed service / SaaS can free up your business to focus on its strengths, allowing your team to focus on its core tasks and future strategy.
Security oversight infrastructure is essential. We all want to avoid service outages. SaaS can help, especially if your team is small or doesn’t even exist yet.
To sum up, here is a recap of what we have achieved:
- Increased capacity – We have gained in infrastructure stability and resiliency, an easier deployment process, an unraveled troubleshooting process. Fewer people needed to steer the vessel = more time for IR. (The next piece of the puzzle, which is automation, will be covered in the next article)
- Time saving per engineer – Having a reliable monitoring solution means spending less time investigating. Eliminating issues like missing logs, scan issues, and missing data can really have a big impact on time efficiency.
- Fine tuning – As a SIEM engineer, this is the first time that I can say that I no longer have to live in fear of a service failure. Running our infrastructure on a relatively stable solution (e.g. Kubernetes, Docker) allows us to self-repair various components of faults that were interrupting the log pipeline.
Hopefully our trip will help other teams rethink their security logging infrastructure.
Contact us, ask questions, share your thoughts, concerns and ideas on LinkedIn.
Finally, we are recruiting! Consult the open positions>