There is a saying in the cybersecurity community, “know normal, find evil.” A phrase that embodies the question, if one doesn’t know how things should normally be, how will they find anything abnormal? As a System Administrator, we use the same concept when maintaining servers and securing websites. However, in our case, instead of only using application logs or extensive integrity monitoring to find malware, we use our knowledge to increase the performance, stability, and uptime of the assets in our care.
Atypical behavior of applications or end users can shed light on previously unknown problems or give advanced notice of impending instability. In the same way, a seismometer can be used to detect earthquakes, abnormal behavior in an IT setting can detect hidden troubles plaguing an environment.
Yet, what does atypical/abnormal behavior look like? How can one find it? Well, various forms of good logging are the root for discovering issues. You can often hear talk about adding “better logging” or implementing “centralized logging.” However, even if you have the logs, how can you act upon something if you cannot interpret and filter the data from them? The true way of discovering problems is by analyzing and interpreting the information you already have.
A System Administrator can use math and statistics to analyze the “average number” vs. the “current number” found in logs. You can often find a pattern that can give actionable information by comparing the current state to what’s normal. For instance, do you see far less website traffic than usual? Has an abnormal entry appeared in one of the logs? Did the number of database queries skyrocket? All of these signs could indicate an underlying problem that may not be immediately obvious. The “stats” of these log entries have changed from their “mode,” or average. When past a certain threshold, this change from normal to abnormal can trigger an investigation and eventually a solution, leading to a more stable and secure environment down the road.
Taken a step further, and instead of just using math to detect problems, we can use it as the foundation for predicting issues from visually unrelated signs. Just detecting problems is reactive behavior. Find a problem, go and fix the problem. It can often be too late to prevent catastrophic failure when you are just looking for an issue’s symptoms, not the signals of an issue. With traditional methods, like previously mentioned, dramatic changes in the raw data can be used to find problems. However, what about when there is a gradual change? Or a change in semi-unrelated metrics? Or repeating patterns in the data?
Enter the use of Machine Learning (ML). The words “machine learning” are often thrown around without relation to something the layman would understand. Machine Learning is often likened to the field of statistics, and at its core, they are indeed nearly synonymous with each other. The foundation of ML is based on statistics, but the difference is their purpose.
Statistics can be easily used to detect problems as they appear. Count the number of current event entries and compare them to a threshold. On the other hand, ML can be used to supercharge this process to an inhuman level and even predict impending issues based on past data. When implemented right, ML can act as a predictive forewarning system, giving valuable insight into when and how problems will occur.