Home Features Greater Cybersecurity Threat Predictions with a Primer in Machine Learning

Greater Cybersecurity Threat Predictions with a Primer in Machine Learning

Why is it that Big Data services such as Netflix can predict, with reasonable accuracy, the movies and programs that I may be interested in when I connect to their service? When I access Amazon.com Online Shopping, I get a good glimpse of items I purchased in the past, and at what timeframe, and also a prediction if it is time to repurchase it? Machine Learning (ML) and Artificial Intelligence (AI) probably hold the key to the reasons these predictions are performed so well. So, why can’t we predict, at least at a small-scale level, when and what type of cybersecurity threats will hit our staff, organizations, or families? Is it possible to utilize the same strategies as Netflix and Amazon to protect us? The article will provide a background into the mechanics of Machine Learning and how with more significant experience these new modes of analysis can better be implemented and utilized to predict cybersecurity threats and crimes.

Artificial Intelligence, AI, neural, machine learning

With the increased number of cyberattacks, especially in light of the COVID-19 pandemic, the process of manually or semi-automatically receiving cyberthreat alerts that need to be analyzed by a human is not feasible. By the time an organization determines that an attack has happened and with the possibility of a breach, an organization or end-user may already be under a second or third attack. Being able to predict and within a timeframe that permits organizations or end-users to act before negative outcomes ensue is critical to withstand cyberattacks.

By Samir Souidi, M.S., MBA, Indiana University, Population Council; and Stan Mierzwa, M.S., CISSP, Director, Kean University Center for Cybersecurity

However, given the increase in cyberthreats resulting from the COVID-19 pandemic, especially with ransomware and Phishing attempts, better proactive and identification of threats is a must!  Integrating Machine Learning (ML) with cybersecurity protection is a must if making better predictions of threats is to occur. To utilize Machine Learning, organizations either need to purchase or incorporate tools that already have the technology integrated. They will need to train-up staff to create and utilize ML. However, in order to be effective or useful for the general consumer, awaiting for higher-level tool use or knowledge is not practical.

Machine Learning and Artificial Intelligence Clinic

For those who are new to the topic of ML, in essence, it is the science of programming computers so they can learn from data by themselves. By a general definition stated by computer gaming and Artificial Intelligence expert Arthur Samuel in 1959, “[ML is the] field of study that gives computers the ability to learn without being explicitly programmed.”  In a more engineering-oriented definition, defined by Computer Scientist and ML learning pioneer, Tom Mitchell in 1997, it is “A computer program is said to learn from an experience E with respect to some task T and some performance measures P, if its performance on T as measured by P, improves with experience E.” In other words, ML is learning to predict a task T from the experience E (data); the accuracy of the prediction will improve by learning more from E (data).

So, how does the ML algorithm learn? The goal of the ML Is to learn the weights,  in the context of the problem, the weights are numerical values and are associated with each feature and measure how important this associated feature is to the accuracy of the prediction. If the accuracy is high, it indicates the weight is very significant, and if the accuracy is low (error) it indicates the weight is not significant. The ML will try many rounds while learning about the training data and assign new weights until it finds the best accuracy. In our example of an email spam filter, the algorithm will assign a mathematically educated guess of the weight for each feature in the email (e.g., one feature indicating if a capital letter is used, or IP address, or repeated words), then will predict if the email is spam or not  (T) and then compare to Y (target or actual) email status (spam or not spam) in the training data, if they are different (P), then this an error and the weights are not correct. The ML will try again with the same data input, but this time it will assign new improved weights because the ML is starting to know the data and can guess better weight values. The ML will repeat his process until it reaches the best accuracy. If we can write all this process in a linear mathematical equation, just for simplicity, we will get something like this:

f(x)=w_0 x_0+w_1 x_1+⋯+w_m x_m

There are many different types of ML models and they can be classified into three major categories:  Supervised Learning; Unsupervised Learning; Reinforcement Learning. Based on the data and the problem that needs to be solved, you can select the best ML category that will help to choose the best algorithm and how you can evaluate the accuracy. In Supervised Learning, the training set you feed to the algorithm includes the desired solutions, called labels, a typical Supervised Learning task is classificationing.

Although most of the applications of ML today are based on Supervised Learning, most of the available data is unlabeled – which creates an unmet need in Supervised Learning. Unlabeled data generally requires a human to go through the data and label them manually. Going forward, new inventions in ML via Self-Supervised Learning, which requires one to provide a small amount of trained data that is labeled, will be beneficial.  We let the ML learn by itself and label the unlabeled data. But Unsupervised learning in its current state is very useful to allow us to understand and cluster data in caser we have massive data that we know about it. For example, Clustering, which is a great tool for data analysis, search engines that will group similar instances into clusters. Unsupervised Learning is well used in Anomaly and fraud detection, the objective is to learn what “normal” data resembles, and then use that to detect abnormal instances. Reinforcement Learning is one of the most exciting fields of Machine Learning today and one of the oldest. It is mainly used in the gaming industry where many applications use Reinforcement Learning in their games, and it received attention when the startup British company DeepMind developed ML using Reinforcement Learning that allows the system to learn to play an Atari game from scratch and eventually outperformed humans at these games. In Reinforcement Learning, the learning system called an agent, can observe the environment, select and perform actions, and get rewards in return or penalty in the form of negative rewards. It must learn by itself the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

Beside game applications, Reinforcement Learning is heavily in combination with Deep Learning in self-driving cars, where the Deep Learning is detecting on the fly the physical objects that a car senses and the Reinforcement Learning absorbs these inputs and tries to learn how to drive without crashing or hitting any object or person (get rewards).

Existing Systems and Solutions Utilizing ML Cybersecurity Threat Prevention

There do exist cyberthreat and protection vendors that do have ML integrated into their platforms. As reported by Built-In, large firms such as Microsoft have defended against the installation of malicious cryptocurrency miners by using Windows Defender, their software that uses ML to identify and block perceived threats.  It is just this sort of integration of ML that proves effective because of the focus on “learning”. This is even more important as more and more cyberthreats are emerging during the global, transnational pandemic event. In addition, ML is well implemented by Cloud providers to monitor unusual login activity to their platform with the use of anomaly detection algorithms that allow analyzing unusual activities.  For example, if the same user account login took place from two different geographical locations in a close time range, this would be flagged or an alert created. Also, ML algorithms are used by cybersecurity providers to analyze “on the fly” massive DNS records to prevent Domain Generation Algorithms (DGA) based malware.

Predicting Where ML will Further Grow in Cybersecurity Tools

Given the increases in cybersecurity threats and threat actors, and the amount of data flowing in cybersecurity operations centers, more in the way of intelligence is inevitable to help those analysts tasked with protecting organizations and agencies.  It is quickly becoming extremely challenging to simply respond to cyber incidents.  What will assist the cyber operations staff is to provide solutions that can predict and auto-respond to incidents where possible.  ML may lead the way especially, in Unsupervised Learning, where we can see ML trying to learn by itself with less supervision from humans to check all system events manually, cluster them and predict which event will cause harm to the system, and taking action.  In essence, instead of just receiving an alert, the Machine Learning algorithm will take action and stop these said determined events.

Think gaming, simulation, and self-driving cars where there is the potential to handle so many different constraints and movements with the use of ML – why can’t we do the same with cyber?


References

Research paper: A Machine Learing Framework for studing Domaon Generation Algorithm (DGA) -Base Malware, by Tommy Chin, Kaiqi Xiong, Chengbin Hu, YiLi


About the Authors

Samir SouidiSamir Souidi is the Global Enterprise Systems/Software Architect at the Population Council, headquartered in New York City, New York, and owner of the startup Atlas Data Services. Samir is an expert in data analytics and science, having received his MBA from Indiana University with a specialization in data in 2020.  Souidi is the recipient of the 2016 Excellence Award in Information Systems from InsideNGO, now known as Humentum.  Souidi has a BA in Business Administration from the Supérieure de Commerce in Morocco and an MS in information systems from Pace University and a member of IEEE. He is fluent in Arabic and French and traveled globally in implementing mHealth and eHealth solutions.

Stanley Mierzwa is the Director, Center for Cybersecurity at Kean University

Stanley Mierzwa is the Director, Center for Cybersecurity at Kean University in the United States. He lectures at Kean University on Cybersecurity Risk Management, Cyber Policy, Digital Crime and Terrorism and Foundations in Cybersecurity.  He is a peer reviewer for the Online Journal of Public Health Informatics journal, a member of the FBI Infragard, IEEE, ISC(2), and a board member (Chief Technology Officer) of the global pharmacy education non-profit, Vennue Foundation. Stan holds an M.S. in Management with specialization in Information Systems from New Jersey Institute of Technology and a B.S. Electrical Engineering Technology from Fairleigh Dickinson University, is also a Certified Information Systems Security Professional (CISSP).

 

Disclaimer

Views expressed in this article are personal. The facts, opinions, and language in the article do not reflect the views of CISO MAG and CISO MAG does not assume any responsibility or liability for the same.


A more detailed version of this article will be available in the February issue of CISO MAG.