Artificial Intelligence and Machine Learning

in Insurance Sector


As rapid technological advances reshape the insurance landscape, carriers are encouraged to adopt technologies to enhance customer service, create better solutions for operational efficiency, and build ever more accurate underwriting models. Artificial Intelligence (AI) and particularly Machine Learning (ML) hold a promise in this regard as it has proven to be successful in multiple disciplines including ecommerce, predictive maintenance, election forecast, and drug discovery.


1. Global AI Market

Figure 1 shows the 2019 global AI market share by economic sector.   It is worth noting the absence of the insurance sector as a pioneering industry adopting AI in its operations like other industries that invested in the technology with various degrees of success.  

Figure 1.  Global AI market share, by end use, 2019 (%) (1)


The increasing adoption of AI/ML on the global scene is a testimony of the critical role of this technology in helping industries decrease costs and increase revenues.  Figure 2 shows forecast of AI global market size in the next 5 years.

Figure 2.  Forecast of global AI market size from 2018 to 2025 (2)

Other sources are even more generous in their forecasts as they estimate that the global AI market size is expected to grow at a compound annual growth rate of 42.2% from 2020 to 2027 and will reach a whopping $733.7B by 2027 (3)

To have a better appreciation of the AI technology impact on various business sectors, consider figure 3 that depicts results from a poll of 1,872 enterprises worldwide indicating cost decreases and revenue increases from AI by function.  Enterprises report that AI drives revenue in sales and marketing while reducing costs in supply chain management and manufacturing functions. 


Figure 3.  Results from a poll of 1,872 enterprises worldwide indicating cost decreases and revenue increases from AI by function (4)


Marketing and sales include use cases of customer service analytics, customer segmentation, channel management, prediction of likelihood to buy, pricing and promotion, closed-loop marketing, marketing-budget allocation, churn reduction and next product to buy.   Product and service development include use cases of product feature optimization, product development-cycle optimization, creation of new AI-based enhancements, and creation of new AI-based products.


2. Potential for Machine Learning in Insurance Value chain

It is believed that most insurance companies process only a small percentage of the data they have access to. Such data is mostly structured and housed in traditional databases. Analyzing unstructured data to extract valuable insights and trends requires advanced data science techniques focused on AI.  ML as a major class of AI, has proven its efficiency across many industries in extracting useful information from both structured and unstructured data to drive business decisions and discover valuable insights that otherwise go unnoticed. For instance, ML-based analytics of insurance data can be used across the value chain to understand and evaluate risk, efficiently process claims, and predict customer behavior with a great accuracy.  Other applications of insurance industry can also benefit from the deployment of ML including exposure analysis, underwriting risk analysis, intelligent document process, submission process, pricing, risk appetite, subrogation, litigation, and fraud identification.

According to a surveyed group of Insurers asked where they see AI adding value to their businesses in terms of understanding and managing risk, the majority indicated the cases of exposure analysis, underwriting risk analysis, and submission process. Figure 4 shows the potential value of ML in understanding and managing risk. Such a survey resulted in classifying insurance applications in three classes: i) high value class of applications where  52-77% of Insurers indicated that such an application represents a potential top AI value for them, ii) medium value class of applications where  20-47% of Insurers indicated that such an application represents a potential top AI value for them, and iii) low value class of applications where  only 7-13% of Insurers indicated that such an application represents a potential top AI value for them.

Figure 4.  Potential AI value for understanding and managing risk. The percentage value on front of each category indicates the percentage of insurers who responded that such a category is a top value area for them where AI can add value (5)



3. Example of applications poised to benefit from ML

It is a new reality that machines will take over many of human jobs starting with customer service where chatbots and similar gadgets built on ML-based Natural Language Processing (NLP) techniques will handle initial interactions with the customer to identify his/her intent and to address customer concerns or call for human help if needed.  The case of other applications potentially benefitting from ML is even stronger as ML-based algorithms are well poised to handle unstructured data where classical algorithms fail to extract needed information and insights.  Following, are a few applications of the insurance industry poised to benefit from advances made in ML.

3. 1.  Insurance advice

Customer satisfaction is expected to be higher when ML algorithms are deployed to provide personalized services and recommendations for insurance products that are best for that specific customer based on his/her profile and previous behaviors of other consumers who share similar experiences and personalized information. 

ML is capable of scavenging profiles of thousands and thousands of consumers to extract personalized insights and recommendations.  Efficient ML techniques such as clustering and classification can be deployed to give advices that will most certainly work for a given customer using tailored tools and products.  For example, ML-based clustering techniques can learn that a given customer is classified with a specific group of consumers of a certain age bracket, gender, geographic location…etc. As such, such customer is most likely interested on a new insurance product based on the known responses and preferences of other consumers belonging to the same cluster.


3. 2.  Claims Processing

Automating claims processing is a great feat of the Insurance industry that ML can help achieve on many levels.  ML is a powerful technique that can enable building efficient predictive models to help insurers better understand claims costs and process pain points that need to be addressed on a timely manner.  These insights and efficient predictive models can help a carrier save millions of dollars in claim costs while at the same time increase customer satisfaction through fast settlement, pointed probes, and more efficient case administration. Those predictive models can help insurers in their plans and forecasts by budgeting accurate figures for funding allocation to claim reserves.

Deployment of computer vision to automatically scan documents using ML-based OCR (Optical Character Recognition) and NLP to interpret document content including handwritten claims can significantly reduce the document input load.  Another benefit of using ML-based system instead of a human to handle claims is the protection of customer privacy.  

Using ML to automatically process claims reduces input time, eliminates human error, and provides fast and stress-free claims settlements.


3. 3.  Fraud prevention

The FBI estimates that more insurance companies lose more than $40 billion per year due to insurance fraud (non-health insurance) costing the average U.S. family between $400 and $700 per year in the form of increased premiums (6).  There are many kinds of insurance fraud including premium diversion, fee churning, and asset diversion, among others. Premium diversion is the most common type of insurance fraud involving insurance agents i) failing to send premiums to the underwriter and instead keeping the money for themselves and ii) selling insurance without license.  In fee churning, a series of intermediaries take commissions by registering the same customer multiple times leading to the payment of multiple commissions on the same customer until there is no money to pay outstanding claims. The company left to pay claims is often a shell company made initially to fail. Asset diversion is mainly the theft of insurance company assets, particularly during a merger or an acquisition.  False or exaggerated claims by policyholders is another source of insurance fraud that costs the industry billions of dollars.

In addition to those types of frauds, there are other kinds of insurance frauds that need to be addressed by insurance companies. ML is a powerful tool that can help in this regard by identifying potential fraudulent claims faster and more accurately. ML is efficient in analyzing unstructured data and extracting hidden trends and turning points to identify potential fraud and expose its methods.  

ML-based algorithms are known by identifying hidden and meaningful trends and patterns in complex data while at the same time minimizing false alarms.  In addition, those algorithms learn more over time and get better as more data and more information become more available and thus learn how to perform those detection tasks better over time.   Once the ML model has been trained on big enough data it can properly with high confidence detect and extract abnormal schemes including fraudulent behaviors.


4. Concluding thoughts

This short analysis pointed out the importance of ML for the insurance business.  Applications such as insurance advice, claims processing, and fraud detection are poised to benefit tremendously from ML.  The potential of applying AI in various aspects of the insurance industry is so broad.  As such, insurers need to be thorough and focused on their exploration of the technology.    As insurers consider and evaluate ML for their business operations, they should consider developing proofs of concept, testing derived ML benefits, and extending deployments as they become more educated and successful in exploring this game-changing technology.

 5. Sources



(3) )






Saturday, 08 August 2020 12:20

Social Media Analytics for Sales Prediction

Written by


Social Media Analytics for Sales Prediction


1.     Introduction

Social data analytics has recently gained esteem in predicting the future outcomes of important events like major political elections and box-office movie revenues.  Related actions such as tweeting, liking, and commenting can provide valuable insights about consumer’s attention to a product or service.  Such an information venue presents an interesting opportunity to harness data from various social media outlets and generate specific predictions for public acceptance and valuation of new products and brands.  This new technology based on gauging consumer interest via the analysis of social media content provides a new and vital tool to the sales team to predict sales numbers with a great deal of accuracy.

This use case focuses on forecasting product sales based on social media and time-series analysis. We present a predictive model of product sales using sentiment and consumer reactions gathered from social media over time periods. Our predictive model illustrates how different time scale-based predictors derived from sentiment can improve the prediction of future sales. 

The widespread belief that social media data was simply too noisy and too biased, to accurately correlate with sales data was thankfully proven wrong using efficient AI models.  We developed a unique process that collects relevant data from influential social media outlets and uses state of the art machine learning algorithms to predict sales with state-of-the-art accuracy.  

The ultimate goal is to develop an accurate estimate of the product sale before its release to provide the sales team with a valuable knowledge of its potential profit and decide the quantity of the release in different regions based on customer request.  An interesting case that can be detected via social media is when there is a negative feedback that can hinder the business from earning leads.  To manage this particular case and other similar situations it is therefore necessary for companies to have access to authentic feedback from potential customers in order to react on a timely manner either by finding a way to satisfy customers or by improving the product quality.

In addition to predicting future product success or failure, the model can be easily configured to provide a detailed map of consumer satisfaction with an already launched product. Other criteria related to consumer demographics such as geographic location and age group can also be extracted and studied to build better sales strategy and targeted marketing campaigns.




2. Methodology


The adopted approach is to collect customer sentiments data via social media analytics to train a Machine Learning model that predicts the evolution of the commercial product or service.  The proposed model predicts the success of failure of commercial products/services and highlights the most important trends based on sentiment analysis of social media feedbacks. It aims at helping the sales team to improve or develop new sales strategies to increase customer loyalty and retention.  In addition, the tool can help in detecting false information and protecting the business brand and reputation.


Here are the main steps taken towards building the predictive model


        Extract data from social media (e.g. posts, comments, reactions…etc.)


        Analyze sentiments of social media feedback


        Generate datasets from Facebook, Instagram, and twitter


        Predict the impact of those sentiments on future product performance


3.  Technical approach


First step consists of extracting data including posts, comments, and reactions from social media, namely Twitter, Facebook, and Instagram through web scraping and relevant APIs.

Second step involves preprocessing the extracted data by applying a proprietary sentiment analysis algorithm and using well-known lexicon and rule-based libraries that are specifically attuned to sentiments expressed in social media. A dictionary of lexical features is used to score sentiments with a set of five heuristics. Lexical feature in this context refers to anything used for textual communication including words, emoticons like “:-)”, acronyms like “LOL”, and slang like “meh”.  These colloquialisms get mapped to intensity values in order to associate a numerical value to each lexical feature.  Lexical features are not the only things in the sentence which affect the sentiment. There are other contextual elements, like punctuation, capitalization, modifiers, and conjunctions that also impact the emotion.  

All these details are accounted for in the set of five heuristics. The effect of these heuristics is quantified using human raters in well documented processes that showed exceptional efficiency when analyzing the sentiment of movie reviews and opinion articles.

After extracting data and applying the sentiment analysis algorithm to it, the next step in the methodology is to generate a dataset that include the percentage of positive, neutral, and negative feedbacks by a specific period, also dubbed timestamp.  The developed model is flexible.  It can generate a dataset with different time stamps including months, weeks, days, minutes, seconds, or any arbitrary timestamp for that matter. A dataset is defined by a name, start date, end date, and timestamp.  


4.  Application

Recall that the mission of the predictive model that we built per the steps explained in the previous section is to estimate the evolution of commercial products/services based on the sentiment analysis of feedbacks from social media.  

We developed and tested several ML-based algorithms and train them using social media data that we collected and cured following the rigorous process detailed earlier. To account for seasonal fluctuations in sales, the model uses the technique of time series forecasting to insure a steady accurate prediction. Here are the sequential steps to be followed during the prediction process


i)                   Select a dataset of interest

ii)                 Train all ML algorithms with the given dataset.

iii)               After convergence of training algorithms, the model will select the ML-based algorithms that provides the best accuracy

iv)               Using the best algorithm identified in the previous step, predict the total sales revenues of the product of interest


We followed this prediction process to forecast the total revenues generated by the sales of Big Mac meal of McDonald's chain.  First part of the process is to build a dataset to train the model and to gauge its accuracy on historical data before going live.  The dataset is divided in two parts:

·         The first part is based on features defined by people feedback on social media. These features contain the percentage of positive, negative, and neutral feedback of people for a specific time span and timestamp defined by the user.

·         The second part includes sales or turnover following the same timestamp of the first feature. This feature will contain the sales provided by the customer.

Once the dataset is formed, it will be used to train the ML-based algorithms.  For the Application of McDonald’s Big Mac, we developed a small data (15 rows) contains the sales of McDonalds starting from January 2016 until March 2020. The predicted sales (average mean) are displayed every three months. We have been predicting the next average sales for the coming three months after March 2020.

The table below shows the performance of each ML algorithm that we tested including the best algorithm with the highest accuracy.




ML algorithm


Decision Tree

Gradient Boosting

Random Forest





Accuracy (%)










Table 1. Performance of various ML algorithms based on social media dataset. The Bagging algorithm performed best and predicted sales revenues for the Big Mac of McDonald’s in the amount of $6.023M



We linked all the studied models for Facebook, Instagram, and twitter and created a desktop application where the user selects the parameters of the dataset including start and end date along with time period to get an estimate of the sales revenues for any product the user aims to forecast


4.  Conclusions and recommendations

This use case enumerated the steps needed to build a ML model based on social media content to predict the sales of commercial products.  All results presented here are based on sentiment analysis of social medias feedbacks.

We built a desktop app that can select the optimal ML algorithm and provides a prediction of a given product sales with an accuracy approaching 90%.  We are currently studying the effect of adding sales information from the competition to improve the model accuracy.


Machine Learning Study for

Predictive Maintenance



Table of Contents

Summary. 2

Data Collection. 2

Vibration data. 3

Temperature data. 3

Feature Extraction. 4

Feature definition. 4

Feature across all data sets. 7

Feature cross correlation. 8

Machine Learning Methods for Classification. 9

Supervised Classification via Neural Networks. 9

Conclusion. 10



This use case summarizes findings of a health monitoring study using empirical vibration and temperature data to build a predictive maintenance model.  Sensors are placed in four different positions on the housing surface of three running motors at different health stages to study the model performance and its robustness with respect to sensor mounting and various operating conditions. Tens of thousands of data segments were processed and used to extract features and build supervised and unsupervised classification algorithms. A feed forward Neural Network was deployed to classify signals (unseen before by the network) from these 3 motors. Preliminary results look promising with 99.2 % classification accuracy. It is also worth to note the algorithm robustness with respect to sensor mounting.

Data Collection

Vibration and temperature data are collected from rotating machines with the purpose of classifying those machines in one of predefined classes, to wit, “Warning” (scheduled maintenance), “Alarming” (under watch), and “Normal” (no action required). Vibration and temperature sensors are placed on the surface of machine of interest to generate data that will be used to classify machine health and eventually raise warnings when necessary to avoid shutdowns and unscheduled maintenance.

In the experimental setting of this study, 3 motors (numbered 1, 2, and 3) are used. Motor #1 is deemed by the operating personnel to be in a critical condition and may fail at any moment. It generated a distinct loud noise and relatively strong vibration profile and higher than usual surface temperature. Motor # 3 sounded very quiet and smooth thus exhibiting a “normal” behavior.  Motor #2 is in between the other 2 motors in terms of noise and vibration strength. Ideally, those motors should run to failure with data being captured at all stages of the motor health for accurate labeling. However, since this is unrealistic and for the purpose of this study, data generated by these 3 motors will be labeled “Warning”, “Alarming”, and “Normal.

Vibration data

A high-quality vibration sensor of up to 48 KhZ sampling rate is attached via a magnet to the housing surface of each one of the 3 running motors.  Four different sensor positions are used for data gathering as shown in Figure 1.  Varying the sensor position is useful to study the model sensitivity to sensor mounting and operating conditions.  The vibration sensor sampling frequency is set at its maximum value of 48 KhZ. Each recording lasted about 90 seconds.     



 Figure 1. One of the 4 sensor positions used to collect vibration data. In all experiments the sensor is attached to the motor housing surface via a magnet


Temperature data

This preliminary setup did not include a temperature sensor. For the purpose of this study, a temperature sensor response is simulated to allow building realistic machine learning models for classification. Temperature response is simulated as a constant base value plus a random component taken from a set of uniformly distributed pseudo random numbers. Base temperature values for the 3 motors are set respectively at 100, 99, and 98 degrees while temperature spans are [96.6 103.8], [95.6 102.4], and [94.1 101.8] respectively. These overlapping temperature spans seem representative of real sensor measurements given noise and variability of operating conditions.

Feature Extraction

Each vibration track of “T” samples (e.g. T=48,000 x 90=4,320,000 samples)  is divided into non overlapping segments of equal length (i.e. S=1024 samples or 21.3 milliseconds per segment) to generate features in the time-frequency domain.

Feature definition

Preliminary features selected for this study are defined as follows:

1.      Time domain energy measure of the vibration signal. It is estimated as the root mean squared value of the vibration time series in the segment of interest (i.e. ""):


Where  V(k) is the vibration amplitude at time sample “k”, “S” is the segment length in samples (i.e. 1024), and “m” is the segment rank varying from 0 (first segment) to the rounded value (T/S-1) (last segment), with T being the total track length (in samples).

Since “S” is a given constant, the feature notation can be simplified as follows:


With  being the vibration time series at segment # m, that is:                  


Studying the effect of segment length “S” and its overlap with neighboring segments is an interesting factor that will be addressed in future studies.  This feature (i.e. time domain energy) is useful in classification as it provides a general indication of the machine health: the lower this feature value is, the healthier the machine is.  Figure 2 shows an example of this feature values across the 3 studied motors

Figure 2.  feature for the 3 motors


2.     Frequency domain energy measure in the 8 bins corresponding to the frequency band [f(10), f(18)]=[422 797] HZ.   This feature is driven by the frequency responses of the three motors since most of the energy for motors # 1 and 2 was concentrated in the area [400-800] HZ; it is calculated as follows:


Where  represents the Fourier transform operator.  Figure 3 shows this feature variation across the 3 studied motors

Figure 3. feature for the 3 motors: according to this feature, motor # 3 (“Normal”) is separated from the other 2 motors


3.      Peak energy value in the frequency domain. It is computed as follows:


This feature exhibited robustness across sensor positions as it will be apparent later. Figure 4 shows the feature variation across the 3 studied motors.

Figure 4. feature for the 3 motors


4.      Simulated temperature

Figure 5 shows variation of the simulated temperature response across the 3 studied motors.

Figure 5.  Simulated temperature response across the 3 studied motors

Feature across all data sets

Figure 6 shows variations of the 4 features defined in the previous section based on all data captured from the 3 different motors at the 4 sensor positions with a total of 51,456 data segments based on a segment length of 1024 samples. Note that data captured with sensor in position # 3 is least strong; in fact, it coincides with the least audible vibration sound that was noticed during data gathering.  Note also, that feature # 3 (i.e. Peak energy value in the frequency domain) is effective (compared to other features) in discriminating between close cases such as motors #2 and #3 in sensor position # 3. 

 Figure 6. Time domain energy (upper left), band-limited frequency domain energy (upper right), peak frequency domain energy (lower left), and simulated temperature (lower right) computed over segments of time for the 3 studied motors across the 4 sensor positions. A data set for each feature is comprised of 4 segments juxtaposed horizontally corresponding to the 4 sensor positions. Each segment is comprised of 3 staircase-like pieces corresponding to the 3 motors.


Feature cross correlation

An important element of feature extraction is to study the correlation between features since it is a measure of their dependency. If the correlation index associated with two features is relatively high then those two features are highly correlated  and as such, it is more beneficial to carry only one feature instead of the two in order to reduce over fitting and improve the generalization of models. There are many ways of calculating the correlation coefficients depending on the nature of dependency between the features of interest (e.g. linear versus nonlinear for example). The Pearson correlation method is typically used as it provides a measure of linear dependency between features and is defined as follows:


Where N is the number of N scalar observations of both features,  and  are the mean and standard deviation of feature A while  and are the mean and standard deviation of feature B.  The correlation coefficient matrix between two features A and B is the matrix of correlation coefficients for each pairwise variable combination.

=                                                                                                                                        (7)

Using equation (6), the correlation coefficient matrix for the 4 studied features across all gathered data is given by the following 4 by 4 matrix:

1.0000    0.9674    0.4086    0.4383

0.9674    1.0000    0.4011    0.4263

0.4086    0.4011    1.0000    0.4335

0.4383    0.4263    0.4335    1.0000


Note that features #1 and #2 (i.e. signal energy in time domain and frequency band [400-800 HZ]) are highly correlated.   Features # 3 (peak frequency domain energy) and #4 (Temperature) are, on the other hand, less correlated with the rest of features making them potentially more efficient for model generalization. The selection of final set of features is determined by the performance of the classification algorithm across various operating conditions 


Machine Learning Methods for Classification

The predictive modeling problem at hand is a classic case of machine learning. There are many supervised and unsupervised techniques that can be used to classify the three studied motors in their appropriate classes (i.e. “Warning”, “Alarming”, and “Normal”). At this early stage of the project with only a few data tracks collected, two classical algorithms will be tested: unsupervised K-means clustering and supervised feed forward neural networks. As more data will be collected more complex algorithms and architectures will be tried and tested for better classification performance. 

Supervised Classification via Neural Networks

A feed forwards neural network with 50-neuron hidden layer and 4 inputs (features) using 70% of the whole data for training (36,019 segments), 15% for validation (7,718 segments), and 15% for testing (7,718 segments) resulted in a successful classification of 99.2% of accuracy as shown by the confusion matrix shown in Figure 7.   The confusion matrix also known as error matrix is typically used to visualize the system performance. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class

In this case, 19 data points of class # 1 (warning) are mistakenly labeled as class # 2 (alarming), 29 points of class #2 (alarming) are mistakenly labeled as class #1 (warning),  5 data points of class #3 (normal) are mistakenly labeled as class # 2 (alarming), and similarly 5 data points of class # 2 (alarming) are labeled as class # 3 (normal). No data point of class # 1 (warning) was misclassified as case #3 (normal) and vice versa. 7,660 (out of 7, 718) or 99.2 % are classified correctly in their appropriate classes.  

Figure 7. Confusion matrix of a 50-neuron hidden layer Feed Forward Neural Network model shows 99.2% classification accuracy


This report presented an initial machine learning modeling for predictive maintenance based on empirical vibration and temperature data.  Vibration sensors are placed in four different positions on the housing surface of three running motors at different health stages to study the model performance and its robustness with respect to sensor mounting and various operating conditions. Gathered data was analyzed to extract features and build supervised and unsupervised classification algorithms.  Initial results using feed forward Neural Networks look promising both in terms of robustness to feature selection and sensor position and in terms of algorithm performance with a 99.2 % classification accuracy

In more complex settings such as manufacturing floors and alike with hundreds of machines and millions of signal segments, a more complex structure such as a Deep Learning with recurrent neural networks is more suitable for classification towards an efficient predictive health monitoring approach.

In case of non-labeled data with no a priori knowledge about machine health, there are other methods and techniques to estimate the machine state (e.g. normal, alarm, warning) including clustering methods and other advanced techniques to estimate the remaining useful life of machines.  Such a scenario will be addressed in a future case study

Monday, 20 July 2020 16:07

Cancer Detection in Pharmaceutical Industry

Written by


AI: A Paradigm Shift in Pharmaceutical Industry- Use Case of Cancer Detection




The current business model of pharmaceutical industry where a new drug may take a decade and Billions of dollars to develop is no longer viable in this digital era of big data and cloud computing.  Giant IT companies such as Amazon and Google are leveraging their deep pockets and strong AI footprints to lower the entry barrier to this vital sector and render classical models of drug discovery and development obsolete.  

AI, particularly Deep Learning field of it, can empower translational pharma research at each phase of drug development and discovery starting from initial candidate selection phase with its aim of drug and target selection up until phase III post launch with its aim of life-cycle management. Each phase in the drug discovery chart can be accelerated by developing and deploying accurate predictive models trained on relevant historical data.  For example, modeling diseased human cells by varying the levels of sugar and oxygen the cells were exposed to, and then tracking their lipid, metabolite, enzyme and protein profiles is an area where AI and cloud computing can add value and save both time and money.   Some of the pharmaceutical companies, including Novartis and AstraZeneca, managed to demonstrate impressive results on drug discovery and development by embracing AI in the last five years [1]. 

In the spirit of showing the benefits of AI and data analytics in pharmaceutical research, we present here the results of using a specific class of AI to detect Ovarian cancer.

Data collection and formatting

Data used in this study is courtesy of Federal Drug Administration-National Cancer Institute, Clinical Proteomics Program Databank.  Data consists of signatures of mass spectrometry on protein profiles of 216 patients including 121 patients with Ovarian cancer and 95 cancer-free persons used as control group in this study.   Signature extraction and identification is performed using serum proteomic pattern diagnostics where proteomic signatures from high dimensional mass spectrometry data are used as a diagnostic classifier [2].  Profile patterns are generated using surface-enhanced laser desorption and ionization (SELDI) protein mass spectrometry [3]. The objective is to build a classifier to classify patients in one of two classes (i.e. cancer and cancer free) based on a limited number of features selected from SELDI data of studied samples.  

Raw data is pre-processed and put in a 216 by 15,000 matrix.  The 216 rows represent the number of patients out of which 121 are ovarian cancer patients and 95 are normal (i.e. cancer-free) patients. The 15,000 columns represent the mass-charge values in M/Z where M stands for mass and Z stands for charge number of ions. M/Z (or simply |MZ|) represents mass divided by charge number and the horizontal axis in a mass spectrum is expressed in units of m/z. Each row in the data matrix represents the ion intensity level at a specific (one out of the 15,000) mass-charge values indicated in |MZ|.

Another 2 by 216 index matrix holds the index information to associate data samples with its appropriate class of patients. For instance, the first 126 elements of the first row of this matrix has the index value of “1” indicating its association with cancer patients, whereas the rest 95 elements of this first row are set to zero indicating its association with cancer-free patients.  So, the reduced dataset of features that will be considered for this study is 216 by 100 matrix. Each column represents one of 216 patients and each row represents the ion intensity level at one of the 100 highest mass-charge values for each patient. A 3-D representation of this dataset is shown below in Figure 1.

Figure 1Ion intensity levels at the 100 highest mass-charge values of the 216 patients


Classification Using a Feed Forward Neural Networks

Various clustering and classification techniques have been tested. We present in this section the results of classification using Feed Forward Neural Networks (FFNN), which is an important Machine Learning technique widely used in classification problems. The set of features identified in the previous section (i.e. highest 100 mass-charge values will) be used to classify cancer and normal samples.

 A 1-hidden layer feed forward neural network with 100 input neurons, 8 hidden layer neurons, and 2 output neurons is created and trained to classify data samples. Figure 2 shows the FFNN structure used in this classification study.


Figure 2Feed Forward Neural Networks architecture used for classification


The input and target samples are automatically divided into training, validation, and test sets. The training set is used to train and teach the FFNN. Training continues as long as the FFNN performance is improving

Data is distributed over training, validation, and test sets respectively with 152 data samples (or 70% of the entire data set of 216 samples), 32 data samples (or 15%), and 32 data samples (or 15%). The network performance on the test data set gives an estimate of how well the network will perform when tested with data from the real world.  Figure 3 shows how the network's performance improved during training using the well-known Scale Conjugate Gradient (SCG) algorithm. Note that training performance is improved by minimizing cross entropy loss function shown on a logarithmic scale.  It rapidly decreased as the network was trained.


Figure 3Training performance of the FFNN of Figure 2. Note that at training epoch 11, validation error was minimal; optimal network parameters are identified at such a training epoch



Classification Results

The trained neural network can now be tested with the testing samples that were

 partitioned from the main dataset. The testing data is excluded from training and hence provides an "unseen" dataset to test the network on.  One measure of how well the FFNN would perform is the confusion plot, also known as error matrix, to visualize the system classification accuracy as shown in Figure 4. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class.  The confusion matrix shows the percentages of correct and incorrect classifications. Correct classifications are the green squares on the matrix diagonal. Red squares represent incorrect classifications. Class 1 indicates cancer patients and class 2 indicates cancer-free patients.


Figure 4Confusion matrix showing the proposed FFNN classification performance on “unseen data before” with an accuracy exceeding 96%


Figure 5 shows another way of measuring the FFNN performance using error histogram across the three datasets (i.e. training, validation, and test). As can be seen, most of the instances resulted on smallest errors for the three types of datasets.


 Figure 5.  FFNN performance on the three datasets (i.e. training, validation, and test). Most of instances resulted on a small errors showing accurate classification .




In this study and based on the Ion intensity levels of 216 individuals including 126 cancer patients and 95 cancer free control group, a straight Feed Forward Neural Networks classifier showed excellent classification results approaching 97% accuracy.  This use case study was just an example to show the promise of Artificial Intelligence in pharma R&D including drug discovery and drug development. Chronical diseases such as Alzheimer, diabetes, and cancer are expected to benefit from this new research paradigm in pharmaceutical companies built around AI and Cloud computing.

YaiGlobal is excited to have its mission set on the promises and challenges of this structural transformation that is touching almost every field of the economy. With its resolute commitment to develop and deploy AI and Cloud computing to address real complex issues, YaiGlobal is looking forward to being an active part of this paradigm shift of digital transformation.



[1] Alex Zhavoronkov, "Deep Dive Into Big Pharma AI Productivity: One Study Shaking The Pharmaceutical Industry”, Retrieved from


[2] T.P. Conrads, et al., "High-resolution serum proteomic features for

     ovarian detection", Endocrine-Related Cancer, 11, 2004, pp. 163-178.


[3] E.F. Petricoin, et al., "Use of proteomic patterns in serum to

     identify ovarian cancer", Lancet, 359(9306), 2002, pp. 572-577.