Simple Experiment to test the impact of Significant News Events to Stock Trading at Colombo Stocks Exchange in 2017 – A data mining approach

Stock market is an integral part of any economy and often reflects the performance of the respective economy and un-doubtfully, external forces play a major role in the performance of the stock market. Although, the impact of these factors are not clearly visible or directly attributable as financial performances of listed companies or industries, suggesting a prevailing issue in predicting stock market solely on financial performance. A typical example in the local domain is the record for the highest stock trade on 9th January 2015 in the aftermath of the Presidential election for the year 2015. A similar case in the global context is with the election of Trump in the US. Therefore, in order to understand this effect, we – a group of 3, engaged in a small experiment using data mining considering ‘How Colombo Stock Market is affected by the local news events in Sri Lanka’.   

The variation in Stock prices in CSE in 2015. The highest recorded around the Election days in January.

The goal of the experiment was to establish the existence of a correlation between significant news events and the volume of capital share trading during the day. In order to specifically concentrate on the hypothesis, the experiment assumed there exist a many-to-one relation among a set of events that occur in a specific day to the share trade volume. Therefore, it was considered that only significant news events and that only has an impact on the stock market disregarding any other reasons and causes for fluctuations on market indicators. Implications of this assumption are discussed in depth in the evaluation and discussion section of this article.  

Gathering Data

Compiling a comprehensive data set in this matter produced substantial amount of work load as there exists no proper data set in the Sri Lankan context. Simply, considering out focus was specifically on news events in Sri Lanka, data sets at least summarizes news events on a daily basis was non-existent and hence, provided our requirement was to model the events in terms of custom feature vector which was specific to the design of the analysis produced an extra set of challenges. Typically, similar analysis considers an Natural Language Processing approach where the news event – typically the headlines are used to develop a feature vector consisting of frequent key words that will be used in the development of a classification model using an appropriate approach depending on the objectives of that respective research. Although in this experiment, the problem is characterized by having to model each specific event in a manner that is uniquely identifiable (at least to a significant extent) and the modeled uniformly across all events.

We have already gathered stock market data from the Colombo Stock Market – having collected officially for academic purposes. Therefore, in order to develop the data set of events, we had to formulate a very simple mechanism that handpicked significant events on a daily basis in year 2017. The condition for being considered a ‘significant’ event was simply being featured among the headlines of the front pages of the Sunday Observer newspaper (the selection was due to it being the only freely available e-newspaper). These headlines were also collected considered the relative importance. Going through each daily newspaper was a very tiring task even though the time span was only 1 year which required the assistance of 2 other members of the group to engage in the task as well for individual time spans of the year. We only considered up to 3 events per day in this aspect as it was observed and later justified that the impact of events exponentially drops with the relative importance of the events, specifically after the 3rd significant event. Simply, often only the 2 most significant events had significant relevance for decisions being made.

Developing the Data set

The structure of the data set in this experiment was simple. A single day is modeled by the data, up to 3 events and a class attributes – the stock trading category. Each of these events was further modeled by its keyword, sentiment, type and relative trending index.

In modeling the event, the key assumption was to consider that there exists an proportionate relation among the significance of an event to the search volume of a keyword directly attributable to the event. Simply, the significance was modeled by the how much the internet users in Sri Lanka paid interest in searching for a particular event in the aftermath.

There are several key aspects in modeling of these key events. We used Google Trends to retrieve the trending index of the events. Google Trends provide a key word search service which indicates the volume of search of that key work relative to the total amount of searches been performed for the period in concern. This was an issue to overcome as it would have easier to model the event by directly considering the actual volume of search, but since only relative indexes were present, the hype of a particular had to be modeled using a work around. Using the comparison service, we searched for all available key words for the day in order to compare the results – but in all cases, we kept the date range constant. Yet, the absolute indexes of these events would not make a good comparison since it was observed that some keywords are searched throughout the year in large volumes despite it being related to a significant event or not. Hence, as per the statistics of Google Trends regarding for 2016, the top 5 keywords has been ‘Cricket Scores’, ‘Prema Dadayama’, ‘Sadahatama Oba Mage’ (both names of popular Tele-dramas), US Election and ‘Sidu’(another popular Tele drama) – most of which are irrelevant and insignificant to decisions being made regarding economy or investment. Similarly, keywords such as airliner ‘Sri Lankan’, political figures such as ‘Mahinda Rajapakse’, ‘Maithripala Sririsena’ has a usual trending pattern which would incorrectly suggest a large hype and an incorrect relative significance among other events of the day.
Therefore in order to overcome, first the average of the particular keyword for the period was calculated (having extracted the entire variation of indexes for the period in CSV form) and only the difference of the index to this average was extracted for that particular day. Yet the value would be objective to the event and in order to derive the relative index, all indexes of the available events for the day were normalized to a scale of 0-1 so that the event with the most difference to the average has a significant proportion by ratio among the other events while the least significant among the 3 has the least proportion.

Using a program developed using Python for sentiment analysis, trained using a large corpus annotated with positive and negative sentiment, the news events were individually categorized to 2 sentiments as noted. The headlines were fed into the program having listed them in a text file where the results were indicated as (+)1 – for positive and -1 – negative in the data set. The type of the event was added to the data set manually, classifying them as economic, political or general. Therefore, having completed the data set, the ideal data set should have contained 4745 different data values, corresponding to 365 days (rows) of the years. Although, since the stock markets are closed on all holidays and weekends and the fact that not all days have a significant events, the actual collected data set was limited to only 141 rows (for 9 months in 2017). This was yet another issue with when performing classifications as the some attributes such as event_keyword and the absolute value of Share Trade Equity would be too fine grained which creates a significant error. This was the reason which removed the keywords for the analysis and normalize/discretize the share trade equity to 4 categorizes. The normalization considered a normal distribution for the share trade equity values where they were distributed by equal probability of 25% and each of them being assigned a class label. In developing the dataset, the rationale for considering the Share Trade Equity as the class attribute were since it give a holistic perspective to stock market trading, share trading per day being slightly an independent event from share trading on previous days and due to the fact that it being rather neutral to specific industry/organizational performances at it hides these differences. Therefore, in line with the assumptions stated, it was established that the Share Trade Equity (as a category) be used as the class attribute.  

It would have been ideal to have the keyword in the analysis considering that typical NLP approached to classification would consider developing a feature vector using the most significant keywords from the dataset to model the event. Rather, in order to­­ comprehensively model the events, the design incorporated the type, sentiment, significance if the event in both ordinal and ratio forms so that the analysis not confined only to tokens but other attributes as well.
It should be noted that, in order to capture news events which trends for several days, the event had been duplicated by keyword in the data set over the period of that the event trended in the newspapers. The variation of reach/search is captured by the variation in the relative significance index of the index intuitively.


We used both Weka and Rapid Miner to perform analysis on data, where Weka being the primarily tool in order to make the process simpler. The analysis used only dense data sets considering 1, 2 and all 3 events thereby inducting a pair-wise deletion for handling incomplete data (i.e. Considering only 2 events, rows with only 1 events missing have been deleted) and analysis were conducted on all these 3 sets of data. Although, the best results were achieved having performed classification using 2 events and the evaluation of which is used for all further explanation in this article. Hence, these were also performed using a 70-30 holdout and 10-fold Cross Validation which was later recognized that the latter produced better results. In this case, in order to overcome the ‘curse of dimensionality’ if it had applied in this case, a Principal Component Analysis (PCA) was performed. It was identified that only the significance index of 2 events had 0.98 correlations which prompted the significance index of the second event be removed from analysis.

It should also be noted that the attribute event_type was binarized so that the categorical values are indicated numerically so that number intensive algorithms such as k-NN could be performed. 
The classifications were performed using Decision Tree, Support Vector Classification, Naïve Bayes, Bayesian Networks, Logistic regression and k-Nearest Neighbors algorithms.

Evaluation and Observations

Classification performance of each algorithm was performed using the basic indicators such as accuracy, true-positive rate, precision, recall, PRC area etc. Accordingly, the best classification accuracy was obtained using a Bayesian Network at 69.281% - a considerable accuracy provide that PRC Area in all cases were in the range of 0.45 – 0.6, which depicts the data points are rather scattered. This is a significant observation as this denotes that several other factors also have to contribute other than the events and it’s attributes in order to converge at a higher accuracy. Yet, it could also be concluded that rather than a set of events in a day, the decisions are highly impacted by a very few – typically a single or couple of prominent events in a day. Therefore, it was observed and justified through the analysis that decision makers may be focusing on key events – specifically turning points of the society or the economy as a part of their rational decision making process.

Section of the Results obtained against different algorithms. 

An important finding of the PCA was the high correlation of sentiment of the particular event to the final outcome. The PCA analysis produced an Eigen-matrix where event_sentiment was recognized to be the attribute selected from the first component followed by event_significance index in the second component, which depicts that the final outcome is biased towards the general opinion about the event resulting in positive or negative impact on trading. Therefore, it’s highly likely that a future study on this domain would produce a higher accuracy for prediction if it could incorporate significance and especially the sentiment as attributes to the feature vector.


In the aftermath of the analysis, in order to understand how to improve the accuracy of the models, we further studies the domain and identified 3 key factors for decision making regarding stock markets such as stock market policies, industry/organizational financial performance and government policies. Although, government policies are indirectly incorporated in the events in case these trigger a significant event that headlines the news i.e. gazzeting a price ceiling for rice causing an improvised rice shortage. Although, having ignored the other 2 factors by an assumption for the experiment, it was established considering the available resources such as the history of all stock market policy changes and seasonal performances of industries/organizations that they have a significant impact to the decisions even at a higher level of the stock market.

Additionally, the models could be improved primarily using a larger dataset ranging in a time span over few years. This would also overcome the issue of granularity of the keywords and the class attribute, so that the prediction could be performed for a numeric value given that that the other major factors discussed above are also incorporated in the data set.

Having observed the entire collection of stock data, it was observed that several seasonal variations along with an increase in the trading capacity at the stock market. Therefore, it is recommended that any further study could incorporate a few attributes incorporating time series analysis to adopt these variations over time to analysis.

Another important aspect being observed as a strategy to improve accuracy of the models was to incorporate an attribute per event that factorizes the impact of such similar events. Provided a larger data set, the rationale of this approach is to identify and incorporate the different impacts of the same category of events other than the type. The typical process would incorporate first clustering the data and therefore, statistically calculating the probability of the being a member of a particular cluster given a type which could be used as an indicator on the level of impact that a particular event given a scenario has on the outcome.

In conclusion, although the experiment didn’t provide conclusive evidence of a strong relationship between the events and the stock trading volume, it suggested enough evidence that events could have an impact on decision making relating to stock trade. Hence, provided that there exists several applications for this domain, it’s also clear that further research could be conducted in improving these systems considering a vast range of factors impacting the market.


