Simple Experiment to test the impact of Significant News Events to Stock Trading at Colombo Stocks Exchange in 2017 – A data mining approach
Stock market is an integral part of any economy and often
reflects the performance of the respective economy and un-doubtfully, external
forces play a major role in the performance of the stock market. Although, the
impact of these factors are not clearly visible or directly attributable as
financial performances of listed companies or industries, suggesting a
prevailing issue in predicting stock market solely on financial performance. A
typical example in the local domain is the record for the highest stock trade
on 9th January 2015 in the aftermath of the Presidential election for the year
2015. A similar case in the global context is with the election of Trump in the
US. Therefore, in order to understand this effect, we – a group of 3, engaged
in a small experiment using data mining considering ‘How Colombo Stock Market
is affected by the local news events in Sri Lanka’.
![]() |
The variation in Stock prices in CSE in 2015. The highest recorded around the Election days in January. |
The goal of the experiment was to establish the existence of
a correlation between significant news events and the volume of capital share
trading during the day. In order to specifically concentrate on the hypothesis,
the experiment assumed there exist a many-to-one relation among a set of events
that occur in a specific day to the share trade volume. Therefore, it was
considered that only significant news events and that only has an impact on the
stock market disregarding any other reasons and causes for fluctuations on
market indicators. Implications of this assumption are discussed in depth in
the evaluation and discussion section of this article.
Gathering Data
Compiling a comprehensive data set in this matter produced
substantial amount of work load as there exists no proper data set in the Sri
Lankan context. Simply, considering out focus was specifically on news events
in Sri Lanka, data sets at least summarizes news events on a daily basis was
non-existent and hence, provided our requirement was to model the events in
terms of custom feature vector which was specific to the design of the analysis
produced an extra set of challenges. Typically, similar analysis considers an
Natural Language Processing approach where the news event – typically the headlines
are used to develop a feature vector consisting of frequent key words that will
be used in the development of a classification model using an appropriate
approach depending on the objectives of that respective research. Although in
this experiment, the problem is characterized by having to model each specific
event in a manner that is uniquely identifiable (at least to a significant
extent) and the modeled uniformly across all events.
We have already gathered stock market data from the Colombo
Stock Market – having collected officially for academic purposes. Therefore, in
order to develop the data set of events, we had to formulate a very simple
mechanism that handpicked significant events on a daily basis in year 2017. The
condition for being considered a ‘significant’ event was simply being featured
among the headlines of the front pages of the Sunday Observer newspaper (the
selection was due to it being the only freely available e-newspaper). These
headlines were also collected considered the relative importance. Going through
each daily newspaper was a very tiring task even though the time span was only
1 year which required the assistance of 2 other members of the group to engage
in the task as well for individual time spans of the year. We only considered
up to 3 events per day in this aspect as it was observed and later justified
that the impact of events exponentially drops with the relative importance of
the events, specifically after the 3rd significant event. Simply, often only
the 2 most significant events had significant relevance for decisions being
made.
Developing the Data set
The structure of the data set in this experiment was simple.
A single day is modeled by the data, up to 3 events and a class attributes –
the stock trading category. Each of these events was further modeled by its
keyword, sentiment, type and relative trending index.
In modeling the event, the key assumption was to consider
that there exists an proportionate relation among the significance of an event
to the search volume of a keyword directly attributable to the event. Simply,
the significance was modeled by the how much the internet users in Sri Lanka
paid interest in searching for a particular event in the aftermath.
There are several key aspects in modeling of these key
events. We used Google Trends to retrieve the trending index of the events.
Google Trends provide a key word search service which indicates the volume of
search of that key work relative to the total amount of searches been performed
for the period in concern. This was an issue to overcome as it would have
easier to model the event by directly considering the actual volume of search,
but since only relative indexes were present, the hype of a particular had to
be modeled using a work around. Using the comparison service, we searched for
all available key words for the day in order to compare the results – but in
all cases, we kept the date range constant. Yet, the absolute indexes of these
events would not make a good comparison since it was observed that some
keywords are searched throughout the year in large volumes despite it being
related to a significant event or not. Hence, as per the statistics of Google
Trends regarding for 2016, the top 5 keywords has been ‘Cricket Scores’, ‘Prema
Dadayama’, ‘Sadahatama Oba Mage’ (both names of popular Tele-dramas), US
Election and ‘Sidu’(another popular Tele drama) – most of which are irrelevant
and insignificant to decisions being made regarding economy or investment.
Similarly, keywords such as airliner ‘Sri Lankan’, political figures such as
‘Mahinda Rajapakse’, ‘Maithripala Sririsena’ has a usual trending pattern which
would incorrectly suggest a large hype and an incorrect relative significance
among other events of the day.
Therefore in order to overcome, first the average of the
particular keyword for the period was calculated (having extracted the entire
variation of indexes for the period in CSV form) and only the difference of the
index to this average was extracted for that particular day. Yet the value
would be objective to the event and in order to derive the relative index, all
indexes of the available events for the day were normalized to a scale of 0-1
so that the event with the most difference to the average has a significant
proportion by ratio among the other events while the least significant among
the 3 has the least proportion.
Using a program developed using Python for sentiment
analysis, trained using a large corpus annotated with positive and negative
sentiment, the news events were individually categorized to 2 sentiments as
noted. The headlines were fed into the program having listed them in a text
file where the results were indicated as (+)1 – for positive and -1 – negative
in the data set. The type of the event was added to the data set manually,
classifying them as economic, political or general. Therefore, having completed
the data set, the ideal data set should have contained 4745 different data
values, corresponding to 365 days (rows) of the years. Although, since the stock
markets are closed on all holidays and weekends and the fact that not all days
have a significant events, the actual collected data set was limited to only
141 rows (for 9 months in 2017). This was yet another issue with when
performing classifications as the some attributes such as event_keyword and the
absolute value of Share Trade Equity would be too fine grained which creates a
significant error. This was the reason which removed the keywords for the
analysis and normalize/discretize the share trade equity to 4 categorizes. The
normalization considered a normal distribution for the share trade equity
values where they were distributed by equal probability of 25% and each of them
being assigned a class label. In developing the dataset, the rationale for considering
the Share Trade Equity as the class attribute were since it give a holistic
perspective to stock market trading, share trading per day being slightly an
independent event from share trading on previous days and due to the fact that
it being rather neutral to specific industry/organizational performances at it
hides these differences. Therefore, in line with the assumptions stated, it was
established that the Share Trade Equity (as a category) be used as the class
attribute.
It would have been ideal to have the keyword in the analysis
considering that typical NLP approached to classification would consider
developing a feature vector using the most significant keywords from the
dataset to model the event. Rather, in order to comprehensively model the
events, the design incorporated the type, sentiment, significance if the event
in both ordinal and ratio forms so that the analysis not confined only to
tokens but other attributes as well.
It should be noted that, in order to capture news events
which trends for several days, the event had been duplicated by keyword in the
data set over the period of that the event trended in the newspapers. The
variation of reach/search is captured by the variation in the relative
significance index of the index intuitively.
Analysis
We used both Weka and Rapid Miner to perform analysis on
data, where Weka being the primarily tool in order to make the process simpler.
The analysis used only dense data sets considering 1, 2 and all 3 events
thereby inducting a pair-wise deletion for handling incomplete data (i.e.
Considering only 2 events, rows with only 1 events missing have been deleted)
and analysis were conducted on all these 3 sets of data. Although, the best
results were achieved having performed classification using 2 events and the
evaluation of which is used for all further explanation in this article. Hence,
these were also performed using a 70-30 holdout and 10-fold Cross Validation
which was later recognized that the latter produced better results. In this
case, in order to overcome the ‘curse of dimensionality’ if it had applied in
this case, a Principal Component Analysis (PCA) was performed. It was
identified that only the significance index of 2 events had 0.98 correlations
which prompted the significance index of the second event be removed from
analysis.
It should also be noted that the attribute event_type was
binarized so that the categorical values are indicated numerically so that
number intensive algorithms such as k-NN could be performed.
The classifications were performed using Decision Tree,
Support Vector Classification, Naïve Bayes, Bayesian Networks, Logistic
regression and k-Nearest Neighbors algorithms.
Evaluation and Observations
Classification performance of each algorithm was performed
using the basic indicators such as accuracy, true-positive rate, precision,
recall, PRC area etc. Accordingly, the best classification accuracy was
obtained using a Bayesian Network at 69.281% - a considerable accuracy provide
that PRC Area in all cases were in the range of 0.45 – 0.6, which depicts the
data points are rather scattered. This is a significant observation as this
denotes that several other factors also have to contribute other than the
events and it’s attributes in order to converge at a higher accuracy. Yet, it
could also be concluded that rather than a set of events in a day, the
decisions are highly impacted by a very few – typically a single or couple of
prominent events in a day. Therefore, it was observed and justified through the
analysis that decision makers may be focusing on key events – specifically
turning points of the society or the economy as a part of their rational
decision making process.
![]() |
Section of the Results obtained against different algorithms. |
An important finding of the PCA was the high correlation of
sentiment of the particular event to the final outcome. The PCA analysis
produced an Eigen-matrix where event_sentiment was recognized to be the
attribute selected from the first component followed by event_significance index
in the second component, which depicts that the final outcome is biased towards
the general opinion about the event resulting in positive or negative impact on
trading. Therefore, it’s highly likely that a future study on this domain would
produce a higher accuracy for prediction if it could incorporate significance
and especially the sentiment as attributes to the feature vector.
Conclusion
In the aftermath of the analysis, in order to understand how
to improve the accuracy of the models, we further studies the domain and
identified 3 key factors for decision making regarding stock markets such as
stock market policies, industry/organizational financial performance and
government policies. Although, government policies are indirectly incorporated
in the events in case these trigger a significant event that headlines the news
i.e. gazzeting a price ceiling for rice causing an improvised rice shortage.
Although, having ignored the other 2 factors by an assumption for the
experiment, it was established considering the available resources such as the
history of all stock market policy changes and seasonal performances of
industries/organizations that they have a significant impact to the decisions
even at a higher level of the stock market.
Additionally, the models could be improved primarily using a
larger dataset ranging in a time span over few years. This would also overcome
the issue of granularity of the keywords and the class attribute, so that the
prediction could be performed for a numeric value given that that the other
major factors discussed above are also incorporated in the data set.
Having observed the entire collection of stock data, it was
observed that several seasonal variations along with an increase in the trading
capacity at the stock market. Therefore, it is recommended that any further
study could incorporate a few attributes incorporating time series analysis to
adopt these variations over time to analysis.
Another important aspect being observed as a strategy to
improve accuracy of the models was to incorporate an attribute per event that
factorizes the impact of such similar events. Provided a larger data set, the
rationale of this approach is to identify and incorporate the different impacts
of the same category of events other than the type. The typical process would
incorporate first clustering the data and therefore, statistically calculating
the probability of the being a member of a particular cluster given a type
which could be used as an indicator on the level of impact that a particular
event given a scenario has on the outcome.
In conclusion, although the experiment didn’t provide
conclusive evidence of a strong relationship between the events and the stock
trading volume, it suggested enough evidence that events could have an impact
on decision making relating to stock trade. Hence, provided that there exists
several applications for this domain, it’s also clear that further research
could be conducted in improving these systems considering a vast range of
factors impacting the market.
Comments
Post a Comment