
Introducing Data Analysis in Stock Market
Ever feel like traditional financial analysis doesn’t give you the whole picture? The Ibovespa, a major stock market index in Brazil, is a treasure trove of data waiting to be unlocked through time series analysis. This approach leverages historical data to gain insights into future trends. By analyzing data points at specific intervals (time frames, ‘TF’), we can tailor our analysis to different trading styles, from long-term fundamental investing to short-term day trading. Here I will show you a small analysis of the financial market, where I use a slightly different approach to mak
Financial market data can be analyzed within a defined time interval, which is called a _time frame_, which is a fixed time value that separates the situation of an asset before and after this time frame. Knowing that price movement is strongly influenced by what has already happened in the past, we can frame stock market data as a Time Series dataset, spaced in blocks of time that can be from month to month, for those who make long-term fundamental decisions in the market, it can be on daily time intervals, or even less than that, such as hourly intervals, 5 minutes, widely used by Day Traders and Scalpers. Therefore, for those who master the power of programming, they can follow another philosophy of analysis, which involves knowledge of quantitative trading, which boils down to applying data science to time series data, which in this case, IBOV.
This blog post delves into some basic data science techniques you can use to analyze the Ibovespa. It’s important to remember that the financial market is inherently unpredictable. While these techniques can’t guarantee perfect results, they can equip you with valuable information to make informed decisions. So I will give a presentation of some basic ways where you can analyze the graph through data science techniques. But before presenting these techniques, an important point must be raised: Inherent anomalies in the financial market are impossible to measure! Which makes any strategy feasible to errors and distortions. In other words, I will present some techniques to gain insights about the asset, but it’s an introductory analysis. All this analysis will be available in my github repository for free, if any of you be interested in to it.
To proceed with data analysis, we need to define, according to data science principles, how we can model stock market data. Knowing that each piece of information defined within a fixed time frame is interdependent on the others, indicating that its future behavior is intrinsic to past behavior, we can define price behavior as:
Forecast = Trend Component + Seasonal Component + Noise
Where:
- Trend Component represents the trend observed in the time series data, indicating the overall direction in which the data is moving.
- Seasonal Component represents the repetitive pattern or seasonality observed at regular intervals within the data.
- Noise represents the random fluctuations or irregularities present in the data that cannot be attributed to the trend or seasonality.
The noise is like unexpected news to the market that shakes countries or specific companies, distorting the nominal price behavior. Such events occur periodically, impossible to predict 100%. So, for analysis purposes, we will disregard the noise. We are left with some way to identify if an asset is trending or stationary, what period, according to the TF, the stationary behavior repeats, and in the end, I present a way to make predictions. I will present the use of some techniques in the chart below, which represents data from the IBOV, daily, during the initial period of the pandemic in Brazil, from 03/26/2020 to 02/22/2024. Keep in mind that a large scope of code is required to fully understand how techniques work. You can visit my github in this link here.

Techniques for Trends and Sazonality
The first analysis is to determine which part of these data on the graph is trending and which part has entered a stationary moment. There are some techniques to extract trend information from the dataset, using the Cox-Stuart test. On the other hand, the technique used to measure sazonality is Kruskal-wallis test.
The Cox-Stuart test is used for **detecting monotonic trends** (either upward or downward) in time series data. It’s a non-parametric statistical test, meaning it doesn’t rely on assuming a specific distribution for the data.
Here’s a breakdown of its functionality:
1. Compares Differences: The test focuses on the differences between consecutive data points in the time series.
2. Counts Upward Changes: It specifically counts the number of times the value increases compared to the previous point (for upward trend detection) or decreases (for downward trend detection).
3. Statistical Significance: The test then compares this count of positive/negative differences to the total number of non-zero differences. This comparison is used to calculate a p-value.
You can find the code for Cox-Stuart in the Scikit-learn module, but here is a simple walkthrough code for experts to analyze how the magic happens. Here is a code that receives 2 parameters: x represents the data to be analyzed, and the other argument represents the trend direction. The return value is a float number that, if it’s under 0.05, indicates that the analyzed data is in trend. This value is commonly used to discern a maximum error value for the whole experiment. If you use this technique in the entire dataset, you shall see that the prece is in trend.
def simpleCS (x, trend_type = "l"):
"""
TREND analysis with pvalue: Cox_Stuart
"""
#trend_type = "l" --> "decreasing trend"
#trend_type = "r" --> "increasing trend"
n0 = len(x)%2
if n0 == 1:
remover = int((len(x))/2)
x = np.delete(x, int((len(x))/2))
half = len(x)/2
x1 = x[np.arange(0, half, dtype=int)]
x2 = x[np.arange(half, len(x), dtype=int)]
n = np.sum((x2 - x1) != 0)
t = np.sum(x1 < x2)
if trend_type == "l":
pvalue = stats.binom.cdf(t, n, 0.5)
else:
pvalue = 1 - stats.binom.cdf(t - 1, n, 0.5)
return pvalue
For the seasonality test, as mentioned previously, I usually employ the Kruskal-Wallis test. This is a non-parametric statistical test used to assess whether multiple groups (seasons in this case) have statistically significant differences in their medians. It serves as an alternative to the one-way analysis of variance (ANOVA) when the data may not follow a normal distribution. The test compares the medians of several independent groups by analyzing the ranks of the data points instead of the raw values. Here’s a breakdown of a example code:
def simpleKW(y, freq = 12):
"""
SEASONALITY analysis with value: Kruskal_Wallis.
"""
Rank = np.array(pd.Series(y).rank(method='average', na_option='keep'))
extra = freq - len(Rank)%freq
dat = np.concatenate((np.repeat(np.nan, extra), Rank))
yMAT = dat.reshape((int(len(dat)/freq), freq))
Nobs = np.apply_along_axis(lambda x: np.count_nonzero(~np.isnan(x)), 0, yMAT)
R2n = np.power(np.apply_along_axis(np.nansum, 0, yMAT), 2)/Nobs
H = 12/(sum(Nobs) * (sum(Nobs) + 1)) * sum(R2n) - 3 * (sum(Nobs) + 1)
if sum(np.unique(Rank, return_counts=True)[1]>1) > 0:
valor = np.unique(Rank, return_counts=True)[1]
valor = valor[valor > 1]
sumT = sum(np.power(valor, 3) - valor)
Correction = 1 - sumT/(np.power(len(y),3) - len(y))
H = H/Correction
return 1 - chi2.cdf(H, freq-1)
The code begins by computing the rank of each data point in the variable `y`, utilizing `pd.Series(y).rank(method=’average’)`. Here, the ‘average’ method ensures that tied values are assigned the same rank. To handle potential missing values (NaNs), the code incorporates the `na_option=’keep’` argument. Subsequently, the code checks if the length of the rank array (`len(Rank)`) is a multiple of the frequency (`freq`), which denotes the number of seasons. If the length is not a multiple, it adds additional NaNs (`extra`) using `np.repeat`. This step guarantees that the data can be reshaped into a matrix with equally-sized seasons. Lastly, the original ranks are concatenated with the added NaNs utilizing `np.concatenate`, and the resulting array (`dat`) is assigned. The `dat` array is then reshaped into a matrix (`yMAT`), where each row represents a season and each column represents data points within that season. This transformation is accomplished through `dat.reshape((int(len(dat)/freq), freq))`.
The returned value for Kruskal-Wallis is the same as Cox-Stuart. It returns a float number that, if it’s under 0.05, indicates a maximum error that the analyzed data can exhibit, concluding that it’s in seasonality. So if you run it on the data of Figure 1, the Kruskal-Wallis response will be over 0.05, obviously indicating a trend. But if we slice the data between 05/11/2021 and 15/06/2023 (fig. 2), Kruskal-Wallis shows a different value.

Trying Predictions
As I mentioned earlier, it is difficult to make a data prediction where the presence of noise is unpredictable and of unknown intensity. To then mitigate the error and be able to train a model for trend prediction, I separated the dataset from figure 1 and extracted data for a 30-minute TF. The result is shown in figure 3.

The model used to predict values here is ‘Exponential Smothing’ do modulo ‘statsmodels’.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
Exponential smoothing is a popular technique for time series forecasting. It assigns exponentially decreasing weights to past observations as you move backward in time. This gives more weight to recent observations, under the assumption that they are more relevant for predicting the future. Here some key points:
- The `ExponentialSmoothing` class likely offers different methods and parameters to configure the specific type of exponential smoothing (e.g., simple exponential smoothing, Holt’s linear trend model, etc.).
- This import suggests the code you’re analyzing might be building a forecasting model for a time series dataframe using exponential smoothing.
To be able to generate the predictions, I will use the ‘Exponential Smoothing’ model on 90% of the data to be used as training data for the model, and the other 10%, which are the test data, I will use to compare the prediction results with the actual value, shown in the graph in figure 3. In figure 4 below, the training data is in blue, the test data is in red, and the prediction result is in yellow.

It is clear how the graph is affected by the anomalous behavior of the price. It is possible to highlight that the prediction has a delayed behavior and lower intensity, concluding that more analysis of other factors affecting the price needs to be considered in the prediction, and not just the price of an asset. Techniques such as Mean Squared Error (MSE) or Mean Absolute Error (MAE) are ways to measure how far off the obtained results are.
So that’s it, folks, this is an introduction to data analysis for the financial market. I intend to bring more analyses in the future, but with this blog, I show some points such as prediction, trend analysis and seasonality, and an introduction to time series. For more advanced analyses, you can check my Github