martedì 11 febbraio 2025

Stima dell'errore di Linear Bayesian Regression con PyMC

 Sempre seguendo i post precedenti questa e' la stima degli errori relativi ai tre parametri usando NUTS (No-U-Turn Sampler) is an advanced Markov Chain Monte Carlo (MCMC) algorithm used to sample from the posterior distribution in Bayesian inference. 

y = beta*x + alpha + sigma

alpha : distribuzione dell'errore per il parametro intercetta

beta : distribuzione dell'errore per il parametro coefficiente angolare

sigma : distribuzione del termine di errore




       


mean    sd  hdi_3%  hdi_97%  ...  mcse_sd  ess_bulk  ess_tail  r_hat

alpha  1.63  0.04    1.55     1.71  ...      0.0   3026.81   3754.21    1.0

beta  -0.01  0.00   -0.01    -0.01  ...      0.0   3068.24   3519.73    1.0

sigma  0.09  0.02    0.06     0.12  ...      0.0   3733.28   3659.21    1.0


date;value
2002-08-02 15:10:10; 1.7195121951219514
2002-08-12 06:55:13; 1.4329268292682926
2002-08-21 17:38:00; 1.451219512195122
2002-09-04 09:06:26; 1.274390243902439
2002-09-13 04:42:21; 1.1402439024390243
2002-09-20 02:57:43; 1.207317073170732
2002-10-01 00:56:28; 1.0121951219512195
2002-10-02 07:10:10; 0.8841463414634148
2002-10-08 09:16:24; 0.8902439024390243
2002-10-15 07:31:46; 0.7073170731707319
2002-10-21 04:35:43; 0.5548780487804876
2002-10-30 00:11:38; 0.5548780487804876
2002-11-06 23:38:24; 0.6219512195121952
2002-11-12 05:35:30; 0.5548780487804876
2002-11-21 06:13:42; 0.40853658536585336
2002-11-27 03:17:39; 0.26829268292682906
2002-12-02 04:12:27; 0.21341463414634143
2002-12-14 03:22:38; 0.13414634146341475
2002-12-20 05:28:51; 0.12195121951219523


import pandas as pd
import pymc as pm
import numpy as np
import matplotlib.pyplot as plt
import arviz as az

# Step 1: Read the CSV file containing the time series data
# The CSV is assumed to have columns 'date' and 'value'
data = pd.read_csv('dataset.csv', sep=';')

# Convert the date column to datetime format (if it's not already)
data['date'] = pd.to_datetime(data['date'])

# Convert the date to a numeric format (e.g., number of days since the first date)
data['date_numeric'] = (data['date'] - data['date'].min()).dt.days

# Step 2: Prepare the data for regression
X = data['date_numeric'].values # Independent variable (time)
y = data['value'].values # Dependent variable (value)

# Step 3: Set up the Bayesian Linear Regression model with pymc3
with pm.Model() as model:
# Define the prior distributions for the coefficients (alpha, beta)
alpha = pm.Normal('alpha', mu=0, sigma=10)
beta = pm.Normal('beta', mu=0, sigma=10)

# Define the likelihood function (linear model with Gaussian noise)
sigma = pm.HalfNormal('sigma', sigma=1)
mu = alpha + beta * X # Linear relationship

# Observed data likelihood
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)

# Step 4: Inference with MCMC sampling (e.g., NUTS sampler)
trace = pm.sample(2000, return_inferencedata=True)

# Step 7: Make predictions using posterior predictive sampling within the same model context
posterior_predictive = pm.sample_posterior_predictive(trace, var_names=["alpha", "beta", "sigma"])

# Step 5: Posterior diagnostics and results
az.plot_trace(trace)
plt.show()

# Step 6: Summary of the posterior distribution
print(az.summary(trace, round_to=2))

# Step 7: Extract the posterior predictions
# Access posterior samples
alpha_samples = trace.posterior["alpha"].values
beta_samples = trace.posterior["beta"].values

# Ensure proper shape for broadcasting (flatten the samples)
alpha_samples = alpha_samples.flatten() # Shape: (num_samples,)
beta_samples = beta_samples.flatten() # Shape: (num_samples,)

# Make predictions using the posterior samples
y_pred_samples = alpha_samples[:, None] + beta_samples[:, None] * X[None, :]

# Calculate the mean prediction
y_pred_mean = np.mean(y_pred_samples, axis=0)

# Plot the data and the fitted regression line
plt.plot(data['date'], y, 'o', label='Data')
plt.plot(data['date'], y_pred_mean, label='Bayesian Linear Regression', color='red')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Step 1: Calculate the x-axis intercept (numeric form) from posterior samples
intercept_x_numeric_samples = -alpha_samples / beta_samples

# Step 2: Calculate the mean intercept (mean of the samples)
intercept_x_numeric_mean = np.mean(intercept_x_numeric_samples)

# Step 3: Convert the numeric intercept back to a date
intercept_date = data['date'].min() + pd.to_timedelta(intercept_x_numeric_mean, unit='D')

print(f"Date when the regression line crosses the x-axis: {intercept_date}")

# Step 1: Calculate the variance of the intercept (numerically)
alpha_var = np.var(alpha_samples)
beta_var = np.var(beta_samples)
beta_mean = np.mean(beta_samples)
alpha_mean = np.mean(alpha_samples)

# Step 2: Estimate the standard deviation of the intercept (using the delta method)
intercept_sd = np.sqrt((1 / beta_mean) ** 2 * alpha_var + (alpha_mean / beta_mean**2) ** 2 * beta_var)

print(f"Standard deviation of the intercept on the x-axis: {intercept_sd} days")

sempre per confronto con la regressione lineare ai minimi quadrati

Intercept: 1.6299999167194188

Slope: -0.01137953913185518

Intercept with x-axis (datetime): 2002-12-23 20:55:05.998311688

Standard error of intercept on x-axis: 6.648111410435301 days


import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load the dataset
df = pd.read_csv('dataset.csv', sep=';')

# Convert 'date' to datetime if it's not already in datetime format
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Convert 'date' to the number of days since the earliest date
df['date_numeric'] = (df['date'] - df['date'].min()).dt.days

# Extract the numeric 'date' and 'value' for regression analysis
X = df['date_numeric'].values
Y = df['value'].values

# Add a constant (for intercept) to the X values for the regression
X = sm.add_constant(X) # Adds the intercept term

# Fit the linear regression model using Ordinary Least Squares (OLS)
model = sm.OLS(Y, X)
results = model.fit()

# Get the intercept and slope
intercept = results.params[0]
slope = results.params[1]

# Standard error of the intercept
stderr_intercept = results.bse[0]

# Calculate the intercept of the regression line with the x-axis (y = 0), which is x = -intercept / slope
intercept_x_axis = -intercept / slope

# Standard deviation of the residuals (errors)
std_dev = np.std(results.resid)

# Standard error for the intercept at the point of intercept on the x-axis
stderr_intercept_x = std_dev / abs(slope)

# Print results
print(f'Intercept: {intercept}')
print(f'Slope: {slope}')
intercept_x_axis_numeric = -intercept / slope
intercept_x_axis_datetime = df['date'].min() + pd.to_timedelta(intercept_x_axis_numeric, unit='D')
print(f'Intercept with x-axis (datetime): {intercept_x_axis_datetime}')
print(f'Standard error of intercept on x-axis: {stderr_intercept_x} days')






Nessun commento:

Posta un commento

Physics informed neural network Fukuzono

Visto che puro ML non funziona per le serie tempo di cui mi sto occupando ed le regressioni basate su formule analitiche mostrano dei limiti...