DHOB (IU5SGN): Stima dell'errore di Linear Bayesian Regression con PyMC

martedì 11 febbraio 2025

Stima dell'errore di Linear Bayesian Regression con PyMC

Sempre seguendo i post precedenti questa e' la stima degli errori relativi ai tre parametri usando NUTS (No-U-Turn Sampler) is an advanced Markov Chain Monte Carlo (MCMC) algorithm used to sample from the posterior distribution in Bayesian inference.

y = beta*x + alpha + sigma

alpha : distribuzione dell'errore per il parametro intercetta

beta : distribuzione dell'errore per il parametro coefficiente angolare

sigma : distribuzione del termine di errore

mean sd hdi_3% hdi_97% ... mcse_sd ess_bulk ess_tail r_hat

alpha 1.63 0.04 1.55 1.71 ... 0.0 3026.81 3754.21 1.0

beta -0.01 0.00 -0.01 -0.01 ... 0.0 3068.24 3519.73 1.0

sigma 0.09 0.02 0.06 0.12 ... 0.0 3733.28 3659.21 1.0

date;value
2002-08-02 15:10:10; 1.7195121951219514
2002-08-12 06:55:13; 1.4329268292682926
2002-08-21 17:38:00; 1.451219512195122
2002-09-04 09:06:26; 1.274390243902439
2002-09-13 04:42:21; 1.1402439024390243
2002-09-20 02:57:43; 1.207317073170732
2002-10-01 00:56:28; 1.0121951219512195
2002-10-02 07:10:10; 0.8841463414634148
2002-10-08 09:16:24; 0.8902439024390243
2002-10-15 07:31:46; 0.7073170731707319
2002-10-21 04:35:43; 0.5548780487804876
2002-10-30 00:11:38; 0.5548780487804876
2002-11-06 23:38:24; 0.6219512195121952
2002-11-12 05:35:30; 0.5548780487804876
2002-11-21 06:13:42; 0.40853658536585336
2002-11-27 03:17:39; 0.26829268292682906
2002-12-02 04:12:27; 0.21341463414634143
2002-12-14 03:22:38; 0.13414634146341475
2002-12-20 05:28:51; 0.12195121951219523

import pandas as pd
import pymc as pm
import numpy as np
import matplotlib.pyplot as plt
import arviz as az

# Step 1: Read the CSV file containing the time series data
# The CSV is assumed to have columns 'date' and 'value'
data = pd.read_csv('dataset.csv', sep=';')

# Convert the date column to datetime format (if it's not already)
data['date'] = pd.to_datetime(data['date'])

# Convert the date to a numeric format (e.g., number of days since the first date)
data['date_numeric'] = (data['date'] - data['date'].min()).dt.days

# Step 2: Prepare the data for regression
X = data['date_numeric'].values  # Independent variable (time)
y = data['value'].values  # Dependent variable (value)

# Step 3: Set up the Bayesian Linear Regression model with pymc3
with pm.Model() as model:
    # Define the prior distributions for the coefficients (alpha, beta)
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    beta = pm.Normal('beta', mu=0, sigma=10)

    # Define the likelihood function (linear model with Gaussian noise)
    sigma = pm.HalfNormal('sigma', sigma=1)
    mu = alpha + beta * X  # Linear relationship

    # Observed data likelihood
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)

    # Step 4: Inference with MCMC sampling (e.g., NUTS sampler)
    trace = pm.sample(2000, return_inferencedata=True)

    # Step 7: Make predictions using posterior predictive sampling within the same model context
    posterior_predictive = pm.sample_posterior_predictive(trace, var_names=["alpha", "beta", "sigma"])

# Step 5: Posterior diagnostics and results
az.plot_trace(trace)
plt.show()

# Step 6: Summary of the posterior distribution
print(az.summary(trace, round_to=2))

# Step 7: Extract the posterior predictions
# Access posterior samples
alpha_samples = trace.posterior["alpha"].values
beta_samples = trace.posterior["beta"].values

# Ensure proper shape for broadcasting (flatten the samples)
alpha_samples = alpha_samples.flatten()  # Shape: (num_samples,)
beta_samples = beta_samples.flatten()  # Shape: (num_samples,)

# Make predictions using the posterior samples
y_pred_samples = alpha_samples[:, None] + beta_samples[:, None] * X[None, :]

# Calculate the mean prediction
y_pred_mean = np.mean(y_pred_samples, axis=0)

# Plot the data and the fitted regression line
plt.plot(data['date'], y, 'o', label='Data')
plt.plot(data['date'], y_pred_mean, label='Bayesian Linear Regression', color='red')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Step 1: Calculate the x-axis intercept (numeric form) from posterior samples
intercept_x_numeric_samples = -alpha_samples / beta_samples

# Step 2: Calculate the mean intercept (mean of the samples)
intercept_x_numeric_mean = np.mean(intercept_x_numeric_samples)

# Step 3: Convert the numeric intercept back to a date
intercept_date = data['date'].min() + pd.to_timedelta(intercept_x_numeric_mean, unit='D')

print(f"Date when the regression line crosses the x-axis: {intercept_date}")

# Step 1: Calculate the variance of the intercept (numerically)
alpha_var = np.var(alpha_samples)
beta_var = np.var(beta_samples)
beta_mean = np.mean(beta_samples)
alpha_mean = np.mean(alpha_samples)

# Step 2: Estimate the standard deviation of the intercept (using the delta method)
intercept_sd = np.sqrt((1 / beta_mean) ** 2 * alpha_var + (alpha_mean / beta_mean**2) ** 2 * beta_var)

print(f"Standard deviation of the intercept on the x-axis: {intercept_sd} days")

sempre per confronto con la regressione lineare ai minimi quadrati

Intercept: 1.6299999167194188

Slope: -0.01137953913185518

Intercept with x-axis (datetime): 2002-12-23 20:55:05.998311688

Standard error of intercept on x-axis: 6.648111410435301 days

import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load the dataset
df = pd.read_csv('dataset.csv', sep=';')

# Convert 'date' to datetime if it's not already in datetime format
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Convert 'date' to the number of days since the earliest date
df['date_numeric'] = (df['date'] - df['date'].min()).dt.days

# Extract the numeric 'date' and 'value' for regression analysis
X = df['date_numeric'].values
Y = df['value'].values

# Add a constant (for intercept) to the X values for the regression
X = sm.add_constant(X)  # Adds the intercept term

# Fit the linear regression model using Ordinary Least Squares (OLS)
model = sm.OLS(Y, X)
results = model.fit()

# Get the intercept and slope
intercept = results.params[0]
slope = results.params[1]

# Standard error of the intercept
stderr_intercept = results.bse[0]

# Calculate the intercept of the regression line with the x-axis (y = 0), which is x = -intercept / slope
intercept_x_axis = -intercept / slope

# Standard deviation of the residuals (errors)
std_dev = np.std(results.resid)

# Standard error for the intercept at the point of intercept on the x-axis
stderr_intercept_x = std_dev / abs(slope)

# Print results
print(f'Intercept: {intercept}')
print(f'Slope: {slope}')
intercept_x_axis_numeric = -intercept / slope
intercept_x_axis_datetime = df['date'].min() + pd.to_timedelta(intercept_x_axis_numeric, unit='D')
print(f'Intercept with x-axis (datetime): {intercept_x_axis_datetime}')
print(f'Standard error of intercept on x-axis: {stderr_intercept_x} days')

DHOB (IU5SGN)

martedì 11 febbraio 2025

Stima dell'errore di Linear Bayesian Regression con PyMC

Nessun commento:

Posta un commento

Physics informed neural network Fukuzono

Clock

Ricerca

analitcs