Profiler API¶
One-call financial data profiling that answers: "Is this data ready for ML?"
The profiler analyzes column statistics, data quality issues specific to financial time series (gaps, splits, outliers), and return distribution properties.
fs.profiler.profile(df, column="close")¶
Generate a comprehensive profile report for a financial DataFrame.
import finasys as fs
df = fs.load("AAPL", start="2024-01-01")
report = fs.profiler.profile(df)
print(report.shape) # (252, 7)
print(report.date_range) # ('2024-01-02', '2024-12-31')
print(report.symbols) # ['AAPL']
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
pl.DataFrame |
required | DataFrame with financial data |
column |
str |
"close" |
Primary price column for distribution analysis |
Returns: ProfileReport dataclass (see below).
fs.profiler.profile_summary(df, column="close")¶
Generate a text summary designed for LLM consumption. Can be plugged directly into agent system prompts.
Example output:
DATA PROFILE | 252 rows x 7 columns
Date range: 2024-01-02 to 2024-12-31
Symbols: AAPL
Quality issues: 9 missing dates; 11 price outliers
Returns distribution: skew=0.501, kurtosis=3.647, non-normal (JB p=0.0000)
Tail ratio: 0.987
close: mean=205.65, std=25.58, range=[163.51, 257.61], nulls=0 (0.0%)
Returns: str
Report Dataclasses¶
ProfileReport¶
The top-level report containing all analysis results.
| Field | Type | Description |
|---|---|---|
shape |
tuple[int, int] |
(rows, columns) |
date_range |
tuple[str, str] |
(start_date, end_date) |
symbols |
list[str] |
Symbols found in the data |
column_stats |
dict[str, ColumnProfile] |
Per-column statistics |
quality |
DataQualityReport |
Data quality assessment |
distribution |
DistributionReport |
Return distribution analysis |
report = fs.profiler.profile(df)
# Serialize to dict (JSON-compatible)
data = report.to_dict()
import json
json.dumps(data, default=str) # works
ColumnProfile¶
Statistical profile of a single column. Computed for all numeric columns automatically.
| Field | Type | Description |
|---|---|---|
name |
str |
Column name |
dtype |
str |
Polars data type |
count |
int |
Total row count |
null_count |
int |
Number of null values |
null_pct |
float |
Null percentage (0-100) |
mean |
float |
Mean (numeric columns only) |
std |
float |
Standard deviation |
min |
float |
Minimum value |
max |
float |
Maximum value |
skewness |
float |
Skewness |
kurtosis |
float |
Excess kurtosis |
quantiles |
dict[str, float] |
Quantiles at 1%, 5%, 25%, 50%, 75%, 95%, 99% |
cs = report.column_stats["close"]
print(f"Mean: {cs.mean:.2f}")
print(f"Std: {cs.std:.2f}")
print(f"Median: {cs.quantiles['0.5']:.2f}")
print(f"Nulls: {cs.null_count} ({cs.null_pct:.1f}%)")
DataQualityReport¶
Financial-specific data quality checks. Detects issues that generic profilers miss.
| Field | Type | Description |
|---|---|---|
missing_dates |
list[str] |
Business days with no data (holidays, gaps) |
duplicate_rows |
int |
Number of duplicate rows |
zero_volume_days |
int |
Days with zero trading volume |
price_outliers |
dict[str, int] |
Per-column count of >4-sigma daily moves |
suspected_splits |
list[str] |
Dates with >20% overnight price changes |
q = report.quality
# Check for data gaps
if q.missing_dates:
print(f"Warning: {len(q.missing_dates)} missing trading dates")
print(f" First 5: {q.missing_dates[:5]}")
# Check for stock splits (unadjusted data)
if q.suspected_splits:
print(f"Warning: {len(q.suspected_splits)} suspected stock splits")
print(" Consider using adjusted close prices")
# Outlier check
for col, count in q.price_outliers.items():
print(f" {col}: {count} outliers (>4 sigma)")
DistributionReport¶
Return distribution characteristics. Financial returns are famously non-normal -- this tells you how non-normal.
| Field | Type | Description |
|---|---|---|
returns_skewness |
float |
Skewness of daily returns (negative = left tail heavier) |
returns_kurtosis |
float |
Excess kurtosis (>0 = fat tails, typical for equities) |
jarque_bera_stat |
float |
Jarque-Bera test statistic |
jarque_bera_pvalue |
float |
JB p-value (<0.05 = reject normality) |
is_normal |
bool |
True if p-value > 0.05 |
tail_ratio |
float |
Right tail / left tail ratio (>1 = positive skew) |
d = report.distribution
if not d.is_normal:
print("Returns are non-normal (typical for financial data)")
print(f" Kurtosis: {d.returns_kurtosis:.2f} (0 = normal, >0 = fat tails)")
print(f" Skewness: {d.returns_skewness:.2f}")
Jarque-Bera test
The JB test checks whether returns follow a normal distribution by examining skewness and kurtosis. Formula: JB = n/6 * (S^2 + K^2/4). For financial returns, normality is almost always rejected -- this is expected and important to know when choosing models.
Multi-Symbol Profiling¶
The profiler works with multi-symbol DataFrames:
df = fs.load(["AAPL", "GOOGL", "MSFT"], start="2024-01-01")
report = fs.profiler.profile(df)
print(report.symbols) # ['AAPL', 'GOOGL', 'MSFT']
print(fs.profiler.profile_summary(df))
Column statistics are computed across the entire DataFrame. Date gap detection runs per-symbol to avoid false positives from different trading calendars.