跳轉到

finlab.ml

機器學習模組,提供特徵工程、標籤生成與模型訓練整合功能。

使用情境

  • 建立 ML 選股策略的特徵集
  • 生成技術指標(TA-Lib 整合)
  • 設計訓練標籤(報酬、風險指標)
  • 整合 qlib 框架進行模型訓練
  • 預測股票未來表現

快速範例

特徵工程

from finlab import data
from finlab.ml import feature as mlf

# 合併基本面特徵
features = mlf.combine({
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比'),
    'roe': data.get('fundamental_features:股東權益報酬率')
}, resample='W')

# 加入技術指標
features_ta = mlf.combine({
    'fundamental': features,
    'technical': mlf.ta(mlf.ta_names(n=10))
}, resample='W')

標籤生成

from finlab.ml import label as mll

# 生成未來 1 週報酬率標籤
label = mll.daytrading_percentage(
    features.index,
    period=1,
    resample='W'
)

qlib 模型訓練

import finlab.ml.qlib as q

# 切分訓練/測試集
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]

# 訓練 LightGBM 模型
model = q.LGBModel()
model.fit(X_train, y_train)

# 預測並轉換為持倉權重
pred = model.predict(X_test)
position = pred.is_largest(30)  # 買入前 30 名

詳細教學

參考 機器學習策略開發,了解: - 完整 ML 策略開發流程 - 特徵工程最佳實踐 - 標籤設計技巧 - 模型訓練與優化 - 過度配適防範


API Reference

finlab.ml.feature

特徵工程模組,用於合併與處理各類特徵。

combine()

finlab.ml.feature.combine

combine(features, resample=None, sample_filter=None, **kwargs)

The combine function takes a dictionary of features as input and combines them into a single pandas DataFrame. combine 函數接受一個特徵字典作為輸入,並將它們合併成一個 pandas DataFrame。

PARAMETER DESCRIPTION
features

a dictionary where values are dataframes or callables returning dataframes. 索引為日期時間,欄位 為證券代碼的 DataFrame,或可呼叫以取得 DataFrame 的函式。

TYPE: Dict[str, DataFrame | Callable]

resample

Optional argument to resample the data in the features. Default is None. 選擇性的參數,用於重新取樣特徵中的資料。預設為 None。

TYPE: str DEFAULT: None

sample_filter

a boolean dictionary where index is date and columns are instrument representing the filter of features.

TYPE: DataFrame DEFAULT: None

**kwargs

Additional keyword arguments to pass to the resampler function. 傳遞給重新取樣函數 resampler 的其他關鍵字引數。

DEFAULT: {}

RETURNS DESCRIPTION

A pandas DataFrame containing all the input features combined. 一個包含所有輸入特徵合併後的 pandas DataFrame。

Examples:

這段程式碼教我們如何使用finlab.ml.feature和finlab.data模組,來合併兩個特徵:RSI和股價淨值比。我們使用f.combine函數來進行合併,其中特徵的名稱是字典的鍵,對應的資料是值。 我們從data.indicator('RSI')取得'rsi'特徵,這個函數計算相對強弱指數。我們從data.get('price_earning_ratio:股價淨值比')取得'pb'特徵,這個函數獲取股價淨值比。最後,我們得到一個包含這兩個特徵的DataFrame。

from finlab import data
import finlab.ml.feature as f
import finlab.ml.qlib as q

features = f.combine({

    # 用 data.get 簡單產生出技術指標
    'pb': data.get('price_earning_ratio:股價淨值比'),

    # 用 data.indicator 產生技術指標的特徵
    'rsi': data.indicator('RSI'),

    # 用 f.ta 枚舉超多種 talib 指標
    'talib': f.ta(f.ta_names()),

    # 利用 qlib alph158 產生技術指標的特徵(請先執行 q.init(), q.dump() 才能使用)
    'qlib158': q.alpha('Alpha158')

    })

features.head()
datetime instrument rsi pb
2020-01-01 1101 0 2
2020-01-02 1102 100 3
2020-01-03 1108 100 4

使用範例

from finlab import data
from finlab.ml import feature as mlf

# 範例 1:合併基本面特徵
features = mlf.combine({
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比'),
    'roe': data.get('fundamental_features:股東權益報酬率')
}, resample='W')

# 範例 2:合併技術指標
features = mlf.combine({
    'talib': mlf.ta(['talib.RSI__period14__', 'talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__'])
}, resample='D')

# 範例 3:混合多種特徵
features = mlf.combine({
    'fundamental': mlf.combine({'pb': pb, 'pe': pe}),
    'technical': mlf.ta(mlf.ta_names(n=5)),
    'custom': custom_feature_df
}, resample='W')

resample 參數

  • 'D' - 每日
  • 'W' - 每週(週五)
  • 'M' - 每月(月底)
  • 特徵與標籤的 resample 必須一致!

ta()

finlab.ml.feature.ta

ta(feature_names, factories=None, resample=None, start_time=None, end_time=None, adj=False, cpu=-1, **kwargs)

Calculate technical indicator values for a list of feature names.

PARAMETER DESCRIPTION
feature_names

A list of technical indicator feature names. Defaults to None.

TYPE: Optional[List[str]]

factories

A dictionary of factories to generate technical indicators. Defaults to {"talib": TalibIndicatorFactory()}.

TYPE: Optioanl[Dict[str, TalibIndicatorFactory]] DEFAULT: None

resample

The frequency to resample the data to. Defaults to None.

TYPE: Optional[str] DEFAULT: None

start_time

The start time of the data. Defaults to None.

TYPE: Optional[str] DEFAULT: None

end_time

The end time of the data. Defaults to None.

TYPE: Optional[str] DEFAULT: None

**kwargs

Additional keyword arguments to pass to the resampler function.

DEFAULT: {}

RETURNS DESCRIPTION
DataFrame

pd.DataFrame: technical indicator feature names and their corresponding values.

技術指標計算

from finlab.ml import feature as mlf

# 計算特定指標
indicators = mlf.ta([
    'talib.RSI__period14__',
    'talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__',
    'talib.BBANDS__timeperiod20_nbdevup2_nbdevdn2__upperband__'
], resample='W')

# 自動生成隨機指標組合
random_indicators = mlf.ta(mlf.ta_names(n=10), resample='W')

ta_names()

finlab.ml.feature.ta_names

ta_names(lb=1, ub=10, n=1, factory=None)

Generate a list of technical indicator feature names.

PARAMETER DESCRIPTION
lb

The lower bound of the multiplier of the default parameter for the technical indicators.

TYPE: int DEFAULT: 1

ub

The upper bound of the multiplier of the default parameter for the technical indicators.

TYPE: int DEFAULT: 10

n

The number of random samples for each technical indicator.

TYPE: int DEFAULT: 1

factory

A factory object to generate technical indicators. Defaults to TalibIndicatorFactory.

TYPE: IndicatorFactory DEFAULT: None

RETURNS DESCRIPTION
List[str]

List[str]: A list of technical indicator feature names.

Examples:

import finlab.ml.feature as f


# method 1: generate each indicator with random parameters
features = f.ta()

# method 2: generate specific indicator
feature_names = ['talib.MACD__macdhist__fastperiod__52__slowperiod__212__signalperiod__75__']
features = f.ta(feature_names, resample='W')

# method 3: generate some indicator
feature_names = f.ta_names()
features = f.ta(feature_names)

生成指標名稱列表

from finlab.ml import feature as mlf

# 生成 10 種隨機參數組合的所有 TA-Lib 指標
indicator_names = mlf.ta_names(n=10)
print(f"共 {len(indicator_names)} 個指標")

# 查看範例
for name in indicator_names[:5]:
    print(name)
# 輸出:
# talib.RSI__period14__
# talib.RSI__period7__
# talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__
# ...

n 參數建議

  • n=1: 每個指標使用預設參數(~100 個指標)
  • n=5: 每個指標隨機生成 5 組參數(~500 個指標)
  • n=10: 更多變化但計算時間較長(~1000 個指標)
  • 建議先用 n=1 測試,確認可行後再增加

注意事項

  • 指標數量過多會導致:
    • 計算時間長
    • 記憶體佔用大
    • 模型過度配適風險高
  • 建議使用特徵選擇(feature selection)減少指標數量

finlab.ml.label

標籤生成模組,用於設計機器學習的目標變數。

daytrading_percentage()

finlab.ml.label.daytrading_percentage

daytrading_percentage(index, **kwargs)

Calculate the percentage change of market prices over a given period.

PARAMETER DESCRIPTION
index

A multi-level index of datetime and instrument.

TYPE: Index

resample

The resample frequency for the output data. Defaults to None.

TYPE: Optional[str]

period

The number of periods to calculate the percentage change over. Defaults to 1.

TYPE: int

trade_at_price

The price for execution. Defaults to close.

TYPE: str

**kwargs

Additional arguments to be passed to the resampler function.

DEFAULT: {}

RETURNS DESCRIPTION
Series

pd.Series: A pd.Series containing the percentage change of stock prices.

預測未來 N 期報酬率

from finlab.ml import feature as mlf, label as mll

# 建立特徵
features = mlf.combine({...}, resample='W')

# 生成標籤:預測未來 1 週報酬率
label = mll.daytrading_percentage(
    features.index,
    period=1,
    resample='W'
)

# 生成標籤:預測未來 4 週報酬率
label_4w = mll.daytrading_percentage(
    features.index,
    period=4,
    resample='W'
)

period 參數選擇

  • period=1: 短期預測(適合週/月調倉策略)
  • period=4: 中期預測(適合季度調倉)
  • period 越大: 預測越穩定但信號越遲鈍
  • 建議: 與策略調倉頻率一致

maximum_adverse_excursion()

finlab.ml.label.maximum_adverse_excursion

maximum_adverse_excursion(index, period=1, trade_at_price='close')

Calculate the maximum adverse excursion of market prices over a given period.

PARAMETER DESCRIPTION
index

A multi-level index of datetime and instrument.

TYPE: Index

resample

The resample frequency for the output data. Defaults to None.

TYPE: Optional[str]

period

The number of periods to calculate the percentage change over. Defaults to 1.

TYPE: int DEFAULT: 1

trade_at_price

The price for execution. Defaults to close.

TYPE: str DEFAULT: 'close'

**kwargs

Additional arguments to be passed to the resampler function.

RETURNS DESCRIPTION
Series

pd.Series: A pd.Series containing the percentage change of stock prices.

最大不利偏移(風險指標)

from finlab.ml import label as mll

# 計算未來 N 期的最大跌幅
mae_label = mll.maximum_adverse_excursion(
    features.index,
    period=5
)

# 可用於訓練風險預測模型
# 負值越大表示風險越高

finlab.ml.qlib

qlib 框架整合模組,提供多種機器學習模型用於股票預測。

模型類別

finlab.ml.qlib 提供以下模型:

finlab.ml.qlib.LGBModel

LGBModel()

LGBModel is a wrapper model for LightGBM model.

import finlab.ml.qlib as q

# build X_train, y_train, X_test

model = q.LGBModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

finlab.ml.qlib.XGBModel

XGBModel()

XGBModel is a wrapper model for XGBoost model.

import finlab.ml.qlib as q

# build X_train, y_train, X_test

model = q.XGBModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

finlab.ml.qlib.CatBoostModel

CatBoostModel()

CatBoostModel is a wrapper model for CatBoost model.

import finlab.ml.qlib as q

# build X_train, y_train, X_test

model = q.CatBoostModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

finlab.ml.qlib.LinearModel

LinearModel()

LinearModel is a wrapper model for Linear model.

import finlab.ml.qlib as q

# build X_train, y_train, X_test

model = q.LinearModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

基礎使用範例

from finlab import data
from finlab.ml import feature as mlf, label as mll
import finlab.ml.qlib as q

# 1. 準備特徵
features = mlf.combine({
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比')
}, resample='W')

# 2. 準備標籤
label = mll.return_percentage(features.index, resample='W', period=1)

# 3. 切分訓練/測試集
is_train = features.index.get_level_values('datetime') < '2020-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]

# 4. 建立並訓練模型
model = q.XGBModel()  # 可選: q.LGBModel(), q.CatBoostModel(), q.LinearModel()
model.fit(X_train, y_train)

# 5. 預測
y_pred = model.predict(X_test)

支援的模型類型

模型 說明 優點 缺點
LGBModel() LightGBM 快速、效能好、記憶體佔用小 需安裝 lightgbm
XGBModel() XGBoost 穩定、可解釋性高 訓練較慢
CatBoostModel() CatBoost 處理類別特徵佳、過擬合風險低 記憶體佔用大
LinearModel() 線性回歸 簡單、快速、不易過擬合 表現通常較差

模型選擇建議

  • 初學者推薦: LGBModel()(平衡速度與效能)
  • 追求穩定: XGBModel()
  • 有類別特徵: CatBoostModel()
  • 快速驗證想法: LinearModel()
  • 避免過擬合: 先用 LinearModel() 建立基準,再嘗試複雜模型

常見錯誤

  • 資料洩露: 確認訓練集與測試集日期不重疊(用時間切分,不要隨機切分)
  • 過度配適: 測試集表現遠差於訓練集(IC < 0.02 可能是過擬合)
  • 未來函數: 標籤使用 .shift(-1)period=1 確保不洩露未來資訊

完整範例

建立 ML 選股策略

from finlab import data
from finlab.ml import feature as mlf, label as mll, qlib
from finlab.backtest import sim

# 步驟 1:建立特徵集
print("建立特徵...")
features = mlf.combine({
    # 基本面特徵
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比'),
    'roe': data.get('fundamental_features:股東權益報酬率'),

    # 技術指標(使用少量指標避免過擬合)
    'technical': mlf.ta(mlf.ta_names(n=1)[:20])  # 只取前 20 個
}, resample='W')

# 步驟 2:生成標籤
print("生成標籤...")
label = mll.daytrading_percentage(
    features.index,
    period=1,
    resample='W'
)

# 步驟 3:切分訓練/測試集
print("切分資料...")
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]

# 步驟 4:訓練模型
print("訓練模型...")
model = q.LGBModel()
model.fit(X_train, y_train)

# 步驟 5:預測
print("生成預測...")
pred = model.predict(X_test)

# 步驟 6:轉換為交易訊號
print("生成交易訊號...")
position = pred.is_largest(30)  # 買入預測報酬率最高的 30 檔

# 步驟 7:回測
print("執行回測...")
report = sim(position, resample='W')
report.display()

# 步驟 8:分析特徵重要性(如果是樹模型)
if hasattr(model, 'feature_importances_'):
    import pandas as pd
    feature_importance = pd.Series(
        model.feature_importances_,
        index=features.columns.get_level_values(1).unique()
    ).sort_values(ascending=False)

    print("\n前 10 重要特徵:")
    print(feature_importance.head(10))

常見問題

Q: 特徵和標籤的 resample 不一致會怎樣?

會導致日期錯位,無法正確配對特徵與標籤:

# ❌ 錯誤:resample 不一致
features = mlf.combine({...}, resample='W')  # 週
label = mll.daytrading_percentage(features.index, period=1, resample='D')  # 日
# → 會出現 shape 不匹配錯誤

# ✅ 正確:resample 一致
features = mlf.combine({...}, resample='W')
label = mll.daytrading_percentage(features.index, period=1, resample='W')

Q: 如何避免過度配適?

# 方法 1:減少特徵數量
features = mlf.ta(mlf.ta_names(n=1)[:20])  # 只用 20 個指標

# 方法 2:使用正則化(降低模型複雜度)
# LightGBM 支援 L1/L2 正則化、樹深度限制等
# 方法 3:使用簡單模型
model = q.LinearModel()  # 線性模型不易過擬合

# 方法 4:樣本外測試
# 確保測試集表現接近訓練集(不要相差超過 50%)

Q: 訓練很慢怎麼辦?

# 方法 1:減少資料範圍
data.truncate_start = '2020-01-01'

# 方法 2:減少特徵數量
features = mlf.ta(mlf.ta_names(n=1)[:10])  # 只用 10 個指標

# 方法 3:使用 resample='W' 或 'M'(而非 'D')
features = mlf.combine({...}, resample='W')  # 週度資料更快

Q: 如何處理缺失值?

# 先檢查缺失值狀況
print(f"特徵缺失值比例: {features.isna().sum().sum() / features.size:.2%}")
print(f"標籤缺失值比例: {label.isna().sum() / len(label):.2%}")

# 方法 1:刪除有缺失值的樣本
features_clean = features.dropna()
label_clean = label.dropna()

# 方法 2:向前填補(適合時間序列)
features_filled = features.fillna(method='ffill')
label_filled = label.fillna(method='ffill')

# 方法 3:填 0(適合技術指標)
features_zero = features.fillna(0)

Q: 預測結果如何轉換為交易訊號?

# 方法 1:買入前 N 名
position = pred.is_largest(30)

# 方法 2:設定閾值
position = pred > pred.quantile(0.8)  # 買入前 20%

# 方法 3:多空策略
long_position = pred.is_largest(20)   # 做多前 20 名
short_position = pred.is_smallest(20)  # 做空後 20 名
position = long_position - short_position

# 方法 4:根據預測值分配權重
position = pred / pred.sum()  # 按預測值比例分配

參考資源