データ解析の実務プロセス入門

posted with カエレバ

あんちべ森北出版 2015-06-20

Amazonで探す

楽天市場で探す

Yahooショッピングで探す

目次
はじめに
最小二乗法による多項式データフィッティング
ファイナンスでの線形回帰の使い方
Seasonal Fitting
Auto regressive fitting
データフィッテングの三箇条
最小二乗分類
Confusion matrix (混合行列)
最小二乗分類器
GItHubリポジトリ
参考資料
MyEnigma Supporters

はじめに

スタンフォード大学には

機械学習を学ぶ上での第一歩として、

Introduction to Matrix Methods (EE103)という授業があります。

今回の記事では、

この授業の教科書である

Introduction to Applied Linear Algebraを

読んだ際の技術メモです。

この教科書は下記のリンクのページから

pdfをダウンロードすることができます。

Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares

本記事では、

上記の教科書の最小二乗法によるデータフィッティングの部分のみのメモです。

他の部分に関しては、下記の記事を参照下さい。

myenigma.hatenablog.com

最小二乗法による多項式データフィッティング

線形方程式、Ap = bにおいて、

Aに各次数のx座標を格納し、

bに対応するy座標のデータを格納した状態で、

最小二乗法を実施すると、

任意の次数の多項式において、

データフィッティングを実施することができます。

pはデータフィッティングにおける、

各次数の重みパラメータベクトルになります。

下記は、あるデータを様々な次数の多項式で

データフィッティングした結果です。

f:id:meison_amsl:20180727214050p:plain

コードは下記の通りです。

using PyPlot


function get_sin_training()

    x = [i for i in 0:0.5:10]
    y = sin.(x)*3.0 + x

    return x, y
end


function construct_polynomial_matrix(tx, degree)
    nteaching=length(tx)
    A = fill(1.0, (nteaching,1))
    for i in 1:degree
        At = tx.^i
        A = hcat(A, At)
    end

    return A
end


function polynomial_fitting(tx, ty, degree, x)

    A = construct_polynomial_matrix(tx, degree)

    # calc parameter vector
    pv = inv(A'*A)*A'*ty

    Ap = construct_polynomial_matrix(x, degree)

    y = Ap*pv 
    
    return y
end


function main()

    tx, ty = get_sin_training()

    x = [i for i in 0:0.1:10]
    plot(tx, ty, "xb")

    for d in 1:5
        y = polynomial_fitting(tx, ty, d, x)
        plot(x, y, label=string(d))
    end

    axis("equal")
    legend()
    show()

end

main()

上記の図を見ると分かる通り、

与えられたデータに対して、

次元を増やしていくと、

その点とフィッティング線の誤差は小さくなりますが、

他の入力に対しては、汎用性が無くなってしまいます。

これを過学習といいます。

これに対応する方法としては、

データを学習データとテストデータにわけ、

学習データのフィッティング結果を、

テストデータで評価し、次元を決める方法があります。

下記の図は、

ある学習データとテストデータにおいて

多項式フィッティングの次元を増やした場合、

学習データの誤差は次元が増える毎に減りますが

テストデータに対する誤差は逆に増えていることがわかります。

そこで、テストデータの誤差が

一番小さくなる次元数を選べば良いということになります。

f:id:meison_amsl:20180729150255p:plain

ファイナンスでの線形回帰の使い方

ファイナンスでは、各投資先の今後のリターンを

市場価格からの線形回帰モデルで表すことが多い。

市場価格から、かならずオフセット的に得られるリターンと、

時間とともに、変動する部分を線形回帰モデルで表す。

また、線形回帰モデルと実際の値の差をグラフにしたものを、

de-trented 時系列データといい、

各時刻で予測を上まったのか、下まわったのかを確認することができる。

Seasonal Fitting

周期的に変化するデータにおいて、

その周期が事前に分かる場合、

線形最小二乗法を使うことで、

そのようなデータもフィッティングすることができます。

データ全体の線形傾向を表した項と、

周期的な各オフセット量を、

最小二乗法で計算します。

これをSeasonal Fittingというようです。

例えば、下記のようなフィッティングが可能です。

f:id:meison_amsl:20180727202639p:plain

コードは下記の通りです。

using PyPlot

function get_sin_training()

    x = [i for i in 0:18]

    y = sin.(x)*5.0 + x/5

    return x, y
end

function construct_polynomial_matrix(tx, degree)
    nteaching=length(tx)
    A = fill(1.0, (nteaching,1))
    for i in 1:degree
        At = tx.^i
        A = hcat(A, At)
    end

    return A
end


function seasonal_fitting(tx, ty, cycle)

    nteaching=length(tx)

    A = []

    ind = 1
    while ind <= nteaching
        for j in 1:cycle
            t = fill(0.0, cycle)
            t[j] = 1.0
            Ar = vcat([ind], t)
            ind += 1

            if length(A) == 0
                A = Ar'
            else
                A = vcat(A, Ar')
            end

            if ind >= nteaching
                break
            end
        end
    end

    pv = inv(A'*A)*A'*ty

    y = A*pv 
    
    return y
end



function main()
    tx, ty = get_sin_training()

    plot(tx, ty, "xb", label="data")

    cycle = 6
    y = seasonal_fitting(tx, ty, cycle)

    plot(tx, y, "-r", label="fitting")

    axis("equal")
    legend()

    show()
end


main()

Auto regressive fitting

Auto regressive fitting (AR fitting)は、

各時刻tのデータを、

その前のt-1からt-1-MのM個の過去のデータの

重み付き足し算で近似する近似方法です。

f:id:meison_amsl:20180801212049p:plain

日本語では自己回帰モデルと呼ばれます。

ARモデルでは、t-1 から t-1Mのデータの重みを、

最小二乗法を使って計算できます。

実際にAuto regressive fittingを実施した結果が下記です。

f:id:meison_amsl:20180731210837p:plain

今回はM=4として、過去4つのデータを元に、

各時刻のデータを近似しています。

コードは下記の通りです。

using PyPlot

function get_sin_training()

    x = [i for i in 0:20]

    y = sin.(x)*1.0 + x/5

    for i in 1:length(y)
        y[i]+=rand()
    end

    return x, y
end


function auto_regressive_fitting(tx, ty, M)

    N = length(tx)
    A = []
    for i in 1:N-M
        Ad = ty[i:i+M-1]

        if length(A) == 0
            A = Ad'
        else
            A = vcat(A, Ad')
        end
    end

    # calc parameter vector
    pv = inv(A'*A)*A'*ty[M+1:end]

    y = A*pv 
    
    return y
end


function main()

    tx, ty = get_sin_training()
    plot(tx, ty, "xb")

    M = 4
    y = auto_regressive_fitting(tx, ty, M)

    plot(tx[M+1:end], y, "-r")

    axis("equal")

    show()

end


main()