การลด Noise ในข้อมูลทางการเงิน

Juglans Regia
2 เม.ย. 2566
ยาว 3 นาที

ทฤษฏีทางการเงินยุคใหม่อยู่อยู่บนสมมุติฐานของผลตอบแทนและความเสี่ยง ทำให้ Covariance matrices มีความสำคัญกับการพัฒนาแบบจำลอง ทั้งการ ประเมินความเสี่ยง เพิ่มประสิทธิภาพพอร์ตโฟลิโอ จำลองสถานการณ์ผ่านมอนติคาร์โล ค้นหาคลัสเตอร์ ลดขนาดพื้นที่เวกเตอร์

แต่การประมาณค่า Covariance matrices มักจะเกิด noise จนทำให้แบบจำลองของเราอาจใช้การไม่ได้ ดังนั้นการจัดการกับ noise และปรับปรุง Covariance matrices จึงมีความสำคัญมาก

Marcenko–Pastur Theorem

The Marcenko–Pastur theorem เป็นผลจากทฤษฎีเมทริกซ์สุ่มที่ให้ความสัมพันธ์ระหว่างค่าลักษณะเฉพาะของเมทริกซ์สุ่มขนาดใหญ่และมิติของเมทริกซ์ เมทริกซ์สุ่มเป็นไปตามการแจกแจงแบบหนึ่งที่เรียกว่าการแจกแจงแบบมาร์เชนโก-ปาสเตอร์ ซึ่งขึ้นอยู่กับอัตราส่วนของจำนวนแถวต่อจำนวนคอลัมน์ในเมทริกซ์

ทฤษฎีบท Marcenko–Pastur ถูกใช้ในด้านการเงิน มักใช้ในการวิเคราะห์ค่าลักษณะเฉพาะของเมทริกซ์ความแปรปรวนร่วมตัวอย่างของชุดสินทรัพย์ทางการเงิน เพื่อให้เข้าใจโครงสร้างพื้นฐานของผลตอบแทนและความสัมพันธ์ของสินทรัพย์ จากหนังสือเราจะได้ code ตัวอย่างมา

def mpPDF(var, q, pts):
    eMin, eMax = var*(1-(1./q)**.5)**2, var*(1+(1./q)**.5)**2 # calc lambda_minus, lambda_plus
    eVal = np.linspace(eMin, eMax, pts) #Return evenly spaced numbers over a specified interval. eVal='lambda'
    #Note: 1.0/2*2 = 1.0 not 0.25=1.0/(2*2)
    pdf = q/(2*np.pi*var*eVal)*((eMax-eVal)*(eVal-eMin))**.5 #np.allclose(np.flip((eMax-eVal)), (eVal-eMin))==True
    pdf = pd.Series(pdf, index=eVal)
    return pdf

และทดสอบด้วย

def getPCA(matrix):
    # Get eVal, eVec from a Hermitian matrix
    eVal, eVec = np.linalg.eig(matrix) #complex Hermitian (conjugate symmetric) or a real symmetric matrix.
    indices = eVal.argsort()[::-1] #arguments for sorting eval desc
    eVal,eVec = eVal[indices],eVec[:,indices]
    eVal = np.diagflat(eVal) # identity matrix with eigenvalues as diagonal
    return eVal,eVec
#---------------------------------------------------

def fitKDE(obs, bWidth=.15, kernel='gaussian', x=None):
    #Fit kernel to a series of obs, and derive the prob of obs
    # x is the array of values on which the fit KDE will be evaluated
    #print(len(obs.shape) == 1)
    if len(obs.shape) == 1: obs = obs.reshape(-1,1)
    kde = KernelDensity(kernel = kernel, bandwidth = bWidth).fit(obs)
    #print(x is None)
    if x is None: x = np.unique(obs).reshape(-1,1)
    #print(len(x.shape))
    if len(x.shape) == 1: x = x.reshape(-1,1)
    logProb = kde.score_samples(x) # log(density)
    pdf = pd.Series(np.exp(logProb), index=x.flatten())
    return pdf

แต่อย่างไรก็ตามเราไม่สามารถสร้างที่สมบูรณ์แบบได้ด้วยการสุ่ม

Marcenko–Pastur Distribution

การแจกแจงแบบ Marcenko–Pastur เป็นการแจกแจงความน่าจะเป็นที่อธิบายค่าลักษณะเฉพาะของเมทริกซ์สุ่มที่มีจำนวนแถวและคอลัมน์จำนวนหนึ่ง ในบริบทของเมทริกซ์ความแปรปรวนร่วม สามารถใช้การแจกแจง Marcenko–Pastur เพื่อวิเคราะห์ค่าลักษณะเฉพาะของเมทริกซ์ความแปรปรวนร่วมตัวอย่างของชุดสินทรัพย์ทางการเงิน

การแจกแจง Marcenko–Pastur กำหนดไว้ดังนี้:

$$f(x) = \frac{1}{2\pi qx}\sqrt{(b-x)(x-a)}$$

โดยที่ $a$ และ $b$ เป็นขอบเขตล่างและบนของการแจกแจง ตามลำดับ และ $q = \frac{T}{N}$ คืออัตราส่วนของจำนวนการสังเกต (T) ต่อจำนวนสินทรัพย์ ( น). ขอบเขตล่างและบนของการกระจายถูกกำหนดโดย:

$$a = (1 — \sqrt{q})²$$ $$b = (1 + \sqrt{q})²$$

รูปร่างของการแจกแจง Marcenko–Pastur ขึ้นอยู่กับค่าของ $q$ สำหรับ $q > 1$ การกระจายจะ “กว้าง” และมีจุดสูงสุดเดียว ในขณะที่สำหรับ $q < 1$ การกระจายจะ “แคบ” และมีจุดสูงสุด 2 จุด

#snippet 2.3
def getRndCov(nCols, nFacts): #nFacts - contains signal out of nCols
    w = np.random.normal(size=(nCols, nFacts))
    cov = np.dot(w, w.T) #random cov matrix, however not full rank
    cov += np.diag(np.random.uniform(size=nCols)) #full rank cov
    return cov
#---------------------------------------------------

def cov2corr(cov):
    # Derive the correlation matrix from a covariance matrix
    std = np.sqrt(np.diag(cov))
    corr = cov/np.outer(std,std)
    corr[corr<-1], corr[corr>1] = -1,1 #for numerical errors
    return corr
#---------------------------------------------------
def corr2cov(corr, std):
    cov = corr * np.outer(std, std)
    return cov

ลองทดสอบโดย

#snippet 2.4 - fitting the marcenko-pastur pdf - find variance
#Fit error
def errPDFs(var, eVal, q, bWidth, pts=1000):
    var = var[0]
    pdf0 = mpPDF(var, q, pts) #theoretical pdf
    pdf1 = fitKDE(eVal, bWidth, x=pdf0.index.values) #empirical pdf
    sse = np.sum((pdf1-pdf0)**2)
    print("sse:"+str(sse))
    return sse 
    
# find max random eVal by fitting Marcenko's dist
# and return variance
def findMaxEval(eVal, q, bWidth):
    out = minimize(lambda *x: errPDFs(*x), x0=np.array(0.5), args=(eVal, q, bWidth), bounds=((1E-5, 1-1E-5),))
    print("found errPDFs"+str(out['x'][0]))
    if out['success']: var = out['x'][0]
    else: var=1
    eMax = var*(1+(1./q)**.5)**2
    return eMax, var

จะได้การแจกแจงดังนี้

Denoising

โดยทั่วไปในทางการเงินที่จะลดขนาดเมทริกซ์ความแปรปรวนร่วมที่มีเงื่อนไขเป็นตัวเลขทำให้เมทริกซ์ความแปรปรวนร่วมเข้าใกล้เส้นทแยงมุมมากขึ้นและลดสัญญาณรบกวน

# DENOISING BY CONSTANT RESIDUAL EIGENVALUE

def denoisedCorr(eVal, eVec, nFacts):
    eVal_ = np.diag(eVal).copy()
    eVal_[nFacts:] = eVal_[nFacts:].sum()/float(eVal_.shape[0] - nFacts) #all but 0..i values equals (1/N-i)sum(eVal_[i..N]))
    eVal_ = np.diag(eVal_) #square matrix with eigenvalues as diagonal: eVal_.I
    corr1 = np.dot(eVec, eVal_).dot(eVec.T) #Eigendecomposition of a symmetric matrix: S = QΛQT
    corr1 = cov2corr(corr1) # Rescaling the correlation matrix to have 1s on the main diagonal
    return corr1

Detoning

เครื่องมือ Detoning เป็นวิธีการขั้นสูงที่รวมเมทริกซ์ความแปรปรวนร่วมของตัวอย่างเข้ากับค่าประมาณ “ก่อนหน้า” ของเมทริกซ์ความแปรปรวนร่วม เช่น เมทริกซ์เอกลักษณ์หรือเมทริกซ์ความแปรปรวนร่วมตัวอย่างของสินทรัพย์ชุดอื่น ตัวประมาณค่าการหดตัวมีข้อได้เปรียบในการทนทานต่อค่าผิดปกติและอาจให้การประมาณเมทริกซ์ความแปรปรวนร่วมที่แม่นยำกว่าในบางกรณี

def detoned_corr(corr, eigenvalues, eigenvectors, market_component=1):
    """
    De-tones the de-noised correlation matrix by removing the market component.
    The input is the eigenvalues and the eigenvectors of the correlation matrix and the number
    of the first eigenvalue that is above the maximum theoretical eigenvalue and the number of
    eigenvectors related to a market component.
    :param corr: (np.array) Correlation matrix to detone.
    :param eigenvalues: (np.array) Matrix with eigenvalues on the main diagonal.
    :param eigenvectors: (float) Eigenvectors array.
    :param market_component: (int) Number of fist eigevectors related to a market component. (1 by default)
    :return: (np.array) De-toned correlation matrix.
    """
    
    # Getting the eigenvalues and eigenvectors related to market component
    eigenvalues_mark = eigenvalues[:market_component, :market_component]
    eigenvectors_mark = eigenvectors[:, :market_component]
    
    # Calculating the market component correlation
    corr_mark = np.dot(eigenvectors_mark, eigenvalues_mark).dot(eigenvectors_mark.T)
    
    # Removing the market component from the de-noised correlation matrix
    corr = corr - corr_mark
    
    # Rescaling the correlation matrix to have 1s on the main diagonal
    corr = cov2corr(corr)
    
    return

ส่งท้าย

Marcenko–Pastur ให้การกระจายของค่าลักษณะเฉพาะที่เกี่ยวข้องกับเมทริกซ์แบบสุ่ม เราสามารถแยกแยะระหว่างค่าลักษณะเฉพาะที่เกี่ยวข้องกับสัญญาณและค่าลักษณะเฉพาะที่เกี่ยวข้องกับสัญญาณรบกวนได้

เงื่อนไขของเมทริกซ์สหสัมพันธ์คืออัตราส่วนระหว่างค่าลักษณะเฉพาะสูงสุดและต่ำสุด (ตามโมดูลี) Denoising ช่วยลดจำนวนเงื่อนไขโดยการเพิ่มค่าลักษณะเฉพาะที่ต่ำที่สุด เราสามารถลดจำนวนเงื่อนไขลงได้อีกโดยการลดค่าลักษณะเฉพาะสูงสุด สิ่งนี้ทำให้เกิดความรู้สึกทางคณิตศาสตร์และยังเป็นความรู้สึกที่เข้าใจได้ง่าย การลบส่วนประกอบของตลาดที่มีอยู่ใน

อ้างอิง

https://github.com/emoen/Machine-Learning-for-Asset-Managers/tree/master/Machine_Learning_for_Asset_Managers