Blog: sql相关代码

[TOC]

正则表达式相关

开始和结束符 作为或者出现

^和$写到[]里面就是固定的了.

^放到[]并且在第一位时候eg[ ^abc ]表示非abc的意思. 当出现在中间eg [a^bc]表示a、b、c、^ 这四个字符的其中一个。

要想匹配a、b 或 c或行开头应该怎么写呢?不能写成 [ ^a bc]。但可以写成^[abc], 或者可以写成(^ [abc])

示例程序

###统计PRF指标

SELECT tp, fp, tn, fn, recall, precision, 2*precision*recall / (precision + recall) as f1
FROM 
(
    SELECT tp, fp, tn, fn, tp*1.0/(tp+fn) as recall, tp*1.0/(tp+fp) as precision
    FROM 
    (
        SELECT sum(tp) as tp, sum(fp) as fp, sum(tn) as tn, sum(fn) as fn
        FROM 
        (
            SELECT CAST(SPLIT_PART(flag, ',', 1) AS BIGINT) AS tp, CAST(SPLIT_PART(flag, ',', 2) AS BIGINT) AS fp, 
                   CAST(SPLIT_PART(flag, ',', 3) AS BIGINT) AS tn, CAST(SPLIT_PART(flag, ',', 4) AS BIGINT) AS fn
            FROM 
            (
                SELECT 
                    CASE ---- TP FP TN FN
                        WHEN label_predict = 1 AND label = 1 THEN '1,0,0,0' --tp
                        WHEN label_predict = 1 AND label = 0 THEN '0,1,0,0' --fp
                        WHEN label_predict = 0 AND label = 0 THEN '0,0,1,0' --tn
                        WHEN label_predict = 0 AND label = 1 THEN '0,0,0,1' --fn
                    END AS flag 
                FROM 
                (
                    SELECT label, CASE WHEN score >= 0.5 THEN 1 ELSE 0 END AS label_predict 
                    FROM du_reader_sentence_classify_validation_output_ori
                    WHERE ds = '${bizdate}'
                ) A 
            ) B 
        ) C 
    ) D 
) E 

计算不同threshold下的覆盖情况


SELECT D.threshold, D.avg_confidence_rouge, D.number, E.all_number, D.number * 1.0/E.all_number as ratio
FROM 
(
    SELECT "1" AS key, threshold, AVG(confidence_rouge) as avg_confidence_rouge, COUNT(1) as number 
    FROM 
    ( 
        SELECT threshold, answer_score, confidence_rouge
        FROM 
        (
            SELECT "1" as key, answer_score, confidence_rouge FROM search_du_reader_mrc_predict_result
        ) A 
        JOIN 
        (
            SELECT key, CAST(threshold AS DOUBLE) AS threshold
            FROM 
            ( 
                SELECT TRANS_ARRAY(1, ',', key, threshold) AS (key, threshold)
                FROM 
                ( 
                    SELECT "1" as key, '0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1' as threshold FROM dual
                ) B1
            ) B2
        ) B ON A.key = B.key
        WHERE answer_score >= threshold
    ) C 
    GROUP BY threshold
) D 
JOIN 
(
    SELECT "1" as key, COUNT(1) as all_number  FROM search_du_reader_mrc_predict_result
) E ON D.key = E.key
DISTRIBUTE BY "1" SORT BY threshold
;
  • 高频二阶bert, 覆盖pv: 37%
  • 类目改写: 覆盖pv: 42%
  • 中长尾二阶bert, 覆盖pv: 29%
  • 中长尾替换改写, 覆盖pv: 28%
  • 意图改写, 覆盖pv: 27%
  • session改写, 覆盖pv: 22%
  • 少结果&低质量改写, 覆盖pv: 15%左右
  • 向量召回, 覆盖pv: 8%左右
  • 新增同义词覆盖pv: +35%左右

方案1: 关闭: 高频二阶bert + 类目改写 + session改写 + 向量召回

方案2: 关闭: 高频二阶bert + 类目改写 + session改写 + 向量召回 + 关闭新增同义词 + 意图改写 + 中长尾替换改写

  • ha3缺口比较大, 需要加资源. (缺口3.5w的core)

  • 周四重新压测(或者周三)

  • ha3 70%的时候开始降级,

  • 昨天压测, 改写和phrase各降低了10%, 但是还没抗住.

  • 昨天压测的时候, 无结果率正常的高. 不是超时. 日常1点几, 后面会上升到3-4%. 随着压测加量, 会提升. (别离. 兆逸)

重点在于理解

1: 把问题再升华下, 提升到query理解这方面

2: mall tab 尝试transfer_learning 试试

3: 不要表现的太明显, 自己的想法不要直接表现出来.

4: hb属于和谁都有隔阂的, 而且举棋不定, 没有自己的想法. 但是也不要表现的太明显

5: 点击反馈和意图理解结合起来.

6: 和某个人靠近不要太明显, 最近刷pv有点刷的太积极了.容易引起老大和同事的误会