中文內容夾雜英文，出現[UNK]，該如何解決?

by sfkuo - opened Dec 2, 2024

Dec 2, 2024

中文內容夾雜英文，出現[UNK]，該如何解決? 謝謝

輸入的內容為:為監測與防治此新興傳染病我國於2020年1月15日起公告COVID19為第五類法定傳染病並於2020年1月21日確診第一起境外移入確診個案另於1月28日確診第1例本土個案為境外移入造成之家庭群聚感染

輸出結果:為監測與防治此新興傳染病。我國於2020年1月15日起公告[UNK][UNK][UNK][UNK][UNK]19為第五類法定傳染病，並於2020年1月21日確診第一起境外移入確診個案，另於1月28日確診第1例本土個案，為境外移入造成之家庭群聚感染。

p208p2002

Owner Dec 4, 2024

•

edited Dec 4, 2024

你好，我在原本的解碼實現方式較為簡單，是直接運用模型預測的 token id 轉換回文字，所以當文字中包含模型不認得的文字(即詞表外)就會出現 [UNK]。
然而我們可以透過文字與token的映射表來取回[UNK]範圍的文字，如此便可以修復這個問題。

我簡單的示範一下這部分如何進行，調整後應該就可以解決你的問題:
https://colab.research.google.com/drive/1dE5UtjNIlESEpJURF0mEcHt_FmyuySdD?usp=sharing

由於要調整到zhpr會牽涉到較大的重構，目前我並沒有計畫去更新原本的解碼策略，但是歡迎提出PR :)

請讓這個議題懸掛在這邊，直到zhpr更新為止。

yk-xia

Feb 17

•

edited Feb 17

看了下 https://colab.research.google.com/drive/1dE5UtjNIlESEpJURF0mEcHt_FmyuySdD?usp=sharing 的实现没太理解。我的解决方法是在后处理的时候稍微改一下。看下面的代码。试了我找的一段话，没问题，UNK都替换成原文了。

    model_pred_out = []
    for batch in tqdm(dataloader):
        batch_out = predict_step(batch,model,tokenizer)
        for out in batch_out:
            model_pred_out.append(out)
    merge_pred_result = merge_stride(model_pred_out,step)
    merge_pred_result_decode = decode_pred(merge_pred_result)

    # 不马上合并
    # merge_pred_result_decode = ''.join(merge_pred_result_decode)
    # print(merge_pred_result_decode)

    # 使用双指针和for循环取出UNK
    ptr_orig = 0
    merge_pred_result_decode_unk_removed = ""
    for ptr_restored in range(len(merge_pred_result_deocde)):
        curr = merge_pred_result_deocde[ptr_restored]
        if ptr_orig == len(text):
            assert curr != '[UNK]'
        if curr == '[UNK]':
            curr = text[ptr_orig]
            ptr_orig += 1
        elif ptr_orig < len(text) and curr == text[ptr_orig]:
            ptr_orig += 1
    merge_pred_result_decode_unk_removed += curr

    print(merge_pred_result_decode_unk_removed)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment