Great Circle Associates

XCIN Mail-list
(April 2001)


Indexed By Date: [Previous] [Next] Indexed By Thread: [Previous] [Next]

Subject: Re: Some ideas about improving libtabe
From: Chih-Hao Tsai <hao520@yahoo.com>
Organization: Taiwan Linux User Group News Server
Date: Thu, 19 Apr 2001 02:47:44 -0500
To: xcin@tlug.sinica.edu.tw
Delivered-To: xcin-gate@tlug.sinica.edu.tw
Delivered-To: xcin-list@tlug.sinica.edu.tw
Reply-To: xcin@tlug.sinica.edu.tw



Pai-Hsiang Hsiao wrote:

> 我現在有的是一個不大不小的 Chinese TreeBank. 基本上是一些簡體中文的新聞稿等
> 東西, 約有十萬詞. 這些語料是由人工斷詞及加 part-of-speech tag. 我要做簡繁轉
> 換及加注音時, 就比較簡單及可靠點. 簡繁轉換基本上做完了, 一對多的情形部份用詞
> 庫挑出來比對, 再用人工校對. 注音也差不多, 先查教育部的詞典, 再比對 tsi.src,
> 最後再用人工加. (我看到現在的結果, tsi.src 在這部份已經快成為教育部 clc dict
> 的 super set. Thanks to everyone who contributes, great work!)

如果不需要用到詞類標記的話,或可考慮 PH corpus。GB 碼分詞新
聞語料庫,有兩百多萬詞。

ftp://ftp.cogsci.ed.ac.uk/pub/chinese/

 
--
Chih-Hao Tsai | ICQ#5734422 | http://www.geocities.com/hao520


To Unsubscribe: send mail to majordomo@linux.org.tw
with "unsubscribe xcin" in the body of the message



Follow-Ups:
References:
Indexed By Date Previous: Re: [填補注音]
From: thhsieh@tlug.sinica.edu.tw
Next: Re: Xcin 2.5.2 for Darwin/Mac OS X
From: "Yuting Kuo" <yuting@bigfoot.com>
Indexed By Thread Previous: Re: Some ideas about improving libtabe
From: Pai-Hsiang Hsiao <phsiao@fas.harvard.edu>
Next: Re: Some ideas about improving libtabe
From: Pai-Hsiang Hsiao <phsiao@fas.harvard.edu>