句子分块

句子分块

分块也称为浅层分析,它基本上是识别句子部分和短语(如名词短语)。 词性标注告诉你单词是名词,动词,形容词等,但它并没有给你任何关于句子中句子或短语结构的线索。有时除了单词的词性,自然语言处理任务需要获取更多信息,这是就需要对句子进行解析,从中获得完整的解析树。

PyRATA

Python nltk.RegexpParser() Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def prepareForNLP(text):
sentences = nltk.sent_tokenize(text)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences


def chunk(sentence):
chunkToExtract = """
NP: {<NNP>*}
{<DT>?<JJ>?<NNS>}
{<NN><NN>}"""
grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|CLAUSE>+} # Chunk verbs and their arguments
CLAUSE: {<NP><VP>} # Chunk NP, VP
}<[\.VI].*>+{ # chink any verbs, prepositions or periods
"""
parser = nltk.RegexpParser(grammar)
result = parser.parse(sentence)
print "result.label():", result.label()
for subtree in result.subtrees():
t = subtree
t = ' '.join(word for word, pos in t.leaves())
print(t)

if __name__ == '__main__':
example_sent = "A man with a red helmet stands on a small moped on a dirt road .".lower()
sentences = prepareForNLP(example_sent)
for sentence in sentences:
chunk(sentence)

输出:

1
2
3
4
5
6
7
8
9
a man with a red helmet stands on a small moped on a dirt road .
a man
with a red helmet
a red helmet
stands on a small moped on a dirt road
on a small moped
a small moped
on a dirt road
a dirt road

相关链接

名词短语的分块:

NP Chunking (State of the art))

词性标记集

英语标记集

-------------本文结束感谢您的阅读-------------