[요약정리]딥 러닝을 이용한 자연어 처리 입문(Ch10. 영어 Word2Vec 실습 #2)

02 Mar 2020 in Data on NLP

Word2Vec을 배웠으니 실습을 해보도록 한다.

참고 사이트 : 딥 러닝을 이용한 자연어 처리 입문 에 대한 글을 요약하였다.

영어 Word2Vec 만들기

나의 훈련데이터로 임베딩 훈련을 시키는 것을 배우지만,
데이터가 부족할 경우에는 사전에 훈련된 워드 임베딩(pre-trained word embedding vector)를 가지고 와서 작업 할 수 있다.

1) 훈련 데이터 이해하기

<content>와 </content> 사이의 내용가져오기
내용 중에 (Laughing)이나 (Applause)와 같은 배경음도 제거하자
당연히 데이터마다 다를테니까 이해부터 하는게 중요하다

2) 데이터 로드 및 전처리 하기

2-1) 데이터 로드

import re
from lxml import etree
import urllib.request
import zipfile
from nltk.tokenize import word_tokenize, sent_tokenize

import nltk.data
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Statssy\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.

True

urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")
# 데이터 다운로드

with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
  target_text = etree.parse(z.open('ted_en-20160408.xml', 'r'))
  parse_text = '\n'.join(target_text.xpath('//content/text()'))
# xml 파일로부터 <content>와 </content> 사이의 내용만 가져온다.

parse_text[:300]
# 로드한 데이터 중 300개의 글자만 출력

"Here are two reasons companies fail: they only do more of the same, or they only do what's new.\nTo me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.\nConsider Facit"

2-2) 전처리 수행

content_text = re.sub(r'\([^)]*\)', '', parse_text)
# 정규 표현식의 sub 모듈을 통해 content 중간에 등장하는 (Audio), (Laughter) 등의 배경음 부분을 제거.
# 해당 코드는 괄호로 구성된 내용을 제거.

sent_text=sent_tokenize(content_text)
# 입력 코퍼스에 대해서 NLTK를 이용하여 문장 토큰화를 수행.

normalized_text = []
for string in sent_text:
     tokens = re.sub(r"[^a-z0-9]+", " ", string.lower())
     normalized_text.append(tokens)
# 각 문장에 대해서 구두점을 제거하고, 대문자를 소문자로 변환.

result = [word_tokenize(sentence) for sentence in normalized_text]
# 각 문장에 대해서 NLTK를 이용하여 단어 토큰화를 수행.

print('총 샘플의 개수 : {}'.format(len(result)))

총 샘플의 개수 : 273424

for line in result[:3]: # 샘플 3개만 출력
    print(line)

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']
['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']
['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing']

3) Word2Vec 훈련시키기

from gensim.models import Word2Vec
model = Word2Vec(sentences=result, size=100, window=5, min_count=5, workers=4, sg=0)

size = 워드 벡터의 특징 값. 즉, 임베딩 된 벡터의 차원.
window = 컨텍스트 윈도우 크기
min_count = 단어 최소 빈도 수 제한 (빈도가 적은 단어들은 학습하지 않는다.)
workers = 학습을 위한 프로세스 수
sg = 0은 CBOW, 1은 Skip-gram.

# man과 유사한 단어 출력
model_result = model.wv.most_similar("man")
print(model_result)

[('woman', 0.8638994097709656), ('guy', 0.8042402863502502), ('lady', 0.7918710708618164), ('boy', 0.7913432717323303), ('gentleman', 0.7436702847480774), ('girl', 0.7406295537948608), ('soldier', 0.7341784834861755), ('kid', 0.7041806578636169), ('poet', 0.6990624666213989), ('surgeon', 0.6646226048469543)]

4) Word2Vec 모델 저장하고 로드하기

from gensim.models import KeyedVectors
model.wv.save_word2vec_format('eng_w2v') # 모델 저장
loaded_model = KeyedVectors.load_word2vec_format("eng_w2v") # 모델 로드

model_result = loaded_model.most_similar("man")
print(model_result)

[('woman', 0.8638994097709656), ('guy', 0.8042402863502502), ('lady', 0.7918710708618164), ('boy', 0.7913432717323303), ('gentleman', 0.7436702847480774), ('girl', 0.7406295537948608), ('soldier', 0.7341784834861755), ('kid', 0.7041806578636169), ('poet', 0.6990624666213989), ('surgeon', 0.6646226048469543)]