본문 바로가기

AI_딥_러닝_언어지능

AI_파이썬_언어지능_textunderstanding

import os                # 파일폴더경로 동의 정보를 가져오기 위함
import re                # Regular Expression을 구현하기 위함
import pandas as pd      # 데이터구조화
import tensorflow as tf  # 텐서플로우를 나중에 쓰기 위해
from tensorflow.keras import utils # 인터넷을 통해 외부자료 다운

data_set = tf.keras.utils.get_file(
  fname = 'imdb.tar.gz',
  origin = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
  extract = True
)
Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84125825/84125825 ━━━━━━━━━━━━━━━━━━━━ 5s 0us/step

data_set

/root/.keras/datasets/imdb.tar.gz
def directory_data(directory):
  data = {}
  data["review"] = []
  for file_path in os.listdir(directory):
    with open(os.path.join(directory, file_path), "r", encoding='utf-8') as file:
      data["review"].append(file.read())
  return pd.DataFrame.from_dict(data)

def data(directory):
  pos_df = directory_data(os.path.join(directory, "pos"))
  neg_df = directory_data(os.path.join(directory, "neg"))
  pos_df["sentiment"] = 1
  neg_df["sentiment"] = 0
  return pd.concat([pos_df, neg_df])


train_df = data(os.path.join(os.path.dirname(data_set), "aclImdb", "train"))
test_df = data(os.path.join(os.path.dirname(data_set), "aclImdb", "test"))
train_df.head()
train_df.shape, test_df.shape
((25000, 2), (25000, 2))

reviews = list(train_df['review'])
print(reviews[0])
print(reviews[1])

tokenized_reviews = [r.split() for r in reviews]
print(tokenized_reviews[0])
print(tokenized_reviews[1])

len_review_by_words = [len(r) for r in tokenized_reviews]
print(len_review_by_words[:2])

len_review_by_alphabet = [len(s.replace(' ','')) for s in reviews]
print(len_review_by_alphabet[:2])
Claire Denis's Chocolat is a beautiful but frustrating film. The film presents a very interesting look at the household of a European colonial family living in Cameroon, giving the viewer an informative perspective on the lives of many characters and their interaction. However, the development of these characters is often maddeningly insufficient. For example, a central theme in the story is young France's inability to form strong relationships with others. Although this portrayal is executed flawlessly, notably in the way that Denis frames the story with scenes from France's return to her childhood home, the girl's lack of intimacy with the film's other characters makes it difficult for a viewer to invest much interest in her development (or lack thereof) as a protagonist. The general stagnation of the film's character development makes it difficult to become engaged in the loosely organized plot. The film raises a great deal of tension between characters, particularly between Aimee and the men in her life, but never fully addresses this social friction, leaving the viewer unsatisfied. The final few scenes are powerful but depressing. Denis's work is certainly interesting from an intellectual and historical standpoint, but if you are looking for a film with adventure or drama, Chocolat is definitely not the best choice.
"Thieves and Liars" presents us with a very naturalistic depiction of the levels of corruption that affect many Puerto Ricans and force them to make difficult if not impossible choices about their and their loved ones' lives. The cast is excellent, considering that some are non-professional actors; an excellent choice that augments the level of reality in the film. The photography propels the story without intrusion, as it should be in this type of film. The script captures the idiosyncrasies and attitudes of the "Boricuas" in a very deep way. Sometimes it feels like you're watching a documentary! Watching this film you feel as if you've secretly entered the real Puerto Rican society and stand invisibly watching it implode. I loved it!
['Claire', "Denis's", 'Chocolat', 'is', 'a', 'beautiful', 'but', 'frustrating', 'film.', 'The', 'film', 'presents', 'a', 'very', 'interesting', 'look', 'at', 'the', 'household', 'of', 'a', 'European', 'colonial', 'family', 'living', 'in', 'Cameroon,', 'giving', 'the', 'viewer', 'an', 'informative', 'perspective', 'on', 'the', 'lives', 'of', 'many', 'characters', 'and', 'their', 'interaction.', 'However,', 'the', 'development', 'of', 'these', 'characters', 'is', 'often', 'maddeningly', 'insufficient.', 'For', 'example,', 'a', 'central', 'theme', 'in', 'the', 'story', 'is', 'young', "France's", 'inability', 'to', 'form', 'strong', 'relationships', 'with', 'others.', 'Although', 'this', 'portrayal', 'is', 'executed', 'flawlessly,', 'notably', 'in', 'the', 'way', 'that', 'Denis', 'frames', 'the', 'story', 'with', 'scenes', 'from', "France's", 'return', 'to', 'her', 'childhood', 'home,', 'the', "girl's", 'lack', 'of', 'intimacy', 'with', 'the', "film's", 'other', 'characters', 'makes', 'it', 'difficult', 'for', 'a', 'viewer', 'to', 'invest', 'much', 'interest', 'in', 'her', 'development', '(or', 'lack', 'thereof)', 'as', 'a', 'protagonist.', 'The', 'general', 'stagnation', 'of', 'the', "film's", 'character', 'development', 'makes', 'it', 'difficult', 'to', 'become', 'engaged', 'in', 'the', 'loosely', 'organized', 'plot.', 'The', 'film', 'raises', 'a', 'great', 'deal', 'of', 'tension', 'between', 'characters,', 'particularly', 'between', 'Aimee', 'and', 'the', 'men', 'in', 'her', 'life,', 'but', 'never', 'fully', 'addresses', 'this', 'social', 'friction,', 'leaving', 'the', 'viewer', 'unsatisfied.', 'The', 'final', 'few', 'scenes', 'are', 'powerful', 'but', 'depressing.', "Denis's", 'work', 'is', 'certainly', 'interesting', 'from', 'an', 'intellectual', 'and', 'historical', 'standpoint,', 'but', 'if', 'you', 'are', 'looking', 'for', 'a', 'film', 'with', 'adventure', 'or', 'drama,', 'Chocolat', 'is', 'definitely', 'not', 'the', 'best', 'choice.']
['"Thieves', 'and', 'Liars"', 'presents', 'us', 'with', 'a', 'very', 'naturalistic', 'depiction', 'of', 'the', 'levels', 'of', 'corruption', 'that', 'affect', 'many', 'Puerto', 'Ricans', 'and', 'force', 'them', 'to', 'make', 'difficult', 'if', 'not', 'impossible', 'choices', 'about', 'their', 'and', 'their', 'loved', "ones'", 'lives.', 'The', 'cast', 'is', 'excellent,', 'considering', 'that', 'some', 'are', 'non-professional', 'actors;', 'an', 'excellent', 'choice', 'that', 'augments', 'the', 'level', 'of', 'reality', 'in', 'the', 'film.', 'The', 'photography', 'propels', 'the', 'story', 'without', 'intrusion,', 'as', 'it', 'should', 'be', 'in', 'this', 'type', 'of', 'film.', 'The', 'script', 'captures', 'the', 'idiosyncrasies', 'and', 'attitudes', 'of', 'the', '"Boricuas"', 'in', 'a', 'very', 'deep', 'way.', 'Sometimes', 'it', 'feels', 'like', "you're", 'watching', 'a', 'documentary!', 'Watching', 'this', 'film', 'you', 'feel', 'as', 'if', "you've", 'secretly', 'entered', 'the', 'real', 'Puerto', 'Rican', 'society', 'and', 'stand', 'invisibly', 'watching', 'it', 'implode.', 'I', 'loved', 'it!']
[210, 122]
[1133, 624]

import matplotlib.pyplot as plt

plt.figure(figsize=(12,5))
plt.hist(len_review_by_words, bins = 50, alpha = 0.5, color = 'r')
plt.hist(len_review_by_alphabet, bins = 50, alpha = 0.5, color = 'b')
plt.yscale('log', nonpositive='clip')
plt.title('Review Length Histogram')
plt.xlabel('Review Length')
plt.ylabel('Number of Reviews')
plt.show()


import numpy as np

print('최대 단어수를 가지는 문장은 몇개의 단어인가?', np.max(len_review_by_words))
print('최소 단어수를 가지는 문장은 몇개의 단어인가?', np.min(len_review_by_words))
print('평균 단어수를 가지는 문장은 몇개의 단어인가?', np.mean(len_review_by_words))
print('문장에 있는 단어수의 중간값은 얼마인가?', np.median(len_review_by_words))
print('문장에 있는 단어들의 표준편차는 얼마인가?', np.std(len_review_by_words))
print('문장에 하위 10% 길이는 얼마인가?', np.percentile(len_review_by_words,10))
최대 단어수를 가지는 문장은 몇개의 단어인가? 2470
최소 단어수를 가지는 문장은 몇개의 단어인가? 10
평균 단어수를 가지는 문장은 몇개의 단어인가? 233.7872
문장에 있는 단어수의 중간값은 얼마인가? 174.0
문장에 있는 단어들의 표준편차는 얼마인가? 173.72955740506563
문장에 하위 10% 길이는 얼마인가? 91.0