반응형

아래 내용은 Udemy에서 Pytorch: Deep Learning and Artificial Intelligence를 보고 정리한 내용이다.

Embeddings

Text is sequence data, but it is not continuous

 

One-Hot Encoding?

vector  = > 0을 포함한 것 

map each word to an index 

  a => [1,0] 1

  b => [0,1] 2

 

T words -> one hot encoded vector of size V 

T x V matrix

 

data has no useful geometrical structure.

 

each word to a D-dimensional vector(not one-hot encoded)

 

 

 

 

 

RNN with Embedding

 

D = embedding dimension -> hypperparameter

V = vecab size (# of unique words)

 

Embedding matrix is V x D

Each row is a D-size vector for a word

 

 

forward Function

out = self.embed(x)

 

 

Text Preprocessing

Text Files

each word -> Integer ->vector

 

Structured Text Files

CSV(comma separated value)

dovument 

 

 

pandas for CSV?

document 

multiple words -> individual words

 

Multiple words to single words(Tokenization)

string.split() => a lot of cases

can't handle punctuation  => . 

remove 를 해야 한다. 

 

Tokens to Integers

dataset = long sequence of words

current_idx = 0

word2idx = {}

for word in dataset:

  if word not in word2idx :

    word2idx[word] = current_idx

    current_idx += 1

 

current_idx = 0 => 2로 한다. 

보통 0으로 안한다.

1 = padding

0 = unknown  => 우리는 학습을 못 할 떄 

 

Constant-Length Sequence

길이를 같게 하고 하기 위해 padding 을 한다.

 

Pre-padding vs Post-padding

- Text classification RNN reads the input from left-to-right, the pre-padding 이 더 좋다.

   시작 -> 끝 

- challenging for RNNs to learn long-term dependencies!

 

convert to csv

Tokenization

map each token to a unique integer

 

The task

text classification(many-to-one)

Input: sequence of words, output: a single label (spam or not spam)

 

Field Objects

import torchtext.data as ttd

TEXT = ttd.Field(

  sequentail = True,

  batch_first = True,

  lower = True,

  pad_first = True

)

LABEL = ttd.Field(sequentail = False, use_vocab = False, is_target = True)

 

TabularDataset Object

 

split train, test data

dataset.split()

 

build vocab

TEXT.build_vocab(train_dataset)

vocab = TEXT.vocab

 

vocab object

stoi(C-style naming)   문자 -> 숫자

itos(reverse mapping) 숫자 -> 문자

 

 

stoi and itos

Dictionary = keys -> values

List = keys -> value

 

pytorch Text Preprocessing

import torch
import torch.nn as nn
import torchtext.data as ttd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
data = {
    "label" : [0,1,1] ,
    "data"  :[
              "I like eggs and ham.",
              "Eggs I like!",
              "Ham and eggs or just ham?"
    ]
}
df = pd.DataFrame(data)
df.head()
df.to_csv("data.csv", index = False)
TEXT = ttd.Field(
    sequential = True,
    batch_first = True,
    lower = True,
    tokenize = 'spacy',
    pad_first = True
)

LABEL = ttd.Field(sequential=False, use_vocab=False, is_target=True)
dataset = ttd.TabularDataset(
    path = 'data.csv',
    format = 'csv',
    skip_header = True,
    fields = [('label', LABEL) ,('data' , TEXT)]
)
ex = dataset.examples[0]
train_dataset , test_dataset = dataset.split(0.66)
TEXT.build_vocab(train_dataset,)
vocab = TEXT.vocab
vocab.stoi
vocab.itos
train_iter, test_iter = ttd.Iterator.splits(
    (train_dataset, test_dataset) , sort_key = lambda x: len(x.data),
    batch_sizes = (2,2)  , device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
)
for inputs, targets in train_iter:
  print("inputs:" , inputs, "shape:", inputs.shape)
  print("targets:" , targets, "shape:" ,targets.shape)
  break
  
for inputs, targets in test_iter:
  print("inputs:" , inputs, "shape:", inputs.shape)
  print("targets:" , targets, "shape:" ,targets.shape)
  break

 

CNNs for Text

sequence

1-D convolution

For images:

2 spatial dimensions + 1 input feature dimensions + 1 output feature dimension = 4

for sequences:

1 time dimension + 1 input feature dimension + 1 output feature dimension = 3

 

 

경고: feature first

Recall that for images , pytorch cnn expects image to be N x C x H x W 

  "feature first"

whereas, in Tensorflow  / OpenCV/ others, it's N x H x W x C

  "feature last"

the torchvision data generators hide this detail

NLP, output of embedding is N x T x D ("feature last")

nn.Conv1d(), we expect N x D x T as input!("feature first")

그래서 , must reshape before and after convolutions

 

 

Text Classification with CNNs

output of embediding is alwasy (N, T,D)

conv1d expects (N, D, T)

 

 

=> out.permute(0,2,1)

 

 

change it back

out.permute(0,2,1)

 

cnn 도 경과가 좋게 나온다.

class CNN(nn.Module):
  def __init__(self, n_vocab, embed_dim, n_outputs):
    super(CNN, self).__init__()
    self.V = n_vocab
    self.D = embed_dim
    self.K = n_outputs

    self.embed = nn.Embedding(self.V, self.D)

    self.conv1 = nn.Conv1d(self.D, 32, 3, padding = 1)
    self.pool1 = nn.MaxPool1d(2)
    self.conv2 = nn.Conv1d(32, 64, 3, padding = 1)
    self.pool2 = nn.MaxPool1d(2)
    self.conv3 = nn.Conv1d(64, 128, 3, padding = 1)

    self.fc = nn.Linear(128, self.K)
  def forward(self, X):
    out = self.embed(X)
    out = out.permute(0,2,1)
    out = self.conv1(out)
    out = F.relu(out)
    out = self.pool1(out)
    out = self.conv2(out)
    out = F.relu(out)
    out = self.pool2(out)
    out = self.conv3(out)
    out = F.relu(out)
    
    out = out.permute(0,2,1)

    out, _ = torch.max(out, 1)
    out = self.fc(out)
    return out

 

making predicions with Trained NLP Model

single_sentence = 'Our dating service has been asked 2 contast U by someone shy!'
toks= TEXT.preprocess(single_sentence)
sent_idx = TEXT.numericalize([toks])
model(sent_idx.to(device))

 

반응형

+ Recent posts