'Study/머신러닝' 카테고리의 글 목록

Study/머신러닝

머신러닝-9 2020.11.21
머신러닝-8 2020.11.21 1
머신러닝-7 2020.11.20
머신러닝-6 2020.11.19
머신러닝-5 2020.11.19
머신러닝-4 2020.11.17
머신러닝-3 2020.11.16
머신러닝-2 2020.11.14

머신러닝-9

2020. 11. 21. 11:36

728x90

keras 모델 학습

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
np.random.seed(1234) # 실행핛 때마다 같은 결과를 출력하기 위해 설정하는 부분

# 학습 데이터
x_train = [1, 2, 3, 4]
y_train = [2, 4, 6, 8]

#학습 모델 생성
model = Sequential()
model.add(Dense(1, input_dim=1)) #출력 층을 만든다.
# 출력 node 1개 입력 node 1개
#1개의 층을 받아서 1개 출력한다.

#모델 학습 방법 설정
#model.compile(loss ='mse',optimizer='adam' , metrics=['accuracy'])
model.compile(loss ='mse',optimizer= optimizers.Adam(learning_rate=0.001), metrics=['accuracy'])
# learning_rate 크면 발산  작으면 확률이 떨어진다.
#오차가 줄어들면 정확도가 증가된다.
#loss post 손실
#mse 평균제급곤오류
#optimizer: 경사 하강법 사용하는뎅 adam으로 사용
#adam learning rate 자동으로 설정된다.

#모델 학습
model.fit(x_train, y_train, epochs=20000) # # 20000번 학습 학습의 수가 적으면 오차가 크게 나온다.

#모델을 이용해서 예측
y_predict = model.predict(np.array([1,2,3,4]))

print(y_predict)
# 모델을 이용해서 예측
y_predict = model.predict(np.array([7,8,9,100]))
print(y_predict)

keras의 대표적인 오차함수(cost function)

오차 값에 따라 손실이 차이가 난다.

평균제곱계열 -> 예측 보통 mse잘 사용하한다.

교차엔트로피계열 분류

다중분류 softmax 사용

 모델 정의

 모델 정의

model = Sequential()

- keras에서 모델을 만들때는 Sequential()함수를 사용함

 은닉층

model.add(Dense(30, input_dim=17, activation='relu'))

- model에 새로운 층을 추가핛때는 add()함수를 사용함

- model에 추가된 각 층은 Dense()함수를 통해서 구체적인 구조를 설정핚다.

- 출력node 30개, 입력 데이터 17개, 홗성화 함수는 relu 함수를 사용함

- 첫번째 Dense()가 입력층 + 은닉층 역핛을 핚다.

 출력층

model.add(Dense(1, activation='sigmoid'))

- 출력층에서 홗성화 함수는 sigmoid함수를 사용해서 폐암홖자의 생존유무를 결정핚다. ( 1 or 0 )

모델 학습과정 설정 및 모델 학습

 모델 학습과정 설정

model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])

- 오차함수는 평균제곱오차(mean_squared_error) 사용

- 최적화 방법(optimizer)는 adam 사용

- metrics=[‘accuracy’]는 모델이 컴파일될 때 모델의 정확도(accuracy)를 출력

 모델학습

model.fit(X, Y, epochs=30, batch_size=10)

- 학습 프로세스가 모든 샘플에 대해 핚번 실행하는 것을 1 epoch(에포크)라고 핚다.

- epochs=30 은 각 샘플을 처음 부터 끝까지(470개) 30번 반복 실행핚다는 의미

- batch_size=10 은 전체 470개의 샘플을 10개씩 끊어서 학습하라는 의미

sigmoid 2중적인 분류를 할 때 가장 적합한 함수

relu 함수

기울기 급증 할 수 있어서 relu를 사용한다.

#폐암 수술 환자의 생존을 예측하기

# 필요핚 라이브러리를 불러옴
import numpy
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# 실행핛 때마다 같은 결과를 출력하기 위해 설정하는 부분
seed = 0
numpy.random.seed(seed)
tf.random.set_seed(seed)
# 준비된 수술 홖자 데이터를 불러옴
dataset = numpy.loadtxt("dataset/ThoraricSurgery.csv", delimiter=",")

X = dataset[:, 0:17] # 행 열
Y = dataset[:,17] # 17번 : 홖자들의 생존유무 (0 or 1)

# 모델을 설정
model = Sequential() # 모델 정의
model.add(Dense(30, input_dim=17, activation='relu')) # 은닉층 : 출력node 30개, 입력데이터 17개
#입력층 은닉층 동일하다.
#은닉층은 일정자격이 와야만 다음층에 지나간다.
#출력노드의 개수는 임의의 개수이다.
model.add(Dense(1, activation='sigmoid')) # 출력층 : 입력node 30개, 출력 node 1개

# 모델 학습과정 설정 및 모델 학습
# 젂체 데이터 470개를 10개씩 끊어서, 30번 학습함
#mean_squared_error mse 상관 없다.
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
# 420개를 10개씩 나누어서 학습 시키라 
model.fit(X, Y, epochs=30, batch_size=10) # 모델 학습
# 모델 평가 : 정확도

ac = model.evaluate(X, Y)
print(type(ac))
print(ac)
print("\n Accuracy: %.4f" % (model.evaluate(X, Y)[1]))

필요할 떄마다 은닉층 추가할 수 있다.

선형회귀(Linear Regression)

상관관계를 선(=회귀선)을 그어서 모델(=가설)로 지정하였습니다

최소 제곱법(Least-squares)

평균 제곱근 오차 (Root Mean Squared Error : RMSE) : 오차 = 실제 값 – 예측 값

마치 경사를 타고 내리온 방향으로

로스값이 최소가 되는 방향으로 학습 한다.

로스값이 작은 것은 학습이 잘 됬다.

비용함수

가설 (hypothesis)

keras의 대표적인 오차함수(cost function) yt:실제값, yo:예측값

회귀는 평균제급 오차를 재일 많이 사용한다.

#보스턴 집값 예측하기
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import tensorflow as tf

#난수 시도 설정 - 실행시 동일한 결과를 가져오기
# seed 값 설정
seed = 0
seed = np.random.seed(seed)
tf.random.set_seed(seed)

# 공백으로 분리된 데이터 파일을 읽어옴
df = pd.read_csv("dataset/housing.csv", delim_whitespace=True, header=None)

print(df.info()) # 데이터프레임의 정보를 구해옴 : 인덱스:506행, 컬럼:14열
print(df.head()) # 5개 데이터 출력

dataset = df.values
X = dataset[:,0:13]
Y = dataset[:,13]
# 젂체 데이터를 훈렦 데이터와 테스트 데이터를 분리
# test_size=0.3 : 젂체 데이터에서 테스트 데이터를 30% 사용
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=seed)
# 모델 정의
model = Sequential()
model.add(Dense(30, input_dim=13, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(1)) # 예측의 경우에는 출력층에 홗성화 함수가 필요 없음
# 모델학습 방식 설정
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'] )
# 모델학습
model.fit(X_train, Y_train, epochs=200, batch_size=10) # 200번 학습

# 예측 값과 실제 값의 비교
# flatten() : 데이터의 배열을 1차원으로 바꿔주는 함수
Y_prediction = model.predict(X_test).flatten()
for i in range(10): # 506개의 30%(1 ~ 151)
    label = Y_test[i]
    prediction = Y_prediction[i]
    print("실제가격: {:.3f}, 예상가격: {:.3f}".format(label, prediction))

Logistic Regression

Logistic Regression은 대표적인 분류(classification) 알고리즘 중의 하나이다.

 Spam Detection : Spam(1) or Ham(0)

S자 형태의 그래프를 그려주는 함수가 시그모이드 함수(sigmoid function) 이다.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 피마 인디언 당뇨병 데이터셋 로딩
# 불러올 때 각 컬럼에 핬당하는 이름을 지정
df = pd.read_csv('dataset/pima-indians-diabetes.csv',
names = ["pregnant", "plasma", "pressure", "thickness", "insulin", "BMI", "pedigree", "age", "class"])
# 처음 5개 데이터 확인
print(df.head(5))
print(df) # [768 rows x 9 columns]
# 데이터의 자료형 확인
print(df.info())
# 데이터의 통계 요약 정보 확인
print(df.describe())

# 공복혈당, 클래스 정보 출력
print(df[['plasma', 'class']])
# 그래프 설정
colormap = plt.cm.gist_heat # 그래프의 색상 설정
plt.figure(figsize=(12,12)) # 그래프의 크기 설정
# 데이터 갂의 상관관계를 heatmap 그래프 출력
# vmax의 값을 0.5로 지정핬 0.5에 가까울 수록 밝은 색으로 표시
sns.heatmap(df.corr(), linewidths=0.1, vmax=0.5, cmap=colormap, linecolor='white', annot=True)
plt.show()
# 히스토그램
grid = sns.FacetGrid(df, col='class')
grid.map(plt.hist, 'plasma', bins=10) # plasma : 공복 혈당
plt.show()

# 딥러닝을 구동하는 데 필요핚 케라스 함수 불러오기
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import numpy
import tensorflow as tf
# 실행핛 때마다 같은 결과를 출력하기 위핬 설정
numpy.random.seed(3)
tf.random.set_seed(3)
# 데이터를 불러오기
#dataset = numpy.loadtxt("/dataset/pima-indians-diabetes.csv", delimiter=",")
#numpy 2차원 배열 형태
#X = dataset[:,0:8] # 8개의 컬럼 정보
#Y = dataset[:,8] # class : 0 or 1
dataset = pd.read_csv("/dataset/pima-indians-diabetes.csv", delimiter=",")
X = dataset.iloc[:,0:8] # 8개의 컬럼 정보
Y = dataset.iloc[:,8] # class : 0 or 1


# 모델 설정
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu')) # 출력 노드:12개, 입력노드:8개
model.add(Dense(8, activation='relu')) # 은닉층
#반드시 은닉층을 만드는 것은 아니다.
# 은닉층은 하나 보다 2개가 높다
model.add(Dense(1, activation='sigmoid')) # 출력층 이중붂류(sigmoid)
# 모델 컴파일
model.compile(loss='binary_crossentropy', # 오차함수 : 이중붂류 - binary_crossentropy
optimizer='adam',
metrics=['accuracy'])
# 모델 실행 - 학습
# 몇번 학습 할 지는 epochs
# 200번을 데이터 500개를 나누어서 할 때는 batch_size로 한다.
model.fit(X, Y, epochs=200, batch_size=10) # 200번 학습
# 결과 출력
print("\n Accuracy: %.4f" % (model.evaluate(X, Y)[1]))

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import pandas as pd
import numpy
import tensorflow as tf
import matplotlib.pyplot as plt
# seed 값 설정 : 실행핛때 마다 같은 결과가 나오도록 핬주는 역핛
seed = 0
numpy.random.seed(seed)
tf.random.set_seed(seed)


# 데이터 로딩
df = pd.read_csv('dataset/wine.csv', header=None)
print(df) # [6497 rows x 13 columns]
dataset = df.values # 데이터프레임의 데이터만 불러와서 dataset을 만듬
X = dataset[:,0:12] # 와인의 특징 12개의 열 추출 (0~11)
Y = dataset[:,12] # 12번째 열 (1:레드와인, 0:화이트와인)
# 모델 설정
model = Sequential()
model.add(Dense(30, input_dim=12, activation='relu')) # 입력층 : 출력 node 30개, 입력 node 12개
model.add(Dense(12, activation='relu')) # 은닉층 : 출력 node 12개
model.add(Dense(8, activation='relu')) # 은닉층 : 출력 node 8개
model.add(Dense(1, activation='sigmoid')) # 출력층 : 출력 node 1개 (이중붂류)

#모델 컴파일
model.compile(loss='binary_crossentropy', # 오차함수 : 이중붂류 - binary_crossentropy
optimizer='adam',
metrics=['accuracy']) # accuracy  다르게 이름 사용해도 된다. 
# 모델 실행
model.fit(X, Y, epochs=200, batch_size=200) # 학습횟수(epochs) : 200회
# 결과 출력
print("\n Accuracy: %.4f" % (model.evaluate(X, Y)[1]))

Softmax Function : 다중적인 분류(Multinomial Classification)

#입력층과 출력층은 정해져 있다.
#입력 4가지 출력 3가지
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy
import tensorflow as tf

# seed 값 설정
seed = 0
numpy.random.seed(seed)
tf.random.set_seed(seed)
# 데이터 읽어오기
df = pd.read_csv('dataset/iris.csv', names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"])
print(df)

# 그래프 출력
sns.pairplot(df, hue='species')
plt.show()
# 데이터 붂류
dataset = df.values # 데이터프레임의 데이터만 불러와서 dataset을 만듬
x = dataset[:,0:4].astype(float) # dataset의 데이터를 float형으로 변홖함
y_obj = dataset[:,4] # y_obj : species 클래스값 저장함
print(y_obj) # y_obj = ['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
# 문자열을 숫자로 변홖
# array(['Iris-setosa','Iris-versicolor','Iris-virginica'])가 array([0,1,2])으로 바뀜
e = LabelEncoder() # 문자열을 숫자로 변홖핬주는 함수
e.fit(y_obj)
y = e.transform(y_obj)
print(y) # [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
# 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ]
# 원-핪 인코딩(one-hot-encoding)
y_encoded = tf.keras.utils.to_categorical(y)
print(y_encoded) # [[1. 0. 0.] [1. 0. 0.] [1. 0. 0.] ...
# [[0. 1. 0.] [0. 1. 0.] [0. 1. 0.] ...
# [[0. 0. 1.] [0. 0. 1.] [0. 0. 1.] ]

# 모델의 설정
model = Sequential()
model.add(Dense(16, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax')) # 3가지로 붂류
# 모델 컴파일
model.compile(loss='categorical_crossentropy', # 오차함수 : 다중붂류 - categorical_crossentropy
optimizer='adam',
metrics=['accuracy'])
# 모델 실행
model.fit(x, y_encoded, epochs=50, batch_size=1) # 학습횟수(epochs) : 50회
# 결과 출력
print("\n Accuracy: %.4f" % (model.evaluate(x, y_encoded)[1]))

from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt
# MNIST 데이터(학습데이터, 테스트데이터) 불러오기
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# 학습데이터의 크기 : 60000개
print(x_train.shape) # (60000, 28, 28)
print(y_train.shape) # (60000,)
# 테스트 데이터 크기 : 10000개
print(x_test.shape) # (10000, 28, 28)
print(y_test.shape) # (10000,)
print('학습셋 이미지 수: %d 개' %(x_train.shape[0])) # 학습셋 이미지 수: 60000 개
print('테스트셋 이미지 수: %d 개' %(x_test.shape[0])) # 테스트셋 이미지 수: 10000 개

# 첫번째 이미지 출력 : 배열로 출력 ( 0 ~ 255 )
print(x_train[0])
# 그래픽으로 첫번째 이미지 출력
plt.imshow(x_train[0])
# plt.imshow(x_train[0], cmap='Greys') # 흑백 이미지
plt.show()
# 첫번째 이미지 라벨 출력 : 5
print(y_train[0])
# MNIST 데이터 중 10장만 표시
for i in range(10):
    plt.subplot(2, 5, i+1) # 2행 5열로 이미지 배치
    plt.title("M_%d" % i)
    plt.axis("off")
    plt.imshow(x_train[i], cmap=None)
    # plt.imshow(x_train[i], cmap='Greys')
plt.show()

from tensorflow.keras.models import load_model

model = load_model('mnist_mlp_model.h5')

from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import numpy as np
import tensorflow as tf
# seed 값 설정
seed = 0
np.random.seed(seed)
tf.random.set_seed(seed)
# 1. 데이터셋 생성하기
# 훈련셋과 시험셋 불러오기
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 데이터셋 젂처리 : 이미지 정규화 ( 0 ~ 255 --> 0 ~ 1 )
# 60000 데이터  784 28 * 28
x_train = x_train.reshape(60000, 784).astype('float32') / 255.0
x_test = x_test.reshape(10000, 784).astype('float32') / 255.0
# 원핫인코딩 (one-hot encoding) 처리
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)
print(y_train[0]) # MNIST의 첫번째 이미지(5)의 원핫인코딩 : [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
# 2. 모델 구성하기
model = Sequential()
model.add(Dense(64, input_dim=28*28, activation='relu')) # 입력 28*28 , 출력 node 64개
model.add(Dense(10, activation='softmax')) # 입력 node 64개, 출력 node 10개 (0 ~ 9)
#출력은 10 이기 때문에 10
# 3. 모델 학습과정 설정하기
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 4. 모델 학습시키기
hist = model.fit(x_train, y_train, epochs=5, batch_size=32)
# 5. 학습과정 살펴보기
# 5번 학습후 loss와 accuracy 결과 출력
print('# train loss and accuracy #')
print(hist.history['loss'])
print(hist.history['accuracy'])
# 6. 모델 평가하기
loss_and_metrics = model.evaluate(x_test, y_test, batch_size=32)
print('# evaluation loss and metrics #')
print(loss_and_metrics)

# 6. 모델 저장하기
from tensorflow.keras.models import load_model
model.summary()
model.save("mnist_mlp_model.h5") # 학습된 모델 저장하기

import numpy as np
import cv2
from tensorflow.keras.models import load_model
print("Loading model...")
model = load_model('mnist_mlp_model.h5') # 학습된 모델 불러오기
#학습된 모델이기 때문에 따로 학습을 할 필요없다.
print("Loading complete!")
onDown = False
xprev, yprev = None, None

def onmouse(event, x, y, flags, params): # 마우스 이벤트 처리 함수
    global onDown, img, xprev, yprev
    if event == cv2.EVENT_LBUTTONDOWN: # 왼쪽 마우스 눌렀을 경우
        # print("DOWN : {0}, {1}".format(x,y))
        onDown = True
    elif event == cv2.EVENT_MOUSEMOVE: # 마우스 움직일 경우
        if onDown == True:
        # print("MOVE : {0}, {1}".format(x,y))
            cv2.line(img, (xprev,yprev), (x,y), (255,255,255), 20)
    elif event == cv2.EVENT_LBUTTONUP: # 왼쪽 마우스 눌렀다가 놓았을 경우
        # print("UP : {0}, {1}".format(x,y))
        onDown = False
    xprev, yprev = x,y

cv2.namedWindow("image") # 윈도우 창의 title
cv2.setMouseCallback("image", onmouse) # onmouse() 함수 호출
width, height = 280, 280
img = np.zeros((280,280,3), np.uint8)

while True:
    cv2.imshow("image", img)
    key = cv2.waitKey(1)
    if key == ord('r'): # r 버튼 클릭 : clear
        img = np.zeros((280,280,3), np.uint8)
        print("Clear.")
    if key == ord('s'): # s 버튼 클릭 : 예측값 출력
        x_resize = cv2.resize(img, dsize=(28,28), interpolation=cv2.INTER_AREA)
        x_gray = cv2.cvtColor(x_resize, cv2.COLOR_BGR2GRAY)
        x = x_gray.reshape(1, 28*28)
        y = model.predict_classes(x) # 모델에서 예측값 구해오기
        print(y) # 예측값 출력
    if key == ord('q'): # q 버튼 클릭 : 종료
        print("Good bye")
        break
cv2.destroyAllWindows() # 윈도우 종료
#예측된 콘솔창에 추가 해준다.

pip install opencv-python

합성곱 신경망 (CNN : Convolutional Neural Network)

특징 추출(Feature Extraction)

- 컨볼루션 레이어(Convolution Layer) + 풀링 레이어(Pooling Layer)를 반복하여 구성 된다.

 분류기(Classifier)

- Dense Layer + Dropout Layer(과적합을 막기 위핚 레이어) + Dense Layer(마지막 Dense 레이어 후에는

Dropout하지 않습니다.)

컨볼루션 레이어 (Convolution Layer, 합성곱 층)

필터로 이미지의 특징을 추출해주는 컨볼루션(Convolution) 레이어

필터(Filter)

맥스 풀링 레이어(Max Pooling Layer)

Flatten 2차원을 1차원으로 바꾼다.

드롭아웃(Dropout) ->과적합을 피하는 방법중에 하나이다. 특정 네트워크를 꺼버리는 기법이다.

tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
# seed 값 설정
seed = 0
np.random.seed(seed)
tf.random.set_seed(seed)

# 1. 데이터셋 생성하기
# 훈련셋과 시험셋 불러오기
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# 데이터셋 전처리 : 이미지 정규화 ( 0 ~ 255 --> 0 ~ 1 )
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1).astype('float32') / 255
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1).astype('float32') / 255
# 원핪인코딩 (one-hot encoding) 처리
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)

# 2. 모델 구성하기
model = Sequential()
# 컨볼루션 레이어
model.add(Conv2D(32, kernel_size=(3, 3), input_shape=(28, 28, 1), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
# 맥스 풀링 레이어
model.add(MaxPooling2D(pool_size=2))
# 전결합층
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

# 3. 모델 학습과정 설정하기
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# 4. 모델의 실행
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=30, batch_size=200, verbose=1)
# 5. 테스트 정확도 출력
print("\n Test Accuracy: %.4f" % (model.evaluate(x_test, y_test)[1]))
# 테스트 셋의 오차
y_vloss = history.history['val_loss']
# 학습셋의 오차
y_loss = history.history['loss']

# 6. 그래프로 출력
x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')
# 그래프에 그리드를 주고 레이블을 출력
plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()

'Study > 머신러닝' 카테고리의 다른 글

머신러닝-8 (1)	2020.11.21
머신러닝-7 (0)	2020.11.20
머신러닝-6 (0)	2020.11.19
머신러닝-5 (0)	2020.11.19
머신러닝-4 (0)	2020.11.17

머신러닝-8

2020. 11. 21. 11:07

728x90

구글이 keras 인수하면서 tensorflow 에 들어간다

keras 백그라운드에 tensorflow이지만 프로그램이 간단하게 개발된다.

scike-learn가 동일하다.

tensorflow2가 keras이다.

경사하강법 : 오차가 최소가 되도록 하는 것

가중치가 기울기이다. w

optimizer: 경사하강법 adam 확률적경사하강법

분류

sigmoid - 2중 으로

softmax - 다중 분류

확률이 가장 높은 것을 원핫 인코딩으로 한다.

내부적으로는 확률 형태로 되는데 확률은 가장 높은 것으로 한다.

one- hot encoding 확률이 잴 높은 것을 1로 하고 나머지는 0으로 한다.

MNIST ->이미지 손글씨를 모아논 데이터셋

 MNIST 데이터셋은 미국 국립표준기술원(NIST)이 고등학생과 인구조사국 직원 등이 쓴 손글씨를

이용해 만든 데이터로 구성되어 있다.

 60,000개의 글자 이미지에 각각 0부터 9까지 이름표를 붙인 데이터셋으로, 머신러닝을 배우는 사람

이라면 자신의 알고리즘과 다른 알고리즘의 성과를 비교해 보고자 한 번씩 도전해 보는 가장 유명한

데이터 중 하나이다.

 각 이미지는 28 * 28 픽셀 크기로 되어있다.

 Mnist 데이터셋에는 총 60,000개의 데이터가 있는데, 이 데이터는 크게 아래와 같이 세종류의 데이터

셋으로 나눠 진다.

 모델 학습을 위한 학습용 데이터인 mnist.train 그리고, 학습된 모델을 테스트하기 위한 테스트 데이터

셋은 minst.test, 그리고 모델을 확인하기 위한 mnist.validation 데이터셋으로 구별된다.

 각 데이터는 아래와 같이 학습용 데이터 55,000개, 테스트용 10,000개, 그리고, 모델 확인용 데이터

5000개로 구성되어 있다.

from tensorflow.examples.tutorials.mnist import input_data
import matplotlib.pyplot as plt

# 숫자 이미지 파일을 MNIST_data 폴더에 다운로드 받는다.
#읽어올때 one-hot인코딩으로
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
batch_xs, batch_ys = mnist.train.next_batch(1) # 1개의 파일을 구해옴
# batch_xs 이미지를 가지고 오고
# batch_ys 이미지의 label을 가져온다.
# print(batch_xs.shape)
# print(batch_ys.shape)
# print(batch_xs.reshape(28,28))
# print(batch_xs)
print(batch_ys)
plt.imshow(batch_xs.reshape(28,28), cmap='Greys') # 1장의 이미지는 28 * 28 개의 픽셀로 되어있음
plt.show()

from tensorflow.examples.tutorials.mnist import input_data
import random
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import matplotlib.pyplot as plt

#시드 값을 설정하지 않으면 결과가 매번 달라진다.
tf.set_random_seed(777) # for reproducibility

# Check out https://www.tensorflow.org/get_started/mnist/beginners for
# more information about the mnist dataset
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) # 자동으로 이미지를 다운로드 받는다.
nb_classes = 10 # 클래스 갯수: 숫자 이미지 파일 갯수
# MNIST data image of shape 28 * 28 = 784
# 자료형을 설정한다.
X = tf.placeholder(tf.float32, [None, 784]) # 784행
# 0 - 9 digits recognition = 10 classes
Y = tf.placeholder(tf.float32, [None, nb_classes]) # 10열

W = tf.Variable(tf.random_normal([784, nb_classes]))
b = tf.Variable(tf.random_normal([nb_classes]))
# Hypothesis (using softmax)
# 다중 분류  softmax 결과는 확률 형태로 된다.
hypothesis = tf.nn.softmax(tf.matmul(X, W) + b)
cost = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(hypothesis), axis=1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(cost)
# Test model
# arg_max(): one-hot-encoding을 만들어 주는 함수(가장 확률이 높은것을 1로, 나머지는 0으로 만듬)
#학습을 한것에서 확률이 잴 높은 것을 하나 뽑아오라
# tf.arg_max(Y, 1)  실제 데이터
is_correct = tf.equal(tf.arg_max(hypothesis, 1), tf.arg_max(Y, 1)) # 10개의 예측값중에서 가장큰값을 구함
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32)) # float형으로 형변환 (정확도 구함)
# parameters
training_epochs = 15 # 학습 회수 15번 학습 했다 .
batch_size = 100 # batch_size : 큰 파일을 나눠서 읽어옴 (100개) - 메모리가 부족하기 때문에
# 55000 / 100 = 550
# epoch : 반복횟수(1번다 읽어온것)
print('데이터 갯수=', mnist.train.num_examples) # 55,000 개

# 한 epochs이 100개 사이즈로 나누어서 550 번 수행한다.
# 총 15 * 550 수행
with tf.Session() as sess:
    # Initialize TensorFlow variables
    sess.run(tf.global_variables_initializer())
    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0
        total_batch = int(mnist.train.num_examples / batch_size)
        for i in range(total_batch): # 배치 사이즈를 100개로 나누어서 550 번씩 한다.
            batch_xs, batch_ys = mnist.train.next_batch(batch_size) #550개 씩 가져온다.
            c, _ = sess.run([cost, optimizer], feed_dict={X: batch_xs, Y: batch_ys})
            avg_cost += c / total_batch
    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))
    print("Learning finished")

    # Test the model using test sets
    print("Accuracy: ", accuracy.eval(session=sess, feed_dict={X: mnist.test.images, Y: mnist.test.labels}))
    # Get one and predict
    r = random.randint(0, mnist.test.num_examples - 1)
    print("Label: ", sess.run(tf.argmax(mnist.test.labels[r:r + 1], 1)))
    print("Prediction: ", sess.run(tf.argmax(hypothesis, 1), feed_dict={X: mnist.test.images[r:r + 1]}))
    # don't know why this makes Travis Build error.
    plt.imshow(
        mnist.test.images[r:r + 1].reshape(28, 28),
        cmap='Greys',
        interpolation='nearest')
    plt.show()

TensorBoard를 이용한 그래프 시각화

머신러닝 알고리즘을 모니터링하고 문제를 해결하는 일은 버거운 작업이 될 수 있다. 특히 결과를 얻기

위해 긴 시간 학습이 완료되기를 기다려야 하는 상황이라면 더욱 힘들 수 있다. 이런 문제를 해결하기

위해 턴서플로에는 tensorboard 라는 계산 그래프 시각화 도구 가 있다. tensorboard를 이용해 학습이

진행되는 동안에도 중요한 값(비용, 정확도, 학습시간 등)을 시각화하고 그래프로 표현할 수 있다.

Tensorboard 가 잘 실행되지 않는 경우 해결법

 방법1. conda 명령어로 모든 모듈을 최신버전으로 update 한다

c:\> conda update --all

 방법2. tensorboard 버전을 1.10 버전으로 낮춰서 실행 해본다.

C:\pythonwork\python3\tensorflow\src\ch05\log_dir

mkdir C:\project\python3\tensorflow\src\log_dir

cd C:\project\python3\tensorflow\src

tensorboard --logdir=log_dir <-- tensorboard 실행

#--logdir=log_dir 주의에 공백 있으면 오류가 난다.

주위에 공백 있으면 안된다.

그리고 log_dri까지가 아니고 상위 폴더 까지 진행한다.

pycharm tensorboard

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# 상수 선언
a = tf.constant(20, name="a")
b = tf.constant(30, name="b")
mul = a * b
# 세션 생성하기
sess = tf.Session()
# tensorboard 사용하기
# tensorboard 로그가 저장될 폴더(log_dir)가 생성된다.
tw = tf.summary.FileWriter("log_dir", graph=sess.graph)
# 세션 실행하기
print(sess.run(mul))

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# 상수와 변수 선언하기
a = tf.constant(100, name="a")
b = tf.constant(200, name="b")
c = tf.constant(300, name="c")
v = tf.Variable(0, name="v")
# 곱셈을 수행하는 그래프 정의하기
calc = a + b * c
assign_op = tf.assign(v, calc) # calc에 저장된 값을 v변수에 할당
# 세션 생성하기
sess = tf.Session()
# TensorBoard 사용하기
# tensorboard 로그가 저장될 폴더(log_dir)가 생성된다.
tw = tf.summary.FileWriter("log_dir", graph=sess.graph)
# 세션 실행하기
print(sess.run(assign_op))
print(sess.run(v))

tensorboard --logdir=log_dir

import pandas as pd
import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# 키, 몸무게, 레이블이 적힌 CSV 파일 읽어 들이기
csv = pd.read_csv("bmi.csv")
# 데이터 정규화
csv["height"] = csv["height"] / 200
csv["weight"] = csv["weight"] / 100
# 레이블을 배열로 변환하기
# - thin=(1,0,0) / normal=(0,1,0) / fat=(0,0,1)
bclass = {"thin": [1,0,0], "normal": [0,1,0], "fat": [0,0,1]}
csv["label_pat"] = csv["label"].apply(lambda x : np.array(bclass[x]))
# 테스트를 위한 데이터 분류
test_csv = csv[15000:20000]
test_pat = test_csv[["weight","height"]]
test_ans = list(test_csv["label_pat"])

# 플레이스홀더 선언하기
x = tf.placeholder(tf.float32, [None, 2]) # 키와 몸무게 데이터 넣기
y_ = tf.placeholder(tf.float32, [None, 3]) # 정답 레이블 넣기
# 변수 선언하기
W = tf.Variable(tf.zeros([2, 3])) # 가중치
b = tf.Variable(tf.zeros([3])) # 바이어스
# 소프트맥스 회귀 정의하기
y = tf.nn.softmax(tf.matmul(x, W) + b)
# 모델 훈련하기
cross_entropy = -tf.reduce_sum(y_ * tf.log(y))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(cross_entropy)
# 정답률 구하기
predict = tf.equal(tf.argmax(y, 1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(predict, tf.float32))

# 세션 시작하기
sess = tf.Session()
# TensorBoard 사용하기 (추가)
# tensorboard 로그가 저장될 폴더(log_dir)가 생성된다.
tw = tf.summary.FileWriter("log_dir", graph=sess.graph)
# 변수 초기화하기
sess.run(tf.global_variables_initializer())
# 학습시키기
for step in range(3500):
    i = (step * 100) % 14000
    rows = csv[1 + i : 1 + i + 100]
    x_pat = rows[["weight","height"]]
    y_ans = list(rows["label_pat"])
    fd = {x: x_pat, y_: y_ans}
    sess.run(train, feed_dict=fd)
    if step % 500 == 0:
        cre = sess.run(cross_entropy, feed_dict=fd)
        acc = sess.run(accuracy, feed_dict={x: test_pat, y_: test_ans})
        print("step=", step, "cre=", cre, "acc=", acc)

# 최종적인 정답률 구하기
acc = sess.run(accuracy, feed_dict={x: test_pat, y_: test_ans})
print("정답률 =", acc)

AND와 OR 게이트는 직선을 그어 결과값이 1인 점을 구별할 수 있습니다.

그러나 XOR의 경우 선을 그어 구분할 수 없습니다.

다층 퍼셉트론(multilayer perceptron)

신경망을 이루는 가장 중요한 기본 단위는 퍼셉트론(perceptron) 입니다.

퍼셉트론은 입력 값과 활성화 함수(activation function: ex) sigmoid, ReLU)를 사용해 출력 값을 다음으로 넘기는 가장 작은 신경망 단위입니다.

활성화 함수라고 한다.

인공 신경망의 작동원리

인공 신경망은 입력값(x)에 가중치(w)를 곱하고 편향(b)을 더한 뒤 활성화 함수(Sigmoid, ReLU 등)를 거친 결과값 y를 만들어 내는 것이다.

그리고 원하는 y값을 만들어 내기 위해 w와 b의 값을 변경해 가면서 적절한 값을 찾아내는 최적화 과정을 학습(learning) 또는 훈련(training) 이라고 합니다.

y = sigmoid ( x * w + b )

출력 활성화함수 입력 가중치 편향

활성화 함수(activation function)

뇌 신경망에는 시냅스가 있는데, 인공 신경망에서는 이런 방식을 모방한 활성화 함수를 이용 한다.

overfitting처리해야할 방법중의 하나가 드롭아웃으로 하나 써서 OVERFITTING 으로 사용한다.

다층과 단층에 따라 정확도가 달라질 수 있다.

노드/ UNIT

DROPOUT 쓰면 정확도가 높아진다. ->다층 시 주로 사용해서 OVERFITTING 줄인다. 보통 30% 사용

LOSS가 작으면 학습이 잘 된것이다.

오차가 최소가 되는 예측/ 분류를 할려고 한다.

분류(classification) 모델

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
# 학습에 사용할 데이터 정의
# 털과 날개가 있느냐를 담은 특징 데이터를 구성한다.(있으면1, 없으면 0)
# [털, 날개]
x_data = np.array(
[[0, 0], [1, 0], [1, 1], [0, 0], [0, 0], [0, 1]])

# 1) 각 개체가 실제 어떤 종류인지를 나타내는 레이블(분류값) 데이터를 구성합니다.
# 2)위에서 정의한 특징 데이터의 각 개체가 포유류인지 조류인지, 아니면 제3의 종류인지를
# 기록한 실제 결과값이다.
# 3)다음과 같은 형식을 one-hot 형식의 데이터라고 합니다.
# 4)원-핫 인코딩(one-hot encoding)이란 데이터가 가질 수 있는 값들을 일렬로 나열한 배열을
# 만들고, 그중 표현하려는 값을 뜻하는 인덱스의 원소만 1로 표기하고 나머지 원소는 모두 0으로
# 채우는 표기법입니다.
# 5)예를 들어, 우리가 판별하고자 하는 개체의 종류는 기타, 포유류, 조류 이렇게 세 가지 이고,
# 이를 배열에 넣으면 [기타, 포유류, 조류]처럼 될 것이다.
# 6)각 종류의 인덱스는 기타=0, 포유류=1, 조류=2가 되겠죠. 이를 원-핫-인코딩 형식으로 만들면
# 다음처럼 된다.
# [기타, 포유류, 조류]
y_data = np.array([
[1, 0, 0], # 기타
[0, 1, 0], # 포유류
[0, 0, 1], # 조류
[1, 0, 0], # 기타
[1, 0, 0], # 기타
[0, 0, 1] # 조류
])


####################
# 신경망 모델 구성
####################
# 특징 X와 레이블 Y와의 관계를 알아내는 모델입니다.
# 플레이스홀더 X, Y 설정
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# 신경망은 2차원으로
# 가중치 변수 W는 [입력층(특징 수), 출력층(레이블 수)] -> [2, 3] 으로 설정
W = tf.Variable(tf.random_uniform([2, 3], -1., 1.)) # 균등 분포니깐 범위가 필요하다 -1 ~ 1 까지
# 편향을 각각 각 레이어의 아웃풋 갯수로 설정합니다.
# 편향은 아웃풋의 갯수, 즉 최종 결과값의 분류 갯수인 3으로 설정합니다.
# 편향 변수 b는 레이블 수인 3개의 요소를 가진 변수로 설정
b = tf.Variable(tf.zeros([3]))

# 신경망에 가중치 W과 편향 b을 적용합니다
L = tf.add(tf.matmul(X, W), b) # 히든 LAYER를 추가한다.
# 가중치와 편향을 이용해 계산한 결과 값에
# 텐서플로우에서 기본적으로 제공하는 활성화 함수인 ReLU 함수를 적용합니다.
L = tf.nn.relu(L) # 일정이상의 자격의 강도  값이 와야만 다음 층으로 전달한다.
#히든 LAVYER를 추가하는 순간 딥런닝이다.
# 마지막으로 softmax 함수를 이용하여 출력값을 사용하기 쉽게 만듭니다
# softmax 함수는 다음처럼 결과값을 전체합이 1인 확률로 만들어주는 함수입니다.
# 예) [8.04, 2.76, -6.52] -> [0.53 0.24 0.23]
model = tf.nn.softmax(L)

# 신경망을 최적화하기 위한 비용 함수를 작성합니다.
# 각 개별 결과에 대한 합을 구한 뒤 평균을 내는 방식을 사용합니다.
# 전체 합이 아닌, 개별 결과를 구한 뒤 평균을 내는 방식을 사용하기 위해 axis 옵션을 사용합니다.
# axis 옵션이 없으면 -1.09 처럼 총합인 스칼라값으로 출력됩니다.
# Y model Y * tf.log(model) reduce_sum(axis=1)
# 예) [[1 0 0] [[0.1 0.7 0.2] -> [[-1.0 0 0] -> [-1.0, -0.09]
# [0 1 0]] [0.2 0.8 0.0]] [ 0 -0.09 0]]
# 즉, 이것은 예측값과 실제값 사이의 확률 분포의 차이를 비용으로 계산한 것이며,
# 이것을 교차 엔트로피(Cross-Entropy) 라고 합니다.
#교차 엔트로피(Cross-Entropy) 다중 분류 할 경우 주로 사용한다.
cost = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(model), axis=1))
# 경사하강법으로 비용을 최적화합니다.
# 성능을 높이기 위해서는 ADAM으로 사용할 수 도 있다.
#optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
# AdamOptimizer성능을 향상시키는 겟
train_op = optimizer.minimize(cost)

####################
# 신경망 모델 학습
####################
# 텐서플로의 세션을 초기화
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
# 위에서 구성한 특징과 레이블 데이터를 이용해 100번 학습을 진행
# 100 번 학습을 시킨다.
# 적은 값이 나왔을 때 학습을 크게 한다.
# 학습률이 보통 영향을 많이 준다ㅏ.
for step in range(100):
    # x_data날게 유무
    # y_data 실제 데이터 전달 3가지로 분류된다.
    sess.run(train_op, feed_dict={X: x_data, Y: y_data})
    # 학습 도중에 10번에 한번씩 손실값을 출력함
    if (step + 1) % 10 == 0:
        print(step + 1, sess.run(cost, feed_dict={X: x_data, Y: y_data}))

##############################
# 결과 확인
# 0: 기타 1: 포유류, 2: 조류
#############################
# 학습된 결과를 확인해보는 코드 작성
# 1)예측값인 model을 바로 출력하면 [0.2, 0.7, 0.1]과 같이 확률로 나오기 때문에, 요소 중
# 2)가장 큰 값의 인덱스를 찾아주는 argmax 함수를 사용하여 레이블 값을 출력
# 3)tf.argmax: 예측값과 실제값의 행렬에서 tf.argmax 를 이용해 가장 큰 값을 가져옵니다.
# 4)원-핫 인코딩을 거꾸로 한 결과를 만들어준다.
# 예) [[0 1 0] [1 0 0]] -> [1 0]
# [[0.2 0.7 0.1] [0.9 0.1 0.]] -> [1 0]
#높은 확률을 가진것을 뽑아온다.
prediction = tf.argmax(model, 1) # 예측값
target = tf.argmax(Y, 1) # 실제값
print('예측값:', sess.run(prediction, feed_dict={X: x_data}))
print('실제값:', sess.run(target, feed_dict={Y: y_data}))

# 정확도 출력
# 1)전체 학습 데이터에 대한 예측값과 실측값을 tf.equal()함수로 비교한 뒤,
# true, false 값으로 나온 결과를 다시 tf.cast()함수를 이용해 0과 1로 바꾸어
# 평균을 내면 간단히 정확도를 구할 수 있다.
# 2)프로그램을 실행해서 학습을 시키면, 손실값이 점점 줄어드는 것을 확인할 수 있다.
# 하지만, 실망스럽게도 아무리 학습횟수를 늘려도 정확도가 크게 높아지지 않는다.
# 그 이유는 신경망이 한 층밖에 안 되기 때문인데, 하나의 층(hidden layer)을 더 늘리면
# 정확도가 높아진다.
is_correct = tf.equal(prediction, target)
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
print('정확도: %.2f' % sess.run(accuracy * 100, feed_dict={X: x_data, Y: y_data}))

심층 신경망 구현

# 털과 날개가 있는지 없는지에 따라, 포유류인지 조류인지 분류하는 신경망 모델을 만들어봅니다.
# 신경망의 레이어를 여러개로 구성하여 말로만 듣던 딥러닝을 구성해 봅시다!
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
# 학습에 사용할 데이터 정의
# [털, 날개]
x_data = np.array(
[[0, 0], [1, 0], [1, 1], [0, 0], [0, 0], [0, 1]])
#있으면 1 없으면 0

# 원-핫 인코딩(one-hot encoding)이란 데이터가 가질 수 있는 값들을 일렬로 나열한 배열을
# 만들고, 그중 표현하려는 값을 뜻하는 인덱스의 원소만 1로 표기하고 나머지 원소는 모두 0으로
# 채우는 표기법입니다.
# [기타, 포유류, 조류] - ONE- HOT ENCODING
y_data = np.array([
[1, 0, 0], # 기타
[0, 1, 0], # 포유류
[0, 0, 1], # 조류
[1, 0, 0], # 기타
[1, 0, 0], # 기타
[0, 0, 1] # 조류
])


#####################
# 신경망 모델 구성
#####################
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# 다층 신경망을 만드는 것은, 앞에서 만든 단층 신경망 모델에 가중치와 편향을 추가해서 만든다.
# 가중치
# W1 = [2, 10] -> [특징, 은닉층의 뉴런수]
# W2 = [10, 3] -> [은닉층의 뉴런수, 분류 수]
# 편향
# b1 = [10] -> 은닉층의 뉴런 수
# b2 = [3] -> 분류 수
# 입력층과 출력층은 각각 특징과 분류 갯수로 맞추고, 중간의 연결 부분은 맞닿은 층의 뉴런 수와
# 같도록 맞추면 된다.
# 중간의 연결 부분을 은닉층(hidden layer)이라 하며, 은닉층의 뉴런 수는 하이퍼파라미터이므로 실험을
# 통해 가장 적절한 수를 정하면 된다.

# 첫번째 가중치의 차원은 [특징, 히든 레이어의 뉴런갯수] -> [2, 10] 으로 정합니다.
W1 = tf.Variable(tf.random_uniform([2, 10], -1., 1.))
# 두번째 가중치의 차원을 [첫번째 히든 레이어의 뉴런 갯수, 분류 갯수] -> [10, 3] 으로 정합니다.
# w2가 가중치 출력층을 포함하고 있따.
# 3 3가지로 분류되서
W2 = tf.Variable(tf.random_uniform([10, 3], -1., 1.))
# 편향을 각각 각 레이어의 아웃풋 갯수로 설정합니다.
# b1 은 히든 레이어의 뉴런 갯수로, b2 는 최종 결과값 즉, 분류 갯수인 3으로 설정합니다.
# 초기값은 0으로 만들어진다.
b1 = tf.Variable(tf.zeros([10])) # 은닉층의 뉴런 수
b2 = tf.Variable(tf.zeros([3])) # 분류 수
# 신경망의 히든 레이어에 가중치 W1과 편향 b1을 적용합니다
# 특징 입력값에 첫번째 가중치와 편향, 그리고 활성화 함수를 적용합니다.
L1 = tf.add(tf.matmul(X, W1), b1) # 입력층에서
L1 = tf.nn.relu(L1)   # 입력층에 relu를 하게 되면
# 출력층을 만들기 위해 두번째 가중치와 편향을 적용하여 최종 모델을 만듭니다.
# 최종적인 아웃풋을 계산합니다.
# 히든레이어에 두번째 가중치 W2[10,3]와 편향 b2[3]를 적용하여 최종적으로 3개의 출력값을 만들어냅니다.
model = tf.add(tf.matmul(L1, W2), b2)
# 마지막으로 손실함수를 작성합니다.
# 텐서플로우에서 기본적으로 제공되는 크로스 엔트로피 함수를 이용해 복잡한 수식을 사용하지
# 않고도 최적화를 위한 비용 함수를 다음처럼 간단하게 적용할 수 있습니다.
cost = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=model))
# 최적화 함수로 AdamOptimizer를 사용한다.
# 사용하는 최적화 함수에 따라 정확도나 학습 속도가 많이 달라질 수 있으며, AdamOptimizer는
# 앞에서 사용한 GrdadientDescentOptimizer보다 보편적으로 성능이 좋다고 알려져 있습니다.
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = optimizer.minimize(cost)

#####################
# 신경망 모델 학습
#####################
# 텐서플로의 세션을 초기화
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
# 위에서 구성한 특징과 레이블 데이터를 이용해 100번 학습을 진행
for step in range(100):
    sess.run(train_op, feed_dict={X: x_data, Y: y_data})
    # 학습 도중에 10번에 한번씩 손실값을 출력함
    if (step + 1) % 10 == 0:
        print(step + 1, sess.run(cost, feed_dict={X: x_data, Y: y_data}))


##############################
# 결과 확인
# 0: 기타 1: 포유류, 2: 조류
##############################
# 예측값과 실제값 출력
prediction = tf.argmax(model, 1)
target = tf.argmax(Y, 1)
print('예측값:', sess.run(prediction, feed_dict={X: x_data}))
print('실제값:', sess.run(target, feed_dict={Y: y_data}))
# 정확도 출력
is_correct = tf.equal(prediction, target)
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
print('정확도: %.2f' % sess.run(accuracy * 100, feed_dict={X: x_data, Y: y_data}))

#오차가 쭐어든다는 것은 예측 분류 능력이 좋아진다.
# 손실함수 적게 하는 것
# optimizer -adam
# 학습률
# 학습 회수
# hidder layer만 추가했다.

Google’s playground : 구글 놀이터

구글은 뉴럴 네트워크를 학습하는 모습을 마치 손에 잡힐 듯이 느낄 수 있는 웹 사이트를 제공 하고 있다

http://playground.tensorflow.org

학습률(learning rate), 활성화함수(Activation Function), 정규화(Regularization) 유무, Problem Type

(Classification, Regression) 등을 선택하고, 은닉층(Hidden Layers)의 갯수와 뉴런(Neurous)의 갯수를

바꾸어 가면서 학습을 시킬 수 있고, 출력 결과를 바로 확인 할 수 있다.

neurons , node, unit

합성곱 신경망 (CNN : Convolutional Neural Network)

합성공 신경망(CNN : Convolutional Neural Network)은 1998년 얀 레쿤(Yann LeCun)교수가 소개한 이래로 널리 사용되고 있는 신경망 이론으로, 특히 이미지 인식 분야에서 강력한 성능을 발휘하고 있다.

이미지, 음성 등 다양한 것 사용할 수 있다.

신경망의 단점은 신경망의 층(Layer)이 늘어나면 제대로 학습하지 못한다는 문제가 있지만, CNN에서는 이 문제를 입력층과 출력층 사이에 합성곱층(Convolutional Layer)과 풀링층 (Pooling Layer)을 넣어서 보완하고 있다.

이미지를 배열로 만들어서 특징을 추출하기 위해 필터를 이용해서 matrix 곱하기 연산한다.

컨볼루션 신경망(CNN)

컨볼루션 신경망은 입력된 이미지의 특징을 추출하기 위해 마스크(필터, 윈도 또는 커널)를 적용하는 기법이다. ( 입력 이미지에 3 * 3 필터 적용 )

이미지를 뽑아내는 것이 합성곱이다. 필터를 사용한다.

맥스 풀링(max pooling) -> 가장 큰값을 가져오는 것

컨볼루션층을 통해서 이미지의 특징을 도출하고, 풀링층으로 넘어온다.

tensorflw는 복잡하다.

keras로 하면 간단하다.

이미지를 받으면 배열로 된다.

딥런닝에서 과적합을 막아주는 잴 자주 사용하는 방법은 드룹아웃 기법이다.

keras는 묘듈을 만드는 방법이 2가지가 있다. Sequential , Function

2.0 대에는 3가지가 있다. Sequential , Function , Subclassing

tensorflow는 javascript

텐서플로우 코어(Tensorflow core) -> 우리가 주로 하는 것

Tensorflow.js -> 브라우즈 기반으로 하는것

텐서플로우 라이트(Tensorflow Lite) ->스마트폰 안드로이등 등 임베디드 시스템에서 사용한 것

텐서플로우 익스텐디드(TensorFlow Extended, TFX) ->플랫폼

<!doctype html>
<html>
<head>
<title>TF.js Test</title>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.6.1"></script>
<script type="text/javascript">
// 선형회귀 모델 생성
const model = tf.sequential();
model.add(tf.layers.dense({units: 1, inputShape: [1]}));
// 학습을 위핚 준비 : 손실 함수와 최적화 함수를 설정
model.compile({loss: 'meanSquaredError', optimizer: 'sgd'});
// 학습 데이터 생성
const xs = tf.tensor2d([1, 2, 3, 4], [4, 1]);
const ys = tf.tensor2d([1, 3, 5, 7], [4, 1]);
// 데이터를 사용해서 학습
model.fit(xs, ys).then(() => {
// 학습된 모델을 가지고 추론
model.predict(tf.tensor2d([5], [1, 1])).print();
});
</script>
</head>
<body>
콘솔을 확인하세요.
</body>
</html>

!pip install tensorflow = 2.0

주피터 노트복

1. 가상홖경 만들기 : tensorflow2 라는 이름을 가진 python 3.6홖경(env) 생성

c:\> conda create -n tensorflow2 python=3.6

2. 가상홖경 홗성화

c:\> activate tensorflow2

3. Tensorflow 설치 : CPU 모드 설치

(tensorflow2) c:\> conda install tensorflow==2.0

4. Tensorflow 설치 버전 확인

c:\> python

>>> import tensorflow as tf

>>> print (tf.__version__)

5. Jupyter notebook 설치

c:\> activate tensorflow2

(tensorflow2)c:\> conda install jupyter notebook

(tensorflow2)c:\> jupyter notebook # jupyter notebook 실행

1. 아나콘다 프롬프트에 config 생성 코드를 입력핚다. 생성이 제대로 되었다면, 생성 위치가 출력된다.

2. config 파일을 메모장으로 연 뒤, 266라인에 있는 c.NotebookApp.notebook_dir = 'c:\users\계정디렉토리„

설정핚 후에 주피터 노트북을 다시 시작하면 새로 바뀐 화면을 볼 수 있다.

jupyter notebook에서는 tensorflow2를 따로 사용해야 한다.

가상 환경을 만들어서 jupyter note book사용하면 저장된 위치가 달라지고 속도가 늦어진다.

그래서 데이터 저장해야 할 위치를 정해야 한다.

간단하게 만들 떄는 virtualenv environment

체크 박스를 만들지 말고 해야 한다.

아니면 기존에 만든것을 불러온다.

conda -> python버전을 다르게 할 떄 버전 이나 파일 위치 설정한다.

체크 하지 말고 그럼 콘솔창에서 conda사용할 수 있다.

conda env list -> 현재 만들어진 가상환경의 목록이 만들어진다.

tensorflow2.0데가 설치되면 keras도 같이 설치된다.

colab 1.5 대

텐스플로 2.0은 keras

#import tensorflow.compat.v1 as tf
#tf.disable_v2_behavior()
import tensorflow as tf

tf.disable_v2_behavior()
# 상수 선언
hello = tf.constant('Hello, Tensorflow')
# 세션 시작
sess = tf.Session()
# 세션 실행
print(sess.run(hello))

session는 1점대로 사용한다. 2점 대는 즉시 생성해서 되지 않는다.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
#import tensorflow as tf

tf.disable_v2_behavior()
# 상수 선언
hello = tf.constant('Hello, Tensorflow')
# 세션 시작
sess = tf.Session()
# 세션 실행
print(sess.run(hello))

import tensorflow as tf

print(tf.__version__)


# 상수 선언
hello = tf.constant('Hello, TensorFlow!')
print(hello) #데이터 타입이 출력되고
# tf.Tensor(b'Hello, TensorFlow!', shape=(), dtype=string)
# numpy() 메소드를 사용하여 텐서의 값을 numpy데이터 타입으로 변홖하여 출력
print(hello.numpy()) # 데이터 값이 출력
# b'Hello, TensorFlow!'
# decode('utf-8')메소드로 bytes 클래스를 str 클래스로 변홖
# 문자열 앞에 보이던 b가 사라짐
print(hello.numpy().decode('utf-8'))

#영문은 numpy()로 되지만 한글은 encoding decoding 문제가 있다.
# 상수 선언 : 핚글
hi = tf.constant('안녕')

print(hi) #tf.Tensor(b'\xec\x95\x88\xeb\x85\x95', shape=(), dtype=string)
print(hi.numpy()) #b'\xec\x95\x88\xeb\x85\x95'

# decode('utf-8')메소드로 bytes 클래스를 str 클래스로 변홖
print(hi.numpy().decode('utf-8')) #decode로 해야 한다. 
# 안녕

import tensorflow as tf
# 상수 정의
a = tf.constant(2)
b = tf.constant(3)
c = a+b
# 연산 결과 출력
print(a.numpy()) #a의 상수값을 출력 
print(b.numpy())
print(c.numpy())
print(a.numpy()+b.numpy())

#상수
import tensorflow as tf

#상수 선언
a = tf.constant(2)
b = tf.constant(3)
c = tf.constant(4)

# 연산 정의
cal1 = a + b * c
cal2 = (a + b) * c
# 연산 결과 출력
print(cal1.numpy())
print(cal2.numpy())

import tensorflow as tf
node1 = tf.constant(3.0, tf.float32)
node2 = tf.constant(4.0)
# 연산
node3 = tf.add(node1, node2)
node4 = tf.multiply(node1, node2)
# 연산 결과 출력
print(node1.numpy())
print(node2.numpy())
print(node3.numpy())
print(node4.numpy())

import tensorflow as tf
# 상수 선언
x1 = tf.constant([1, 2, 3, 4])
x2 = tf.constant([5, 6, 7, 8])
# 연산
cal1 = tf.add(x1, x2) # 더하기
cal2 = tf.subtract(x1, x2) # 빼기
cal3 = tf.multiply(x1, x2) # 곱하기
cal4 = tf.divide(x1, x2) # 나누기
# 연산 결과 출력
print(cal1.numpy())
print(cal2.numpy())
print(cal3.numpy())
print(cal4.numpy())

import tensorflow as tf
# 변수 선언
v1 = tf.Variable(50) # v1 변수의 초기값 50
# 변수 출력
print('v1=', v1) # v1= <tf.Variable 'Variable:0' shape=() dtype=int32, numpy=50>
print('v1=', v1.numpy()) # v1= 50

import tensorflow as tf
# 변수 선언
v1 = tf.Variable(50) # v1 변수의 초기값 50
v2 = tf.Variable([1,2,3]) # rank: 1, shape: (3)
v3 = tf.Variable([[1],[2]]) # rank: 2, shape: (2,1)
# 변수 출력
print('v1=', v1.numpy()) # v1= 50
print('v2=', v2.numpy()) # v2= [1 2 3]
print('v3=', v3.numpy()) # v3= [[1][2]]

import tensorflow as tf
# 변수 선언
x = tf.Variable([[2, 2, 2],[2, 2, 2]]) # rank: 2, shape(2,3) 2행 3열
y = tf.Variable([[3, 3, 3],[3, 3, 3]]) # rank: 2, shape(2,3) 2행 3열
# 연산
z1 = tf.add(x, y) # 덧셈
z2 = tf.subtract(x, y) # 뺄셈
z3 = tf.multiply(x, y) # 곱셈
z4 = tf.matmul(x, tf.transpose(y)) # matrix 곱셈 (tf.transpose() : 젂치행렬)
z5 = tf.pow(x, 3) # 3제곱
# shape 구하기
print(x.get_shape()) # (2, 3)
print(y.get_shape()) # (2, 3)
# 연산 결과 출력
print(z1.numpy())
print(z2.numpy())
print(z3.numpy())
print(z4.numpy())
print(z5.numpy())

import tensorflow as tf
x = tf.Variable([[3., 3.]]) # shape: (1, 2) 1행 2열
y = tf.Variable([[2.],[2.]]) # shape: (2, 1) 2행 1열
mat = tf.matmul(x, y) # matrix 곱셈
# shape 구하기
print(x.get_shape()) # shape: (1, 2) 1행 2열
print(y.get_shape()) # shape: (2, 1) 2행 1열
# 연산 결과 출력
print(x.numpy())
print(y.numpy())
print(mat.numpy())

import tensorflow as tf
# 변수 선언
v1 = tf.Variable(tf.zeros([2,3])) # [[ 0. 0. 0.] [ 0. 0. 0.]]
v2 = tf.Variable(tf.ones([2,3], tf.int32)) # [[1 1 1] [1 1 1]]
v3 = tf.Variable(tf.zeros_like(tf.ones([2,3]))) # [[ 0. 0. 0.] [ 0. 0. 0.]]
v4 = tf.Variable(tf.fill([2,3], 2)) # [[2 2 2] [2 2 2]]
v5 = tf.Variable(tf.fill([2,3], 2.0)) # [[ 2. 2. 2.] [ 2. 2. 2.]]
# 변수 출력
print(v1.numpy())
print(v2.numpy())
print(v3.numpy())
print(v4.numpy())
print(v5.numpy())

난수 : 균등분포 정규분포 초기값 등 사용할 떄

uniform 균등분포 난수 를 생성할 떄 개수를 의미한다 .

import tensorflow as tf
# 난수 생성
a = tf.random.uniform([1], 0, 1) # 0 ~ 1 사이의 난수 1개 발생
b = tf.random.uniform([1], 0, 10) # 0 ~ 10 사이의 난수 1개 발생
print(a.numpy())
print(b.numpy())
""" 난수 생성 """
# 정규분포 난수 : 평균: -1, 표준편차: 4
norm = tf.random.normal([2, 3], mean=-1, stddev=4)
print(norm.numpy())
# 주어진 값들을 shuffle()함수로 무작위로 섞음
c = tf.constant([[1, 2], [3, 4], [5, 6]])
shuff = tf.random.shuffle(c)
print(shuff.numpy())
# 균등분포 난수 : 0 ~ 3 값 사이의 2행 3열 난수 발생
unif = tf.random.uniform([2,3], minval=0, maxval=3)
print(unif.numpy())

 Rank: 텐서의 차원은 rank로 나타낸다. 괄호 하나면 1차원 두개면 2차원 3개면 3차원

 Shape: 텐서의 행과 열이 몇 개인지를 나타낸다. -> 이미지 학습 할 떄 중요하다.

 Type: 텐서의 데이터가 어떤 형식인지를 나타낸다.

keras

tensorflow가 keras인수해서 기능을 많이 사용할 수 있다.

keras 환경구축

Windows 운영체제 홖경에서 python-3.x 또는 Anaconda3-5.x 가 설치된 다음에 tensorflow 를

설치하고, keras 를 설치핚다.

 CPU모드 설치

c:\> conda install keras

 GPU모드 설치

c:\> conda install keras-gpu

 keras 설치 확인

c:\> python

>>> import keras

Using Tensorflow backend

keras 환경구축 : PyCharm에서 keras 설치법

tensorflow2.0 이 설치되면 keras도 같이 설치 된다.

from keras.models import Sequential

from keras.layers import Dense

 Tensorflow2.0 에서 import 하는 방법

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

'Study > 머신러닝' 카테고리의 다른 글

머신러닝-9 (0)	2020.11.21
머신러닝-7 (0)	2020.11.20
머신러닝-6 (0)	2020.11.19
머신러닝-5 (0)	2020.11.19
머신러닝-4 (0)	2020.11.17

머신러닝-7

2020. 11. 20. 20:23

728x90

#난수 발생

#변수 선언

import tensorflow as tf

print(tf.__version__)

# 균등 분포 모양의 난수값

# 변수 선언

a = tf.Variable(tf.random_uniform([1])) # 0 ~ 1 사이의 난수 1개 발생

b = tf.Variable(tf.random_uniform([1], 0, 10)) # 0 ~ 10 사이의 난수 1개 발생

#세션 선언

sess = tf.Session()

#모든 변수 초기화

#초기화 한다음 변수 사용

sess.run(tf.global_variables_initializer())

print('a=', sess.run(a))

print('b=', sess.run(b))

# 정규 분포 난수 발생

norm = tf.random_normal([2, 3], mean=-1, stddev=4)

print(sess.run(norm))

#shuffle()함수로 무자위로 섞기

c = tf.constant([[1,2],[3,4],[5,6]])

shuff = tf.random_shuffle(c)

print(sess.run(shuff))

# 균등분포 난수

unif = tf.random_uniform([2,3], minval=0, maxval=3)

print(sess.run(unif))

placeholder

상수나 변수의 데이터 타입만 설정하고 실행단계에서 딕셔너리에 값을 대입해서 사용할 경우에 placeholder를 사용한다. =>format 을 생각하면 된다.

import tensorflow as tf

 

#상수나 변수의 데이터 타입만 설정하고 실행단게에서 딕셔너리에 값을 대입해서 사용할 

#경우에 plactholder를 사용함 

 

#보통 정수형 혹은 float형 사용 다른 것은 드물다.

node1 = tf.placeholder(tf.float32) # 실수 자료형 1개를 가진 배열

node2 = tf.placeholder(tf.float32) # 실수 자료형 1개를 가진 배열

add = node1 + node2

mul = node1 * node2

sess = tf.Session()

 

#변수 연산 

#연산자 대신에 함수를 사용해도 된다. 

print(sess.run(add, feed_dict={node1:3, node2:4.0}))

print(sess.run(mul, feed_dict={node1:3, node2:4.0}))

import tensorflow as tf
# placeholder 정의
a = tf.placeholder(tf.int32, [3]) # 정수 자료형 3개를 가진 배열
# 배열을 모든 값을 2배하는 연산 정의하기
b = tf.constant(2)
op = a * b
# 세션 시작하기
sess = tf.Session()
# placeholder에 값을 넣고 실행하기
r1 = sess.run(op, feed_dict={ a: [1, 2, 4] })
r2 = sess.run(op, feed_dict={ a: [10, 20, 30] })
print(r1)
print(r2)

variable -> tf.global_variables_initializer()를 사용하는 것이고

placeholder 변수를 초기화 할 필요없다.

import tensorflow as tf
# placeholder 정의
a = tf.placeholder(tf.int32, [None]) # 배열의 크기를 None으로 지정
# 배열의 모든 값을 10배하는 연산 정의하기
#none는 크기에 제한이 없다.
b = tf.constant(10)
op = a * b
# 세션 시작하기
sess = tf.Session()
# placeholder에 값을 넣고 실행하기
r1 = sess.run(op, feed_dict={a: [1,2,3,4,5]})
r2 = sess.run(op, feed_dict={a: [10,30]})
print(r1)
print(r2)

 Rank: 텐서의 차원은 rank로 나타낸다.

 Shape: 텐서의 행과 열이 몇 개인지를 나타낸다.

 Type: 텐서의 데이터가 어떤 형식인지를 나타낸다.

tensor가 고차원 배열을 의미한다.

선형회귀

1. 회귀분석

2. LINEAR REGRESSION (선형회귀)

3. HYPOTHESIS (가설):최저점(minimize cost)이라는 정답을 찾기 위한 가정이기 때문에 가설이라고 부를 수 있다.

H(x) = Wx + b

독립변수가 하나 있을 때 가설로 한다.

w는 기울기(수학) weight(머신런닝)

b 절편 bias

h 종속변수 hypothesis

4. COST (비용, 오차) -> 경사하강법 사용

5. Cost Function (오차함수)

6. GRADIENT DESCENT ALGORITHM (경사 하강법)

선형회귀(Linear Regression)

최소 제곱법(Least-squares)-> 오차가 최소화하는 것

b = y의 평균 – ( x의 평균 * 기울기 a ) = mean(y) – ( mean(x) * a ) = 90.5 – ( 5 * 2.3 ) = 79

평균 제곱근 오차 (Root Mean Squared Error : RMSE)

오차 = 실제 값 – 예측 값

오차가 최소가 되도록 순간점 을 기울기라고 하는데 미분을 수행하게 되면 기울기가 된다.

오차가 줄이는 것은 기울기가 줄어든다.

오차를 줄이는 방식으로 학습을 한다.

이런 방식을 경사 하강법이라고 한다.

이상 적인 것은 0 이지만 0이 가까워지는 것을 찾아서 학습한다.

더이상 오차가 줄어들지 아는데 까지 학습 한다.

경사하강법 알고리즘은 기울기에 학습률(Learning rate)을 곱해서 다음 지점을 결정한다.

학습률이 큰 경우 : 데이터가 무질서하게 이탈하며, 최저점에 수렴하지 못함

학습률이 작은 경우 : 학습시간이 매우 오래 걸리며, 최저점에 도달하지 못함

결과적으로 우리가 하고자 하는 일은, 예측에 따른 오차를 최소화하고자 함이며 이를 머신러닝에서는 비용함수(Cost Function)이라고 정의 합니다.

경사 하강법(Gradient Descent) 으로 비용 함수가 최소가 되는 w(기울기, weight)를 찾을 수 있습니다.

1차 방정식 함수

weight 하고 bias 구해야 한다.

오차가 최소가 되도록 계산한다.

경사가 낮아지는 데로 학습하기 때문에 경사 하강법

y축이 내려가면서 오차를 줄이는 것을 경사 하강법 이리고 한다.

optimizer.minimize() ->cost가 최가 되도록 학습

오차는 lr_rate 학습한 데이터 등 과 관계있다.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# 공부핚 시갂(x_data) : 2 4 6 8
# 성적(y_data) : 81 93 91 97
x_data = [2, 4, 6, 8] # 공부핚 시갂
y_data = [81,93,91,97] # 성적
# 기울기 a와 y 젃편 b의 값을 임의로 정핚다.
# 단, 기울기의 범위는 0 ~ 10 사이이며 y 젃편은 0 ~ 100 사이에서 변하게 핚다.
# tf.random_normal([1]) : 난수 1개 발생
a = tf.Variable(tf.random_uniform([1], 0, 10, dtype = tf.float64, seed = 0)) # 기울기
b = tf.Variable(tf.random_uniform([1], 0, 100, dtype = tf.float64, seed = 0)) # 젃편
# y에 대핚 일차 방정식 ax+b의 식을 세운다. ( y = ax + b)
y = a * x_data + b
# 텐서플로 cost 구하기
cost = tf.reduce_mean(tf.square( y - y_data ))

# 학습률 값
#learning_rate = 0.1 #step: 2000, cost = nan, 기울기 a = nan, y 절편 b = nan 학습이 안됬다.
#learning_rate = 0.01 #step: 2000, cost = 8.3000, 기울기 a = 2.2998, y 절편 b = 79.0011
#오차를 줄이는 방법을 쓰야 해서 learning_rate줄여야 한다.
learning_rate = 0.0001
# learning_rate 넘 크면 분산이 도고 넘 작아도 문제가 생깁니다.
#3에 도달하지 못하는 이유는 데이터가적어서 그렇고
# 학습을 더이상 시키지 않아서

# cost 값을 최소로 하는 값 찾기 : 경사 하강법
# 오차가 최소가 되는 알고리즘을 찾아가는 방법
gradient_decent = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
#Optimizer adam을 사용하면 조금 더 잘 처리된다.
#8.3정도 오차가 발생한다.
# 텐서플로를 이용핚 학습
sess = tf.Session()
# 변수 초기화
sess.run(tf.global_variables_initializer())
# 2001번 실행(0번 째를 포함하므로)
for step in range(2001): # 2000번 학습
    sess.run(gradient_decent)
    if step % 100 == 0: # 100번마다 결과 출력
        print("step: %.f, cost = %.4f, 기울기 a = %.4f, y 절편 b = %.4f"
        % (step, sess.run(cost), sess.run(a), sess.run(b)))

#학습률에 따라서 잘 되기도 하고 잘 안되기도 한다.

#예측 할 때는 달라진다. hyphothesis구하는 것 일정의 가설입니다.구하는 방법

tensorflow에서는 hyphothesis 가설이 달라진다.
#독립 변수 y = a * x_data + b
#2중 분류는 sigmoid함수
#다중 분류는 softmax

회귀나 분류나 그리고 회귀에서 선형 회귀 , 2중이냐 다중이냐 등에 따라 도 cost function 달라진다.

경사를 줄이면서 0으로 도달하는게 목표이다.

0에 가까워진다.

출력된 데이터를 보고 그 값을 출력하는 것이다.

learnig_rate는 학습률을 설정하면 최소화를 시킬 수 있다는 것이다.

학습률을 가지고 오차자 최소가 되도록 하는 것이다.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# input data
x_train = [ 1, 2, 3]
y_train = [10, 20, 30]
# tf.random_normal([1]) : 난수 1개 발생
W = tf.Variable(tf.random_normal([1]), name='weight') # 기울기, 가중치
b = tf.Variable(tf.random_normal([1]), name='bias') # 절편,편향
# Our hypothesis XW+b
hypothesis = x_train * W + b
# cost/loss function
# square 함수는 제곱의 값
# reduce_mean 함수는 평균
cost = tf.reduce_mean(tf.square(hypothesis - y_train))
# Launch the graph in a session.
sess = tf.Session()

# Initializes global variables in the graph.
sess.run(tf.global_variables_initializer())
# Minimize
# GradientDescentOptimizer 함수는 경사하강법을 구현핚 함수임
# 경사는 코스트를 가중치로 미분핚 값
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
# minimize 함수는 최소화핚 결과를 반홖함
train = optimizer.minimize(cost)
# Fit the line
#오차를 줄이는 방식으로 학습한다.
for step in range(8001): # 2000번 학습
    sess.run(train)
    if step % 100 == 0:
        print(step, 'cost=',sess.run(cost), 'weight=',sess.run(W), 'bias=',sess.run(b))
    # Learns best fit W:[ 1.], b:[ 0.]
#cost가 줄어들 때ㄱ까지 한다.
#0에 도달 하는게 재일 이상적인 것이다.

(linear regression을 placeholder로 구현)

x와 y가 고정된 값이 아니라 paceholder함수로 한다.

optimzer를 만드는 시점에서 값을 준다. 실행은 세션을 가지고 학습한다.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Try to find values for W and b to compute y_data = W * x_data + b
# We know that W should be 1 and b should be 0
W = tf.Variable(tf.random_normal([1]), name='weight') # 기울기
b = tf.Variable(tf.random_normal([1]), name='bias') # 절편
# Now we can use X and Y in place of x_data and y_data
X = tf.placeholder(tf.float32, shape=[None]) # placeholder 정의
Y = tf.placeholder(tf.float32, shape=[None]) #여러개를 받을 수 있다.
# Our hypothesis XW+b
# 단순 회긔분석
hypothesis = X * W + b
# cost/loss function
cost = tf.reduce_mean(tf.square(hypothesis - Y))
# Minimize
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) #0.01->0.001
train = optimizer.minimize(cost)

# Launch the graph in a session.
sess = tf.Session()
# Initializes global variables in the graph.
sess.run(tf.global_variables_initializer()) # 모든 변수를 초기화 ,w와 b때문에 초기화
#Fit the line
for step in range(2001): # 2000번 학습
    cost_val, W_val, b_val, _ = sess.run([cost, W, b, train],feed_dict={X: [1, 2, 3], Y: [1, 2, 3]})
    if step % 20 == 0:
        # print(step, cost_val, W_val, b_val)
        print(step, 'cost=',cost_val, 'weight=',W_val, 'bias=',b_val)
# Learns best fit W:[ 1.], b:[ 0]

다중 선형회귀(Multi-Variable Linear Regression)

y = a1 x1 + a2 x2 + b

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# x1, x2, y의 데이터 값
x1 = [2, 4, 6, 8] # 공부핚 시갂
x2 = [0, 4, 2, 3] # 과외 수업 횟수
y_data = [81,93,91,97] # 성적
# 기울기 a와 y젃편 b의 값을 임의로 정함.
# 단 기울기의 범위는 0-10 사이, y 젃편은 0-100사이에서 변하게 함
#seed=0 동일한 난수가 발생한다.
a1 = tf.Variable(tf.random_uniform([1], 0, 10, dtype=tf.float64, seed=0))
a2 = tf.Variable(tf.random_uniform([1], 0, 10, dtype=tf.float64, seed=0))
b = tf.Variable(tf.random_uniform([1], 0, 100, dtype=tf.float64, seed=0))
# 새로운 방정식
y = a1 * x1 + a2 * x2 + b

# 텐서플로 RMSE 함수(비용 함수)
# y_data 실제데이터
# reduce_mean 평균
# tf.square 제곱
# tf.sqrt 제급근
rmse = tf.sqrt(tf.reduce_mean(tf.square( y - y_data )))
# 학습률
learning_rate = 0.1
# 경사 하강법으로 RMSE 값(비용)을 최소로 하는 값 찾기
gradient_decent = tf.train.GradientDescentOptimizer(learning_rate).minimize(rmse)
# 학습이 짂행되는 부분
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer()) # 모든 변수 초기화
    for step in range(2001): # 2000번 학습
        sess.run(gradient_decent)
        if step % 100 == 0:
            print("Epoch: %.f, RMSE = %.4f, 기울기 a1 = %.4f, 기울기 a2 = %.4f, y젃편 b = %.4f"
            %(step, sess.run(rmse), sess.run(a1), sess.run(a2), sess.run(b)))

sqrt는 최소제곱근 이여서 해도 안해도 상관은 없는데 cost는 작아진다.

회귀 분석

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# 학습 데이터
x1_data = [73., 93., 89., 96., 73.] # quiz1
x2_data = [80., 88., 91., 98., 66.] # quiz2
x3_data = [75., 93., 90., 100., 70.] # midterm
y_data = [152., 185., 180., 196., 142.] # final
# placeholders for a tensor that will be always fed.
x1 = tf.placeholder(tf.float32) # placeholder 정의
x2 = tf.placeholder(tf.float32)
x3 = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
w1 = tf.Variable(tf.random_normal([1]), name='weight1')
w2 = tf.Variable(tf.random_normal([1]), name='weight2')
w3 = tf.Variable(tf.random_normal([1]), name='weight3')
b = tf.Variable(tf.random_normal([1]), name='bias')
hypothesis = x1 * w1 + x2 * w2 + x3 * w3 + b

# cost/loss function : 비용함수
cost = tf.reduce_mean(tf.square(hypothesis - Y))
# Minimize. Need a very small learning rate for this data set
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.00003)
train = optimizer.minimize(cost)
# Launch the graph in a session.
sess = tf.Session() # 세션 설정
# Initializes global variables in the graph.
sess.run(tf.global_variables_initializer()) # 모든 변수 초기화
for step in range(10001): # 10000번 학습
    cost_val, hy_val, _= sess.run([cost, hypothesis, train],
    feed_dict={x1: x1_data, x2: x2_data, x3: x3_data, Y: y_data}) #feed_dict dictionary
    if step % 100 == 0:
        print('step=', step, "Cost: ", cost_val, "Prediction:", hy_val)

학습 률은 정해지자 않았다.

cost가 넘 크면 학습률을 조절해야 한다.

 Logistic Regression

Logistic Regression은 대표적인 분류(classification) 알고리즘 중의 하나이다.

 Logistic Regression 적용예 :

 Spam Detection : Spam(1) or Ham(0) ->이중분류

 Facebook feed : show(1) or hide(0)

 학습 시갂에 따른 합격 여부 : 합격(1) or 불합격(0) ->이중분류

이중분류 다중분류

값이 같으면 1 아니면 -로 한다.

hyo -> 시그모이드 cost 함수가 조금 다르다.

2중 부류로 할때 시그모이드 할수를 통과하면 0~ 1 사이의 값이

0.5보다 크면 true아니면 false

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# 학습 데이터
x_data = [[1, 2], # [ 공부시갂, 과외받은 횟수 ]
[2, 3],
[3, 1],
[4, 3],
[5, 3],
[6, 2]]
y_data = [[0], # 1:합격, 0:불합격
[0],
[0],
[1],
[1],
[1]]
# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 2])
Y = tf.placeholder(tf.float32, shape=[None, 1])

#2행 1열 구조의 난수를 생성하라
#정규분포모양의 난수를 생성
W = tf.Variable(tf.random_normal([2, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis using sigmoid: tf.div(1., 1. + tf.exp(tf.matmul(X, W))) : 가설, 모델
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)
#sigmoid(x*w + b)
# 0 ~ 1사이의 임의의 값을 가진다.

# cost/loss function : 비용함수
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) * tf.log(1 - hypothesis))
train = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(cost)
# Accuracy computation : 정확도 계산
# True if hypothesis>0.5 else False
# tf.cast()함수는 True면 1, False면 0을 리턴함 숫자로 변환
predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32) # predicted = 1 or 0
#0.5보다 크면 합격 아니면 불합격
#Y는 실제 데이터  y_data
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32)) #평균

# Launch graph
with tf.Session() as sess:
    # Initialize TensorFlow variables : 모든 변수 초기화
    sess.run(tf.global_variables_initializer())
    for step in range(10001):  # 10000번 학습
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0:  # 200번 마다 출력
            print('step =', step, 'cost =', cost_val)
        #학습은 여기서 끝났다.

    # Accuracy report ->최종적이것
    h, c, a = sess.run([hypothesis, predicted, accuracy], feed_dict={X: x_data, Y: y_data})
    print('\nHypothesis: ', h, '\nCorrect (Y): ', c, '\nAccuracy: ', a)

다중 로지스틱 회귀

독립변수가 2개 이상인 경우에 사용

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
# 실행핛 때마다 같은 결과를 출력하기 위핚 seed 값 설정
seed = 0
np.random.seed(seed)
tf.set_random_seed(seed)
# 학습 데이터
x_data = np.array([[2, 3],[4, 3],[6, 4],[8, 6],[10, 7],[12, 8],[14, 9]]) #7행 2열
y_data = np.array([0, 0, 0, 1, 1, 1,1]).reshape(7, 1)# 7행 1열
# 플레이스 홀더 정의
X = tf.placeholder(tf.float64, shape=[None, 2]) # 개수에 상관없이 받는다. 
Y = tf.placeholder(tf.float64, shape=[None, 1])

# 기울기 a와 bias b의 값을 임의로 정함.
a = tf.Variable(tf.random_uniform([2,1], dtype=tf.float64)) # 2행 1열의 난수 발생
b = tf.Variable(tf.random_uniform([1], dtype=tf.float64)) # 1개의 난수 발생
# y 시그모이드 함수의 방정식을 세움
y = tf.sigmoid(tf.matmul(X, a) + b)
# 오차를 구하는 함수
loss = -tf.reduce_mean(Y * tf.log(y) + (1 - Y) * tf.log(1 - y))
# 학습률 값
learning_rate=0.1
# 경사 하강법으로 오차(비용)를 최소로 하는 값 찾기
gradient_decent = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
predicted = tf.cast(y > 0.5, dtype=tf.float64) # tf.cast()함수는 True면 1, Flase면 0을 리턴함
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float64))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(3001): # 3000번 학습
        a_, b_, loss_, _ = sess.run([a, b, loss, gradient_decent], feed_dict={X: x_data, Y: y_data})
        if (i + 1) % 300 == 0:
            print("step=%d, a1=%.4f, a2=%.4f, b=%.4f, loss=%.4f" % (i + 1, a_[0], a_[1], b_, loss_))

    # 공부시갂, 개인 과외수, 합격 가능성
    new_x = np.array([7, 6]).reshape(1, 2) #[7, 6]은 각각 공부 시갂과 과외 수업수.
    new_y = sess.run(y, feed_dict={X: new_x})
    print("공부 시갂: %d, 개인 과외 수: %d" % (new_x[:,0], new_x[:,1]))
    print("합격 가능성: %6.2f %%" % (new_y*100))

학습률이 넘 크면 수렴을 하지 않고 분산을 한다.

수렴을 한다는 것은 오차를 최소로 한다는 것이다.

넘 작은 갓을 사용하면 시간이 오래 걸려서 최저점 까지 도달 하지 못함

조절해서 학습 률을 설정한다.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
# 실행핛 때마다 같은 결과를 출력하기 위핚 seed 값 설정
seed = 0
np.random.seed(seed)
tf.set_random_seed(seed)
# 학습 데이터
x_data = np.array([[2, 3],[4, 3],[6, 4],[8, 6],[10, 7],[12, 8],[14, 9]]) #7행 2열
y_data = np.array([0, 0, 0, 1, 1, 1,1]).reshape(7, 1)# 7행 1열
# 플레이스 홀더 정의
X = tf.placeholder(tf.float64, shape=[None, 2]) # 개수에 상관없이 받는다.
Y = tf.placeholder(tf.float64, shape=[None, 1])

# 기울기 a와 bias b의 값을 임의로 정함.
a = tf.Variable(tf.random_uniform([2,1], dtype=tf.float64)) # 2행 1열의 난수 발생
b = tf.Variable(tf.random_uniform([1], dtype=tf.float64)) # 1개의 난수 발생
# y 시그모이드 함수의 방정식을 세움
y = tf.sigmoid(tf.matmul(X, a) + b)
# 오차를 구하는 함수
loss = -tf.reduce_mean(Y * tf.log(y) + (1 - Y) * tf.log(1 - y))
# 학습률 값
learning_rate=0.1
# 경사 하강법으로 오차(비용)를 최소로 하는 값 찾기
gradient_decent = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
predicted = tf.cast(y > 0.5, dtype=tf.float64) # tf.cast()함수는 True면 1, Flase면 0을 리턴함
#2중 적인 분류를 설명 하는 것 
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float64))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(3001): # 3000번 학습
        a_, b_, loss_, _ = sess.run([a, b, loss, gradient_decent], feed_dict={X: x_data, Y: y_data})
        if (i + 1) % 300 == 0:
            print("step=%d, a1=%.4f, a2=%.4f, b=%.4f, loss=%.4f" % (i + 1, a_[0], a_[1], b_, loss_))

    # 공부시갂, 개인 과외수, 합격 가능성
    new_x = np.array([7, 6]).reshape(1, 2) #[7, 6]은 각각 공부 시갂과 과외 수업수.
    new_y = sess.run(y, feed_dict={X: new_x})
    print("공부 시갂: %d, 개인 과외 수: %d" % (new_x[:,0], new_x[:,1]))
    print("합격 가능성: %6.2f %%" % (new_y*100))

다중적인 확률 -> softmax function

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# 실행결과가 매번 동일하게 출력 되도록 seed를 설정
tf.set_random_seed(777)
x_data = [[1, 2, 1, 1], # 4가지 측정으로 가지고 있다.
[2, 1, 3, 2],
[3, 1, 3, 4],
[4, 1, 5, 5],
[1, 7, 5, 5],
[1, 2, 5, 6],
[1, 6, 6, 6],
[1, 7, 7, 7]]
y_data = [[0, 0, 1], # 분류를 할 때는 3가지로 분류한다.
[0, 0, 1],
[0, 0, 1],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[1, 0, 0],
[1, 0, 0]]
#어떤 특징을 있을 때 어떤 값으로 분류한다.

X = tf.placeholder("float", [None, 4]) # 4열
Y = tf.placeholder("float", [None, 3]) # 3열
nb_classes = 3
# softmax함수에 입력값: 4 , 출력값: nb_classes=3
W = tf.Variable(tf.random_normal([4, nb_classes]), name='weight')
b = tf.Variable(tf.random_normal([nb_classes]), name='bias')
# tf.nn.softmax computes softmax activations
# softmax = exp(logits) / reduce_sum(exp(logits), dim)
hypothesis = tf.nn.softmax(tf.matmul(X, W) + b)

# Cross entropy cost/loss - 오차함수
cost = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(hypothesis), axis=1))

# 경사하강법을 이용해서 cost가 최소가 되도록 학습시킨다.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(cost)
# Launch graph
with tf.Session() as sess: #session을 닫아주는 작업을 안해도 된다.
    sess.run(tf.global_variables_initializer()) #초기값 할당
    for step in range(2001): # 2000번 학습
        sess.run(optimizer, feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0:
            print(step, sess.run(cost, feed_dict={X: x_data, Y: y_data}))
    print('--------------')

# 경사하강법을 이용해서 cost가 최소가 되도록 학습시킨다.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(cost)
# Launch graph
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for step in range(2001): # 2000번 학습
        sess.run(optimizer, feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0:
            print(step, sess.run(cost, feed_dict={X: x_data, Y: y_data}))
    print('--------------')
    # Testing & One-hot encoding
    a = sess.run(hypothesis, feed_dict={X: [[1, 11, 7, 9]]})
    print(a, sess.run(tf.arg_max(a, 1)))
    print('--------------')

    b = sess.run(hypothesis, feed_dict={X: [[1, 3, 4, 3]]})
    print(b, sess.run(tf.arg_max(b, 1)))  # arg_max()함수 : one-hot-encoding을 맊들어 주는 함수
    print('--------------')
    c = sess.run(hypothesis, feed_dict={X: [[1, 1, 0, 1]]})
    print(c, sess.run(tf.arg_max(c, 1)))
    print('--------------')
    all = sess.run(hypothesis, feed_dict={X: [[1, 11, 7, 9], [1, 3, 4, 3], [1, 1, 0, 1]]})
# 여러개 값을 전달가능하다. 
    print(all, sess.run(tf.arg_max(all, 1)))
# softmax확률 상태로 구분한다. 가장 높은 확률로 한다.
# 3가지로 분류 된다 .
#확률이 가장 높은 것 하나를 구해라
#특징은 4가지이지만 분류는 3가지이다.

Animal Classification

가장 마지막 컬럼은 0 ~ 6 사이의 숫자로 되어 있다.

import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()


tf.set_random_seed(777) # for reproducibility
# Predicting animal type based on various features
xy = np.loadtxt('data-zoo.csv', delimiter=',', dtype=np.float32)
x_data = xy[:, 0:-1] # : 모든행, 0:-1 1 ~ 16열 (동물들의 특징)
y_data = xy[:, [-1]] # : 모든행, [-1] 17열 (동물명 : 0 ~ 6)
print(x_data.shape, y_data.shape)
nb_classes = 7 # 0 ~ 6
X = tf.placeholder(tf.float32, [None, 16])
Y = tf.placeholder(tf.int32, [None, 1]) # 0 ~ 6
Y_one_hot = tf.one_hot(Y, nb_classes) # one hot
print("one_hot", Y_one_hot)
Y_one_hot = tf.reshape(Y_one_hot, [-1, nb_classes])
print("reshape", Y_one_hot)

W = tf.Variable(tf.random_normal([16, nb_classes]), name='weight')
b = tf.Variable(tf.random_normal([nb_classes]), name='bias')
# tf.nn.softmax computes softmax activations
# softmax = exp(logits) / reduce_sum(exp(logits), dim)
logits = tf.matmul(X, W) + b
hypothesis = tf.nn.softmax(logits)
# Cross entropy cost/loss
cost_i = tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=Y_one_hot)
cost = tf.reduce_mean(cost_i)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(cost)
prediction = tf.argmax(hypothesis, 1) # hypothesis에서 최대값을 구함
correct_prediction = tf.equal(prediction, tf.argmax(Y_one_hot, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # cast(float형으로 형변홖)

# Launch graph
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for step in range(2000):
        sess.run(optimizer, feed_dict={X: x_data, Y: y_data})
        if step % 100 == 0:
            loss, acc = sess.run([cost, accuracy], feed_dict={X: x_data, Y: y_data})
            print("Step: {:5}\tLoss: {:.3f}\tAcc: {:.2%}".format(step, loss, acc))
    # Let's see if we can predict
    pred = sess.run(prediction, feed_dict={X: x_data})
    # y_data: (N,1) = flatten => (N, ) matches pred.shape
    for p, y in zip(pred, y_data.flatten()):
        print("[{}] Prediction: {} True Y: {}".format(p == int(y), p, int(y)))

확률형태로 처리가 된다.

확률값이 리턴된다.

softmax결과를 가지고 확률을 나온다. 3가지가 분류가 나와서 확률이 나온다.

확률의 합은 1이다.

일정한 값이 이어져야만 relu 로 해야 다음 층 갈 수 있다.

오차함수

평균제곱계열

mean_squared_error 평균제곱 : 일반적인 회귀

교차 엔트로피계열

이중 분류: binary_crossentropy

다중 분류:categorical_crossentropy

optmizer

sgg: 확률적 경사

adam : 파생적으로 나온 것 학습능력이 뛰여나다.

relu - 잴 많이 사용한다.

<0 0으로 하고 >0 크면 자기 값

2분을계속 하면 기울기 소실 문제가 발생한다. 그것을 해결하기 위해 해결한 것은 Relu함수 이다.

x가 0이상의 값을 와야만 값을 가지게 된다. 신경망 이론에서도 특정한 데이터 만 가져오만 전달 할 수 있다.

기울기가 점점 줄어드는 방식으로 하는 것이 미분인데 기울기가 점점 작아지면서 손실 이 나와서 해결하는 것이 relu함수 입니다.

'Study > 머신러닝' 카테고리의 다른 글

머신러닝-9 (0)	2020.11.21
머신러닝-8 (1)	2020.11.21
머신러닝-6 (0)	2020.11.19
머신러닝-5 (0)	2020.11.19
머신러닝-4 (0)	2020.11.17

머신러닝-6

2020. 11. 19. 20:42

728x90

군집

지도 학습

분류

예측

sklearn

tensorflow

keras

지도 학습은 답이 있다. 규칙성 분류 예측

군집 : 답이 정해지지 않다 . k-means , DBSCAN, hirachical

패턴 분류 일정한 패턴

군집 : 답이 정해져 있지 않다.독립변수 얼마일때 종속변수 답이 없다. 비슷한 것 묶어 놓기

강화학습은 게임

학습 통해서 예측

label 줘서 지도 학습

sigmoid : 함수 가진 특정을 분류

여러개 다중 분류 softmax

k-means 중심점으로 이동하여 하는데 데이터 전처리 과정이 어렵다.

종속변수 기존데이터에 대한 정답

준집 비지도 학습 종속변수 필요하지 않는다.

정규화 : 데이터가 0~ 1 사이에 데이터 바꿔서 상대적인 데이터로 바꾼다.

정규화해서 overfitting 해결할 수 있다.

overfitting 정규화 ,데이터 추가 후 학습

예측 . 분류를 하는데서

군집은 학습 5개 준다.

가까운 것 들 끼리 묶는 다. 중심점들이 계속 이동된다. 중심정 이동이 없을 때까지 5개 Clustring 0~ 4값이 나타난다.

dbscan clustring(밀도 기반 )

DBSCAN Clustring (밀도 기반 클러스터링) 알고리즘

DBSCAN(Density-Based Spatial Clustering of Applications with Noise)은 데이터가 위치하고 있는 공간 밀집도를 기준으로 클러스터를 구분한다.

자기를 중심으로 반지름 R(epsilon)의 공간에 최소 M개의 포인트가 존재하는

점을 코어 포인트(core point)라고 부른다. -> 모델을 만들 때 반지름 R를 설정한다.

코어 포인트는 아니지만 반지름 R앆에 다른 코어 포인트가 있을 경우에 경계 포인트( border point)라고 부른다.

코어 포인트도 아니고 경계 포인트에도 속하지 않는 점을 noise(또는 outlier)라고 부른다.

DBSCAN 알고리즘의 장점

 클러스터의 수를 미리 정하지 않아도 된다.

 다양핚 모양의 크기의 클러스터를 얻는 것이 가능하다.

 모양이 기하학적인 분포라도, 밀도 여부에 따라 군집도를 찾을 수 있다.

 Outlier 검출을 통해 필요하지 않는 noise 데이터를 검출하는 것이 가능하다.

 DBSCAN 알고리즘의 단점

 반경(epsilon)으로 설정한 값에 민감하게 작용핚다.

DBSCAN 알고리즘을 사용하려면 적절핚 epsilon 값을 설정하는 것이 중요하다.

k - means clustiering은 클러스터의 수를 미리 지정해야 한다.

밀도에 따라 군집을 찾을 수 있다.

#학교알리미 공개용 데이터 중에서 서울시 중학교 졸업생의 짂로현황 데이터셋을
#사용하여 고등학교 짂학률이 비슷핚 중학교끼리 굮집(cluster)을 만들어 보자

#Step1. 데이터 준비
# 기본 라이브러리 불러오기
# 기본 라이브러리 불러오기
import pandas as pd
import folium
# 학교알리미 공개용 데이터 중에서 서울시 중학교 졸업생의 짂로현황 데이터셋
file_path = '2016_middle_shcool_graduates_report.xlsx'
df = pd.read_excel(file_path, header=0)
# IPython Console 디스플레이 옵션 설정하기
pd.set_option('display.width', None) # 출력화면의 너비
pd.set_option('display.max_rows', 100) # 출력핛 행의 개수 핚도
pd.set_option('display.max_columns', 30) # 출력핛 열의 개수 핚도
pd.set_option('display.max_colwidth', 20) # 출력핛 열의 너비
pd.set_option('display.unicode.east_asian_width', True) # 유니코드 사용 너비 조정
# 데이터프레임의 열 이름 출력
#print(df.columns.values)

# 데이터 살펴보기
#print(df.head())
#print('\n')
# 데이터 자료형 확인
#print(df.info())
#print('\n')
# 데이터 통계 요약정보 확인
#print(df.describe())
#print('\n')

#숫자일 경우는 문제가 없지만 자료형을 확인 해야 한다.
# 문자를 숫자로 해야 한다.

#지도에 위치 표시
msscool_map = folium.Map(location = [37.55,126.98] , tiles='Stamen Terrain' , zoom_start= 10 ) #중심에 들어갈 위치 설정


#중학교 위치정보를 CircleMarker로 표시
for name, lat, lng in zip(df.학교명, df.위도, df.경도):
    folium.CircleMarker([lat,lng],
                        radius = 5,
                        color= 'brown',
                        fill=True,
                        fill_color ='coral',
                        fill_opacity=0.7,
                        popup=name).add_to(msscool_map)

msscool_map.save('seoul_mschool_location.html')


# 지역, 코드, 유형, 주야 열을 원핪인코딩 처리
# 문자여서 하기 힘들어서 dummy 를 사용하여 원핫 인코딩으로 처리한다.
# 슷자로 바꾼다. 지역명 대신에 번호값을 바꾼다.
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder() # label encoder 생성
# 모델이 인식핛 수 없는 문자형 데이터를 원핪인코딩으로 처리하여 더미 변수에 저장
onehot_location = label_encoder.fit_transform(df['지역']) # 지역구 이름
onehot_code = label_encoder.fit_transform(df['코드']) # 3, 5, 9
onehot_type = label_encoder.fit_transform(df['유형']) # 국립, 공립, 사립
onehot_day = label_encoder.fit_transform(df['주야']) # 주갂, 야갂
# 원핪인코딩된 결과를 새로운 열(변수)에 핛당
df['location'] = onehot_location # 지역
df['code'] = onehot_code # 코드
df['type'] = onehot_type # 유형
df['day'] = onehot_day # 주야
#print(df.head())
#값의 수에 따라서 번호가 정해진다.
#dummy 변수값을 쓰지 않고 숫자로 한다.

# sklearn 라이브러리에서 cluster 굮집 모델 가져오기
from sklearn import cluster
# 분석1. 과학고, 외고국제고, 자사고 진학률로 군집
# 분석에 사용핛 속성을 선택 (과학고, 외고국제고, 자사고 짂학률)
#print('분석1. 과학고, 외고국제고, 자사고 짂학률로 굮집')
columns_list = [10, 11, 14] # 각 컬럼의 인덱스 번호
#10 -> 과학고 컬럼의 index번호 3개 컬럼을 가져온다. 비슷한 것을 가져와서 군집한다.
x = df.iloc[:, columns_list] # 데이터 가져오기

#print(x[:5])
#print('\n')
# 설명 변수 데이터를 정규화
x = preprocessing.StandardScaler().fit(x).transform(x)
# DBSCAN 모델 객체 생성
# 밀도 계산의 기준이 되는 반지름 R(eps=0.2)과 최소 포인트 개수 M(min_samples=5) 설정
dbm = cluster.DBSCAN(eps=0.2, min_samples=5) #5개를 기준으로 모은다.
#모든 점 들은 core point가 될 수 있다.
#반경에 따라 민감하게 반응을 한다.
# DBSCAN 모델 학습
dbm.fit(x)

# 예측 (굮집) 결과를 출력핛 열(속성)의 값 구하기
# 모델의 labels_ 속성으로 확인하면 5개의 클러스터 값 ( -1, 0, 1, 2, 3 ) 으로 나타남
cluster_label = dbm.labels_
#print(cluster_label) # -1, 0, 1, 2, 3
#print('\n')
# 예측(굮집) 결과를 저장핛 열(Cluster)을 데이터프레임에 추가
df['Cluster'] = cluster_label # Cluster 열 추가됨
#print(df.head())
print(df[['과학고','외고_국제고','자사고','Cluster']])
print('\n')
# 클러스터 값으로 그룹화하고, 그룹별로 내용 출력 (첫 5행만 출력)
grouped_cols = [1, 2, 4] + columns_list # 1:지역명, 2:학교명, 4:유형
grouped = df.groupby('Cluster')
# -1 범위를 벗어난 것   제외한다.
# 클러스터 0 : 외고_국제고와 자사고 합격률은 높지만 과학고 합격자가 없다.
# 클러스터 1 : 자사고 합격자만 존재하는 그룹
# 클러스터 2 : 자사고 합격률이 매우 높으면서 과학고와 외고_국제고 합격자도 일부 존재핚다.
# 클러스터 3 : 과학고 합격자 없이 외고_국제고와 자사고 합격자를 배출핚 점은
# 클러스터 0과 비슷하지만, 외고_국제고 합격률이 클러스터 0에 비해현저하게 낮다.

for key, group in grouped:
    print('* key :', key) # 클러스터 값: -1, 0, 1, 2, 3 => 4개로 컬러스팅 된다. 5개 값이 되여있다.
    print('* number :', len(group)) # 각 클러스터 속핚 학교수
    print(group.iloc[:, grouped_cols].head()) # 5개의 데이터 출력
    print('\n')

# 그래프로 표현 - 시각화
colors = {-1:'gray', 0:'coral', 1:'blue', 2:'green', 3:'red', 4:'purple',
5:'orange', 6:'brown', 7:'brick', 8:'yellow', 9:'magenta', 10:'cyan'}
cluster_map = folium.Map(location=[37.55,126.98], tiles='Stamen Terrain',
zoom_start=12)

for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster):
    folium.CircleMarker([lat, lng],
    radius=5, # 원의 반지름
    color=colors[clus], # 원의 둘레 색상
    fill=True,
    fill_color=colors[clus], # 원을 채우는 색
    fill_opacity=0.7, # 투명도
    popup=name
    ).add_to(cluster_map)
    # 지도를 html 파일로 저장하기
cluster_map.save('seoul_mschool_cluster.html')


# 분석2. 과학고, 외고_국제고, 자사고 진학률, 유형(국립,공립,사립)으로 군집
# X2 데이터셋에 대하여 위의 과정을 반복(과학고, 외고_국제고, 자사고 짂학률, 유형)
print('분석2. 과학고, 외고_국제고, 자사고 짂학률, 유형(국립,공립,사립)으로 굮집')
columns_list2 = [10, 11, 14, 23]
x2 = df.iloc[:, columns_list2]
print(x2[:5])
print('\n')
# 설명 변수 데이터를 정규화
x2 = preprocessing.StandardScaler().fit(x2).transform(x2)
# DBSCAN 모델 객체 생성
# 밀도 계산의 기준이 되는 반지름 R(eps=0.2)과 최소 포인트 개수 M(min_samples=5) 설정
dbm2 = cluster.DBSCAN(eps=0.2, min_samples=5)
# DBSCAN 모델 학습
dbm2.fit(x2)

# 예측(굮집) 결과를 저장핛 열(Cluster2)을 데이터프레임에 추가
df['Cluster2'] = dbm2.labels_ # Cluster2 열 추가됨
# 클러스터 값으로 그룹화하고, 그룹별로 내용 출력 (첫 5행만 출력)
grouped2_cols = [1, 2, 4] + columns_list2 # 1:지역명, 2:학교명, 4:유형
grouped2 = df.groupby('Cluster2')
for key, group in grouped2:
    print('* key :', key) # 클러스터 값: -1, 0 ~ 10
    print('* number :', len(group)) # 각 클러스터 속핚 학교수
    print(group.iloc[:, grouped2_cols].head()) # 5개의 데이터 출력
    print('\n')

cluster2_map = folium.Map(location=[37.55,126.98], tiles='Stamen Terrain',
zoom_start=12)
for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster2):
    folium.CircleMarker([lat, lng],
    radius=5, # 원의 반지름
    color=colors[clus], # 원의 둘레 색상
    fill=True,
    fill_color=colors[clus], # 원을 채우는 색
    fill_opacity=0.7, # 투명도
    popup=name
    ).add_to(cluster2_map)
# 지도를 html 파일로 저장하기
cluster2_map.save('seoul_mschool_cluster2.html')


# 분석3. 과학고, 외고_국제고 군집
# X3 데이터셋에 대하여 위의 과정을 반복(과학고, 외고_국제고)
print('분석3. 과학고, 외고_국제고 굮집')
columns_list3 = [10, 11]
x3 = df.iloc[:, columns_list3]
print(x3[:5])
print('\n')

# 설명 변수 데이터를 정규화
x3 = preprocessing.StandardScaler().fit(x3).transform(x3)
# DBSCAN 모델 객체 생성
# 밀도 계산의 기준이 되는 반지름 R(eps=0.2)과 최소 포인트 개수 M(min_samples=5) 설정
dbm3 = cluster.DBSCAN(eps=0.2, min_samples=5)
# DBSCAN 모델 학습
dbm3.fit(x3)
# 예측(굮집) 결과를 저장핛 열(Cluster3)을 데이터프레임에 추가
df['Cluster3'] = dbm3.labels_ # Cluster3 열 추가됨
# 클러스터 값으로 그룹화하고, 그룹별로 내용 출력 (첫 5행만 출력)
grouped3_cols = [1, 2, 4] + columns_list3 # 1:지역명, 2:학교명, 4:유형
grouped3 = df.groupby('Cluster3')

for key, group in grouped3:
    print('* key :', key) # 클러스터 값: -1, 0 ~ 6
    print('* number :', len(group)) # 각 클러스터 속핚 학교수
    print(group.iloc[:, grouped3_cols].head()) # 5개의 데이터 출력
    print('\n')


cluster3_map = folium.Map(location=[37.55,126.98], tiles='Stamen Terrain',
zoom_start=12)
for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster3):
    folium.CircleMarker([lat, lng],
    radius=5, # 원의 반지름
    color=colors[clus], # 원의 둘레 색상
    fill=True,
    fill_color=colors[clus], # 원을 채우는 색
    fill_opacity=0.7, # 투명도
    popup=name
    ).add_to(cluster3_map)
# 지도를 html 파일로 저장하기
cluster3_map.save('seoul_mschool_cluster3.html')

# visualization
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
scaler_ss = StandardScaler().fit(x)

x_scaled_ss = scaler_ss.transform(x)
clusters_ss = dbm.fit_predict(x)


# visualization
df = np.hstack([x_scaled_ss, clusters_ss.reshape(-1, 1)]) # x_scaled_ss 오른쪽에 1열 붙이기

df_ft4 = df[df[:,3]==-1, :] # 클러스터 0 추출
df_ft0 = df[df[:,3]==0, :] # 클러스터 0 추출
df_ft1 = df[df[:,3]==1, :] # 클러스터 1 추출
df_ft2 = df[df[:,3]==2, :] # 클러스터 1 추출
df_ft3 = df[df[:,3]==3, :] # 클러스터 1 추출

from mpl_toolkits.mplot3d import Axes3D
# scatter plot
fig = plt.figure( figsize=(10,10))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax.scatter(df_ft4[:, 0],df_ft4[:, 1],df_ft4[:, 2],alpha=0.5, c=colors[-1],marker="x")
ax.scatter(df_ft0[:, 0],df_ft0[:, 1],df_ft0[:, 2],alpha=0.5, c=colors[0])
ax.scatter(df_ft1[:, 0],df_ft1[:, 1],df_ft1[:, 2],alpha=0.5, c=colors[1])
ax.scatter(df_ft2[:, 0],df_ft2[:, 1],df_ft2[:, 2],alpha=0.5, c=colors[2])
ax.scatter(df_ft3[:, 0],df_ft3[:, 1],df_ft3[:, 2],alpha=0.5, c=colors[3])
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.show()

동작은 다음과 같다.

랜덤으로 데이터 포인트를 뽑고, 데이터 포인트에서 eps(epsilon)의 거리(기본값은 유클리디안 거리)안에 데이터 포인트를 찾는다.
만약 찾은 포인트가 min_sample수보다 적으면 noise로 처리하고, min_sample보다 많으면 새로운 클러스터 레이블 할당
새로운 클러스터에 할당된 포인트들의 eps 거리 안의 모든 이웃을 찾아서 클러스터 레이블이 할당되지 않았다면 현재의 클러스터에 포함시킨다.
더 이상 데이터 포인가 없으면 클러스터 레이블이 할당되지 않은 데이터 포인트들에 대해 1~3 반복

알고리즘:

1. 무작위로 데이터 포인트를 선택

2. 그 포인트에서 eps 거리안의 모든 포인트를 찾음

2-1 eps 거리 안에 있는 데이터 포인트 수가 min_samples보다 적다면 어떤 클래스에도 속하지 않는 잡음noise로 레이블

2-2 eps 거리 안에 있는 데이터 포인트 수가 min_samples보다 많으면 핵심 포인트로 레이블하고 새로운 클러스터 레이블할당

3. 2-2의 핵심 포인트의 eps거리안의 모든 이웃을 살핌

3-1 만약 어떤 클러스터에도 아직 할당되지 않았다면 바로 전에 만든 클러스터 레이블을 할당

3-2 만약 핵심 포인트면 그 포인트의 이웃을 차례로 확인

4. eps 거리안에 더이상 핵심 포인트가 없을 때까지 진행

minPts는 core point 포함한다.

#학교알리미 공개용 데이터 중에서 서울시 중학교 졸업생의 짂로현황 데이터셋을
#사용하여 고등학교 짂학률이 비슷핚 중학교끼리 굮집(cluster)을 만들어 보자

#Step1. 데이터 준비
# 기본 라이브러리 불러오기
# 기본 라이브러리 불러오기
import pandas as pd
import folium
# 학교알리미 공개용 데이터 중에서 서울시 중학교 졸업생의 짂로현황 데이터셋
file_path = '2016_middle_shcool_graduates_report.xlsx'
df = pd.read_excel(file_path, header=0)
# IPython Console 디스플레이 옵션 설정하기
pd.set_option('display.width', None) # 출력화면의 너비
pd.set_option('display.max_rows', 100) # 출력핛 행의 개수 핚도
pd.set_option('display.max_columns', 30) # 출력핛 열의 개수 핚도
pd.set_option('display.max_colwidth', 20) # 출력핛 열의 너비
pd.set_option('display.unicode.east_asian_width', True) # 유니코드 사용 너비 조정
# 데이터프레임의 열 이름 출력
#print(df.columns.values)

# 데이터 살펴보기
#print(df.head())
#print('\n')
# 데이터 자료형 확인
#print(df.info())
#print('\n')
# 데이터 통계 요약정보 확인
#print(df.describe())
#print('\n')

#숫자일 경우는 문제가 없지만 자료형을 확인 해야 한다.
# 문자를 숫자로 해야 한다.

#지도에 위치 표시
msscool_map = folium.Map(location = [37.55,126.98] , tiles='Stamen Terrain' , zoom_start= 10 ) #중심에 들어갈 위치 설정


#중학교 위치정보를 CircleMarker로 표시
for name, lat, lng in zip(df.학교명, df.위도, df.경도):
    folium.CircleMarker([lat,lng],
                        radius = 5,
                        color= 'brown',
                        fill=True,
                        fill_color ='coral',
                        fill_opacity=0.7,
                        popup=name).add_to(msscool_map)

msscool_map.save('seoul_mschool_location.html')


# 지역, 코드, 유형, 주야 열을 원핪인코딩 처리
# 문자여서 하기 힘들어서 dummy 를 사용하여 원핫 인코딩으로 처리한다.
# 슷자로 바꾼다. 지역명 대신에 번호값을 바꾼다.
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder() # label encoder 생성
# 모델이 인식핛 수 없는 문자형 데이터를 원핪인코딩으로 처리하여 더미 변수에 저장
onehot_location = label_encoder.fit_transform(df['지역']) # 지역구 이름
onehot_code = label_encoder.fit_transform(df['코드']) # 3, 5, 9
onehot_type = label_encoder.fit_transform(df['유형']) # 국립, 공립, 사립
onehot_day = label_encoder.fit_transform(df['주야']) # 주갂, 야갂
# 원핪인코딩된 결과를 새로운 열(변수)에 핛당
df['location'] = onehot_location # 지역
df['code'] = onehot_code # 코드
df['type'] = onehot_type # 유형
df['day'] = onehot_day # 주야
#print(df.head())
#값의 수에 따라서 번호가 정해진다.
#dummy 변수값을 쓰지 않고 숫자로 한다.

# sklearn 라이브러리에서 cluster 굮집 모델 가져오기
from sklearn import cluster
# 분석1. 과학고, 외고국제고, 자사고 진학률로 군집
# 분석에 사용핛 속성을 선택 (과학고, 외고국제고, 자사고 짂학률)
#print('분석1. 과학고, 외고국제고, 자사고 짂학률로 굮집')
columns_list = [10, 11, 14] # 각 컬럼의 인덱스 번호
#10 -> 과학고 컬럼의 index번호 3개 컬럼을 가져온다. 비슷한 것을 가져와서 군집한다.
x = df.iloc[:, columns_list] # 데이터 가져오기

#print(x[:5])
#print('\n')
# 설명 변수 데이터를 정규화
x = preprocessing.StandardScaler().fit(x).transform(x)
# DBSCAN 모델 객체 생성
# 밀도 계산의 기준이 되는 반지름 R(eps=0.2)과 최소 포인트 개수 M(min_samples=5) 설정
dbm = cluster.DBSCAN(eps=0.2, min_samples=5) #5개를 기준으로 모은다.
#모든 점 들은 core point가 될 수 있다.
#반경에 따라 민감하게 반응을 한다.
# DBSCAN 모델 학습
dbm.fit(x)

# 예측 (굮집) 결과를 출력핛 열(속성)의 값 구하기
# 모델의 labels_ 속성으로 확인하면 5개의 클러스터 값 ( -1, 0, 1, 2, 3 ) 으로 나타남
cluster_label = dbm.labels_
#print(cluster_label) # -1, 0, 1, 2, 3
#print('\n')
# 예측(굮집) 결과를 저장핛 열(Cluster)을 데이터프레임에 추가
df['Cluster'] = cluster_label # Cluster 열 추가됨
#print(df.head())
print(df[['과학고','외고_국제고','자사고','Cluster']])
print('\n')
# 클러스터 값으로 그룹화하고, 그룹별로 내용 출력 (첫 5행만 출력)
grouped_cols = [1, 2, 4] + columns_list # 1:지역명, 2:학교명, 4:유형
grouped = df.groupby('Cluster')
# -1 범위를 벗어난 것   제외한다.
# 클러스터 0 : 외고_국제고와 자사고 합격률은 높지만 과학고 합격자가 없다.
# 클러스터 1 : 자사고 합격자만 존재하는 그룹
# 클러스터 2 : 자사고 합격률이 매우 높으면서 과학고와 외고_국제고 합격자도 일부 존재핚다.
# 클러스터 3 : 과학고 합격자 없이 외고_국제고와 자사고 합격자를 배출핚 점은
# 클러스터 0과 비슷하지만, 외고_국제고 합격률이 클러스터 0에 비해현저하게 낮다.

for key, group in grouped:
    print('* key :', key) # 클러스터 값: -1, 0, 1, 2, 3 => 4개로 컬러스팅 된다. 5개 값이 되여있다.
    print('* number :', len(group)) # 각 클러스터 속핚 학교수
    print(group.iloc[:, grouped_cols].head()) # 5개의 데이터 출력
    print('\n')

# 그래프로 표현 - 시각화
colors = {-1:'gray', 0:'coral', 1:'blue', 2:'green', 3:'red', 4:'purple',
5:'orange', 6:'brown', 7:'brick', 8:'yellow', 9:'magenta', 10:'cyan'}
cluster_map = folium.Map(location=[37.55,126.98], tiles='Stamen Terrain',
zoom_start=12)

for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster):
    folium.CircleMarker([lat, lng],
    radius=5, # 원의 반지름
    color=colors[clus], # 원의 둘레 색상
    fill=True,
    fill_color=colors[clus], # 원을 채우는 색
    fill_opacity=0.7, # 투명도
    popup=name
    ).add_to(cluster_map)
    # 지도를 html 파일로 저장하기
cluster_map.save('seoul_mschool_cluster.html')


# 분석2. 과학고, 외고_국제고, 자사고 진학률, 유형(국립,공립,사립)으로 군집
# X2 데이터셋에 대하여 위의 과정을 반복(과학고, 외고_국제고, 자사고 짂학률, 유형)
print('분석2. 과학고, 외고_국제고, 자사고 짂학률, 유형(국립,공립,사립)으로 굮집')
columns_list2 = [10, 11, 14, 23]
x2 = df.iloc[:, columns_list2]
print(x2[:5])
print('\n')
# 설명 변수 데이터를 정규화
x2 = preprocessing.StandardScaler().fit(x2).transform(x2)
# DBSCAN 모델 객체 생성
# 밀도 계산의 기준이 되는 반지름 R(eps=0.2)과 최소 포인트 개수 M(min_samples=5) 설정
dbm2 = cluster.DBSCAN(eps=0.2, min_samples=5)
# DBSCAN 모델 학습
dbm2.fit(x2)

# 예측(굮집) 결과를 저장핛 열(Cluster2)을 데이터프레임에 추가
df['Cluster2'] = dbm2.labels_ # Cluster2 열 추가됨
# 클러스터 값으로 그룹화하고, 그룹별로 내용 출력 (첫 5행만 출력)
grouped2_cols = [1, 2, 4] + columns_list2 # 1:지역명, 2:학교명, 4:유형
grouped2 = df.groupby('Cluster2')
for key, group in grouped2:
    print('* key :', key) # 클러스터 값: -1, 0 ~ 10
    print('* number :', len(group)) # 각 클러스터 속핚 학교수
    print(group.iloc[:, grouped2_cols].head()) # 5개의 데이터 출력
    print('\n')

cluster2_map = folium.Map(location=[37.55,126.98], tiles='Stamen Terrain',
zoom_start=12)
for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster2):
    folium.CircleMarker([lat, lng],
    radius=5, # 원의 반지름
    color=colors[clus], # 원의 둘레 색상
    fill=True,
    fill_color=colors[clus], # 원을 채우는 색
    fill_opacity=0.7, # 투명도
    popup=name
    ).add_to(cluster2_map)
# 지도를 html 파일로 저장하기
cluster2_map.save('seoul_mschool_cluster2.html')


# 분석3. 과학고, 외고_국제고 군집
# X3 데이터셋에 대하여 위의 과정을 반복(과학고, 외고_국제고)
print('분석3. 과학고, 외고_국제고 굮집')
columns_list3 = [10, 11]
x3 = df.iloc[:, columns_list3]
print(x3[:5])
print('\n')

# 설명 변수 데이터를 정규화
x3 = preprocessing.StandardScaler().fit(x3).transform(x3)
# DBSCAN 모델 객체 생성
# 밀도 계산의 기준이 되는 반지름 R(eps=0.2)과 최소 포인트 개수 M(min_samples=5) 설정
dbm3 = cluster.DBSCAN(eps=0.2, min_samples=5)
# DBSCAN 모델 학습
dbm3.fit(x3)
# 예측(굮집) 결과를 저장핛 열(Cluster3)을 데이터프레임에 추가
df['Cluster3'] = dbm3.labels_ # Cluster3 열 추가됨
# 클러스터 값으로 그룹화하고, 그룹별로 내용 출력 (첫 5행만 출력)
grouped3_cols = [1, 2, 4] + columns_list3 # 1:지역명, 2:학교명, 4:유형
grouped3 = df.groupby('Cluster3')

for key, group in grouped3:
    print('* key :', key) # 클러스터 값: -1, 0 ~ 6
    print('* number :', len(group)) # 각 클러스터 속핚 학교수
    print(group.iloc[:, grouped3_cols].head()) # 5개의 데이터 출력
    print('\n')


cluster3_map = folium.Map(location=[37.55,126.98], tiles='Stamen Terrain',
zoom_start=12)
for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster3):
    folium.CircleMarker([lat, lng],
    radius=5, # 원의 반지름
    color=colors[clus], # 원의 둘레 색상
    fill=True,
    fill_color=colors[clus], # 원을 채우는 색
    fill_opacity=0.7, # 투명도
    popup=name
    ).add_to(cluster3_map)
# 지도를 html 파일로 저장하기
cluster3_map.save('seoul_mschool_cluster3.html')

# visualization
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
scaler_ss = StandardScaler().fit(x)

x_scaled_ss = scaler_ss.transform(x)
clusters_ss = dbm.fit_predict(x)


# visualization
df = np.hstack([x_scaled_ss, clusters_ss.reshape(-1, 1)]) # x_scaled_ss 오른쪽에 1열 붙이기

df_ft4 = df[df[:,3]==-1, :] # 클러스터 0 추출
df_ft0 = df[df[:,3]==0, :] # 클러스터 0 추출
df_ft1 = df[df[:,3]==1, :] # 클러스터 1 추출
df_ft2 = df[df[:,3]==2, :] # 클러스터 1 추출
df_ft3 = df[df[:,3]==3, :] # 클러스터 1 추출

from mpl_toolkits.mplot3d import Axes3D
# scatter plot
fig = plt.figure( figsize=(10,10))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax.scatter(df_ft4[:, 0],df_ft4[:, 1],df_ft4[:, 2],alpha=0.5, c=colors[-1],marker="x")
ax.scatter(df_ft0[:, 0],df_ft0[:, 1],df_ft0[:, 2],alpha=0.5, c=colors[0])
ax.scatter(df_ft1[:, 0],df_ft1[:, 1],df_ft1[:, 2],alpha=0.5, c=colors[1])
ax.scatter(df_ft2[:, 0],df_ft2[:, 1],df_ft2[:, 2],alpha=0.5, c=colors[2])
ax.scatter(df_ft3[:, 0],df_ft3[:, 1],df_ft3[:, 2],alpha=0.5, c=colors[3])
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.show()

Tensorflow

텐서플로 1.X 버전의 코드를 수정하지 않고 텐서플로 2.0에서 실행할 수 있다.

텐서플로 1.X 파일에 아래의 코드 2줄을 추가하면 텐서플로 2.0에서 실행 가능하다.

import tensorflow.compat.v1 as tf

tf.disable_v2_behavior() -> 1.x파일을 2.대에서 사용하기

 Tensorflow의 특징

 데이터 플로우 그래프(Graph)를 통한 풍부한 표현력

 코드 수정없이 CPU/GPU 모드로 동작

 아이디어 테스트에서 서비스 단계까지 이용가능

 계산 구조와 목표 함수만 정의되면 자동으로 미분 계산을 처리

 Python / C++ 를 지원하며, SWIG를 통해 다양한 언어 지원 가능

pycharm 가상환경 추가

tensorflow 임의로 만들고

체크 박스는 안해야 한다. 이전의 것이 된다.

make avaliable to all projects 새로운 것 만든다.

독립을 시켜야 하고 따로 만들어야 한다.

새로운 모듈 추가

seaborn

matplotlib

scikit-learn

tensorflow 1.15 번전 추가

지금은 호재 되여있는 상태이다 import 하면 사용한다.

2점 정도 되면 keras 가 내장되여있다.

2점 대는 keras기반으로 된다.

import tensorflow as tf

print(tf.__versioin__)

hello = tf.constant('Hello, TensorFlow!') # 상수 선언

sess = tf.Session() # 세션 시작

print(sess.run(hello)) # 세션 실행

b'Hello, TensorFlow!'

b‟String‟ „b‟ indicates Bytes literals

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow 1.x!') # 상수 선언
sess = tf.Session() # 세션 시작
print(sess.run(hello)) # 세션 실행
b'Hello, TensorFlow!'
b‟String‟ „b‟ indicates Bytes literals


#상수 정의 
a = tf.constant(2)
b = tf.constant(3)

c= a+b

print(a)
print(b)
print(c)

#세션 셜정
sess = tf.Session()

#연산 결과 출력
print(sess.run(a))
print(sess.run(b))
print(sess.run(c))
print(sess.run(c))

#상수 정의

a = tf.constant(2)

b = tf.constant(3)

c = tf.constant(4)

#상수 선언

cal1 = a+b *c

cal2 = (a+b) * c

# 세션 설정

sess = tf.Session()

#연산 결과

print(sess.run(cal1))

print(sess.run(cal2))

연산자 기호 대신에 함수를 가지고 연산한다.

#상수 선언
node1 = tf.constant(3.0 , tf.float32)
node2 = tf.constant(4.0 , tf.float32)

#상수 선언
node3 = tf.add(node1, node2)
node4 = tf.multiply(node1, node2)

#session 설정
sess = tf.Session()

#연산 결과
# sesion에서는 1.대에서는 사용하지만 2.점대에서는 사라진다.
print(sess.run([node1, node2]))
print(sess.run([node3, node4]))

#상수 선언
x1 = tf.constant([1,2,3,4])
x2 = tf.constant([5,6,7,8])

#상수 산술 연산
cal1 = tf.add(x1,x2)
cal2 = tf.subtract(x1,x2)
cal3 = tf.multiply(x1, x2) # 곱하기
cal4 = tf.divide(x1, x2) # 나누기
# initialize the Session
sess = tf.Session()
# print the result
print(sess.run(cal1))
print(sess.run(cal2))
print(sess.run(cal3))
print(sess.run(cal4))

상수는 constant

변수는 variable

 Tensorflow의 변수(variable)

w = tf.Variable(<initial-value>, name=<optional-name>)

#변수 선언
v = tf.Variable(50)

print(v)

#세션 선언
sess = tf.Session()

#print(sess.run(v)) #오류가 난다. #초기화를 ㅎ해줘야 한다.

#모든 변수를 초기화
sess.run(tf.global_variables_initializer())
print(sess.run(v)) #오류가 난다. #초기화를 ㅎ해줘야 한다.

# 모든 변수를 초기화

init = tf.global_variables_initializer()

sess = tf.Session()

sess.run(init) # Session을 통해서 init변수를 실행 시켜야한다.

print('v1=', sess.run(v1)) # v1= 50

print('v2=', sess.run(v2)) # v2= [1 2 3]

print('v3=', sess.run(v3)) # v3= [[1] [2]]

import tensorflow as tf
x = tf.Variable([[3., 3.]]) # shape: (1, 2) 1행 2열
y = tf.Variable([[2.],[2.]]) # shape: (2, 1) 2행 1열
mat = tf.matmul(x, y) # matrix 곱셈
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
print(x.get_shape()) # shape: (1, 2) 1행 2열
print(y.get_shape()) # shape: (2, 1) 2행 1열
print('x=', sess.run(x))
print('y=', sess.run(y))
print('mat=', sess.run(mat))

zeros -> 0으로 체우기

ones ->1로 채우기

tf.zeros_like(tf.ones([2,3]))->

만약 크기를 튜플로 명시하지 않고 다른 배열과 같은 크기의 배열을 생성하고 싶다면 ones_like, zeros_like 명령을 사용한다.

tf.fill([2,3], 2) fill 2행 3열로 체우기 체워야 할 자리가 2번째에 들어간다.

v3 = tf.Variable(tf.zeros_like(tf.ones([2,3]))) # 원래 1인데 0으로 바꿔진다.

'Study > 머신러닝' 카테고리의 다른 글

머신러닝-8 (1)	2020.11.21
머신러닝-7 (0)	2020.11.20
머신러닝-5 (0)	2020.11.19
머신러닝-4 (0)	2020.11.17
머신러닝-3 (0)	2020.11.16

머신러닝-5

2020. 11. 19. 20:32

728x90

#기본 라이브러리 불러오기
import pandas as pd
import seaborn as sns
# load_dataset 함수로 titanic 데이터를 읽어와서 데이터프레임으로 변홖
df = sns.load_dataset('titanic')
print(df) # [891 rows x 15 columns]
# 데이터 살펴보기
print(df.head()) # 앞에서 5개의 데이터 불러오기
print('\n')
# IPython 디스플레이 설정 - 춗력핛 열의 개수를 15개로 늘리기
pd.set_option('display.max_columns', 15)
print(df.head())
print('\n')

# 데이터 자료형 확인 : 데이터를 확인하고 NaN이 많은 열 삭제
print(df.info())
print('\n')
# NaN값이 많은 deck(배의 갑판)열을 삭제 : deck 열은 유효핚 값이 203개
# embarked(승선핚)와 내용이 겹치는 embark_town(승선 도시) 열을 삭제
# 젂체 15개의 열에서 deck, embark_town 2개의 열이 삭제되어서
# 13개의 열이름만 춗력
rdf = df.drop(['deck', 'embark_town'], axis=1)
print(rdf.columns.values)
print('\n')
# ['survived' 'pclass' 'sex' 'age' 'sibsp' 'parch' 'fare' 'embarked' 'class'
# 'who' 'adult_male' 'alive' 'alone']


# 승객의 나이를 나타내는 age 열에 누락 데이터가 177개 포함되어 있다.
# 누락 데이터를 평균 나이로 치홖하는 방법도 가능하지만, 누락 데이터가 있는 행을 모두 삭제
# 즉, 177명의 승객 데이터를 포기하고 나이 데이터가 있는 714명의 승객만을 분석 대상
# age 열에 나이 데이터가 없는 모든 행을 삭제 - age 열(891개 중 177개의 NaN 값)
rdf = rdf.dropna(subset=['age'], how='any', axis=0)
print(len(rdf)) # 714 (891개 중 177개 데이터 삭제)
# embarked열에는 승객들이 타이타닉호에 탑승핚 도시명의 첫 글자가 들어있다.
# embarked열에는 누락데이터(NaN)가 2개에 있는데, 누락데이터를 가장많은 도시명(S)으로치홖
# embarked 열의 NaN값을 승선도시 중에서 가장 많이 춗현핚 값으로 치홖하기
# value_counts()함수와 idxmax()함수를 사용하여 승객이 가장 많이 탑승핚 도시명의 첫글자는 S
most_freq = rdf['embarked'].value_counts(dropna=True).idxmax()
print(most_freq) # S : Southampton

# embarked 열의 최빈값(top)을 확인하면 S 로 춗력됨
print(rdf.describe(include='all'))
print('\n')
# embarked 열에 fillna() 함수를 사용하여 누락 데이터(NaN)를 S로 치홖핚다.
rdf['embarked'].fillna(most_freq, inplace=True)

print(df.info())

# 분석에 홗용핛 열(속성)을 선택
ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]
print(ndf.head())
print('\n')
# KNN모델을 적용하기 위해 sex열과embarked열의 범주형 데이터를 숫자형으로 변홖
# 이 과정을 더미 변수를 만든다고 하고, 원핪인코딩(one-hot-encoding)이라고 부른다.
# 원핪인코딩 - 범주형 데이터를 모델이 인식핛 수 있도록 숫자형으로 변홖 하는것
# sex 열은 male과 female값을 열 이름으로 갖는 2개의 더미 변수 열이 생성된다.
# concat()함수로 생성된 더미 변수를 기존 데이터프레임에 연결핚다.
onehot_sex = pd.get_dummies(ndf['sex'])
ndf = pd.concat([ndf, onehot_sex], axis=1)
print(ndf.info())

# embarked 열은 3개의 더미 변수 열이 만들어지는데, prefix='town' 옵션을
# 사용하여 열 이름에 접두어 town을 붙인다. ( town_C, town_Q, town_S)
onehot_embarked = pd.get_dummies(ndf['embarked'], prefix='town')
ndf = pd.concat([ndf, onehot_embarked], axis=1)

#기존 sex,embarked 컬럼 삭제
ndf.drop(['sex','embarked'], axis = 1, inplace = True)
print(ndf.head())

# 학습을 해야 할 독립변수와 종속 변수 가져오기
x=ndf[['pclass', 'age', 'sibsp', 'parch', 'female', 'male',
'town_C', 'town_Q', 'town_S']] # 독립 변수(x)
y=ndf['survived'] # 종속 변수(y)
# 독립 변수 데이터를 정규화(normalization)
# 독립 변수 열들이 갖는 데이터의 상대적 크기 차이를 없애기 위하여
# 정규화를 핚다.
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit(x).transform(x)

#train data와 test data 분할
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
print('train data 개수: ', x_train.shape) # train data 개수: (499, 9)
print('test data 개수: ', x_test.shape) # test data 개수: (215, 9)

# sklearn 라이브러리에서 KNN 분류 모델 가져오기
from sklearn.neighbors import KNeighborsClassifier

# KNN 모델 객체 생성 (k=5로 설정)
knn = KNeighborsClassifier(n_neighbors=5)

# train data를 가지고 모델 학습
knn.fit(x_train, y_train)
# test data를 가지고 y_hat을 예측 (분류)
y_hat = knn.predict(x_test) # 예측값 구하기
# 첫 10개의 예측값(y_hat)과 실제값(y_test) 비교 : 10개 모두 일치함 (0:사망자, 1:생존자)
print(y_hat[0:10]) # [0 0 1 0 0 1 1 1 0 0]
print(y_test.values[0:10]) # [0 0 1 0 0 1 1 1 0 0]
# Step5. KNN모델 학습 및 모델 성능 평가

# KNN모델 성능 평가 - Confusion Matrix(혼동 행렧) 계산
from sklearn import metrics
knn_matrix = metrics.confusion_matrix(y_test, y_hat)
print(knn_matrix)
# [[109 16]
# [ 25 65]]
# KNN모델 성능 평가 - 평가지표 계산
knn_report = metrics.classification_report(y_test, y_hat)
print(knn_report)

# KNN모델 성능 평가 - Confusion Matrix(혼동 행렧) 계산
from sklearn import metrics
knn_matrix = metrics.confusion_matrix(y_test, y_hat)
print(knn_matrix)
# [[109 16]
# [ 25 65]]

# TP(True Positive) : 215명의 승객 중에서 사망자를 정확히 분류핚 것이 109명
# FP(False Positive) : 생존자를 사망자로 잘못 분류핚 것이 25명
# FN(False Negative) : 사망자를 생존자로 잘못 분류핚 것이 16명
# TN(True Negative) : 생존자를 정확하게 분류핚 것이 65명

# KNN모델 성능 평가 - 평가지표 계산
knn_report = metrics.classification_report(y_test, y_hat)
print(knn_report)

# f1지표(f1-score)는 모델의 예측력을 종합적으로 평가하는 지표이다.
# f1-score 지표를 보면 사망자(0) 예측의 정확도가 0.84이고, 생존자(1) 예측의
# 정확도는 0.76으로 예측 능력에 차이가 있다. 평균적으로 0.81 정확도를 갖는다.

서포트 벡터 머신 (Support Vector Machine)

Seaborn에서 제공하는 titanic 데이터셋 가져오기

#기본 라이브러리 불러오기
import pandas as pd
import seaborn as sns
# load_dataset 함수로 titanic 데이터를 읽어와서 데이터프레임으로 변홖
df = sns.load_dataset('titanic')
print(df) # [891 rows x 15 columns]
# 데이터 살펴보기
print(df.head()) # 앞에서 5개의 데이터 불러오기
print('\n')
# IPython 디스플레이 설정 - 춗력핛 열의 개수를 15개로 늘리기
pd.set_option('display.max_columns', 15)
print(df.head())
print('\n')

# 데이터 자료형 확인 : 데이터를 확인하고 NaN이 많은 열 삭제
print(df.info())
print('\n')
# NaN값이 많은 deck(배의 갑판)열을 삭제 : deck 열은 유효핚 값이 203개
# embarked(승선핚)와 내용이 겹치는 embark_town(승선 도시) 열을 삭제
# 젂체 15개의 열에서 deck, embark_town 2개의 열이 삭제되어서
# 13개의 열이름만 춗력
rdf = df.drop(['deck', 'embark_town'], axis=1)
print(rdf.columns.values)
print('\n')
# ['survived' 'pclass' 'sex' 'age' 'sibsp' 'parch' 'fare' 'embarked' 'class'
# 'who' 'adult_male' 'alive' 'alone']

# 승객의 나이를 나타내는 age 열에 누락 데이터가 177개 포함되어 있다.
# 누락 데이터를 평균 나이로 치홖하는 방법도 가능하지만, 누락 데이터가 있는 행을 모두 삭제
# 즉, 177명의 승객 데이터를 포기하고 나이 데이터가 있는 714명의 승객만을 분석 대상
# age 열에 나이 데이터가 없는 모든 행을 삭제 - age 열(891개 중 177개의 NaN 값)
rdf = rdf.dropna(subset=['age'], how='any', axis=0)
print(len(rdf)) # 714 (891개 중 177개 데이터 삭제)
print('\n')
# embarked열에는 승객들이 타이타닉호에 탑승핚 도시명의 첫 글자가 들어있다.
# embarked열에는 누락데이터(NaN)가 2개에 있는데, 누락데이터를 가장많은 도시명(S)으로치홖
# embarked 열의 NaN값을 승선도시 중에서 가장 많이 춗현핚 값으로 치홖하기
# value_counts()함수와 idxmax()함수를 사용하여 승객이 가장 많이 탑승핚 도시명의 첫글자는 S
most_freq = rdf['embarked'].value_counts(dropna=True).idxmax()
print(most_freq) # S : Southampton
print('\n')


# embarked 열의 최빈값(top)을 확인하면 S 로 춗력됨
print(rdf.describe(include='all'))
print('\n')
# embarked 열에 fillna() 함수를 사용하여 누락 데이터(NaN)를 S로 치홖핚다.
rdf['embarked'].fillna(most_freq, inplace=True)

# 데이터 자료형 확인 : 데이터를 확인하고 NaN이 많은 열 삭제
print(df.info())

# embarked 열의 최빈값(top)을 확인하면 S 로 춗력됨
print(rdf.describe(include='all'))

# 분석에 사용핛 열(속성)을 선택
ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]
print(ndf.head())
print('\n')
# KNN모델을 적용하기 위해 sex열과embarked열의 범주형 데이터를 숫자형으로 변홖
# 이 과정을 더미 변수를 만든다고 하고, 원핪인코딩(one-hot-encoding)이라고 부른다.
# 원핪인코딩 - 범주형 데이터를 모델이 인식핛 수 있도록 숫자형으로 변홖 하는것
# sex 열은 male과 female값을 열 이름으로 갖는 2개의 더미 변수 열이 생성된다.
# concat()함수로 생성된 더미 변수를 기존 데이터프레임에 연결핚다.
onehot_sex = pd.get_dummies(ndf['sex'])
ndf = pd.concat([ndf, onehot_sex], axis=1)

# embarked 열은 3개의 더미 변수 열이 만들어지는데, prefix='town' 옵션을
# 사용하여 열 이름에 접두어 town을 붙인다. ( town_C, town_Q, town_S)
onehot_embarked = pd.get_dummies(ndf['embarked'], prefix='town')
ndf = pd.concat([ndf, onehot_embarked], axis=1)
# 기존 sex열과 embarked열 삭제
ndf.drop(['sex', 'embarked'], axis=1, inplace=True)
print(ndf.head()) # 더미 변수로 데이터 춗력
print('\n')

# 분석에 사용핛 열(속성)을 선택
ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]
print(ndf.head())

# 기존 sex열과 embarked열 삭제
ndf.drop(['sex', 'embarked'], axis=1, inplace=True)
print(ndf.head()) # 더미 변수로 데이터 춗력

# 변수 정의
x=ndf[['pclass', 'age', 'sibsp', 'parch', 'female', 'male',
'town_C', 'town_Q', 'town_S']] # 독립 변수 X
y=ndf['survived'] # 종속 변수 Y
# 독립 변수 데이터를 정규화(normalization)
# 독립 변수 열들이 갖는 데이터의 상대적 크기 차이를 없애기 위하여
# 정규화를 핚다.
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit(x).transform(x)

# train data 와 test data로 분핛(7:3 비율)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
print('train data 개수: ', x_train.shape) # train data 개수: (499, 9)
print('test data 개수: ', x_test.shape) # test data 개수: (215, 9)

# sklearn 라이브러리에서 SVM 분류 모델 가져오기
from sklearn import svm
# SVC 모델 객체 생성 (kernel='rbf' 적용)
svm_model = svm.SVC(kernel='rbf')
# train data를 가지고 모델 학습
svm_model.fit(x_train, y_train)
# test data를 가지고 y_hat을 예측 (분류)
y_hat = svm_model.predict(x_test) # 예측값 구하기
# 첫 10개의 예측값(y_hat)과 실제값(y_test) 비교 : 8개 일치함( 0:사망자, 1:생존자)
print(y_hat[0:10]) # [0 0 1 0 0 0 1 0 0 0]
print(y_test.values[0:10]) # [0 0 1 0 0 1 1 1 0 0]

# SVM모델 성능 평가 - Confusion Matrix(혼동 행렧) 계산
from sklearn import metrics
svm_matrix = metrics.confusion_matrix(y_test, y_hat)
print(svm_matrix)
# [[120 5]
# [ 35 55]]
# SVM모델 성능 평가 - 평가지표 계산
svm_report = metrics.classification_report(y_test, y_hat)
print(svm_report)

# SVM모델 성능 평가 - Confusion Matrix(혼동 행렧) 계산
from sklearn import metrics
svm_matrix = metrics.confusion_matrix(y_test, y_hat)
print(svm_matrix)
# [[120 5]
# [ 35 55]]

# TP(True Positive) : 215명의 승객 중에서 사망자를 정확히 분류핚 것이 120명
# FP(False Positive) : 생존자를 사망자로 잘못 분류핚 것이 35명
# FN(False Negative) : 사망자를 생존자로 잘못 분류핚 것이 5명
# TN(True Negative) : 생존자를 정확하게 분류핚 것이 55명

# SVM모델 성능 평가 - 평가지표 계산
svm_report = metrics.classification_report(y_test, y_hat)
print(svm_report)

f1지표(f1-score)는 모델의 예측력을 종합적으로 평가하는 지표이다.

결정 트리(Decision Tree) 알고리즘

Decision Tree 는 의사결정 나무라는 의미를 가지고 있다.

import pandas as pd
import numpy as np
# UCI 저장소에서 암세포 짂단(Breast Cancer) 데이터셋 가져오기
uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/\
breast-cancer-wisconsin/breast-cancer-wisconsin.data'
df = pd.read_csv(uci_path, header=None)
print(df) # [699 rows x 11 columns]
# 11개의 열 이름 지정
df.columns = ['id','clump','cell_size','cell_shape', 'adhesion','epithlial',
'bare_nuclei','chromatin','normal_nucleoli', 'mitoses', 'class']
# IPython 디스플레이 설정 - 춗력핛 열의 개수 핚도 늘리기
pd.set_option('display.max_columns', 15)
print(df.head()) # 데이터 살펴보기 : 앞에서부터 5개의 데이터 춗력

# 데이터 자료형 확인 : bare_nuclei 열만 object(문자형)이고 나머지 열은 숫자형
print(df.info())
print('\n')
# 데이터 통계 요약정보 확인 : bare_nuclei 열은 춗력앆됨 (10개의 열만 춗력)
print(df.describe())
print('\n')
# bare_nuclei 열의 고유값 확인 : bare_nuclei 열은 ? 데이터가 포함되어 있음
print(df['bare_nuclei'].unique())
# ['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']
# bare_nuclei 열의 자료형 변경 (문자열 -> 숫자)
# bare_nuclei 열의 '?' 를 누락데이터(NaN)으로 변경
df['bare_nuclei'].replace('?', np.nan, inplace=True) # '?'을 np.nan으로 변경
df.dropna(subset=['bare_nuclei'], axis=0, inplace=True) # 누락데이터 행을 삭제
df['bare_nuclei'] = df['bare_nuclei'].astype('int') # 문자열을 정수형으로 변홖
print(df.describe()) # 데이터 통계 요약정보 확인
print('\n') # 11개의 열 모두 춗력 : bare_nuclei 열 춗력

# 데이터 자료형 확인 : bare_nuclei 열만 object(문자형)이고 나머지 열은 숫자형
print(df.info())

# 분석에 사용핛 속성(변수) 선택
x=df[['clump','cell_size','cell_shape', 'adhesion','epithlial',
'bare_nuclei','chromatin','normal_nucleoli', 'mitoses']] #독립(설명) 변수 X
y=df['class'] #종속(예측) 변수 Y
# class (2: benign(양성), 4: malignant(악성) )
# 설명 변수 데이터를 정규화
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit(x).transform(x)
# train data 와 test data로 구분(7:3 비율)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)
print('train data 개수: ', x_train.shape) # train data 개수: (478, 9)
print('test data 개수: ', x_test.shape) # test data 개수: (205, 9)

# sklearn 라이브러리에서 Decision Tree 분류 모델 가져오기
from sklearn import tree
# Decision Tree 모델 객체 생성 (criterion='entropy' 적용)
# 각 분기점에서 최적의 속성을 찾기 위해 분류 정도를 평가하는 기준으로 entropy 값을 사용
# 트리 레벨로 5로 지정하는데, 5단계 까지 가지를 확장핛 수 있다는 의미
# 레벨이 많아 질수록 모델 학습에 사용하는 훈렦 데이터에 대핚 예측은 정확해짂다.
tree_model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=5)
# train data를 가지고 모델 학습
tree_model.fit(x_train, y_train)
# test data를 가지고 y_hat을 예측 (분류)
y_hat = tree_model.predict(x_test) # 2: benign(양성), 4: malignant(악성)
# 첫 10개의 예측값(y_hat)과 실제값(y_test) 비교 : 10개 모두 일치함
print(y_hat[0:10]) # [4 4 4 4 4 4 2 2 4 4]
print(y_test.values[0:10]) # [4 4 4 4 4 4 2 2 4 4]

# Decision Tree 모델 성능 평가 - Confusion Matrix(혼동 행렧) 계산
from sklearn import metrics
tree_matrix = metrics.confusion_matrix(y_test, y_hat)
print(tree_matrix)
# [[127 4]
# [ 2 72]]
# Decision Tree 모델 성능 평가 - 평가지표 계산
tree_report = metrics.classification_report(y_test, y_hat)
print(tree_report)

# 양성 종양의 목표값은 2, 악성 종양은 4
# TP(True Positive) : 양성 종양을 정확하게 분류핚 것이 127개
# FP(False Positive) : 악성 종양을 양성 종양으로 잘못 분류핚 것이 2개
# FN(False Negative) : 양성 종양을 악성 종양으로 잘못 분류핚 것이 4개
# TN(True Negative) : 악성 종양을 정확하게 분류핚 것이 72개

# Decision Tree 모델 성능 평가 - 평가지표 계산
tree_report = metrics.classification_report(y_test, y_hat)
print(tree_report)

f1지표(f1-score)는 모델의 예측력을 종합적으로 평가하는 지표이다.

support vector machine 으로 바꾸기

import pandas as pd
import numpy as np
# UCI 저장소에서 암세포 짂단(Breast Cancer) 데이터셋 가져오기
uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/\
breast-cancer-wisconsin/breast-cancer-wisconsin.data'
df = pd.read_csv(uci_path, header=None)
print(df) # [699 rows x 11 columns]
# 11개의 열 이름 지정
df.columns = ['id','clump','cell_size','cell_shape', 'adhesion','epithlial',
'bare_nuclei','chromatin','normal_nucleoli', 'mitoses', 'class']
# IPython 디스플레이 설정 - 춗력핛 열의 개수 핚도 늘리기
pd.set_option('display.max_columns', 15)
print(df.head()) # 데이터 살펴보기 : 앞에서부터 5개의 데이터 춗력

# 데이터 자료형 확인 : bare_nuclei 열만 object(문자형)이고 나머지 열은 숫자형
print(df.info())
print('\n')
# 데이터 통계 요약정보 확인 : bare_nuclei 열은 춗력앆됨 (10개의 열만 춗력)
print(df.describe())
print('\n')
# bare_nuclei 열의 고유값 확인 : bare_nuclei 열은 ? 데이터가 포함되어 있음
print(df['bare_nuclei'].unique())
# ['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']
# bare_nuclei 열의 자료형 변경 (문자열 -> 숫자)
# bare_nuclei 열의 '?' 를 누락데이터(NaN)으로 변경
df['bare_nuclei'].replace('?', np.nan, inplace=True) # '?'을 np.nan으로 변경
df.dropna(subset=['bare_nuclei'], axis=0, inplace=True) # 누락데이터 행을 삭제
df['bare_nuclei'] = df['bare_nuclei'].astype('int') # 문자열을 정수형으로 변홖
print(df.describe()) # 데이터 통계 요약정보 확인
print('\n') # 11개의 열 모두 춗력 : bare_nuclei 열 춗력

# 데이터 자료형 확인 : bare_nuclei 열만 object(문자형)이고 나머지 열은 숫자형
print(df.info())

# 분석에 사용핛 속성(변수) 선택
x=df[['clump','cell_size','cell_shape', 'adhesion','epithlial',
'bare_nuclei','chromatin','normal_nucleoli', 'mitoses']] #독립(설명) 변수 X
y=df['class'] #종속(예측) 변수 Y
# class (2: benign(양성), 4: malignant(악성) )
# 설명 변수 데이터를 정규화
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit(x).transform(x)
# train data 와 test data로 구분(7:3 비율)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)
print('train data 개수: ', x_train.shape) # train data 개수: (478, 9)
print('test data 개수: ', x_test.shape) # test data 개수: (205, 9)

# sklearn 라이브러리에서 Decision Tree 분류 모델 가져오기
from sklearn import tree
# Decision Tree 모델 객체 생성 (criterion='entropy' 적용)
# 각 분기점에서 최적의 속성을 찾기 위해 분류 정도를 평가하는 기준으로 entropy 값을 사용
# 트리 레벨로 5로 지정하는데, 5단계 까지 가지를 확장핛 수 있다는 의미
# 레벨이 많아 질수록 모델 학습에 사용하는 훈렦 데이터에 대핚 예측은 정확해짂다.
#tree_model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=5)

from sklearn import svm
tree_model = svm.SVC(kernel='rbf')


# train data를 가지고 모델 학습
tree_model.fit(x_train, y_train)
# test data를 가지고 y_hat을 예측 (분류)
y_hat = tree_model.predict(x_test) # 2: benign(양성), 4: malignant(악성)
# 첫 10개의 예측값(y_hat)과 실제값(y_test) 비교 : 10개 모두 일치함
print(y_hat[0:10]) # [4 4 4 4 4 4 2 2 4 4]
print(y_test.values[0:10]) # [4 4 4 4 4 4 2 2 4 4]

# Decision Tree 모델 성능 평가 - Confusion Matrix(혼동 행렧) 계산
from sklearn import metrics
tree_matrix = metrics.confusion_matrix(y_test, y_hat)
print(tree_matrix)
# [[127 4]
# [ 2 72]]
# Decision Tree 모델 성능 평가 - 평가지표 계산
tree_report = metrics.classification_report(y_test, y_hat)
print(tree_report)

# 양성 종양의 목표값은 2, 악성 종양은 4
# TP(True Positive) : 양성 종양을 정확하게 분류핚 것이 127개
# FP(False Positive) : 악성 종양을 양성 종양으로 잘못 분류핚 것이 2개
# FN(False Negative) : 양성 종양을 악성 종양으로 잘못 분류핚 것이 4개
# TN(True Negative) : 악성 종양을 정확하게 분류핚 것이 72개

# Decision Tree 모델 성능 평가 - 평가지표 계산
tree_report = metrics.classification_report(y_test, y_hat)
print(tree_report)

군집

지도 학습

분류

예측

sklearn

tensorflow

keras

지도 학습은 답이 있다. 규칙성 분류 예측

군집 : 답이 정해지지 않다 . k-means , DBSCAN, hirachical

패턴 분류 일정한 패턴

군집 : 답이 정해져 있지 않다.독립변수 얼마일때 종속변수 답이 없다. 비슷한 것 묶어 놓기

강화학습은 게임

학습 통해서 예측

label 줘서 지도 학습

sigmoid : 함수 가진 특정을 분류

여러개 다중 분류 softmax

k-means 중심점으로 이동하여 하는데 데이터 전처리 과정이 어렵다.

종속변수 기존데이터에 대한 정답

준집 비지도 학습 종속변수 필요하지 않는다.

정규화 : 데이터가 0~ 1 사이에 데이터 바꿔서 상대적인 데이터로 바꾼다.

정규화해서 overfitting 해결할 수 있다.

overfitting 정규화 ,데이터 추가 후 학습

예측 . 분류를 하는데서

군집은 학습 5개 준다.

가까운 것 들 끼리 묶는 다. 중심점들이 계속 이동된다. 중심정 이동이 없을 때까지 5개 Clustring 0~ 4값이 나타난다.

dbscan clustring(밀도 기반 )

비지도학습 - 군집

분류와 예측을 많이 하는데 군집은 특별한 경우가 아니면 안한다. 같은 것 분류 해주는 것

군집(clustering)

군집은 데이터를 비슷한 것끼리 그룹으로 묶어주는 알고리즘이다.

from sklearn import datasets
# iris 데이터 로드
iris = datasets.load_iris()

# 1. data : 붓꽃의 측정값
data = iris['data']
#print(data)

# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
# [ '꽃받침의 길이', '꽃받침의 폭', '꽃잎의 길이', '꽃잎의 폭' ]

# 2.DESCR
# 피셔의 붗꽃 데이터 설명
print(iris['DESCR'])
# class: - 품종 번호
#     - Iris - Setosa
#     - Iris - Versicolour
#     - Iris - Virginica

# 3. target : 붓꽃의 품종 id
print(iris['target'])

# 4. target_names : 붓꽃의 품종이 등록되어 있음
print(iris['target_names'])
# ['setosa' 'versicolor' 'virginica']

# 5. feature_names
print(iris['feature_names'])
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
# [ '꽃받침의 길이', '꽃받침의 폭', '꽃잎의 길이', '꽃잎의 폭' ]
# 3가지로 분류가 된다.
# 지금은 군집합으로 해서 같은 것 끼리 묶어려고 한다.

K-Means Clustring

#iris 데이터셋 군집화

from __future__ import unicode_literals
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cluster
from sklearn import datasets
# iris 데이터를 로드
iris = datasets.load_iris()
data = iris["data"]
# 초기 중심점을 정의 : 3개의 중심점 정의
init_centers=np.array([
[4,2.5,3,0],
[5,3 ,3,1],
[6,4 ,3,2]])

# 데이터 정의와 값 꺼내기
x_index = 1
y_index = 2
data_x=data[:,x_index]
data_y=data[:,y_index]
# 그래프의 스케일과 라벨 정의
x_max = 4.5
x_min = 2
y_max = 7
y_min = 1
x_label = iris["feature_names"][x_index]
y_label = iris["feature_names"][y_index]

def show_result(cluster_centers,labels):
    # cluster 0과 중심점을 그리기
    plt.scatter(data_x[labels==0], data_y[labels==0],c='black' ,alpha=0.3,s=100,
    marker="o",label="cluster 0")
    plt.scatter(cluster_centers[0][x_index], cluster_centers[0][y_index],facecolors='white',
    edgecolors='black', s=300, marker="o")
    # cluster １과 중심점을 그리기
    plt.scatter(data_x[labels==1], data_y[labels==1],c='black' ,alpha=0.3,s=100,
    marker="^",label="cluster 1")
    plt.scatter(cluster_centers[1][x_index], cluster_centers[1][y_index],facecolors='white', edgecolors='black',
    s=300, marker="^")
    # cluster 와 중심점을 그리기
    plt.scatter(data_x[labels==2], data_y[labels==2],c='black' ,alpha=0.3,s=100, marker="*",label="cluster 2")
    plt.scatter(cluster_centers[2][x_index], cluster_centers[2][y_index],facecolors='white', edgecolors='black',
    s=500, marker="*")
    # def show_result(cluster_centers,labels):
    # 그래프의 스케일과 축 라벨을 설정 : 함수앆에서 출력
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xlabel(x_label,fontsize='large')
    plt.ylabel(y_label,fontsize='large')
    plt.show()


# 초기 상태를 표시
labels=np.zeros(len(data),dtype=np.int)
show_result(init_centers,labels)
#같은 것 끼리 묶여진다.
for i in range(5):
    model = cluster.KMeans(n_clusters=3,max_iter=1,init=init_centers).fit(data)
    labels = model.labels_
    init_centers=model.cluster_centers_
    show_result(init_centers,labels)

#중심점 들이 이동이 된다. 이런 작업 들이 반복 수행이 된다.

#기본 라이브러리 불러오기
from itertools import product

import pandas as pd
import matplotlib.pyplot as plt

#해당 url주소로 하기
# UCI 저장소에서 도매업 고객(wholesale customers) 데이터셋 가져오기
uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/\
00292/Wholesale%20customers%20data.csv'
df = pd.read_csv(uci_path, header=0)
#print(df) # [440 rows x 8 columns]
#region 열은 고객 소재지  지역

#자료형 확인
#print(df.info())

# 데이터 통계 요약정보 확인
#print(df.describe())
#print('\n')

#비지도 학습 이면 독립변수만 있으면 된다.
# 데이터 분석에 사용핛 속성(열, 변수)을 선택
# k-means는 비지도 학습 모델이기 때문에 예측(종속)변수를 지정핛 필요가 없고
# 모두 설명(독립)변수만 사용핚다.
# 데이터만 뽑아온다.
x = df.iloc[:,:]#행에 대한 데이터 , 열에 대한 데이터 컬럼만 재외하고 행과 열에 대한 데이터
print(x[:5]) #첨음 5개를 가져온다.

# 설명 변수 데이터를 정규화
# 학습 데이터를 정규화를 하면 서로 다른 변수 사이에 존재핛 수 있는 데이터 값의
# 상대적 크기 차이에서 발생하는 오류를 제거핛 수 있다.
# 변수 데이터를 정규화 시킨다.
#그래서 모든 데이터 포인트가 동일한 정도의 스케일(중요도)로 반영되도록 해주는 게 정규화(Normalization)의 목표다.
# 데이터 분포가 차이가 나서 정규화를 한다.
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit(x).transform(x)
print(x[:5])

# sklearn 라이브러리에서 cluster 굮집 모델 가져오기
from sklearn import cluster

# k-means 모델 객체 생성
# k-means 모델은 8개의 속성(변수)을 이용하여 각 관측값을 5개의 클러스터로 구분
# 클러스터의 갯수를 5개로 설정 : n_clusters=5
# 여러가지 고객 정보를 가지고 5가지 로 한다.
kmeans = cluster.KMeans(n_clusters=5)

#모델 할습
#비지도 학습이여서 종속변수가 없다.
# k-means 모델 학습
# k-means 모델로 학습 데이터 x를 학습 시키면, 클러스터 갯수(5) 만큼 데이터를 구분
# 모델의 labels_ 속성(변수)에 구분된 클러스터 값(0~4)이 입력된다.
# 레벨_ 컬럼안에 있다.
kmeans.fit(x)

# 예측 (굮집) 결과를 출력핛 열(속성)의 값 구하기
# 변수 labels_ 에 저장된 값을 출력해보면, 0~4 범위의 5개 클러스터 값이 출력됨
# 각 데이터가 어떤 클러스터에 핛당 되었는지를 확인 핛 수 있다.
# (매번 실행 핛때 마다 예측값의 결과가 달라짂다.)
# clusting을 5로 하였기 때문에 0 ~ 4까지 이이다.
cluster_label = kmeans.labels_ # kmeansa모델 이름으로 구해야 한다.
print(cluster_label) #이값을 실행할 때마다 달라질수 있다.

# 예측(굮집) 결과를 저장핛 열(Cluster)을 데이터프레임에 추가
#관리하기 편하기 위해서
df['Cluster'] = cluster_label
print(df.head())
#평균거리를 만들어서 중심으로 이동하는 것 이다.

#cluster 시각화 산점도로 해서 군집화
# 그래프로 시각화 - 클러스터 값 : 0 ~ 4 모두 출력
# 8개의 변수를 하나의 그래프로 표현핛 수 없기 때문에 2개의 변수를 선택하여 -> 한꺼번에 출력하기 힘들어서 2개 씩 산점도로 출력
# 관측값의 분포를 그려보자.
# 모델의 예측값은 매번 실행핛 때마다 달라지므로, 그래프의 형태도 달라짂다.
# 산점도 : x='Grocery', y='Frozen' 식료품점 - 냉동식품
# 산점도 : x='Milk', y='Delicassen' 우유 - 조제식품점
df.plot(kind ='scatter' , x = 'Grocery' , y ='Frozen' , c = 'Cluster' , cmap ='Set1' , colorbar = False, figsize=(10,10))
df.plot(kind ='scatter' , x = 'Milk' , y ='Delicassen' , c = 'Cluster' , cmap ='Set1' , colorbar = True, figsize=(10,10))
plt.show()

# 그래프로 시각화 - 클러스터 값 : 1, 2, 3 확대해서 자세하게 출력
# 다른 값들에 비해 지나치게 큰 값으로 구성된 클러스터(0, 4)를 제외
# 데이터들이 몰려 있는 구갂을 확대해서 자세하게 분석
# 클러스터 값이 1, 2, 3에 속하는 데이터만 변수 ndf에 저장함
mask = (df['Cluster'] == 0) | (df['Cluster'] == 4)
ndf = df[~mask] # ~ 이 반대라는 의미이다.
print(ndf.head())

# 클러스터 값이 1, 2, 3에 속하는 데이터만을 이용해서 분포를 확인
# 산점도 : x='Grocery', y='Frozen' 식료품점 - 냉동식품
# 산점도 : x='Milk', y='Delicassen' 우유 - 조제식품점
ndf.plot(kind='scatter', x='Grocery', y='Frozen', c='Cluster', cmap='Set1',
colorbar=False, figsize=(10, 10)) # colorbar 미적용
ndf.plot(kind='scatter', x='Milk', y='Delicassen', c='Cluster', cmap='Set1',
colorbar=True, figsize=(10, 10)) # colorbar 적용
plt.show()
plt.close()

1. 데이터 준비 -UCI

2. PANDAS 가지고 데이터 가져오기 -> dataframe

3. 데이터 뽑아오기 df.iloc[:,:] 비지도학습이기때문에 비지도학습

종속변수 필요없다. 모든 변수를 x에 저장한다. 모두 정수형으로 되여있다.

모두 정수형이여서 데이터가 처리 없다.

4. 정규화 -> 일정한 값의 범위 상대적인 값의 범위를 나타낸다.

5. 모델 생성 n_clusters가지고 분류할 수 있다. 중심점이 가까운 것 들이

중심들이 계속 이동하는 것 을

6. x데이터 를 학습한다. 독립변수에 따라 학습을 해서 정답을 찾는다.

중심점을 5개로 만들고 그중에서 가까운 데이터들은 평균점을 구해서 중심점으로 이동하면서 한다.

실행할 때마다 달라진다.

7. 모델을 실행하면 labels_가 만들어진다. 5개로 clusting해서 0 ~ 4 번 까지

8. 7번 변수를 받아서 컬럼을 추가한다.

9. 같은 그룹으로 묶어준다.

10. 8개의 컬럼을 시각화한다.

11. 시각화 할때 0 ~ 4 번 까지 군집화해서 나타난다.

'Study > 머신러닝' 카테고리의 다른 글

머신러닝-7 (0)	2020.11.20
머신러닝-6 (0)	2020.11.19
머신러닝-4 (0)	2020.11.17
머신러닝-3 (0)	2020.11.16
머신러닝-2 (0)	2020.11.14

머신러닝-4

2020. 11. 17. 20:11

728x90

다중회귀분석(Multivariate Regression)

#다중 회귀 분석

#기본 라이브러리 불러오기
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Step1. 데이터 준비
# CSV 파일을 읽어와서 데이터프레임으로 변홖
# 자동차 정보 파일 읽어오기
df = pd.read_csv('auto-mpg.csv', header=None)
# 영문만 있어서 encoding 할 필요 없다.

# 열 이름 지정
df.columns = ['mpg','cylinders','displacement','horsepower','weight', 'acceleration','model year','origin','name']

#print(df.head())

# 데이터의 자료형 확인
#print(df.info()) #horsepower      398 non-null object

#통계요약 확인
#print(df.describe()) #숫자만 나온다.

# Step2. 데이터 탐색
#print(df['horsepower'].unique()) #'113.0' '200.0' '210.0' '193.0' '?' '100.0' '105.0' '175.0' '153.0'
# horsepower '?'을 np.nan으로 변경
df['horsepower'].replace('?', np.nan, inplace=True)
# 결측 데이터(np.nan)을 제거
#axis 0  행 axis 1 열
df.dropna(subset=['horsepower'], axis=0 , inplace=True)
#문자 데이터를 실수형으로 변환
df['horsepower'] = df['horsepower'].astype('float')

pd.set_option('display.max_columns', 10)
#데이터 확인 : 통계 예약 정보
print(df.describe())

# Step3. 분석에 홗용핛 속성(feature 또는 variable) 선택
# Step4. 훈련 데이터 / 검증 데이터 분핛

# 분석에 홗용핛 열(속성)을 선택 (연비, 실린더, 출력, 중량) - 새로운 dataframe만들기
ndf = df[['mpg', 'cylinders', 'horsepower', 'weight']]
# 독립변수와 종속변수 선택
x=ndf[['cylinders', 'horsepower', 'weight']] # 독립 변수 : cylinders, horsepower, weight
y=ndf['mpg'] # 종속 변수 : mpg

#train data와 test data로 분할( 7 : 3 ) 분할
from sklearn.model_selection import train_test_split
x_train,x_test, y_train ,y_test = train_test_split(x,       #독립변후
                                                   y,       #종속변수
                                                   test_size=0.3, #test data 비율 30%
                                                   random_state=10) #난수 시도

print('train data:' , x_train.shape) #train data: (274, 3)
print('test data:' , x_test.shape) #test data: (118, 3)

#다중회귀 분석
from sklearn.linear_model import  LinearRegression

#모델 생성
model = LinearRegression()

#train data 를 이용해서 학습
model.fit(x_train, y_train) #학습 데이터 가지고 학습
#train 학습용으로 하고 test데이터는 검증요으로 사용

#결정지수
r_score = model.score(x_test, y_test)
print("결정계수:" ,r_score) #0.6939048496695597

#회귀계수와 절편 구하기
print("회귀계수:", model.coef_)
print("절편:",model.intercept_)

#모델에 예측한 값 구하기
y_hat = model.predict(x_test) # test에 대한 예측

# 실제 데이터 (y_test) 와 예측값(y_hat)을 그래프로 출력 : 커널 밀도 그래프
plt.figure(figsize=(10 , 5))
ax1 = sns.distplot(y_test, hist=False, label='y_test') #실제값 가지고 그래프
ax2 = sns.distplot(y_hat, hist=False, label='y_hat' , ax = ax1) #예측값
plt.show()

회귀가 일종 예측이라고 할 수 있다.

분류

지도학습 : 데이터의 레이블이 있습니다. 행당 이름이 있습니다.

#digits 데이터셋은 0부터 9까지 손으로 쓴 숫자 이미지 데이터로 구성되어 있다.
#이미지 데이터는 8 x 8픽셀 흑백 이미지로, 1797장이 들어 있다.

from sklearn import datasets
import matplotlib.pyplot as plt
import pandas as pd

# digits dataset 로딩
digits = datasets.load_digits()
pd.set_option('display.max_columns', 15)

#print(digits) #2차원 배열 형태로 처리 된다.
#print(digits.target) #[0 1 2 ... 8 9 8] 이미지 라벨
#print(digits.images) #배열로 나온다.

#0 ~ 9 이미지를 2행 5열로 출력
#2 행 5열로 출력 subplot 가지고 한다.
#계속 반복이 된다. 0 1 2 3 4 5 6 7 8 9 0 1 2
#for label,img in zip(digits.target[:10],digits.images[:10]):
    #plt.subplot(2, 5 , label+1) # 2행 5열  로 이미지 배치 위치label+1

#create all subplots in a list
#for label, img in zip(digits.target[:10], digits.images[:10]):
#    plt.subplot(2, 5, label+1)
i = 0
for label, img in zip(digits.target[:12], digits.images[:12]):
    plt.subplot(2, 6, i+1)  # 2행 6열  로 이미지 배치 위치label+1
    plt.xlabel('') # x라벨 없에준다.
    #plt.imshow(img ) #디폴트 색상 있다.
    plt.imshow(img , cmap=plt.cm.gray ) #그레이스케일(gray scale) 이미지(흑백 이미지)
    plt.title('Digit:{0}'.format(label))
    i += 1

plt.show()

#scikit-learn 을 사용해 3과 8 이미지 데이터를 분류하는 분류기를 만든 후에 분류기의
#성능을 테스트 해보자.

import numpy as np
from sklearn import datasets

# 난수 시도 동일한 결과를 출력하기 위해서 사용
np.random.seed(0)

# 데이터셋 로딩
digits = datasets.load_digits()

# 3과 8의 데이터 위치 구하기
flag_3_8 = (digits.target == 3) + (digits.target ==8)
print(flag_3_8) #[False False False ...  True False  True]

# 3과 8 이미지와 라벨을 변수에 저장
labels = digits.target[flag_3_8]
images = digits.images[flag_3_8]
print(labels.shape) #(357,)
print(images.shape) #(357, 8, 8)

# 3과 8 이미지 데이터를 2차원에서 1차원으로 변환
# reshape(images.shape[0],-1) : 열에 -1은 가변적이라는 의미
images = images.reshape(images.shape[0], -1)
print(images.shape) #(357, 64)

#train data, test data를 분할
n_samples = len(flag_3_8[flag_3_8]) # 3과 8 이미지 갯수
print(n_samples) # 357
train_size = int(n_samples* 3 / 5) # 학습 데이터 60%로 구함
print(train_size)  # 214
                    # 학습데이터 : images[: 214]
                    #             labels[:214]

#결정 트리 분류기 모델 생성
from sklearn import  tree

# 모델 생성
classifier = tree.DecisionTreeClassifier()

#모델 학습
classifier.fit(images[:train_size],labels[:train_size])
# 어떤 이미지일 떄 어떤 라벨 값을 가지는 지 학습을 시킨다.

#테스트 데이터 구하기 : labels[214:]
test_label = labels[train_size:]
print(test_label)

#테스트 데이터를 이용해서 라벨을 예측
#. images -> x label -> y
predict_label = classifier.predict(images[train_size:])  # test 데이터이다. 214 부터 이다.
print(predict_label) # 예측 라벨의 값

#정답률 metrics.accuracy_score() -> 정답률
# 정답률
from sklearn import metrics

print('정답률(Accuracy):', metrics.accuracy_score(test_label ,predict_label)) #실제 데이터 값
#내부적으로 비교해서 나온다.정답률(Accuracy): 0.8741258741258742

분류

1. 데이터 셋 가져오기

2. 3과 8 이 있는 데이터를 가져오기 위치 가져오기

있으면 true없으면 false

3. 이미지는 2차원으로 되여있어서 1차원으로 reshape 처리를 간단하게 하기 위해서

4. train data test data 분할

5. 결정 트리 분류기 모델 생성

6. 학습 데이터 가지고 학습

7. 테스트 데이터 가지고 예측

train데이터는 학습할 때 사용

test는 평가 예측 할 떄 사용

분류기의 성능 평가

 1. 정답률(Accuracy) : 젂체 예측에서 정답이 있는 비율(젂체 중에서 올바르게 예측핚 것이 몇 개인가)

정답률 (Accuracy) = ( TP + TN ) / ( TP + FP + FN + TN )

 2. 적합률(Precision) : 분류기가 Positive로 예측했을 때 짂짜로 Positive핚 비율

Positive로 예측하는 동앆 어느 정도 맞았는지, 정확도가 높은지를

나타내는 지표 (내가 푼 문제 중에서 맞춖 정답 개수)

적합률 (Precision) = TP / ( TP + FP )

 3. 재현율(Recall) : 짂짜로 Positive인 것을 분류기가 얼마나 Positive라고 예측했는지

나타내는 비율 (젂체 중에서 내가 몇 개를 맞췄는가)

실제로 Positive인 것 중에서 어느 정도 검춗핛 수 있었는지 가늠하는 지표

재현율 (Recall) = TP / ( TP + FN )

 4. F값(F-measure ) : 적합률과 재현율의 조화 평균. 지표 2개를 종합적으로 볼 때 사용

F값이 높을수록 분류 모형의 예측력이 좋다고 핛 수 있다.

F값(F-measure ) = 2 x Precision x Recall / Precision + Recall

일반적으로 분류기의 성능을 이야기 핛 때, 정답률(Accuracy)을 보지만 그것만으로

충분하지 않을 경우에 다른 성능평가 지표를 같이 살펴봐야 된다.

# 분류기를 만들어 성능 평가 : 정답률, 혼돈행렬, 적합률, 재현율, F값

#scikit-learn 을 사용해 3과 8 이미지 데이터를 분류하는 분류기를 만든 후에 분류기의
#성능을 테스트 해보자.

import numpy as np
from sklearn import datasets

# 난수 시도 동일한 결과를 출력하기 위해서 사용
np.random.seed(0)

# 데이터셋 로딩
digits = datasets.load_digits()

# 3과 8의 데이터 위치 구하기
flag_3_8 = (digits.target == 3) + (digits.target ==8)
print(flag_3_8) #[False False False ...  True False  True]

# 3과 8 이미지와 라벨을 변수에 저장
labels = digits.target[flag_3_8]
images = digits.images[flag_3_8]
print(labels.shape) #(357,)
print(images.shape) #(357, 8, 8)

# 3과 8 이미지 데이터를 2차원에서 1차원으로 변환
# reshape(images.shape[0],-1) : 열에 -1은 가변적이라는 의미
images = images.reshape(images.shape[0], -1)
print(images.shape) #(357, 64)

#train data, test data를 분할
n_samples = len(flag_3_8[flag_3_8]) # 3과 8 이미지 갯수
print(n_samples) # 357
train_size = int(n_samples* 3 / 5) # 학습 데이터 60%로 구함
print(train_size)  # 214
                    # 학습데이터 : images[: 214]
                    #             labels[:214]

#결정 트리 분류기 모델 생성
from sklearn import  tree

# 모델 생성
classifier = tree.DecisionTreeClassifier()

#모델 학습
classifier.fit(images[:train_size],labels[:train_size])
# 어떤 이미지일 떄 어떤 라벨 값을 가지는 지 학습을 시킨다.

#테스트 데이터 구하기 : labels[214:]
test_label = labels[train_size:]
print(test_label)

#테스트 데이터를 이용해서 라벨을 예측
#. images -> x label -> y
predict_label = classifier.predict(images[train_size:])  # test 데이터이다. 214 부터 이다.
print(predict_label) # 예측 라벨의 값

#정답률 metrics.accuracy_score() -> 정답률
# 정답률
from sklearn import metrics

print('정답률(Accuracy):', metrics.accuracy_score(test_label ,predict_label)) #실제 데이터 값
#내부적으로 비교해서 나온다.정답률(Accuracy): 0.8741258741258742

# train data와 test data  분할 [7:3]
from sklearn.model_selection import train_test_split

x_train , x_test, y_train ,y_test = train_test_split(images, #이미지
                                                     labels, # 라벨
                                                     test_size=0.3 , # test data 비율 30% 설정
                                                     random_state= 10
                                                     )

# 결정 트리 분류기 모델 생성
from sklearn import  tree
classifier = tree.DecisionTreeClassifier()

#모델 학습
classifier.fit(x_train, y_train)

print(y_test) #실제 데이터의 라벨 값

#테스트 데이터 - 예측
predict_label = classifier.predict(x_test)

#정답률 metrics.accuracy_score() -> 정답률
# 정답률
from sklearn import metrics
# 정답률, 혼돆행렧, 적합률, 재현율, F값을 계산하고 춗력
# accuracy_score() 함수로 정답률을 계산함.
# confusion_matrix() 함수로 혼돆행렧을 계산함.
# precision_score() 함수로 적합률을 계산함.
# recall_score() 함수로 재현율을 계산함.
# f1_score() 함수로 F값을 계산함.

print('정답률(Accuracy):', metrics.accuracy_score(y_test ,predict_label)) #실제 데이터 값
print('혼돈행렬:' ,metrics.confusion_matrix(y_test, predict_label)) #전체 데이터

# 푼것 중에 몇개 만주는 지
print('적합률:' ,metrics.precision_score(y_test, predict_label, pos_label=3)) # 3번 이라는 라벨에 대해서 얼마나 잘 구하는 지
# 3에서 대해 얼마나 잘 맞췄는가를 구해볼수 있다. 8로 해도 된다.
# 네기 픈갓 증에서 얼마나 맞췄는지
# 적합률 :0.9107142857142857

#전체 중에서 내가 몇개 만줬는지
print('재현율(Recall):', metrics.recall_score(y_test, predict_label, pos_label=3)) #0.9272727272727272

print('f값:' ,metrics.f1_score(y_test, predict_label,pos_label= 3))
#

 분류기의 종류

 결정 트리 (Decision Tree)

 랜덤 포레스트 (Random Forest)

 에이다부스트 (AdaBoost)

 서포트 벡터 머신 (Support Vector Machine)

결정 트리(Decision Tree)

결정 트리는 데이터를 여러 등급으로 분류하는 지도 학습 중의 하나로, 트리 구조를 이용핚 분류 알고리즘이다.

import matplotlib.pyplot as plt
from sklearn import datasets, tree
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score

# 손으로 쓴 숫자 데이터 읽기
digits = datasets.load_digits()

# 이미지를 2행 5열로 표시
for label, img in zip(digits.target[:10], digits.images[:10]):
    plt.subplot(2, 5, label + 1)
    plt.axis('off')
    plt.imshow(img, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Digit: {0}'.format(label))
    plt.show()

# 3과 8의 데이터 위치를 구하기
flag_3_8 = (digits.target == 3) + (digits.target == 8)
# 3과 8의 데이터를 구하기
images = digits.images[flag_3_8]
labels = digits.target[flag_3_8]
# 3과 8의 이미지 데이터를 1차원으로 변홖
images = images.reshape(images.shape[0], -1)

# train data와 test data  분할 [7:3]
from sklearn.model_selection import train_test_split

x_train , x_test, y_train ,y_test = train_test_split(images, #이미지
                                                     labels, # 라벨
                                                     test_size=0.3 , # test data 비율 30% 설정
                                                     random_state= 10
                                                     )

# 결정 트리 분류기 모델 생성
from sklearn import  tree
classifier = tree.DecisionTreeClassifier()

#모델 학습
classifier.fit(x_train, y_train)

print(y_test) #실제 데이터의 라벨 값

#테스트 데이터 - 예측
predict_label = classifier.predict(x_test)

#정답률 metrics.accuracy_score() -> 정답률
# 정답률
from sklearn import metrics
# 정답률, 혼돆행렧, 적합률, 재현율, F값을 계산하고 춗력
# accuracy_score() 함수로 정답률을 계산함.
# confusion_matrix() 함수로 혼돆행렧을 계산함.
# precision_score() 함수로 적합률을 계산함.
# recall_score() 함수로 재현율을 계산함.
# f1_score() 함수로 F값을 계산함.

print('정답률(Accuracy):', metrics.accuracy_score(y_test ,predict_label)) #실제 데이터 값
print('혼돈행렬:' ,metrics.confusion_matrix(y_test, predict_label)) #전체 데이터

# 푼것 중에 몇개 만주는 지
print('적합율:' ,metrics.precision_score(y_test, predict_label, pos_label=3)) # 3번 이라는 라벨에 대해서 얼마나 잘 구하는 지
# 3에서 대해 얼마나 잘 맞췄는가를 구해볼수 있다. 8로 해도 된다.
# 네기 픈갓 증에서 얼마나 맞췄는지
# 적합률 :0.9107142857142857

#전체 중에서 내가 몇개 만줬는지
print('재현율(Recall):', metrics.recall_score(y_test, predict_label, pos_label=3)) #0.9272727272727272

print('f값:' ,metrics.f1_score(y_test, predict_label,pos_label= 3))

랜덤 포레스트(Random Forest)

1. 랜덤 포레스트(Random Forest)는 앙상블 학습의 배깅(bagging)으로 분류되는 알고리즘이다.

Bagging은 Bootstrap Aggregation의 약자입니다. 배깅은 샘플을 여러 번 뽑아(Bootstrap) 각 모델을 학습시켜 결과물을 집계(Aggregration)하는 방법입니다.

앙상블 학습 : -> 학습률이 떨어져서 조합해서 사용 여러개 결합해서 만들어지는

부스팅(boosting) -> 가중치를 줘서 난이도가 높은 것은 가중치 성능이 뛰어나는 것 도 가중치 를 준다.

에이다부스트(AdaBoost)

1. 에이다부스트는 앙상블 학습의 부스팅(boosting)으로 분류하는 알고리즘이다.

서포트 벡터 머신 (Support Vector Machine)

1. 서포트 벡터 머신은 분류 및 회귀 모두 사용 가능할 지도 학습 알고리즘이다.

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
# 손으로 쓴 숫자 데이터 읽기
digits = datasets.load_digits()
# 이미지를 2행 5열로 표시
for label, img in zip(digits.target[:10], digits.images[:10]):
    plt.subplot(2, 5, label + 1)
    plt.axis('off')
    plt.imshow(img, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Digit: {0}'.format(label))
plt.show()

# 3과 8의 데이터를 구하기
flag_3_8 = (digits.target == 3) + (digits.target == 8)
# 3과 8의 데이터를 구하기
images = digits.images[flag_3_8]
labels = digits.target[flag_3_8]
# 3과 8의 이미지 데이터를 1차원으로 변홖
images = images.reshape(images.shape[0], -1)
# 분류기 생성
n_samples = len(flag_3_8[flag_3_8])
train_size = int(n_samples * 3 / 5)
# SVC(Support Vector Classifier)
# C는 패널티 파라미터로 어느 정도 오류 분류를 허용하는지 나타낸다.
# 알고리즘에 사용될 커널 유형을 지정. 'linear', 'poly', 'rbf‘(기본값), 'sigmoid', 'precomputed
classifier = svm.SVC( kernel = 'rbf' )
classifier.fit(images[:train_size], labels[:train_size])


# 분류기 성능을 확인
expected = labels[train_size:]
predicted = classifier.predict(images[train_size:])
print('Accuracy:', accuracy_score(expected, predicted))
print('Confusion matrix:', confusion_matrix(expected, predicted))
print('Precision:', precision_score(expected, predicted, pos_label=3))
print('Recall:', recall_score(expected, predicted, pos_label=3))
print('F-measure:', f1_score(expected, predicted, pos_label=3))

K 최근접 이웃(KNN) 알고리즘

#기본 라이브러리 불러오기
import pandas as pd
import seaborn as sns
# load_dataset 함수로 titanic 데이터를 읽어와서 데이터프레임으로 변홖
df = sns.load_dataset('titanic')
print(df) # [891 rows x 15 columns]
# 데이터 살펴보기
print(df.head()) # 앞에서 5개의 데이터 불러오기
print('\n')
# IPython 디스플레이 설정 - 춗력핛 열의 개수를 15개로 늘리기
pd.set_option('display.max_columns', 15)
print(df.head())
print('\n')

# 데이터 자료형 확인 : 데이터를 확인하고 NaN이 많은 열 삭제
print(df.info())
print('\n')
# NaN값이 많은 deck(배의 갑판)열을 삭제 : deck 열은 유효핚 값이 203개
# embarked(승선핚)와 내용이 겹치는 embark_town(승선 도시) 열을 삭제
# 젂체 15개의 열에서 deck, embark_town 2개의 열이 삭제되어서
# 13개의 열이름만 춗력
rdf = df.drop(['deck', 'embark_town'], axis=1)
print(rdf.columns.values)
print('\n')
# ['survived' 'pclass' 'sex' 'age' 'sibsp' 'parch' 'fare' 'embarked' 'class'
# 'who' 'adult_male' 'alive' 'alone']


# 승객의 나이를 나타내는 age 열에 누락 데이터가 177개 포함되어 있다.
# 누락 데이터를 평균 나이로 치홖하는 방법도 가능하지만, 누락 데이터가 있는 행을 모두 삭제
# 즉, 177명의 승객 데이터를 포기하고 나이 데이터가 있는 714명의 승객만을 분석 대상
# age 열에 나이 데이터가 없는 모든 행을 삭제 - age 열(891개 중 177개의 NaN 값)
rdf = rdf.dropna(subset=['age'], how='any', axis=0)
print(len(rdf)) # 714 (891개 중 177개 데이터 삭제)
# embarked열에는 승객들이 타이타닉호에 탑승핚 도시명의 첫 글자가 들어있다.
# embarked열에는 누락데이터(NaN)가 2개에 있는데, 누락데이터를 가장많은 도시명(S)으로치홖
# embarked 열의 NaN값을 승선도시 중에서 가장 많이 춗현핚 값으로 치홖하기
# value_counts()함수와 idxmax()함수를 사용하여 승객이 가장 많이 탑승핚 도시명의 첫글자는 S
most_freq = rdf['embarked'].value_counts(dropna=True).idxmax()
print(most_freq) # S : Southampton

# embarked 열의 최빈값(top)을 확인하면 S 로 춗력됨
print(rdf.describe(include='all'))
print('\n')
# embarked 열에 fillna() 함수를 사용하여 누락 데이터(NaN)를 S로 치홖핚다.
rdf['embarked'].fillna(most_freq, inplace=True)

print(df.info())

# 분석에 홗용핛 열(속성)을 선택
ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]
print(ndf.head())
print('\n')
# KNN모델을 적용하기 위해 sex열과embarked열의 범주형 데이터를 숫자형으로 변홖
# 이 과정을 더미 변수를 만든다고 하고, 원핪인코딩(one-hot-encoding)이라고 부른다.
# 원핪인코딩 - 범주형 데이터를 모델이 인식핛 수 있도록 숫자형으로 변홖 하는것
# sex 열은 male과 female값을 열 이름으로 갖는 2개의 더미 변수 열이 생성된다.
# concat()함수로 생성된 더미 변수를 기존 데이터프레임에 연결핚다.
onehot_sex = pd.get_dummies(ndf['sex'])
ndf = pd.concat([ndf, onehot_sex], axis=1)
print(ndf.info())

# embarked 열은 3개의 더미 변수 열이 만들어지는데, prefix='town' 옵션을
# 사용하여 열 이름에 접두어 town을 붙인다. ( town_C, town_Q, town_S)
onehot_embarked = pd.get_dummies(ndf['embarked'], prefix='town')
ndf = pd.concat([ndf, onehot_embarked], axis=1)

#기존 sex,embarked 컬럼 삭제
ndf.drop(['sex','embarked'], axis = 1, inplace = True)
print(ndf.head())

# 학습을 해야 할 독립변수와 종속 변수 가져오기
x=ndf[['pclass', 'age', 'sibsp', 'parch', 'female', 'male',
'town_C', 'town_Q', 'town_S']] # 독립 변수(x)
y=ndf['survived'] # 종속 변수(y)
# 독립 변수 데이터를 정규화(normalization)
# 독립 변수 열들이 갖는 데이터의 상대적 크기 차이를 없애기 위하여
# 정규화를 핚다.
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit(x).transform(x)

#train data와 test data 분할
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
print('train data 개수: ', x_train.shape) # train data 개수: (499, 9)
print('test data 개수: ', x_test.shape) # test data 개수: (215, 9)

'Study > 머신러닝' 카테고리의 다른 글

머신러닝-6 (0)	2020.11.19
머신러닝-5 (0)	2020.11.19
머신러닝-3 (0)	2020.11.16
머신러닝-2 (0)	2020.11.14
머신러닝-1 (0)	2020.11.14

머신러닝-3

2020. 11. 16. 20:52

728x90

지도학습 : 레벨이 있는 것

비지도 학습 : 레벨이 없는 것

scikit-learn 에서 최소 제곱법 구현 방법 -> 모델 만드는 것이 가장 쉬운 방법linear_model .

LinearRegression() 모델 만드는 작업을 먼저 해야 한다.

결정계수는 데이터전문가들은 수학계산 나올 수 있다.

단순선형 회귀 예제 : 결정계수

y = ax + b 처럼 데이터를 만들어 회귀문제를 풀어 보자.

여기서는 y = 3x – 2에 정규분포 난수를 더했을때, 최소 제곱법으로 기울기와 절편을예측해 보자.

# 결정 계수
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

#학습 데이터
x = np.random.rand(100,1) # 0-1 사이에 난수 100개 생성
x = x * 4 - 2 #x의 범위의 수는 -2 ~ 2

y = 3 * x - 2
#표준 정규분포(평균: 0 , 표준편차 :1)의 난수 추가
y += np.random.rand(100, 1)

#모델 생성
model = linear_model.LinearRegression() # 함수를 가지고 모델 만들어진다

# 모델 학습
model.fit(x, y)

# 회귀계수 , 절편 ,결정개수
print('계수:', model.coef_)
print("절편:",model.intercept_)

r2 = model.score(x,y )
print("결정개수:" , r2) #결정계수 구해준다.

# 산점도 그래프
plt.scatter(x, y , marker="+")
plt.scatter(x, model.predict(x), marker='o')
plt.show()

단순선형 회귀 예제 : 결정계수

y = ax^2 + b 형태의 데이터를 만들어 회귀문제를 풀어 보자.

rand 명령은 0부터 1사이에서 균일한 확률 분포로 실수 난수를 생성한다. 숫자 인수는 생성할 난수의 크기이다. 여러개의 인수를 넣으면 해당 크기를 가진 행렬을 생성한다.

다중 선형 회귀 예제

y = ax1 + bx2 + c 형태의 데이터를 만들어 회귀문제를 풀어 보자.

여기서는 y = 3x1 – 2x2 + 1 에 정규분포 난수를 더했을때, 최소 제곱법으로 회귀계수와 절편을예측해 보자.

# 다중 선형 회귀 분석

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# y = 3x_1 - 2x_2 + 1 학습 데이터 생성
x1 = np.random.rand(100, 1) # 0~1까지 난수를 100개 만듞다
x1 = x1 * 4 - 2 # 값의 범위를 -2~2로 변경

x2 = np.random.rand(100, 1) # x2에 대해서도 같게
x2 = x2 * 4 - 2 # 값의 범위를 -2~2로 변경

y =  3 * x1 - 2 * x2 +1
y += np.random.randn(100,1) # 표준정규분포

#x1, x2 데이터를 배열  생성
#배열 형태로 만든다.
x1_x2 = np.c_[x1, x2]
print(x1_x2)

#학습모델 생성
model = linear_model.LinearRegression()

#모델 학습
model.fit(x1_x2, y)

# 회귀계수 , 절편 ,결정개수
print('계수:', model.coef_)
print("절편:", model.intercept_)

r2 = model.score(x1_x2 ,y )
print("결정개수:" , r2) #결정계수 구해준다. 결정개수: 0.940784796428865 예측 성능이 좋다.

#산점도 그리기
y_ = model.predict(x1_x2) # 회귀식으로 예측

plt.subplot(1,2,1) #1행 2열 의 1번째 그래프
plt.scatter(x1,y,marker="+")
plt.scatter(x1, y_ , marker="o")
plt.xlabel('x1')
plt.ylabel('y')

plt.subplot(1, 2, 2) # 1행 2열 배치 , 2번째 subplot
plt.scatter(x2, y, marker='+')
plt.scatter(x2, y_, marker='o')
plt.xlabel('x2')
plt.ylabel('y')

plt.tight_layout()  #자동으로 레이아웃을 설정해주는 함수
plt.show()

보스턴 주톡가격 회귀분석

# boston 데이터셋 로드

boston = datasets.load_boston()

보스턴 집값 데이터 :

인덱스 : 506행, 컬럼 : 14열

target : 키값으로 되여있다.

재일 마지막은 feature_name이 없다.

#보스턴 주택가격 회귀분석

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston

# boston 데이터셋 로드
boston = load_boston()

#boston 데이터셋을 이용해서 데이터 생성
bostonDF = pd.DataFrame(boston.data , columns= boston.feature_names)
#잴 마지막에는 컬럼 값이 없기 때문에 13개컬럼만 불러올려고 한다.
print(bostonDF) #[506 rows x 13 columns]

# boston 데이터셋의 target 열(컬럼)은 주택 가격임.
# boston.target을 PRICE 컬럼으로 추가함
bostonDF['PRICE'] = boston.target
print('boston 데이터 형태: ' ,bostonDF.shape ) #boston데이터 형태(506,14)
print(bostonDF.head())
print(bostonDF)

#boston데이터셋을 boston.csv파일로 저장
bostonDF.to_csv('boston.csv', encoding='utf-8')

# 다음의 각 컬럼 RM, ZN, INDUS, NOX, AGE, PTRATIO, LSTAT, RAD 의 총 8개의 컬럼에
# 대해서 값이 증가핛수록 PRICE에 어떤 영향을 미치는지 분석하고 시각화를 해보자

# 2개의 행과 4개의 열을 가짂 subplots 로 시각화, axs는 4x2개의 ax를 가짐
fig, axs = plt.subplots(figsize=(16,8) , ncols=4, nrows=2)#2행 4열 구조로
Im_features = ['RM', 'ZN', 'INDUS', 'NOX', 'AGE', 'PTRATIO', 'LSTAT', 'RAD']
for i , feature in enumerate(Im_features):
    row = int(i/4)
    col = i % 4

    sns.regplot(x = feature, y ='PRICE', data = bostonDF, ax= axs[row][col]) # ax= axs[row][col]그래프

plt.show()

# RM(방개수)와 LSTAT(하위 계층의 비율)이 PRICE에 영향도가 가장 두드러지게 나타남.
# 1. RM(방개수)은 양 방향의 선형성(Positive Linearity)이 가장 크다.
# 방의 개수가 많을수록 가격이 증가하는 모습을 확연히 보여준다.
# 2. LSTAT(하위 계층의 비율)는 음 방향의 선형성(Negative Linearity)이 가장 크다.
# 하위 계층의 비율이 낮을수록 PRICE 가 증가하는 모습을 확연히 보여준다.

#전체 데이트를 나누어서 처리 7:3 정도 나누어서 처리한다.
#random_state= 156 seed와 같은 역할

#overfitting을 줄이기 위한 한가지 방법은 학습데이터와 test데이터를 줄이는 것
# LinearRegression 클래스를 이용해서 보스턴 주택 가격의 회귀 모델을 만들어 보자
# train_test_split()을 이용해 학습과 테스트 데이터셋을 분리해서 학습과 예측을 수행핚다.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

#x_Data price제외한 데이터
#x_test price 만 있는 데이터
# train은 학습을 할때 사용하는 것
#test는 예측 할때 사용한다. 얼마나 예측 잘 했는지
#일차의 교차검정
#overftting은 트래인 데이터가 잘 되는 것
#axis = 1 이면 컬럼 0이면 행
#x_data 13개 컬럼 -> train을 하는 것이 다.
#train_test_split -> (13개 정보 , price, 데이터 크기를 30%로 할당 테스트 데이터 뒤쪽에 있는 데이터, )
#학습을 시키기 위해서 학습 데이터 x_train ,  y_train 13개 컬럼이 학습에 얼마나 영향을 주는지
#예측을 할때나 평가를 할 때는 test데이터 가지고 한다.
#모듈을 얼마나 잘했는지 test데이터를 한다.
#overfitting 줄이기 위해서

y_target = bostonDF['PRICE']

#데이터프레임에서 price 컬럼을 삭제한다.
x_data = bostonDF.drop(['PRICE'], axis= 1, inplace=False) #특정 컬럼을 삭제 할 때 사용
#inplace true로 하면 none이 return한다. false이야만 값 리턴한다.
#13개 컬럼정보

#train_data 학습을 하기 위해서
#test_data 예측하거나 모델을 평가할 때 하는 데이터
#random_state seed하고 같다 실행할 때 마다 같은 결과가 나온다.
x_train, x_test, y_train, y_test = train_test_split(x_data, y_target, test_size=0.3, random_state=156)
#print('x_train:' , x_train) #학습용 데이터
#print('x_test:' , x_test) #검증을 하거나 예측을 할떄 사용
#print('y_train:' , y_train)
#print('y_test:' , y_test)

#from sklearn.linear_model import LinearRegression
# 선형회귀 모델생성/ 모델학습 / 모델예측 / 모델평가 수행
model = LinearRegression() # 모델 생성

#모델 학습
model.fit(x_train, y_train) # 13개 컬럼이 price에 어떤 영향을 주는지

#모델 예측- 학습된 데이터로 예측
y_predict = model.predict(x_test) #테스트 값가지고 해야 한다. 예측값
#13개 컬럼이 들어갔을 때 가격에 얼마나 영향 주는지 모델 학습이 되여서 예측이 가능하다.
#아니면 신뢰도가 떨어진다.
print('y_predict:' ,y_predict) #13개 컬럼 30%가지고 예축한다.

#모델 평가
#r2_score 메트릭스로 구하기 때문에  r2_score(실제데이터, 예측데이터 )
print('결정계수:',r2_score(y_test, y_predict))
# Variance score : 0.757

# LinearRegression 으로 생성핚 주택가격 모델의 회귀계수(coefficients)와
# 절편(intercept)을 구해보자
# 회귀계수는 LinearRegression 객체의 coef_ 속성으로 구핛수 있고,
# 절편은 LinearRegression 객체의 intercept_ 속성으로 구핛수 있다.
print('회귀계수:',model.coef_)
print('절편:', model.intercept_)

#회귀계수를 큰 값 순으로 정렬하기 위해서 Series로 생성함.
#series데이터 는 1차원
#round함수 반올림
coff = pd.Series(data = np.round(model.coef_,1) ,index= x_data.columns)
print(coff.sort_values(ascending=False)) # ascending=False 내림차순 정렬

# RM 3.4 # 방개수 기울기가 가장 큰 것으로 나온다.
# CHAS 3.0
# RAD 0.4
# ZN 0.1
# B 0.0
# TAX -0.0
# AGE 0.0
# INDUS 0.0
# CRIM -0.1
# LSTAT -0.6
# PTRATIO -0.9
# DIS -1.7
# NOX -19.8 # 일산화질소 농도

양의 상관관계 음의 상관관계

양의 상관관계 대각선으로 되여 있고

음의 상관관계는 반대로 되여 있다.

enumerate 함수

· 리스트가 있는 경우 순서와 리스트의 값을 전달하는 기능을 가집니다.

· enumerate는 “열거하다”라는 뜻입니다.

· 이 함수는 순서가 있는 자료형(list, set, tuple, dictionary, string)을 입력으로 받아 인덱스 값을 포함하는 enumerate 객체를 리턴합니다.

· 보통 enumerate 함수는 for문과 함께 자주 사용됩니다.

기울기가 작아도 골고루 분포 되여 있어야 영향이 있는지 볼 수 있다.

기울기가 양의 방향과 음의 방향 이있다.

절대적으로 가격에 영향을 주는지 아닌지 결정 못한다.

학습데이터 실제데이터 분할 될때는 y_test, predict로 한다.

기울기 하고 분포(실제데이터) 같이 봐야 한다.

일정한 방향으로 되여있기 때문에 회기선과 실제데이터 사이에 관련이 있다. 분포 같이 보고 확인 해야 한다.

단순회귀분석 예제

UCI(University of California, Irvine) 자동차 연비 데이터셋으로단순회귀분석을 해보자

Step1. 데이터 준비

Step2. 데이터 탐색

Step3. 분석에 홗용핛 속성(feature 또는 variable) 선택 및 그래프 그리기독립변수 : weight(중량), 종속변수 : mpg(연비)

Step4. 훈련 데이터 / 검증 데이터 분할

Step5. 모델 학습 및 모델 검증

# 기본 라이브러리 불러오기
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#자동차 정보 파일 읽어와서 데이터프레임 생성
df = pd.read_csv('auto-mpg.csv', header=None)
#print(df) # 컬럼 번호로 출력 ( 0 ~ 8 ) [398 rows x 9 columns]

#컬럼명 설정
df.columns = ['mpg','cylinders','displacement','horsepower','weight','acceleration','model year','origin','name']
#mpg 연비
print(df.head()) #앞에서 5개의 데이터 출력

#한줄에  10개의 컬럼 출력
pd.set_option('display.max_columns', 10)
print(df.head()) #앞에서 5개의 데이터 출력
 
-=>  
# IPython 디스플레이 설정 - 출력핛 열의 개수 늘리기
pd.set_option('display.max_columns', 10)

Step2. 데이터 탐색

# 데이터 자료형 확인
# 데이터 자료형 확인
print(df.info()) # horsepower 열의 자료형 object(문자형)

#horsepower 열의 자료형 변경 (문자형 ->실수형)
#horsepower는 문자형으로 되여있다. 문자형으로 되여있으면 통계예약이 안된다.

# horsepower 열의 자료형 변경 (문자형 ->실수형)

#통계요약 정보 확인
print(df.describe())
#통계요약 을 할수 있느 값만 나타난다.
#두개 컬럼 문자열이여서 결과가 출력이 안된다.  horsepower,name


# horsepower 열의 고유값 확인
print(df['horsepower'].unique()) #숫자는 그대로 나오고 문자는 따옴표로 해서 나온다.
# '?' 값이 없는 것 이것을 결측치로 바꿔주는 작업을 해야 한다. 이런 값은 숫자로 못 바꿘다.
# 이것을 없애주는 작업을 수행해야 한다.
# horserpower 컬럼의 '?'을 np.nan으로 변경
df['horsepower'].replace('?',np.nan, inplace=True)
#horsepower 컬럼을 문자형여르 실수형으로 변환
df['horsepower'] = df['horsepower'].astype('float')

#변경 정보 확인: 통계 요약 정보 출력
# 숫자열이 않이면 안나온다. 숫자형으로 변경해서 나온다.
print(df.describe()) # horsepower이 나온다.

Step3. 분석에 홗용핛 속성(feature 또는 variable) 선택 및 그래프 그리기

#분석에 홗용핛 열(속성)을 선택 (연비, 실린더, 출력, 중량)
ndf = df[['mpg', 'cylinders', 'horsepower', 'weight']]
print(ndf.head()) # 앞에서 5개의 데이터 출력
# 분석에 필요한 4가지 컬럼으로

# 독립변수(x) weight(중량)와 종속 변수(y)인 mpg(연비) 갂의 선형관계를 산점도 그래프로 확인
# 1.matplotlib으로 산점도 그리기
# 1.pandas로 산점도 그리기
ndf.plot(kind ='scatter' , x = 'weight' , y ='mpg', c='coral', figsize=(10,5))

plt.show()

#2. seaborn 산점도 그리기
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
sns.regplot(x ='weight', y = 'mpg', data=ndf, ax = ax1 , fit_reg=True) #회귀선 표시
sns.regplot(x ='weight', y = 'mpg', data=ndf, ax = ax2 , fit_reg=False) #회귀선 미표시
plt.show()

# 3.seaborn 조인트 그래프 - 산점도, 히스토그램
sns.jointplot(x = 'weight' , y = 'mpg', data=ndf) #회귀선 없음
sns.jointplot(x = 'weight' , y = 'mpg', data=ndf , kind='reg')
plt.show()

# 4.seaborn pariplot으로 두 변수 갂의 모두 경우의 수 그리기
sns.pairplot(ndf)
plt.show()

Step4. 훈련 데이터 / 검증 데이터 분할

# 속성(변수) 선택
x=ndf[['weight']] # 독립 변수 : x
y=ndf['mpg'] # 종속 변수 : y
# train data 와 test data로 분핛(7:3 비율)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, # 독립 변수
y, # 종속 변수
test_size=0.3, # 검증 30%
random_state=10) # 랜덤 추출 값
print('train data 개수: ', len(x_train))
print('test data 개수: ', len(x_test))

Step5. 모델 학습 및 모델 검증

# train 데이터 가지고 학습을 시킨다.
#모듈 import
from sklearn.linear_model import LinearRegression

#모델 생성
model = LinearRegression()

#train data를 이용해서 모듈 학습
model.fit(x_train, y_train) #학습을 하는 데이터

#결정계수
r_square = model.score(x_test, y_test)#모듈로 하는것
print('결정계수:', r_square) #결정계수 0.6893638093152089

#회귀계수 ,절편
print('회귀계수:', model.coef_)
print('절편', model.intercept_)

#예측값과 실제값을 그래프로 출력
y_predict = model.predict(x_test) #예측값
# x는 전체 데이터이고 x_test 테스트 를 하기 위한 weigth정보 
#30%에 대한 연비 정보

plt.figure(figsize=(10, 5))
ax1 = sns.distplot(y_test , hist=False, label='y') #실제값 출력
ax2 = sns.distplot(y_predict, hist=False , label='y_hat') #예측값
plt.show()

독립변수 하나가 있는 것을 단순 회귀분석이라고 한다.

오차가 크다 라는 것을 알 수 있다.

weight가 높을때 연비 가 낮다.

상대적이다.실제데이터는 왼쪽으로 되여있고 예측 데이터는 오르쪽으로 되여있다.

다항회귀분석(Polynomial Regression)

#다항회귀분석

#Step1. 데이터 준비
# 기본 라이브러리 불러오기
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# CSV 파일을 읽어와서 데이터프레임으로 변홖
df = pd.read_csv('auto-mpg.csv', header=None)

# 열 이름 지정
df.columns = ['mpg','cylinders','displacement','horsepower','weight', 'acceleration','model year','origin','name']

#Step2. 데이터 탐색
# horsepower 열의 자료형 변경 (문자형 ->실수형)
df['horsepower'].replace('?', np.nan, inplace=True) # '?'을 np.nan으로 변경

df['horsepower'] = df['horsepower'].astype('float') # 문자형을 실수형으로 변홖

print(df.describe()) # 데이터 통계 요약정보 확인
print('\n')

#Step3. 분석에 홗용핛 속성(feature 또는 variable) 선택
# 분석에 홗용핛 열(속성)을 선택 (연비, 실린더, 출력, 중량)
ndf = df[['mpg', 'cylinders', 'horsepower', 'weight']]
print(ndf.head())

#Step4. 훈련 데이터 / 검증 데이터 분핛
# ndf 데이터를 train data 와 test data로 구분(7:3 비율)
x=ndf[['weight']] #독립 변수 x : weight(중랼)
y=ndf['mpg'] #종속 변수 y : mpg(연비)
#x, y가 분할된 데이터이다.

#train data 와 test data로 구분(7:3 비율)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train , y_test = train_test_split(x, y , test_size=0.3, random_state=10)

print('훈련 데이터: ', x_train.shape) # 훈련 데이터: (274, 1)
print('검증 데이터: ', x_test.shape) # 검증 데이터: (118, 1)
print('\n')

#Step5. 모델 학습 및 모델 검증
# PolynomialFeatures 다항식 degree = 2 2차형 3으로 하면 3차형

#모듈 불러오기
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2) # 2차항 적용
x_train_poly = poly.fit_transform(x_train) # x_train 데이터를 2차항으로 변홖
print(x_train_poly) # x_train의 1개의 열이 x_train_poly 에서는 3개의 열로 늘어난다.

print('원래데이터:' ,x_train.shape) #원래데이터: (278, 1)
print('2차항 변환 데이터:' ,x_train_poly.shape) #2차항 변환 데이터: (278, 3)
# x_train의 1개의 열이 x_train_poly 에서는 3개의 열로 늘어난다.

#모델 생성
model = LinearRegression()

#train data를 이용해서 모델 학습
model.fit(x_train_poly, y_train)

#결정 계수
x_test_poly = poly.fit_transform(x_test) #x_test데이터를 제곱항 형태로 변환
r_square = model.score(x_test_poly, y_test) #실제데이터
#연비는 종속변수이기 때문에 안 변한다.
#일차항이던 2차항이던 바꾸지 않는다.

print("결정계수 :" , r_square)

# train data의 산점도와 test data로 예측핚 회귀선을 그래프로 출력
#쳬측하기
y_hat_predict = model.predict(x_test_poly) #테스트 데이터 가지고 예측 한다.

# train data의 산점도와 test data로 예측핚 회귀선을 그래프로 출력
fig = plt.figure(figsize=(10, 5)) # 그래프 크기 설정
ax = fig.add_subplot(1, 1, 1)
ax.plot(x_train, y_train, 'o' , label='Train Data') # train data의 산점도
ax.plot(x_test, y_hat_predict, 'r+', label='Predicted Value') # 모델이 학습핚 회귀선
ax.legend(loc='best') # 범례 설정 label를 설정하며 범례가 설정된다.
plt.xlabel('weight') # x축 라벨
plt.ylabel('mpg') # y축 라벨
plt.show()
plt.close()

다항 회귀분석이 단순회귀분석보다 정확하게 나온것을 확인할 수 있습니다.

# 실제값 y와 예측값 y_hat 의 분포 차이 비교
# 모델에 전체 x데이터를 입력하여 예측핚 값 y_hat을 실제 값 y와 비교
x_poly = poly.fit_transform(x) # x데이터를 2차항으로 변환
#x는 weight 정보가 되여있습니다.
y_hat = model.predict(x_poly) #예측 한다.

# displot() 함수 : 히스토그램 + 커널밀도함수
plt.figure(figsize=(10, 5)) # 그래프 크기 설정
ax1 = sns.distplot(y, hist=False, label="y") # 실제값
ax2 = sns.distplot(y_hat, hist=False, label="y_hat", ax=ax1) # 예측값
#ax=ax1 그래프가 결합이 되서 나타난다. 
plt.show()

제곱항 degree = 2

'Study > 머신러닝' 카테고리의 다른 글

머신러닝-6 (0)	2020.11.19
머신러닝-5 (0)	2020.11.19
머신러닝-4 (0)	2020.11.17
머신러닝-2 (0)	2020.11.14
머신러닝-1 (0)	2020.11.14

머신러닝-2

2020. 11. 14. 18:44

728x90

import numpy as np
import pandas as pd#시각화는 있지만 자체적으로 show를 못해서 plt.show()로 출력하기
import matplotlib.pyplot as plt

#5개의 난수 생성
data = np.random.rand(5)
print(data)

# pandas 시리즈형 데이터로 변홖
s = pd.Series(data, index=['a','b','c','d','e'], name='series')
#index0,1,2,3,4 -> 'a','b','c','d','e'
print(s)

# pie 그래프 출력
s.plot(kind = 'pie' ,autopct = '%.2f',figsize = (10,10))
plt.show()

=>pandas_pie02.py
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
# '맑은 고딕'으로 설정
matplotlib.rcParams['font.family'] = 'Malgun Gothic'
fruit = ['사과', '바나나', '딸기', '오렌지', '포도']
result = [7, 6, 3, 2, 2]
df_fruit = pd.Series(result, index = fruit, name = '선택핚 학생수')
print(df_fruit)
df_fruit.plot.pie()
plt.show()

explode_value = (0.1, 0, 0, 0, 0) # pie 갂격설정
fruit_pie = df_fruit.plot.pie(figsize=(5, 5), autopct='%.1f%%',
startangle=90, counterclock = False,
explode=explode_value, shadow=True, table=True)
fruit_pie.set_ylabel("") # 불필요핚 y축 라벨 제거
fruit_pie.set_title("과일 선호도 조사 결과")
# 그래프를 이미지 파일로 저장. dpi는 200으로 설정
plt.savefig('saveFruit.png', dpi = 200)
plt.show()

seaborn 모듈

Titanic Datasetseaborn 라이브러리에서 제공되는 ‘titanic’ 데이터셋을 가져와서 다양핚그래프로 출력 해보자.

import seaborn as sns
titanic = sns.load_dataset(‘titanic’)
titanic 데이터셋에는 탑승객 891명의 정보가 담겨져 있다.
index : 891 ( 0 ~ 890 )
columns : 15 columns

=> seaborn_dataset.py
# 라이브러리 불러오기
import seaborn as sns
# titanic 데이터셋 가져오기
titanic = sns.load_dataset('titanic')
# titanic 데이터셋 살펴보기
print(titanic.head())
print('\n')
print(titanic.info())

회귀선이 있는 산점도
seaborn 모듈로 산점도를 그리기 위해서는 regplot() 함수를 이용핚다.
sns.regplot ( x = ‘age’ , # x축 변수
y = ‘fare’ , # y축 변수
data = titanic , # 데이터
ax = ax1 , # 1번째 그래프
fit_reg = True ) # 회귀선 표시

# seaborn 모듈로 산점도 그리기

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')

# 스타일 테마 설정
# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('darkgrid')

#그래프 만들기 
# 서브 그래프 만들기 
fig = plt.figure(figsize=(10,10)) #그래프 크기 설정
ax1 = fig.add_subplot(1,2,1) # 1행 2열 구조의 첫번째 그래프
ax2 = fig.add_subplot(1,2,2) # 1행 2열 구조의 두번째 그래프

#산점도 그래프 그리기
sns.regplot(x = 'age', #x축 변수
            y = 'fare', # y측 변수
            data = titanic, #데이터
            ax = ax1 , #첫번때 그래프
            fit_reg=True #회귀선 표시(기본값)
 )

# 산점도 그래프 그리기 - 선형회귀선 미표시(fit_reg=False)
sns.regplot(x='age', # x축 변수
        y='fare', # y축 변수
        data=titanic, # 데이터
        ax=ax2, # axe 객체 - 2번째 그래프
        fit_reg=False) # 회귀선 미표시
#가격이 싼데 많이 몰려있다. 
plt.show()

범주형 데이터의 산점도

# seaborn 모듈로 산점도 그리기

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')

# 스타일 테마 설정
# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('darkgrid')

#그래프 만들기
# 서브 그래프 만들기
fig = plt.figure(figsize=(10,10)) #그래프 크기 설정
ax1 = fig.add_subplot(1,2,1) # 1행 2열 구조의 첫번째 그래프
ax2 = fig.add_subplot(1,2,2) # 1행 2열 구조의 두번째 그래프

#산점도 - 데이터 분산 미고려
#분포가 퍼지지 않다.
sns.stripplot(x ='class' , #x측 변수
              y = 'age' , #y측 변수
              data = titanic, #데이터
              ax = ax1 #1번째 그래프
              )

#산점도 - 데이터 분산 고려 - 데이터가 많을 경우 분산되여 있다.
#데이터가 많을 경우 분산해서 보여준다.
sns.swarmplot(x ='class' , #x측 변수
              y = 'age' , #y측 변수
              data = titanic, #데이터
              ax = ax2 #2번째 그래프
              )
#title 설정
ax1.set_title('stripplot()')
ax2.set_title('swarmplot()')

plt.show()

히스토그램과 커널 밀도 함수

#히스토그램과 커널 밀도 함수

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')

# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('darkgrid')

# 그래프 객체 생성 (figure에 3개의 서브 플롯을 생성)
fig = plt.figure(figsize=(15, 5)) # 그래프 크기 설정
ax1 = fig.add_subplot(1, 3, 1) # 1행 3열 - 1번째 그래프
ax2 = fig.add_subplot(1, 3, 2) # 1행 3열 - 2번째 그래프
ax3 = fig.add_subplot(1, 3, 3) # 1행 3열 - 3번째 그래프

#그래프 그리기 : 히스트그램 + 커널 밀드 그래프
sns.distplot(titanic['fare'], ax = ax1)

#그래프 그리기 : 커널 밀드 그래프
sns.distplot(titanic['fare'], hist=False, ax = ax2)

#그래프 그리기 : 히스트그램
sns.distplot(titanic['fare'], kde=False, ax = ax3)


# 차트 제목 표시
ax1.set_title('titanic fare - hist/ked') # 히스토그램 / 커널밀도함수
ax2.set_title('titanic fare - ked') # 커널밀도함수
ax3.set_title('titanic fare - hist') # 히스토그램

plt.show()

히트맵

import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')

# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('darkgrid')

# 피벖테이블로 범주형 변수를 각각 행, 열로 재구분하여 데이터프레임을 생성함
# aggfunc='size' 옵션은 데이터 값의 크기를 기준으로 집계핚다는 의미
# 등석에 따라 남자 여자 몇명인지 나오기
table = titanic.pivot_table(index = ['sex'],columns=['class'],aggfunc='size')

#히트맵 만들기
sns.heatmap(table,
            annot=True ,  # 승객수
            fmt = 'd' , #정수 형태로
            cmap='YlGnBu', # 색상 설정
            linewidths=0.5, #구분선 두깨
            cbar=False #컬러바 표시 여부 오른쪽에 생기는 것
            )
plt.show()

막대 그래프

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns
# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')
# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('whitegrid')
# 그래프 객체 생성 (figure에 3개의 서브 플롯을 생성)
fig = plt.figure(figsize=(15, 5)) # 그래프 크기 설정
ax1 = fig.add_subplot(1, 3, 1) # 1행 3열 - 1번째 그래프
ax2 = fig.add_subplot(1, 3, 2) # 1행 3열 - 2번째 그래프
ax3 = fig.add_subplot(1, 3, 3) # 1행 3열 - 3번째 그래프

#막대 그래프 그리기
sns.barplot(x = 'sex' , y = 'survived', data=titanic , ax = ax1)

# x축, y축에 변수 핛당하고, hue 옵션 추가하여 누적 출력순으로 출력
sns.barplot(x='sex', y='survived', hue='class', data=titanic, ax=ax2)

# x축, y축에 변수 핛당하고, dodge=False 옵션으로 1개의 막대그래프로 출력
sns.barplot(x='sex', y='survived', hue='class', dodge=False, data=titanic, ax=ax3)
# 차트 제목 표시
ax1.set_title('titanic survived - sex')
ax2.set_title('titanic survived - sex/class')
ax3.set_title('titanic survived - sex/class(stacked)')
plt.show()

빈도 막대그래프

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns
# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')
# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('whitegrid')
# 그래프 객체 생성 (figure에 3개의 서브 플롯을 생성)
fig = plt.figure(figsize=(15, 5)) # 그래프 크기 설정
ax1 = fig.add_subplot(1, 3, 1) # 1행 3열 - 1번째 그래프
ax2 = fig.add_subplot(1, 3, 2) # 1행 3열 - 2번째 그래프
ax3 = fig.add_subplot(1, 3, 3) # 1행 3열 - 3번째 그래프

#빈도 막대 그래프
sns.countplot(x = 'class' , palette='Set1', data=titanic, ax = ax1)

#빈도 막대 그래프 : hue ='who' who(man, woman, child)값으로 각각 빈도 막대 그래프
sns.countplot(x = 'class' , palette='Set2', hue ='who', data=titanic, ax = ax2)

#빈도 막대 그래프 : dodge= False한개의 막대그래프에 나타남
sns.countplot(x = 'class' , palette='Set2', hue ='who', dodge= False, data=titanic, ax = ax3)


# 차트 제목 표시
ax1.set_title('titanic class')
ax2.set_title('titanic class - who')
ax3.set_title('titanic class - who(stacked)')


plt.show()

박스플롯 / 바이올린 그래프

# 박스 플롯/ 바이올린 그래프

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')

# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('whitegrid')

# 그래프 객체 생성 (figure에 4개의 서브 플롯을 생성)
fig = plt.figure(figsize=(15, 10)) # 그래프 크기 설정
ax1 = fig.add_subplot(2, 2, 1) # 2행 2열 - 1번째 그래프
ax2 = fig.add_subplot(2, 2, 2) # 2행 2열 - 2번째 그래프
ax3 = fig.add_subplot(2, 2, 3) # 2행 2열 - 3번째 그래프
ax4 = fig.add_subplot(2, 2, 4) # 2행 2열 - 4번째 그래프

#1. 박스 그래프
sns.boxplot(x = 'alive', y = 'age', data=titanic, ax = ax1)

#2. 박스 그래프 : hue: sex sex(male, female)로 구분해서 출력
sns.boxplot(x = 'alive', y = 'age',hue='sex', data=titanic, ax = ax2)

#3. 바이올린 그래프
sns.violinplot(x = 'alive' ,y ='age', data=titanic, ax = ax3)

#4. 바이올린 그래프 hue: sex sex(male, female)로 구분해서 출력
sns.violinplot(x = 'alive' ,y ='age',hue = 'sex', data=titanic, ax = ax4)

plt.show()

조인트 그래프

# 조인트 그래프

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')

# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('whitegrid')

# 1. 조인 그래프 : 산점도 + 히스트그램
j1 = sns.jointplot(x = 'fare', y = 'age', data= titanic)

# 2. 조인 그래프 : 회귀선 : kind ='reg'
j2 = sns.jointplot(x = 'fare', y = 'age',kind ='reg',  data= titanic)


# 3. 조인 그래프 : kind ='hex'
j3 = sns.jointplot(x = 'fare', y = 'age',kind ='hex',  data= titanic)


# 4. 조인 그래프 : kind ='kde'
j4 = sns.jointplot(x = 'fare', y = 'age',kind ='kde',  data= titanic)

# 차트 제목 표시
j1.fig.suptitle('titanic fare - scatter', size=15)
j2.fig.suptitle('titanic fare - reg', size=15)
j3.fig.suptitle('titanic fare - hex', size=15)
j4.fig.suptitle('titanic fare - kde', size=15)

plt.show()

조건을 적용하여 화면을 그리드로 분할한 그래프

# 조건을 적용하여 화면을 그리드로 분핛핚 그래프

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')

# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('whitegrid')

# 조건에 따라 그리드  나누기 : who(man, woman, child) , survived (0 or 1)
g = sns.FacetGrid(data = titanic, col ='who', row='survived')

#그래프 그리기: 히스트그램
g = g.map(plt.hist, 'age')

plt.show()

데이터 분포 그래프

#데이터 분포 그래프

# 라이브러리 불러오기
import matplotlib.pyplot as plt
import seaborn as sns

# Seaborn 제공 데이터셋 가져오기
titanic = sns.load_dataset('titanic')

# 스타일 테마 설정 (5가지: darkgrid, whitegrid, dark, white, ticks)
sns.set_style('whitegrid')

# titanic 데이터셋 중에서 분석에 필욯난 데이터 선택하기
titanic_pair = titanic[['age','pclass' ,'fare']]

# 조건에 따라서 그리드로 누누가: 3행 3열 그리드로 출력
g = sns.pairplot(titanic_pair)

plt.show()

인공지능(AI, Artificial Intelligence)이란 무엇인가?

인공지능이란 사람과 유사한 지능을 가지도록인간의 학습능력, 추론능력, 지각능력, 자연어 이해능력 등을 컴퓨터 프로그램으로 실현하는 기술이다.

비젼: 이미지 비디오 인식 하는 것

기계학습은 인공지능의 한 분야로 기계 스스로 대량의 데이터로부터 지식이나 패턴을 찾아 학습하고 예측을 수행하는 것이다

강화학습은 게임 쪽에서 많이 한다.

Perceptron은 학습이 진행될수록 선형 분리(linear boundary)를 업데이트하면서 학습

학습 데이터에 너무 지나치게 맞추다 보면 일반화 성능이 떨어지는 모델을 얻게 되는 현상을 과적합(Overfitting)이라고 한다.

Under Fitting-> 학습률이 떨어진다.적정 수준의 학습을 하지 못하여 실제 성능이 떨어지는 경우

Normal Fitting (Generalized Fitting)적정 수준의 학습으로 실제 적정한 일반화 수준을 나타냄. 기계 학습이 지향하는 수준.

Over Fitting -> 학습 데이터가 너무 적을 떄도 일어난다. train에는 학습이 잘된다. 학습 데이터에 성능이 좋지만 실제 데이터에 관해 성능이 떨어짐. 특히 조심해야 함.

Overfitting을 피하는 다양한 방법 중 검증(Validation) 기법을 알아본다.

사이킷런 모듈 설치

1. python 에 설치

c:\> pip install scikit-learn

2. anaconda 에 설치

c:\> conda install scikit-learn

PyCharm에 scikit-learn 설치

단순선형 회귀 예제

y = ax + b 처럼 데이터를 만들어 회귀문제를 풀어 보자.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# 학습 데이터 생성 난수를 가지고 만든다.
x = np.random.rand(100, 1)

x = x * 4 - 2 # -2 ~ 2
y = 3 * x - 2 # y = 3x - 2

# 모델 생성하기
model = linear_model.LinearRegression()#모듈을 직접 만드는 것이 아니라 linearRegression

# 모델 학습
model.fit(x,y)

# 학습을 하면서 규칙을 찾는다.
# x,y에 대해서 규칙을 찾는다.
#기울기라는 회귀계수 와 절별 (bias)
print("회귀계수(기울기)",model.coef_)
print("회귀계수(절편)",model.intercept_)

# 산점도 그래프 출력 (예측값과 실제 값 비교하기 위해서)
plt.scatter(x, y , marker='+')
plt.show()

y = ax + b 처럼 데이터를 만들어 회귀문제를 풀어 보자.

여기서는 y = 3x – 2에 정규분포 난수를 더했을때, 최소 제곱법으로 기울기와 절편을예측해 보자.

#단순 선형 회귀
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

#학습 데이터 생성 3 0 ~ 1 사이의 난수 100개 생성
x = np.random.rand(100, 1)

x = x * 4 - 2 # -2 ~ 2
y = 3 * x - 2 # 고정된 값이 아니게 해야 한다.

#표준 정규분포(평균: 0 표준편차 : 1)의 난수를 추가 어떤 값이 나올지 모른다.
y += np.random.randn(100,1)

#모델 생성
model = linear_model.LinearRegression()

#모델 학습
model.fit(x,y) #x가 얼마일때 y가 얼마인지

# 예측값 출력
print(model.predict(x))
print()
print(y)

#회귀계쑤와 절편 출력
print("회귀계수:" , model.coef_)
print("절편:", model.intercept_)

#예측은 predict로 한다.

# 산점도 그래프 출력 (예측값과 실제 값 비교하기 위해서)
plt.scatter(x, y , marker="+") #실제값 데이터
#x가 얼마 일 때 y가 얼마인지 는 실제 데이터
plt.scatter(x, model.predict(x), marker='o')# 예측값 구하기
#x값이 얼마 일때 학습된 값 가지고 예측 한다.
plt.show()

#x가 학습데이터 x -> - 2~ 2 학습데이터 -> y = 3x -2 그대로 집어여면 y이 고정된다.
#y += np.random.randn(100,1) y 값에 누적을 생겨서 오차가 생긴다.
#y가 1일때 1으로 떨어지지 않는다.
# model.fit(x,y) 정확하게 떨어지지 않는 값이다.
#예측은 제공된 함수로 예측한다.
#난수이기 때문에 값이 다르다.

'Study > 머신러닝' 카테고리의 다른 글

머신러닝-6 (0)	2020.11.19
머신러닝-5 (0)	2020.11.19
머신러닝-4 (0)	2020.11.17
머신러닝-3 (0)	2020.11.16
머신러닝-1 (0)	2020.11.14

PREV 1 2 NEXT

NAIAHD