'Study' 카테고리의 글 목록 (12 Page)

Study

R-6 2020.09.05
R-5 2020.09.05
R-4 2020.09.05
R-3 2020.09.05
R -2 2020.09.05
R-1 2020.09.02

R-6

2020. 9. 5. 14:14

install.packages("nycflights13")
library(nycflights13)
data(package = "nycflights13")
4개의 data set이 있다.
Data sets in package ĄŽnycflights13ĄŻ:

airlines                            Airline names.
airports                            Airport metadata
flights                             Flights data
planes                              Plane metadata.
weather                             Hourly weather data

install.packages("nycflights13")
library(nycflights13)
data(package = "nycflights13")

flights
#행의 개수

dim(flights)
str(flights)
nrow(flights)
flights #tibble 이여서 몇개만 보여주고 마지막에 개수 보여준다.
summary(flights)#행의 개수 안보임

fly1 <- flights
summary(fly1)
fly1 <- as.data.frame(fly1)
summary(fly1)#행의 개수  class를 보기

#flights 데이터셋에서 열이 이름(변수명)
colnames(flights)
names(flights)

#1월 1일 데이터는 모두 몇개 입니까 ?
library(dplyr)
flights %>% filter(month == '1' & day == '1' ) %>% nrow

flights

#도착 지연 시 이 2시간 이하인 항공편은 모두 몇 회입니까 ?
flights %>% filter(arr_delay < 120) %>% nrow

#출발시간과 도착시간이 모두 1시간 이상 지연된 항공편은 모두 몇 회입니까?
flights %>% filter(dep_delay >= 60 & arr_delay >= 60) %>% nrow

#filter(dep_delay >= 60 & arr_delay >= 60)->flight,origin,dest,arr_delay,dep_delay,distance,air_time조회해서 데이터셋 만들기 
flights_new <- flights %>% filter(dep_delay >= 60 & arr_delay >= 60) %>% select(flight,origin,dest,arr_delay,dep_delay,distance,air_time ,arr_time)
flights_new

#도착지연은 오름차순 출발지연은 내림차누
flights_new %>% arrange(arr_delay,desc(dep_delay))#앞에것 하고 동급 있을때 내림차순으로 한다.

#gain speed구하기 
flights_new %>% mutate( gain = arr_delay- dep_delay, speed = distance/air_time )

flights %>% nrow#336776
flights <- na.omit(flights)#na없에기 
flights %>% nrow#327346
flights %>% summarize(delay= mean(dep_delay), delay_sd = sd(dep_delay))

#flights에서 12개월가 월별 평균출발시간(dep_time)을 구하세요
flights
flights %>% group_by(month) %>% summarise(dep_mean = mean(dep_time))
#'mean(dep_time)' 이름을 지정하지 않았다.


flights %>% distinct(month)
str(flights)
summary(flights)
summary(airquality)
library(ggplot2)
summary(mpg)

크롤링이란   (Crawling)  ->데이터 전처리
크롤링(Crawling)이란?
- 스크레이핑(Scraping)이라고도 하며 웹 페이지를 그대로 가져와서 거기서 데이터를 추출해내는 행위이다.
크롤링(Crawling)은 불법일까?
- 크롤링하여 얻은 데이터를 개인 하드에 소장하는 것까지는 합법이다.
하지만 배포하면 그 순갂부터 합법과 불법이 갈린다.
- 합법적인 크롤링은 사이트 운영자의 의사에 반하지 않는 것이다.
- 불법적인 크롤링은 사이트 운영자의 의사에 반하거나 실정법을 어기는 것이다.
- 크롤링은 웹페이지의 소스가 포함되어 있다. 이런 소스들은 웹프로그래밍 저작물이며
이를 불법 복제하는 것은 위법이다.
크롤링(Crawling)을 위한 기본 개념을 알아보자.
- 서버 : 외부에서 요청하면 규칙대로 정보를 제공하는 역할을 한다.
- 브라우저 : 서버가 주는 것들을 사용자에게 보여준다.
크롤링(Crawling)을 위한 기본 개념을 알아보자.
- 웹 서버는 text(html, css, js 등)와 image를 브라우저를 통해 사용자에게     보여준다
준비물

1. 구글크롬 2. Selector Gadget 3. 엑셀

• 방법

1. 구글크롬을 설치한다.

2. https://selectorgadget.com/에서 들어가서 북마크에 설치한다.

SelectorGadget: point and click CSS selectors

SelectorGadget:point and click CSS selectors SelectorGadget Screencast from Andrew Cantino on Vimeo. SelectorGadget is an open source tool that makes CSS selector generation and discovery on complicated sites a breeze. Just install the Chrome Extension or

selectorgadget.com

3. 다음에 들어간다.
4. 검색 키워드로 서치한다.
5. 원하는 컨테츠를 선택한다.(ex. 뉴스, 블로그, 카페)
6. 2페이지에 갔다가 다시 1페이지로 돌아온다.

install.packages("rvest")
library(rvest)
install.packages("stringr")
library(stringr)

title = c()#빈상태에서 접여여기
body = c()

#주소가 자동으로 바꾸게 넣어줘야 한다.
url_base ="https://search.daum.net/search?w=news&q=%EA%B5%AD%ED%9A%8C&DA=PGD&spacing=0&p="
url_crawl = paste(url_base);#paste는 문자 열 둘개를 붙이는 함수 
# url_crawl = paste(url_base, i )#url주소와 i를 붙여라 
print(url_crawl,1)
hdoc = read_html(url_crawl)
hdoc

for(i in 1:10){#i가 1부터 진행
  #url_crawl = paste(url_base, i , sep ="");#paste는 문자 열 둘개를 붙이는 함수 
  url_crawl = paste(url_base, i , sep ="");
 # url_crawl = paste(url_base, i )#url주소와 i를 붙여라 
  print(url_crawl)
  #t_css = ".f_link_b"#소제목으로 되여있다.F12로 하면 볼수 있다. .은 class를 표현하는 것이다.
  #b_css = ".desc"#
  #ie에서 F12로 하면 볼수 있다. .은 class를 표현하는 것이다
  #구글 selectorGadget에서 볼수 있다.
  
  hdoc = read_html(url_crawl)#그 경로에 있는 것을 읽어온다.가져오는 것
  
  
  #t_node = html_nodes(hdoc, t_css)#사진  ,text중 다르다.판단하는 함수가 
  #b_node = html_nodes(hdoc, b_css)
  t_node = html_nodes(hdoc, ".f_link_b")
  b_node = html_nodes(hdoc, ".desc")
  
  title_part = html_text(t_node)#글자를 가져오는 함수 달려가서 가져와야 한다.
  body_part = html_text(b_node)
  
  title = c(title,title_part)
  # title = c() vector형태로 되여있다.
  body = c(body,body_part)
}
news = cbind(title,body)
news
write.csv(news,"crawltest.csv")

텍스트마이닝
빈도분석 • 감성분석 • 한국어처리 KoNLP
rjAVA는 한글에서 필요하는것
RjAVA있어야 한

library(KoNLP)
library(wordcloud2)

useSejongDic()

text = readLines("textMining\\ahn.txt")
text

nouns <- extractNoun(text)#명사만 뽑아내는 것
nouns
#F1extract Nouns from Korean sentence uses Hannanum analyzer
#F2함수 내용  HannanumObj한글 페키지 
#"그렇습니다. 미래는 지금 우리 앞에 있습니다. " ->"미래" "우리" "앞"

nouns <- unlist(nouns)#LIST아니게 바꾼다.->1차원형태로 VECTOR형태로
nouns

nouns <- nouns[nchar(nouns)>=2]#nouns안에 있는 글자의 개수가 2보다 큰것 
nouns#nouns 2글자 이상인것만 한글자 지웠다.

#빈도를 새는것 
wordFreq <- table(nouns)
wordFreq <- sort(wordFreq, decreasing = T)#숫자가 많이 나온것을 정렬
wordFreq <- head(wordFreq, 20)
wordFreq

wordFreq <- table(nouns) %>% sort(decreasing = T) %>% head(20)
wordFreq
wordcloud2(wordFreq,fontFamily = '맑은 고딕')

useSejongDic()
nouns <- readLines("textMining\\leesungman.txt",encoding = "UTF-8") %>% extractNoun() %>% unlist()
nouns <- readLines("textMining\\leesungman.txt") %>% extractNoun() %>% unlist()
nouns <- nouns[nchar(nouns)>= 2]
nouns
wordFreq <- table(nouns) %>% sort(decreasing = T) %>% head(20)
wordcloud2(wordFreq,fontFamily = '맑은 고딕')
#jeonduhwan.txt
#kimdaejung.txt
#leesungman.txt 'textMining\leesungman.txt'에서 불완전한 마지막 행이 발견되었습니다
#roh.txt parkjunghee1.txt,leesungman.txt->그냥하면 된다.

#가장 많이 사용된 단어 알아보기 
txt = readLines("news.txt")
head(txt)

library(stringr)
extractNoun("대한민국의 영토는 한반도와 그 부속서로 한다.")
nouns <- extractNoun(txt)

wordcount <- table(unlist(nouns))
df_word <- as.data.frame(wordcount,stringsAsFactors = F)

df_word <- rename(df_word, word= Var1, freq = Freq)
df_word <- filter(df_word,nchar(word)>=2)

top20 <- df_word %>% arrange(desc(freq)) %>% head(20) 
top20

#useNIAdic
txt <- str_replace_all(txt,"//w","") #정규표현식

빈도가 많이 쓰여진다.
감성 분석
웹 사이트와 소셜미디어에 나타난 소비자의 감성을 붂석하여 유용한 정보로 재가공하는 기술을 의미한다.
문장에서 사용된 단어의 긍정과 부정의 빈도에 따라 문장의 긍정, 부정을 평가한다.
사람이 작성한 텍스트 앆에는 그 글의 주요 대상이 되는 주제(Topic)와 주제에 대한 글쓴이의 의견(Opinion)이 있다.
감성 붂석은 주제에 대한 글쓴이의 의견을 파악하는 것으로 Opinion Mining이라고도 한다

list.of.packages <- c("","","","")
new.packages <- list.of.packages()

install.packages("twitteR")
install.packages("ROAuth")
install.packages("plyr")
install.packages("stringr")
library(twitteR)
library(ROAuth)
library(plyr)
library(stringr)

#키
API_key <- ""
API_secret <- ""
access_token <- ""
access_secret <- ""

#object로 할수 있다. save(df_midterm,file="df_midterm.rda")->1일 날 것
#load("df_midterm.rda")

setup_twitter_oauth(counsumer_key = API_key, consumer_secret = API_secret,access_token = access_token,access_secret = access_secret)


#전 세계에서 
install.packages("ggmap")
library(ggmap)
library(ggplot2)

html.airports <- read_html("https://en.wikipedia.org/wiki/List_of_busiest_airports_by_passenger_traffic")
html.airports
df <- html_table(html_nodes(html.airports, "table")[[1]],fill= T) #[[]]list중에서 첫번째것 
#html_table 표 가지로 가는 것 
#html_nodes 노드 가지로 가는 것 

head(df)
head(df$Airport)
colnames(df)
colnames(df)[6] <- "total" #있으면 보여주고 없으면 만들어준다.
df$total
df$total <- gsub(',','',df$total)#정규표현식 할떄 나오는 것 g는 global하는 것 
#전부다 대체 하는 것이다. ,를 찾아서 ''으로 대체 한다.
df$total <- as.numeric(df$total)#character->숫자로 바꾸는 것 
df$total

Github에 있는 twitteR 패키지를 설치하기 위해 devtools 패키지를 설치한다.
rjson, bit64, httr은 의존성으로 함께 설치가 필요하다.
ROAuth 패키지는 트위터의 권한을 얻기 위해 설치가 필요하다.

* 패키지 설치 후 library() 로 로드한다. twitter 패키지는 install_github() 함수를 사 용하며
username은 자싞의 Desktop 사용자 이름으로 정의한다.

* 앞서 설명했던 API_key, API_secret, access_token, access_secret 을 복사 붙여넣기 한다.

* 저장했던 4개의 객체를 옵션으로 넣고 함수를 실행한다.

* searchTwitter() 함수를 이용하여 @apple 를 검색한 자료를 가져온다. since, until로 검색 기갂을 설정하고, lang로 언어를 설정할 수 있다.

* 500개를 설정했지만 실제 자료는 300개가 출력됐다. 해당 키워드에 일치되는 결과가 300개이기 때문

유투브에서 보기
3blue1brown
cs230
cs231n

knitr::kable(anscombe)
anscombe.1 <- data.frame(x = anscombe[["x1"]], y = anscombe[["y1"]], Set = "Anscombe Set 1")
anscombe.2 <- data.frame(x = anscombe[["x2"]], y = anscombe[["y2"]], Set = "Anscombe Set 2")
anscombe.3 <- data.frame(x = anscombe[["x3"]], y = anscombe[["y3"]], Set = "Anscombe Set 3")
anscombe.4 <- data.frame(x = anscombe[["x4"]], y = anscombe[["y4"]], Set = "Anscombe Set 4")
anscombe.data <- rbind(anscombe.1, anscombe.2, anscombe.3, anscombe.4)
aggregate(cbind(x, y) ~ Set, anscombe.data, mean)
aggregate(cbind(x, y) ~ Set, anscombe.data, sd)
model1 <- lm(y ~ x, subset(anscombe.data, Set == "Anscombe Set 1"))

model2 <- lm(y ~ x, subset(anscombe.data, Set == "Anscombe Set 2"))

model3 <- lm(y ~ x, subset(anscombe.data, Set == "Anscombe Set 3"))

model4 <- lm(y ~ x, subset(anscombe.data, Set == "Anscombe Set 4"))
library(plyr)

correlation <- function(data) {
  
  x <- data.frame(r = cor(data$x, data$y))
  
  return(x)
  
}

ddply(.data = anscombe.data, .variables = "Set", .fun = correlation)

summary(model1)

summary(model2)

summary(model3)

summary(model4)

library(ggplot2)
gg <- ggplot(anscombe.data, aes(x = x, y = y))

gg <- gg + geom_point(color = "black")

gg <- gg + facet_wrap(~Set, ncol = 2)

gg <- gg + geom_smooth(formula = y ~ x, method = "lm", se = FALSE, data = anscombe.data)

gg

#package있는지 없는지
install.packages("datasauRus")
library(datasauRus)
if(requireNamespace("dplyr")){
  suppressPackageStartupMessages(library(dplyr))
  datasaurus_dozen %>% 
    group_by(dataset) %>% 
    summarize(
      mean_x    = mean(x),
      mean_y    = mean(y),
      std_dev_x = sd(x),
      std_dev_y = sd(y),
      corr_x_y  = cor(x, y)
    )
}
if(requireNamespace("ggplot2")){
  library(ggplot2)
  ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+
    geom_point()+
    theme_void()+
    theme(legend.position = "none")+
    facet_wrap(~dataset, ncol=3)
}
#reshap2가 없을때 
if(!require(reshape2)){
  install.packages("reshape2")
  require(reshape2)
}

비지도학습-군집분석
1.군집분석이란?
2.k-means
3.적정 k의 값
k는 숫자이다.
분류 vs 군집화
분류: 범주의 수 및 각 개체의 범주 정보를 사전에 알 수 있으면, 입력 변수 값으로부터 범저 정보를 유추하여 새로운 개체에 대해 가장
적합한 범주로 할당하는 문제(지도 학습)
군집화(clustering) 군집의 수 ,속성 등이 사전에 알려져 있지 않으면 최적의 구분을 찾아가는 문제(비지도 학습)
k는 숫자이다.
#회귀문제

if(!require(caret)){
  install.packages("caret")
  require(caret)
}

data("iris")
View(iris)

set.seed(123)
inTrain <- createDataPartition(y = iris$Species, p = 0.7, list = F)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
training
#표준화
training.data <- scale(training[-5])#데이터 형태 맞춘다.
training.data
summary(training.data)

iris.kmeans <- kmeans(training.data[,-5],center = 3, iter.max = 10000)
iris.kmeans$centers

의사결정 나무
의사결정나무 정형데이터 나무 시리즈
다양한 도형을 분류하는 의사 결정 나무
검정색 찾을 때 까지 얼마나 가야 만 찾을 수 있는지

if(!require(NbClust)){
  install.packages("NbClust")
  require(NbClust)
}
#값이 정해진 것이 아니다.맘데로 정할 수 있다.
nc <- NbClust(training.data, min.nc = 2, max.nc = 15, method = "kmeans")

barplot(table(nc$Best.n[1,]),xlab ="Number of Clusters",ylab ="Number of chosen",main="Number of Clusters chosen")

training.data <- as.data.frame(training.data)#레ㅌ프트 생성한다.
modFit <- train(x = training.data[,-5],y = training$cluster, method="rpart")
#의상결정으로 분류하는 것  rpart
#rpart 반복적으로 20번씩 계속 해주는 것 
#분집을 나누고 의사결정 모데을 만들어서 예측해서 
#정답하고 결과가 얼마나 차이나는 가 
testing.data <- as.data.frame(scale(testing[,-5]))
testingclusterPred <- predict(modFit, testing.data)
table(testingclusterPred,testing$Species)

entropy 순도를 높여가는 양식
다른 변수인 소득을 기준으로 정렬하고 다시 같은 작업을 반복
entropy낮은 기준에서 나눈다.
순도를 높이기 위해 계속 나눈다.

계층적 방법
dendrogram 계층적분석 방법이다. ggplot에서

knn분석

ribbon
#ribbon
huron <- data.frame(year = 1875 : 1972 , level = as.vector(LakeHuron))
huron
ggplot(data = huron, aes(x= year))+geom_area(aes(y = level))
p <- ggplot(data = huron, aes(x = year))
p <- p+geom_area(aes(y = level))
p + coord_cartesian(ylim = c(570,590))#limit줘서 해줘야 한다.


p <- ggplot(data = huron, aes(x= year))
p <- p+geom_area(aes(y = level))
p +coord_cartesian(ylim = c(min(huron$level)-2,max(huron$level)+2))

 
p <- ggplot(huron,aes(x= year))+geom_ribbon(aes(ymin= min(level)-2,ymax = level+2))

p <- ggplot(huron, aes(x= year)) +geom_ribbon(aes(ymin = min(level)-2, ymax = level+2))

p <- ggplot(huron,aes(x = year,fill="skyblue"))+geom_ribbon(aes(ymin = min(level)-2, ymax = level+2,fill="skyblue"),color="skyblue")

p <- ggplot(huron, aes(x=year,fill="skyblue"))
p + geom_ribbon(aes(ymin=level-2, ymax=level+2 ,fill="skyblue"), colour="blue")

knn ->거리를 계싼하는 것
knn(k-Nearest Neighbor)새로운 돌아온 데이터와 그룹의 데이터와 가장 가까우니 새로운것과 재일 가까운 것은 그룹이다.->분류 알고리즘

k 인접 이웃 분류
overfitting 완전 최적화 데이터 under fit이 안되도록 학습할때 잘 안될 수 있다.
일반화 유지 그래서 적절한 것을 구해야 한다.
유클리드 거리 Euclidean distance L2거리 직선거리
맨하탄 거리 직선이 아닐때 L1거리

'Study > R' 카테고리의 다른 글

R-8 (0)	2020.09.05
R-7 (0)	2020.09.05
R-5 (0)	2020.09.05
R-4 (0)	2020.09.05
R-3 (0)	2020.09.05

R-5

2020. 9. 5. 14:04

이상치가 있으면 인지 하고
지우는지 범위를 변경하는지 알수 있다.
boxplot
상자기준으로
극단치
극단치 경계
윗수염: 하위 75 ~ 100%
3사분위(Q3) 하위 75%
2사분위(Q2) 하위 50% 중간값
1사분위(Q1) 하위 25%
아랫수염
극단치 경계

몇 %이상 들어가야만 사회가 안정적이다.
아니면 값이 이상치이다.
실제있는 데이터만 정확해서 수염의 길이를 표현하는 것있고 있는 것 값중에서 가장 작은 값을 표시한다.
있는데이터에서만 그려주니깐

library(ggplot2)
boxplot(mpg$hwy)
mpg$hwy <- ifelse(mpg$hwy < 12 | mpg$hwy > 37, NA, mpg$hwy)
table(is.na(mpg$hwy))

library(dplyr)
outlier <- data.frame(sex= c(1,2,1,3,2,1) , score= c(5,4,3,4,2,26))
outlier

table(outlier$sex)
table(outlier$score)

outlier$sex <- ifelse(outlier$sex == 3, NA, outlier$sex)#3은 이상치보다 오류이다.
outlier

outlier$score <- ifelse(outlier$score>5 , NA, outlier$score)#16
outlier

outlier <- data.frame(sex= c(1,2,1,3,2,1) , score= c(5,4,3,4,2,26))
boxplot(outlier$score)

outlier %>% filter(!is.na(sex) & !is.na(score)) %>% 
  group_by(sex) %>% 
  summarise(mean_score = mean(score))

mpg

ggplot(mpg,aes(drv, hwy))+geom_boxplot()
ggplot(mpg,aes(y =hwy))+geom_boxplot()
ggplot(mpg,aes(,hwy))+geom_boxplot()
mpg$drv
boxplot(mpg$drv)#숫자가 아니여서 못 그린다.
ggplot(data = mpg, aes(drv))+ geom_bar()

ggplot(mpg,aes(x = manufacturer, y = displ,colour = manufacturer,fill= "fill")) + geom_boxplot()
ggplot(mpg,aes(x = displ, y = manufacturer,colour = displ,fill= "fill")) + geom_boxplot()
#y는 숫자이여야만 높으를 줄수 있다.

na.rm =  T으로 권장드린다.

tibble -> data.frame 기본적인 것

3일차 분석사례실습 / 텍스트마이닝
한국복지패널데이터 분석
성별에 따른 월급 차이
성별 직업 빈도
종교 유무에 따른 이혼율
한국복지패널  데이터 분석
성별에 따른 월급 차이
-“성별에 따라 월급이  얼마나  다른가?”

install.packages("foreign")
library(foreign)#기본으로 깔려져있다.
library(dplyr)  
library(ggplot2) 
raw_welfare <- read.spss(file="Koweps_hpc10_2015_beta1.sav")
welfare <- as.data.frame(raw_welfare)#data를 편리하게 사용하려고 as.data.fram으로 나누었다.
str(welfare)#변수의 형태등 
glimpse(welfare)
head(welfare)
tail(welfare)
summary(welfare)

lit는 데이터 형에 아무른 변화가 없다.
파이썬 list는 vector에 가깝다.

raw_welfare <- read.spss(file="Koweps_hpc10_2015_beta1.sav")
welfare <- as.data.frame(raw_welfare)#data를 편리하게 사용하려고 as.data.fram으로 나누었다.
str(welfare)#변수의 형태등 
glimpse(welfare)
head(welfare)
tail(welfare)
summary(welfare)

#아래에 2가지 방법으로 가능하다.
#1.
welfare <- welfare %>% 
  rename( gender = h10_g3, birth = h10_g4,marriage = h10_g10, religion = h10_g11,
          income = p1002_8aq1, job = h10_eco9,
          region= h10_reg7) %>% 
  select(gender, birth, marriage, religion, income, job ,region)
welfare

#2.
welfare <- welfare %>% 
  select(gender = h10_g3, birth = h10_g4,marriage = h10_g10, religion = h10_g11,
         income = p1002_8aq1, job = h10_eco9,
         region= h10_reg7)
welfare

str(welfare)#numeric으로 변환한다.
summary(welfare)#gender이 이상하게 나온다.na값이 보여진다. 몇개있는지도 보여준다.
#숫자로
plot(welfare)#밑에와 위에 같은데 아래쪽 본다.
pairs(job ~ income+gender+region, data = welfare)
#줄로 된것은 1아니면 2밖에 없다. numeric이 아니라 범주형이다
#막대리 처럼 해지면 
boxplot(welfare)
#income을 따로 잡을수 있다.
sum(is.na(welfare))#21165
sum(welfare, na.rm = T)#38498929
colSums(is.na(welfare))#컬럼별로 sum을 한다.
summary(welfare$income)
mean(welfare$income)
mean(welfare$income,na.rm = T)#241.619
mean(is.na(welfare$income))#0.7219155

range(welfare$income)
range(welfare$income,na.rm = T)
welfare$income <- ifelse(welfare$income == 0 , NA, welfare$income)
#분석할때 소덕의 범위를 
#0이 아닌데 0으로 대답하는 분이 많다고 생각한다.
summary(welfare$income)
plot(welfare$income)#index가 row개수 

install.packages("psych")
library(psych)
describe(welfare)#descirbe 다른 관점에서 보여준다.
#다른 관점에서의 통계적 정보를 보여준다.

str(welfare)#16664
ggplot(data = welfare, aes(x= income))+geom_density()#밀도
#평균등으로 파악하기 힘든 집단이다.
#Removed 12044 rows containing non-finite values (stat_density). 
ggplot(data = welfare, aes(x = income))+geom_freqpoly()
#Removed 12044 rows containing non-finite values (stat_bin)

summary(welfare$gender)
welfare$gender <- ifelse(welfare$gender == 1, 'M','F')
summary(welfare$gender)#전에는 1,2 등으로 나왔는데 character로 바꿔졌다.
table(welfare$gender)#빈도수 세주는 것
ggplot(data = welfare, aes(x = gender))+geom_bar()
ggplot(data = welfare, aes(x = gender, colour = gender))+geom_bar()#테두리
ggplot(data = welfare, aes(x = gender,fill =gender))+geom_bar()#테두리
ggplot(data = welfare, aes(x = gender))+geom_bar(aes(fill =gender))#테두리
barplot(table(welfare$gender), xlab="gender" , ylab = "count",col=rainbow(3))  #barplot count를 한것으로 해야 한다. 그래서 table한 상태로 들어가야 한다.

welfare %>% select(gender, income) %>% group_by(gender) %>% summarise(평균= mean(income, na.rm = T))
welfare %>% filter( !is.na(income)) %>% group_by(gender) %>% summarise(평균= mean(income))
data_gender <- welfare %>% group_by(gender) %>% summarise(평균=mean(income,na.rm = T))
data_gender 
welfare %>% select(gender, income) %>% group_by(gender) %>% summarise(평균= mean(income, na.rm = T)) %>% 
  ggplot(aes(x= gender, y =평균,fill=gender))+geom_bar(stat = 'identity')
ggplot(data=data_gender, aes(x= gender, y =평균,fill = gender))+geom_bar(stat = 'identity')
#stat = 'identity'  이것 무조건 해줘야 한다. 아니면 오류난다.stat_count() must not be used with a y aesthetic

welfare %>% select(gender, income) %>% ggplot(aes(x= income, color =gender))+geom_density()

나이에 따른 소득 차이
-“몇 살에 수입이 가장 많은가?”

class(welfare$birth)
summary(welfare$birth)
qplot(welfare$birth)
#qplot을 쓰지 말라
boxplot(welfare$birth)#이상치를 확인하려면 boxplot을 보면 된다.
sum(is.na(welfare$birth))#결측치 확인 한 것
welfare$age <- 2015 - welfare$birth +1
#열이 없으면 새로 만든다.
summary(welfare$age)
plot(welfare$birth)
plot(table(welfare$birth))
barplot(welfare$birth)

welfare
#있던 데이터에서 정리한다. 나이별로 열을 하는 것을 만들었다.
age_income <- welfare %>% filter(!is.na(income)) %>% group_by(age) %>% summarise(mean_income= mean(income))
head(age_income)
ggplot(data=age_income, aes(x= age, y =mean_income))+geom_line()
ggplot(data = age_income ,aes(x= age,y=mean_income))+geom_point()
ggplot(data = age_income ,aes(x= age,y=mean_income))+geom_point(size=2, color= "blue")
ggplot(data = age_income ,aes(x= age,y=mean_income))+geom_point(size=2, color= "blue")+geom_line()
#layer익때문에 겹져진다.

연령대(세대)별 평균소득
#순서가 상관이 있다. 먼저 그리는 것에 겹쳐진다.

#세대별
install.packages("KoNLP")
library(KoNLP)#Checking user defined dictionary!

useSejongDic()

library(wordcloud2)
text1 = readLines("ahn.txt")
text1

#3: None
library(ggplot2)
library(dplyr)
welfare <- welfare %>% mutate(age_gen = ifelse(age <30,"young",ifelse(age<= 40,"g3",ifelse(age<=50 ,"g4",ifelse(age<= 60,"g5","old")))))
head(welfare,10)

table(welfare$age_gen)
qplot(welfare$age_gen)

age_gen_income <- welfare %>% group_by(age_gen) %>% summarise(mean_income = mean(income,na.rm = T))
age_gen_income

ggplot(data = age_gen_income,aes(x = age_gen, y = mean_income))+geom_col()
ggplot(data = age_gen_income, aes(x= age_gen, y = mean_income))+geom_bar(stat= "identity")
ggplot(data = age_gen_income, aes(x= age_gen, y = mean_income,fill=age_gen))+geom_bar(stat= "identity")
ggplot(data = age_gen_income, aes(x= age_gen, y = mean_income))+geom_bar(stat= "identity",aes(fill=age_gen))
ggplot(data = age_gen_income,aes(x= age_gen, y= mean_income))+geom_col(aes(fill=age_gen))+scale_x_discrete(limits= c("young","g3","g4","g5","old"))#순서대로  scale_x_discrete

나이와 성별에 따른 소득 차이
-“소득은 나이와 성별에 따라 어떻게 다른가?”


gender_income <- welfare %>% group_by(age_gen,gender) %>% 
  summarise(mean_income = mean(income,na.rm = T))
gender_income

ggplot(data = gender_income , aes(x = age_gen , y = mean_income, fill= gender))+geom_col()+scale_x_discrete(limits= c("young","g3","g4","g5","old"))
ggplot(data = gender_income , aes(x = age_gen , y = mean_income, fill= gender))+geom_col(position= "dodge")+scale_x_discrete(limits= c("young","g3","g4","g5","old"))#겹치지 않고 옆으로 진행

신경망
require 좀더 좋다. 없으면 알려준다.

library
data(iris)
str(iris)#불꽃데이터 ->변수
#'data.frame':	150 obs. of  5 variables:
#$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... 꽃받침
#$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... 꽃잎
#$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... 불꽃의 종류
temp <- c(sample(1:50, 30),sample(51:100,30),sample(101:150,30))
temp
iris.training <- iris[temp,]
iris.testing <- iris[-temp,]
library(nnet)
neuralNetResult <- nnet(Species~., data= iris.training, size = 3, decay =0 )#종을 맞추어봐라 나머지 너비,길이 맞춰보라 
neuralNetResult

summary(neuralNetResult)
#1번째 방법
library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
install.packages("reshape")
library(reshape)

#2번째 방법
library(clusterGeneration)
library(scales)
library(reshape)

plot.nnet(neuralNetResult)

pred <- predict(neuralNetResult, iris.testing, type = "class")
pred
real <- iris.testing$Species
table(real, pred)
summary(neuralNetResult)

iris 데이터를 잘 알아야 한다.

ggplot2::diamonds #diamonds는 ggplot2에 있는 것
datasets::iris#default 설치안해도 사용할 수 있는 것 

install.packages("ggplot2movies")
require(ggplot2movies)
movies
dim(movies)

첫번째 신경망을 로젠블랏의 퍼셉트론
퍼셉트론으 x이의 문제 가능하다.
처음에 들어간 데이터가 얼마인가
비선형함수이다.
#손글자

train <- read.csv("mnist_train.csv")
train[10,-1]#10번째 줄의 첫번째것 없에기
train[10,]#첫번때에 label이 있다.
train[10,1]#첫번때에 label이 있다.[1] 3
#3의 특징  예는 3이다.
train

dim(train)

#28* 28 PIXEL로 해서 783개로 해서 w값들이 바꾼다.
m = matrix(unlist(train[10, -1]), nrow = 28, byrow =T) #숫자가 RGB값이다.28라인 갈때마다 끊어진다.
#첫번째는 3이고 나머지는 rgb이다.
m

image(m, col = grey.colors(255))#image를 보고 싶을때 
write.csv(m,"mnist3.csv")

rotate <- function(x) t(apply(x, 2, rev))
par(mfrow = c(2,3))
lapply(1:6, function(x) image(
                      rotate(matrix(unlist(train[x,-1]),nrow = 28, byrow = TRUE)),
                      col= grey.colors(255),
                      xlab=train[x,1]
                      ))


par(mfrow = c(1,1))

#load caret library
install.packages("caret")
library (caret)

#createDataPartition( )은 데이터를 훈련 데이터와 테스트 데이터로 분할한다.
inTrain <- createDataPartition(train$label, p = 0.8, list=F)
#caret::createDataPartition(
#  y,          # 분류(또는 레이블)
#  times=1,    # 생성할 분할의 수
#  p=0.5,      # 훈련 데이터에서 사용할 데이터의 비율
#  list=TRUE,  # 결과를 리스트로 반환할지 여부. FALSE면 행렬을 반환한다.
#)
#0.8 곱해서 한다.
training<-train[inTrain,]#데이터를 저장한다.
testing<-train[-inTrain,]#데이터를 저장한다.
training
testing

write.csv(training, file ="train-data.csv", row.names = F)
write.csv(training, file ="test-data.csv", row.names = F)

install.packages("h2o")
library(h2o)
미부함수

local.h2o <- h2o.init(ip = "localhost", port = 54321, startH2O = TRUE, nthreads=-1)
training <- read.csv("train_data.csv")
testing <- read.csv("test_data.csv")
training
testing

training[,1] <- as.factor(training[,1])#범주형이다.

trData <- as.h2o(training)
trData[,1] <- as.factor(trData[,1])
tsData <- as.h2o(testing)
tsData[,1] <- as.factor(tsData[,1])

#unilx 시간 측정
start <- proc.time()
model.dl <- h2o.deeplearning(x = 2:785,
                             y = 1,
                             trData,
                             activation = "Tanh",#sigmoid등 으로 바꿀수 있다.
                             hidden=rep(160,5),
                             epochs = 20)
end <- proc.time()

diff=end-start
print(diff)

str(pred.dl.df)
pred.dl.df$predict 
pred.dl.df[1,1] 
table(test_labels[1,1],pred.dl.df[1,1]))

'Study > R' 카테고리의 다른 글

R-7 (0)	2020.09.05
R-6 (0)	2020.09.05
R-4 (0)	2020.09.05
R-3 (0)	2020.09.05
R -2 (0)	2020.09.05

R-4

2020. 9. 5. 13:59

library(dplyr)
library(ggplot2)
mpg
data1 <- mpg %>% filter(displ <= 4) %>% summarise(mean(hwy))
data1
data2 <- mpg %>% filter(displ >= 5) %>% summarise(mean(hwy))
data2 

data <- mpg %>% filter( displ <= 4 | displ >= 5) %>% mutate(group1 = ifelse(displ <= 4, 'a', ifelse(displ>=5,'b','c'))) %>% group_by(group1) %>% summarise(mean(hwy))
data

mpg %>% filter()

#mpg는 위에  10개만 보여준다.
head(airquality, 20)
head(mpg, 30) #강제로 10개만 보여준다.

#데이터 타입으로 바꾸는것 as.data.frame
airquality <- as_tibble(airquality)
airquality

mpg

mpg %>% filter(manufacturer=='audi' | manufacturer=='ford') %>% group_by(manufacturer) %>% summarise(test=mean(cty)) %>% arrange(desc(test)) 

mpg_new <-mpg %>% mutate(total = (hwy+cty)/2) %>% arrange(desc(total)) %>% head()        
d2 <-mpg %>% mutate(total = (hwy+cty)/2) %>% arrange(desc(total))  %>% head()  
d2 <- as.data.frame(d2)
arrange(d2,desc(total))

mpg_new <- mpg %>% select('class','cty')
mpg_new

mpg_new %>%  filter(class =='suv' | class =='compact') %>% group_by(class) %>% summarise(sum(cty)) 

mpg %>% select('class','cty') %>%  filter(class =='suv' | class =='compact') %>% group_by(class) %>% summarise(mean(cty))  
mpg %>% filter(class =='suv' | class =='compact')  %>% select('class','cty')  %>% group_by(class) %>% summarise(mean(cty)) 

mpg %>% filter(class =='compact') %>% group_by(manufacturer) %>% summarise(tot=n()) %>% arrange(desc(tot))

mpg데이터셋 보기
ggplot

str->str(mpg)

tbl_df , tbl and data.frame
tbl data.frame비슷하는데 계산하기 위해서 나오는 것
tbl_df
chr m num, int등으로 되여있다.
열의 이름을 보고 싶으면 names(mpg)
names(mpg)[8] #cty만 뜨게끔 몇번째 명을 지정한다.
str(mpg)
names(mpg)
names(mpg)[8] #cty만 뜨게끔 몇번째 명을 지정한다.

names(mpg)[8] <- 'city'
names(mpg)
mpg
names(mpg)[8] <- 'cty'
names(mpg)
mpg

dplyr :: glimpse(mpg)
mpg
names(mpg)[12]
mpg %>% select(c(names(mpg)[12] ))

mpg <- subset( mpg, select = -c(names(mpg)[12] ) )
mpg

mpg <- ggplot2::mpg
mpg
ggplot(data= mpg,aes(x= displ,y=hwy))+geom_point()

ggplot(mpg,aes(displ,hwy,colour = class))+geom_point()

ggplot(mpg, aes(displ,hwy))+geom_point((aes(colour ="blue")))# 바닥에 색갈이 었으면 두개 층이 겹쳐져서 원하는 색갈을 못가진다.레이어가 겁쳐서나오기  overwritting
ggplot(mpg, aes(displ,hwy))+geom_point(colour ="blue")#동급에서 색갈이 먹여지는데 

ggplot(mpg, aes(displ,hwy,colour = class))+geom_point() #색갈갈
ggplot(mpg, aes(displ,hwy,colour = trans))+geom_point()#
ggplot(mpg, aes(displ,hwy,colour = drv))+geom_point()#
ggplot(mpg, aes(displ,hwy,colour = cty))+geom_point()#

#shape도형형
ggplot(mpg, aes(displ,cty, shape=drv))+geom_point()
ggplot(mpg, aes(displ,cty, shape=class))+geom_point()
ggplot(mpg, aes(displ,cty, shape=trans))+geom_point()
ggplot(mpg, aes(displ,cty, shape=cty))+geom_point()#A continuous variable can not be mapped to shape

#size #많을 수록 많아진다.
ggplot(mpg, aes(displ, cty, size= cty)) +geom_point()
ggplot(mpg, aes(displ, cty, size= trans)) +geom_point()

ggplot(mpg, aes(displ, cty, size= cty)) +geom_point(colour ="red")
ggplot(mpg, aes(displ, cty, size= cty)) +geom_point(colour= cty)#객체 'cty'를 찾을 수 없습니다
ggplot(mpg, aes(displ, cty, size= cty)) +geom_point(aes(colour= cty))

#만약 size와 color를 다르게 주면 어떤 그림을 그려 낼까요 
ggplot(mpg, aes(displ ,cty, size = cty , color= drv))+geom_point()

ggplot(mpg,aes(cty,hwy))+geom_point()#점선


str(diamonds)
ggplot(diamonds,aes(carat,price))+geom_point()
ggplot(economics, aes(date, unemploy))+geom_line()#선으로 
ggplot(mpg,aes(cty))
ggplot(mpg,aes(cty))+geom_histogram()#1차원으로 하는 것것


ggplot(mpg,aes(cty))+geom_histogram(bins=10)#더 굵어진다.
ggplot(mpg,aes(cty))+geom_histogram(bins=20)#bins는 막대기 그림림

dia <- diamonds #이름 바꾸기 
class(dia)
dia
#ord order 순서가 정해진다.
#carat 크기 모양

#범주유형으로 색상 하기 
ggplot(diamonds,aes(carat,price, color=cut))+geom_point()
ggplot(diamonds,aes(carat,price, color=color))+geom_point()
ggplot(diamonds,aes(carat,price, color=clarity))+geom_point()

#r과 rstudio 속도 관련문제
r은 훨씬 빠르고 부드럽게 된다.

#facetting
#따로따로 보여주는 것
#범주용 데이터에 대하여

ggplot(mpg, aes(displ,hwy))+geom_point()+facet_wrap(~class)

geom_smooth()
ggplot(mpg,aes(displ,hwy))+geom_point()+geom_smooth()#곡선 범위 
ggplot(mpg,aes(displ,hwy))+geom_point()+geom_smooth(method="loess")#곡선 defaule  local지역을 쪼개서 연결하는 상황
ggplot(mpg,aes(displ,hwy))+geom_point()+geom_smooth(method="lm")#직선 범위  lw linear model 선형모델 

geom_boxplot()

#gitter 흩어주는 것인데 geom_violin()더 예쁘지는 것이다. 절반짤라서 
#geom_violin
ggplot(mpg, aes(drv, hwy)) +geom_violin()
ggplot(mpg,aes(drv,hwy))+geom_jitter()
ggplot(data = mpg, aes(x = drv, y = hwy))+
  geom_point(size = 2, position = "jitter")

geom_feqploy()
ggplot(mpg, aes(hwy)) +geom_freqpoly()#histogram그리고 그다음 그리는 것 
ggplot(mpg, aes(hwy)) +geom_freqpoly(bins = 20)#범위가 넓어진다.

geom_histogam()
ggplot(mpg, aes(displ,color= drv)) +geom_histogram(bindwidth= 0.5)#색상이 안된다.
ggplot(mpg, aes(displ,fill= drv)) +geom_histogram(bindwidth= 0.5)#
ggplot(mpg, aes(displ,fill= drv)) +geom_histogram(bindwidth= 0.5,position = "dodge")#한줄에 색사이 하나밖에 없다.


#geom_bar
#자주사용하는 것이다.
ggplot(mpg, aes(displ,fill= drv)) +geom_bar(position = "dodge")
ggplot(mpg, aes(displ,fill= drv)) +geom_bar(position = "fill")

ggplot(mpg, aes(manufacturer))+geom_bar()#변수가 하나일때는 stat안하면 자동으로count로 된다.
drugs <- data.frame(drug = c("a","b","c"), effect = c( 4, 9, 6))
ggplot(drugs, aes(drug, effect))+geom_bar(stat = "identity")#y값이 정해져있다.identity로무조건 설정해야 한다.
ggplot(drugs, aes(drug, effect))+geom_bar()#stat_count() must not be used with a y aesthetic.

ggplot(economics, aes(date, unemploy / pop))+geom_line()#알아서 계싼해서 만들어준다.
ggplot(economics, aes(date, unemploy))+geom_line()

ggplot(mpg, aes(drv, hwy)) +geom_boxplot()#박스가 25%에서 선을 거고 50% 선을 거 위에 25%에 선이 있다. 중앙값 
#잴 위에 있는 것은 이상값 동그라미 


mpg %>% filter(hwy < 20 & drv == 'f') 
mpg %>% filter(hwy < 25 & drv == 'f') %>% arrange(hwy)

데이터 프레임 합치기
결측치 not avaliable
데이터 정제하기[결측치] 데이터 비여있기

#na이냐 아니냐 
df <- data.frame(sex= c("M","F",NA,"M","F"),score=c(5,4,3,4,NA))
df
is.na(df)
table(is.na(df))

table(is.na((df$sex)))
table(is.na(df$score))

#na빼고 계산하면 안되기때문에 na로 되여있다.
mean(df$score)#[1] NA ->수학계산을 알수없다.
sum(df$score)#[1] NA->수학계산을 알수없다.

df %>% filter(is.na(score))#na인것만 골라낸다.
df %>% filter(!is.na(score))#na가 아닌것을 골라낸다.

#해결법은 아닌것만 모아서 계산하겠다.
df_nomiss <- df %>% filter(!is.na(score)) 
df_nomiss
#평균은 모두합쳐서 ,na빼고 , error
mean(df_nomiss$score) 
sum(df_nomiss$score)

df_nomiss <- df %>% filter(!is.na(score) & !is.na(sex))
df_nomiss

#na.omit생략한다. 지운다. na가보이면 그 행을 지운다. 
#na가 모이면 그 행을 삭제한다. 그래서 사용하면 안된다. 위험한 행위이다.
#다만 데이터가 엄청 많을때 무슨 영향을 미치는 지 알고 있을때 가능하다.
#위험한 행위이다.
df_nomiss2 <- na.omit(df)#모든 변수에 결측치 없는 데이터 추출
df_nomiss2

#아래것은 가장 권장한 결과이다.
#na가  있다는 것을 알고 있기에 원래 데이터가 수정되는 것이 아니여서 안정성이 있다.
mean(df$score, na.rm =  T)#결측치 제외하고 평균산출 
sum(df$score, na.rm = T)#결측치 제외하고 합계 산출
#결론은 어떻게 하냐 ? 입력 is.na na.rm na.omit
#결측치를 지우고 한것이다.

데이터 정제하기 [이상치] 이상한 이상이다.
이상치의 가장대표적인 예는 로또 1등이다. 분포에 끝에 있다는 것이지 나쁘는 것은 아니다.

outlier <- data.frame(sex= c(1,2,1,3,2,1) , score= c(5,4,3,4,2,26))
outlier

table(outlier$sex)
table(outlier$score)

outlier$sex <- ifelse(outlier$sex == 3, NA, outlier$sex)#3은 이상치보다 오류이다.
outlier

outlier$score <- ifelse(outlier$score>5 , NA, outlier$score)#16
outlier

outlier <- data.frame(sex= c(1,2,1,3,2,1) , score= c(5,4,3,4,2,26))
boxplot(outlier$score)

outlier %>% filter(!is.na(sex) & !is.na(score)) %>% 
  group_by(sex) %>% 
  summarise(mean_score = mean(score))

정규표현식
함수
인수가 없는 함수

#함수 ->변수선언할 필요없다.
#함수 ->

Minho <- function(){
  x <- 10
  y <- 20
  return (x*y)#돌려주는 값의 선언
}

ls()
Minho
Minho()

#인수가 있는 함수의 선언
Minho2 <- function(x,y){
  xx <- x
  yy <- y
  return (sum(xx,yy))#돌려주는 값의 선언
  #시스템이 정의한 특정 함수를 이용한 결과를 돌려줌
}

Minho2(2,3)

kaggle 
gitub

Minho3 <- function(x,y){
  x3 <- x+1
  y3 <- y+1
  x4 <- Minho2(x3, y3)#함수에서 함수를 부르는 경우 재귀호출이 가능하다.
  return(x4)
}
Minho3(2,4)

#결과를 화면에 반환하지 않고 변수에 할당
Minho4 <- function(){
  x <-  10
  y <-  10
  return(invisible(x*y)) # 결과값은 보여지지 않지만 변수에는 값이 들어간다.
}
Minho4()
result <- Minho4()
result

#함수 외부의 변수를 조작해야
rm(x)
x <- 70 #시스템 변수 x에 70을할당 
ls()
minho5 <- function(){
  x <- 10#함수내에서 사용하는 변수 
  y <- 20#함수내에서 사용하는 변수 
  
  x <<- 40 #시스템에서 사용하는 변수  x에 40을 할당
  return(x+y)
}
minho5()

minho5 <- function(){
  x <- 20#함수내에서 사용하는 변수 
  y <- 20#함수내에서 사용하는 변수 
  x+y
}

#return이 우선순위고 return이 없으면 마지막으로 된다.
minho5 <- function(){
  x <- 20#함수내에서 사용하는 변수 
  y <- 20#함수내에서 사용하는 변수 
  return(x+y)
  x-y
}

minho5()
minho5

sum1 <- 0 #변수를 설정
for(i in seq(1,10,by =1 )) sum1 <- sum1+i#1에서 10을 1단위로 차례로 넣고 결과 확인
sum1

sum1 <- 0
for(i in 1:5){
  for(j in 1:5){
    sum1 <- sum1 + i*j
  }
}
sum1

sum1 <- 0
for(i in 1:5){
  for(j in 1:5){
    sum1 <- sum1 + i*j
  }
}
sum1

sum1 <- 0 #변수를 설정
for(i in seq(1,10,by =1 )) sum1 <- sum1+i#1에서 10을 1단위로 차례로 넣고 결과 확인
sum1

sum1 <- 0
for(i in 1:5){
  for(j in 1:5){
    sum1 <- sum1 + i*j
  }
}
sum1

while문
sum2 <- 0
i <- 1
while(i <= 10){
  sum2 <- sum2+i;
  i <-i+1
}
sum2

#repeat
sum3 <- 0
i <- 1
repeat{
  sum3 <- sum3+i
  i <- i+1
  if(i>10) break#이 조건에 맞으면 탈출한다.
}
sum3 <- 0
i <- 0
repeat{
  if(i> 5) break
  j <- 0
  repeat{
    if(j > 5)break
    sum3 <- sum3+i*j
    j <- j+1
  }
  i <- i+1
}
sum3

sum1 <- 0
len <- 10
for(i in 1:len){
  sum1 <- sum1 + i
}
sum1

서포트 벡터 머신
데이터 성격 이해 ->탐색적 데이터 (ggplot2,baseR graphic,플롯 , 페어)->변수를 피처엔저니어링(필요한것만 추가 파생변수 등 만든다. 뉴테이트 변수를 빼거나 넣거나 하는것 )
->분석(예측,분류,CNN등)
회귀 최소제곱법 찾는 방법
  단순회귀 예측 연속성
  로지스트회귀 양자택 sigmoid함수 ->
  다중회귀
전문가시스템
의사결정에서의 시리즈

kaggle 알고리즘 -> 분석 의사결정시리즈

vector점이라고 생각하면 된다.
중간에는 decition boundary
margin
support vetors
선형분류 문제 ->데이터를 분류하는 선형 결정경계(Decstion boudary)
분류가능하도록 데이터를 변화시킨다.
함수로 해서 변화한다. 변형을 시킨다.
비선형 분류
소프트 마진
hypperparameter 로 튜닝하다.

playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.47074&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

Tensorflow — Neural Network Playground

Tinker with a real neural network right here in your browser.

playground.tensorflow.org

playground tensorflow

RBF(radial basis function)비선형 분류-> 잘 모르는 RBF
RBF(radial basis function)가우시안 커널
서포트 백터 머신의 장단점
장점
1. 분류문제나 회귀문제 동시에 쓸 수 있다.
2.사용하기 쉽다.
3. 예측의 정확도가 높다
단점
1.커널 hypperparameter 등을 위해 튜닝 과정이 필요하다.
2.선형 회귀/로지스틱 회귀와 다르게 신뢰 구간등을

svm 지원 library
e1071
kabR
kerlab

신경망은 rgb함수로
kernel 선형 아니면 마법사처럼 위에 올라갈것인지
분류가지고 예측한다.
데이터

#분류

install.packages("RCurl")
# load the library
library(RCurl)
# specify the URL for the Iris data CSV
urlfile <-'http://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data'
# download the file
downloaded <- getURL(urlfile, ssl.verifypeer=FALSE)
# treat the text data as a steam so we can read from it
connection <- textConnection(downloaded)
# parse the downloaded data as CSV
letters <- read.csv(connection, header=FALSE)
# preview the first 5 rows
colnames(letters)<-c("letter","xbox", "ybox","width","height",
                     "onpix","xbar","ybar","x2bar","y2bar",
                     "xybar","x2ybar","xy2bar","xedge","xedgey",
                     "yedge","yedgex")
View(letters)
str(letters)
head(letters)
write.csv(letters,"letters.csv")
#머신러닝은 기계로 배운다.

 test_split = 0.2
train_size = round((dim(letters)[1] *(1- test_split)))#20000* -0.8 test는 4만계
set.seed(20180621)#seed 값에 따라사  sample을 돈다.
train_index = sample(1:(dim(letters)[1]),train_size) #1~ 20000, 16000

letters_train <- letters[train_index,]
letters_test  <- letters[-train_index,] #train한것 나머지를 구한다.

install.packages("e1071")
library(e1071)#svm 지원하는 것 
#데이터 , 문자로 맞추고 변수는 그 나머지 모든 것
#fitting svm with linear kernel
letters_linear <- svm(letter~. , data = letters_train,kernel ="linear")
summary(letters_linear)

Call:
svm(formula = letter ~ ., data = letters_train, kernel = "linear")

Parameters:
   SVM-Type:  C-classification
SVM-Kernel:  linear
       cost:  1

Number of Support Vectors:  7147

( 225 456 371 133 484 436 213 209 234 183 313 202 239 287 235 298 346 236 240 335 240 196 175 280 458 123 )

Number of Classes:  26

Levels:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

선형이여서 감마 값이 없다. 얼마나 멀리 있는지 가까이 있는지

Ubuntu ->windows에서 linux사용가능하다.

!hostname
#나를 위해서 만들어진 서버
7f22e97478ea

'Study > R' 카테고리의 다른 글

R-6 (0)	2020.09.05
R-5 (0)	2020.09.05
R-3 (0)	2020.09.05
R -2 (0)	2020.09.05
R-1 (0)	2020.09.02

R-3

2020. 9. 5. 13:50

filter
select
mutate
summarize
arrange

#관측치의 개수와 변수의 개수는 각각 몇 개입니까?
summary(airquality)
dim(airquality)
str(airquality)

#변수 각각에 대해 최솟값,최대값,중앙값,평균 등의 요약통계량을 한꺼번에 보고싶을때 쓰는 함수는 ?
summary(airquality)

library(dplyr)
#오존이  32q크고 ,바람은 9보다 작은 날은 모두 
airquality %>% filter(Ozone > 32 & Wind <9) %>% summarise(n())

airquality %>% select(Ozone , Wind , Temp , Month) %>% filter(Temp >= 80) %>% arrange(desc(Ozone)) %>%  head()

airquality %>% select(1 , 2 , 3 , 4)

airquality %>% select(1:4)

airquality %>% select(-2)


airquality %>% select(Ozone , Wind , Temp , Month) %>% group_by(Month) %>% summarise(ave= mean(Wind))
# summarise(ave= mean(Wind))
airquality %>% select(Ozone , Wind , Temp , Month) %>% group_by(Month) %>% summarise(ave= max(Wind))
#summarise(avg= mean(Wind))

airquality %>% filter(Wind >= 10) %>% group_by(Month) %>% summarise(avg= mean(Temp))

game <- read.csv("gamedata.csv")   #시간이 오래 걸린다.                                                                             
game


library(data.table)
data<- fread("gamedata.csv")

getwd()

dim(data)

library(readr)

data1 <- read_csv("gamedata.csv")
dim(data1)

head(data1)
summary(data1)

rm(data,data1)
rm(list=ls())

data <- fread("conveniencestore.csv",encoding = "UTF-8")
dim(data)

head(data)

data1 <- read_csv("conveniencestore.csv")#한글이 안깨진다. 알아서 코딩이 다 되여있다.
head(data1)

read.csv() #데이터 적을때 
fread
read_csv()#파일과 관계없이 잘 쓰여진다.

summary(data1)
summary(data)

빈도수  table

data <- sample(4, 29, replace = T)
data
table(data) #빈도수 
hist(data)#histogram
hist(table(data))# 붙어 있고 
barplot(data)
barplot(table(data))# 흩어져있다.
pie(table(data))#데이터를 tableㄹ 만들고 pie
table(data) %>% pie()
data %>% table() %>%  pie()

abline()
x-y평면에   y= a+bx 

저수준->위에 있을때 라인 text글자를 집여있다든지 예:abline() 그림을 그리지 않는다.
고수준->혼자서 그림을 그릴수 있다.

par(mfrow= c(1,1))
x <- c(2,3,2,3)
barplot(x)
fit <- lm(dist~speed, data= cars)
fit
plot(fit)
par(mfrow= c(2,2))
plot(fit)

abline(a= 40, b = 4, col ='red')

lty -> line type
lwd -> line weidth
col->색갈
v->vertical->수직
h->horisental ->수평
legend->범례

ggplot2예쁘게 보여주는 것
3. ggplot2 그래픽 패키지
ggplot2 패키지를 알아보자
gg grammer of Graphics
reticulater -> R studio에서 r처럼 사용하는 것
ggplot2
R graphcics cookbook
R science

www.r-graph-gallery.com/

www.ggplot2-exts.org/gallery/

더 다양한 시각화 https://plot.ly/r/

plotly는 Interactive 그래프를 그려주는 라이브러리입니다
Scala, R, Python, Javascript, MATLAB 등에서 사용할 수 있습니다

시각화를 위해 D3.js를 사용하고 있습니다
사용해보면 사용이 쉽고, 세렦된 느낌을 받습니다

mtcars
str(mtcars)
mtcars$cyl
library(data.frame)
mtcars$cyl <- as.factor(mtcars$cyl)
str(mtcars)

#캐릭터 pch(4,6,8)
plot(mpg ~ hp, data= mtcars, col= cyl, pch=c(4,6,8)[mtcars$cyl], cex=1.2)
legend("topright",legend= levels(mtcars$cyl),pch= c(4,6,8) , col = levels(mtcars$cyl))

library(ggplot2)
ggplot(mtcars, aes(x=hp,y=mpg,color= cyl, shape=cyl))+
  geom_point(size=3)

2+3
2단계는 80% 3에서는 30%
1.평면세팅
2.도형선택
3.라벨
4.테마
5.패싯
ggplot라는 부런다.
1.평면세팅 ggplot(data=,aes(x=,y=))
*ggplot(data = 데이터 셋명)
주요 함수 ggplot(data = 데이터 셋명) : 데이터를 불러오는 역할
mapping = aes(x = , y = ) : x축, y축의 꾸미기로 사용한다

geom_function() : 어떤 그래프를 그릴지 정하는 함수
mapping = aes(항목1=값1, 항목2=값2)
: geom_function() 의 옵션으로 꾸미기로 사용한다.

position(x, y), color(색상), fill(채우기), shape(모양), linetype(선 형태), size(크기) 등
팩터로 바구는 것

mpg
str(mpg)
names(mpg)
ggplot(data = mpg ,aes(x = displ , y = hwy))#단계 배경 설정(측)
ggplot(data = mpg ,aes(x = displ , y = hwy))+ geom_point() #배경에 산정도 추가
ggplot(data = mpg ,aes(x = displ , y = hwy))+ geom_point() + xlim(3,6) #x측 분위 3~6으로 지정
ggplot(data = mpg ,aes(x = displ , y = hwy))+ geom_point() + xlim(3,6) + ylim(10,30) #범주형있을때 색갈이 생긴다.
#여기는 왼쪽으로 모여있다.

#범주데이터 fator 3가지 형태로 바꿔는 것 
ggplot(data= mpg, aes(x = displ, y = hwy, color= drv,shape = drv))+geom_point(size=2)

ggplot(data= mpg, aes(x = displ, y = hwy, color= cty))+geom_point(size=2)

summary(mpg$cty)
 factor하면 범주
ggplot(data = mpg, aes(x = displ, y = hwy)) +geom_point(aes(color= class))

ggplot(data = mpg, aes(x = displ, y = hwy,color= class)) +geom_point(size = 3)
ggplot(data = mpg, aes(x = displ, y = hwy)) +geom_point(aes(color= class), size = 3)

p <- ggplot(data = mpg, aes( x= displ,y= hwy))
p + geom_point(aes(color=class))

q <- geom_point(aes(color = class))
p + q

geom_point  Scatterplot
geom_bar  Bar plot
geom_histogram  Histogram
geom_density  Prabablity distribution plot
geom_boxplot  Box and whiskers plot
geom_text  Textual annotations in a plot
geom_errorbar  Error bars

ggplot(data = mpg, aes(x = displ, y = hwy) )+geom_point(size = 2)
ggplot(data = mpg, aes(x = displ, y = hwy , shape= drv) )+geom_point(size = 2)
ggplot(data= mpg, aes(x = displ, y = hwy, color = drv))+geom_point(size = 2)
ggplot(data= mpg, aes(x = displ, y = hwy, color = drv, shape= drv))+geom_point(size = 2)

ggplot(data = mpg, aes(x = displ, y = hwy) )+geom_point(size = 2)+geom_smooth(method = "lm") #수자 보여준다.
ggplot(data = mpg, aes(x = displ, y = hwy , shape= drv) )+geom_point(size = 2)
ggplot(data= mpg, aes(x = displ, y = hwy, color = drv))+geom_point(size = 3)
ggplot(data= mpg, aes(x = displ, y = hwy, color = drv, shape= drv))+geom_point(size =3)+geom_smooth(method = "lm")

p2 <- ggplot(data= mpg, aes(x= displ, y = hwy, color= drv, shape= drv))+
  geom_point(size = 2)
p2

p2 + geom_smooth(method = "lm")
p2 + geom_smooth(method="lm")+theme_dark()

3. 테마 theme

p3 <- ggplot(data= mpg, aes(x= displ, y = hwy, color= drv, shape= drv))+
  geom_point(size = 2)+
  geom_smooth(method= "lm")
p3
p3 + theme_dark() #배경 까막게

p3 <- ggplot(data= mpg, aes(x= displ, y = hwy, color= drv, shape= drv))+
  geom_point(size = 2)+
  geom_smooth(method= "lm")
p3
p3 + theme_dark() #배경 까막게
p3 + theme_bw() # 배경 줄 
p3 + theme_classic() # 아무것도 없음

help(theme_bw)

p3 + theme_gray() #배경 grey
p3 + theme_linedraw() #line 걸어짐
p3 + theme_light() #선 연하게 
p3 + theme_minimal()#테두리 없어짐
p3 + theme_void()
p3 + theme_test()

r은 in memory 방식이기때문에 늦다.

install.packages("ggthemes")
library(ggthemes)
?ggthemes
p2 + theme_wsj() # 오랜지 등
p2 + theme_economist() #색상 연두색
p2 + theme_excel_new() # 엑셀처럼
p2 + theme_fivethirtyeight()#
p2 + theme_solarized_2()
p2 + theme_stata()

4. 라벨

ggplot( data = mpg, aes(x= displ, y = hwy , color = drv , shape = drv))+
  geom_point(size = 2)+
  geom_smooth(method= "lm")+
  labs(title = "<배기량에 따른 고속도로 연비 비교>", x ="배기량", y ="연비" )

5. facet
#면 분할 하은 방법
d <- ggplot(mpg, aes(x = displ, y = hwy , color = drv)) +
  geom_point()
d
d + facet_grid(drv ~ .) #div로 3개로 분할한다.
d + facet_grid(. ~ cyl) #cyl 에 의해서 분할하는 데 열로 분할하라
d + facet_grid(drv ~ cyl)

d + facet_grid( ~ class)
d + facet_wrap( ~ class) #정렬

d + facet_wrap( ~ class, nrow = 2) #행의 개수
d + facet_wrap( ~ class, ncol = 4) #열의 개수

ggplot(data = mpg, aes( x= displ, y = hwy, color = drv))+
  geom_point(size = 2)
ggplot(data = mpg, aes(x = displ, y = hwy, color = drv))+
  geom_point(size = 2, position = "jitter")
dplyr :: glimpse(mpg)
jitter는 모호하게 하는 것이다 값이 거의 최적화 댈때 뭉갠다.

p1 <- ggplot(data= mpg, aes( x= displ, y = hwy , color = drv))
p1 + geom_point(size =2 )
p1+ geom_line() #라인으로 연결
p1 + geom_point(size =2) +geom_line()

hist는 붙어있고 연속변수
막대그래프는 이상변수 떨어져있다.

ggplot( data = mpg, aes( x= displ)) +geom_bar()#y 없을 때 count
ggplot( data = mpg, aes( x= displ, fill = factor(drv))) + geom_bar()
ggplot( data = mpg, aes( x= displ, fill = factor(drv))) +geom_bar(position = "dodge")

#비율로
ggplot( data = mpg, aes( x = displ, fill = factor(drv))) + geom_bar(position = "fill")
ggplot( data = mpg, aes ( x = displ, fill = factor(drv))) + geom_bar(position= "fill")+facet_wrap(~class)#나누어서

ggplot( data = mpg, aes( x = displ))+ geom_histogram()
ggplot( data = mpg, aes( x= displ))+ geom_histogram(fill= "blue")
ggplot( data = mpg, aes( x= displ))+ geom_histogram(fill = "blue", binwidth = 0.1) #쫍아졌다.

library(ggplot2)
library(dplyr)
plot(mtcars)
attach(mtcars)#변수를 쓰겠다.
mtcars
wt
disp
plot(wt) #기본함수  x와 y 에 대한 것에 
mpg
plot(wt, mpg)
plot(wt, mpg, main="wt와 mpg의 관계계")
plot(wt, disp, mpg)#Error in plot.xy(xy, type, ...) : 유효한 플랏 타입이 아닙니다

library(scatterplot3d)
scatterplot3d(wt, disp, mpg, main ="3D sactter plot")
scatterplot3d(wt, disp, mpg, pch = 15, highlight.3d = TRUE, type ="h", main = "3D sactter plot" )

library(rgl)
plot3d(wt, disp, mpg)
plot3d(wt, disp, mpg , main = "wt vs mpg vs disp" , col ="red" , size = 10)

시각화중급

Boxplot
Scatterplot
Densityplot

box plot-데이터 분포도 알 수 있음 최소갓 최대값 중앙값->어디에 몰려있는지
abc <- c(110 , 300, 150, 280, 310)
def <- c(180, 200, 210, 190, 170)
ghi <- c(210, 150, 260, 210, 70)
boxplot(abc,def,ghi)

# col: 상자내부의색지정
# names: 각막대의이름지정
# range: 막대의끝에서수염까지의길이를지정
# width: 박스의폭을지정
# notch: TRUE이면상자의허리부분을가늘게표시
# horizontal: TRUE이면상자를수평으로그림

5가지 요약 수치 사용

abc <- c(110 , 300, 150, 280, 310)
def <- c(180, 200, 210, 190, 170)
ghi <- c(210, 150, 260, 210, 70)
boxplot(abc,def,ghi)
boxplot(abc,def,ghi, col= c("yellow","cyan","green"),name =c("BaseBall","SoccerBall","BaseBall"),horizontal=T)
summary(abc)
summary(def)
summary(ghi)

head(iris)
ggplot(iris, aes(x= Sepal.Length, y = Sepal.Width))+geom_point()

ggplot(iris, aes(x= Sepal.Length, y = Sepal.Width))+geom_point(color="red",fill ="blue",shape = 21, alpha = 0.5, size= 6, stroke = 2)
#alpha투명도 
#stroke 안에 동그라미테두리

ggplot(iris, aes( x = Sepal.Length, y = Sepal.Width, color = Species,shape= Species))+geom_point(size = 6, alpha = 0.5)
ggplot(iris, aes( x = Sepal.Length, y = Sepal.Width, color = Species,shape= Species))+geom_point(size = 3, alpha = 0.5)

data = head(mtcars,30)
ggplot(data,aes(x= wt, y = mpg))+geom_point()+geom_text(label= rownames(data),nudge_x = 0.25, nudge_y = 0.25,check_overlap = T)
#check_overlap겹치느나 안겹치느내
#nudge_x 동그라미 와 오른쪽 거리 
#nudge_y 동그라미와 위거리 

ggplot(data, aes(x = wt, y = mpg)) +geom_label(label = rownames(data),nudge_x = 0.25, nudge_y = 0.2)
#텍스트 둘래 박스 쳐준다.

ggplot(data, aes(x = wt, y = mpg,fill= cyl)) +geom_label(label = rownames(data),color="white",size= 5)
#박스와 다르다.

ggplot(data= iris, aes(x = Sepal.Length, y = Sepal.Width))+geom_point()+geom_rug(col= "steelblue",alpha = 0.1 , size = 1.5)
#테두리 가에 있는 것
#농도가 진해지면 수치가 많다. 분포

library(ggplot2)
install.packages("ggExtra")
library(ggExtra)
head(mtcars)
mtcars
mtcars$wt = as.factor(mtcars$wt)
mtcars$cyl = as.factor(mtcars$cyl)
mpg = as.factor(mpg)
str(mtcars)
ggplot(mtcars, aes(x = wt, y = mpg, color= cyl, size = cyl))+geom_point()+theme(legend.position = "none")
#legend.position = "none" 범례를 안보이게 하기 
ggplot(mtcars, aes(x = wt, y = mpg, color= cyl, size = cyl))+geom_point()

p <- ggplot(mtcars, aes(x = wt, y = mpg, color= cyl, size = cyl))+geom_point()+theme(legend.position = "none")
ggMarginal(p, type="histogram") #이력
ggMarginal(p, type="density") # 선
ggMarginal(p, type="boxplot") # boxplot
ggMarginal(p, type ="histogram", size = 10)#size 조정
ggMarginal(p, type = "histogram", fill="slateblue", xparams = list(bins= 10),yparams = list(bins = 10))

www.r-graph-gallery.com/

정지된것 은 plot
움직이는것 볼 수 있는 것은 창에서

data = data.frame(cond = rep(c("condition_1","condition_2"),each= 10), my_x = 1:100 +rnorm(100, sd= 9),my_y = 1:100 +rnorm(100,sd= 16))
data
#rep(c("condition_1","condition_2"),each= 10) 10번씩
#표준편차 sd
#정교분포
ggplot(data,aes( x= my_x, y = my_y))+geom_point(shape= 1)

#lm 직선 overfitting  
#se= T는 오류편차 주지 말라고 하는 것 범윌ㄹ 안아렬주고 대충알려준다.
ggplot(data, aes(x= my_x, y = my_y))+geom_point(shape= 1) +geom_smooth(method = lm, color="red" ,se= F)
ggplot(data, aes(x= my_x, y = my_y))+geom_point(shape= 1) +geom_smooth(method = lm, color="red" ,se= T)

a = seq(1,29)+4 * runif(29,0.4)
#runif 0~0.1
b = seq(1,29) ^ 2 +runif(29, 0.98)
library(dyplyr)
par(mfrow=c(2,2))#분할 로 해서 4개 그림 그린다.

plot(a,b, pch= 20)
plot(a-b, pch =18)
hist(a, border= F, col = rgb(0.2,0.2,0.8,0.7),main="")
#투명도 0.7
# 0.2 red 0.2 green 0.8 blue
boxplot(a, col ="grey", xlab="a")

install.packages("rattle")
library(rattle)
Temp3pm
cities <- c("Canberra","Darwin","Melbourne","Sydney")
ds <- subset(weatherAUS,Location %in% cities & !is.na(Temp3pm))#Location %in% cities합쳐주는 
p <- ggplot(ds, aes(Temp3pm, colour = Location, fill= Location))
p <- p_geom_denisity(alpha - 0.55)
p
View(weatherAUS)
# %in%속해있는지 
#subset(weatherAUS,Location %in% cities & !is.na(Temp3pm)) 행과열을 추출하는 것이다.

subset(weatherAUS,Location %in% cities & !is.na(Temp3pm))

data(diamonds)
head(diamonds)

ggplot(data = diamonds , aes(x = price, group = cut, fill= cut))+geom_density(adjust = 1.5)
ggplot(data = diamonds , aes(x = price, group = cut, fill= cut))+geom_density(adjust = 5)
#가격에 대해서 예상 이런조건이면 

ggplot(data = diamonds, aes(x= price, group = cut, fill= cut))+ geom_density(adjust = 1.5, alpha= 0.2)
ggplot(data = diamonds, aes( x= price, group = cut, fill= cut))+ geom_density(adjust = 1.5, position = "fill")#누적되서 나타나는 것
x1 = rnorm(100)
x2 = rnorm(100, mean = 2)
par(mfrow = c(2,1))

par(mar = c(0,5,3,3))
plot(density(x1),main="",xlab = "", ylim = c(0,1),xaxt = "n", las = 1, col = "slateblue1", lwd = 4)
par(mar= c(5,5,0,3))
plot(density(x2), main ="", xlab ="Value of my variable", ylim=c(1,0), las = 1, col="tomato3", lwd = 4)


diamonds
ggplot(data = diamonds , aes(x = depth, group = cut, fill= cut))+geom_density(adjust = 1.5)


data <- data.frame(name = c("north","south","south-east","north-west","south-west","north-east","west","east"),val=sample(seq(1,10),8))
data
mpg

install.packages("forcats")
library(forcats)
library(dplyr)
data %>% mutate(name = fct_reorder(name,val)) %>% ggplot(aes(x=name, y = val))+
  geom_bar(stat= "identity")+
  coord_flip() #오름차순 

data %>% mutate(name = fct_reorder(name, desc(val))) %>% ggplot(aes(x= name, y = val))+
  geom_bar(stat= "identity")+
  coord_flip() #desc 내름차순 

data <- data.frame(name = letters[1:5], value= sample(seq(4,15),5), sd = c(1,0.2,3,2,4))
ggplot(data) + geom_bar(aes(x= name, y = value), stat ="identity", fill ="skyblue", alpha= 0.7)+
  geom_errorbar(aes(x = name,ymin = value-sd, ymax = value+sd),width = 0.4 , colour ="orange", alpha = 0.9, size = 1.3)

ggplot(data)+
  geom_bar(aes(x= name, y = value), stat ="identity", fill ="skyblue", alpha = 0.5)+
  geom_crossbar(aes(x = name, y = value, ymin = value-sd , ymax = value+sd ), width = 0.4 , colour="orange", alpha = 0.9, size = 1.3)



ggplot(data)+
  geom_bar(aes(x= name, y = value), stat ="identity", fill ="skyblue", alpha = 0.5)+
  geom_linerange(aes(x = name, ymin = value-sd , ymax = value+sd ), width = 0.4 , colour="orange", alpha = 0.9, size = 1.3)

ggplot(data)+
  geom_bar(aes(x= name, y = value), stat ="identity", fill ="skyblue", alpha = 0.5)+
  geom_errorbar(aes(x = name, ymin = value-sd , ymax = value+sd ), width = 0.4 , colour="orange", alpha = 0.9, size = 1.3)+coord_flip()

'Study > R' 카테고리의 다른 글

R-6 (0)	2020.09.05
R-5 (0)	2020.09.05
R-4 (0)	2020.09.05
R -2 (0)	2020.09.05
R-1 (0)	2020.09.02

R -2

2020. 9. 5. 13:37

R studio에 속성 관리자권한으로 체크해야 권한이 없다고 안 뜬다.

dplyr 패키지를 이용한 데이터 전처리

dplyr 로 가공하기

airquality->자동완성

dplyr

obs 행-> 관측칙
variables 변수 독립변수
variables object

dim(airquality)
summary(airquality)
str(airquality)

#airquality를 이름바꾸기
air <- airquality
air

summary(air)
str(air)

a = 1
a

airquality 를 덥어썼으면 끄고 다시시작하기 기본으로 주는 것은 수정할 수 없다.메모리상에서 객체만 존재

#dplyr설치
install.packages("dplyr")
library(dplyr)

#dependency 필요한것도 같이 가져와서 설치

glimpse()#

str(air)
glimpse(air)#str q보다 직관적이게 보일수 있다.

# air에서 month하고 day로
air1 <- air[,c(5,6)]
air1

air1 <- air[,c(1,3)]
air1

air1 <- air[,c('Ozone','Wind')]
air1

air1 <- air[,c('Ozone','Wind')]
head(air1)

tail(air1)

#1행부터 20행 까지
air1 <- air[1:20,]
air1

air1 <- air[,1:4]
air1

air1 <- air[,c(-5,-6)]
air1

colnames(air1)# head이름  열
rownames(air1) #row 이름  행
names(air1)#열이 더 중요하다.

엑셀보다   csv읽는 이유는 크기가 작아서
rep->복제

x1 <- 1:20
x2 <- rep(c("a","b"),10)
x2
x3 <- sample(1:200,20) #random 데이터
x3

# 1-50  random 10개
x1 <- sample(1:50,10) #random 데이터
x1
# 1은 안쓰도  default로 되여있다.
x1 <- sample(50,10) #random 데이터
x1
# 30-50  random 10개
x1 <- sample(30:50,10) #random 데이터
x1
# 45-50  random 10개
x1 <- sample(45:50,10,replace = TRUE) #'replace = FALSE' 일때는 모집단보다 큰 샘플을 가질 수 없습니다
x1
# 45-50  random 10개
x1 <- sample(45:50,10,replace = TRUE) #random 데이터 뽑는 것 또 뽑는다. 복원추측

x1 <- sample(45:50,10,replace = TRUE) #random 데이터 뽑는 것 또 뽑는다. 복원추측
x1

set.seed(1234)

x1 <- sample(45:50,10,replace = TRUE) #set.seed하고 조회

#airquality에서 153개인데 random으로 15개 끄내기
ari1 <- airquality
index <- sample(153,15)
index
air1 <- air1[index,]
air1

air1 <- nrow(airquality)
air1
air1 <- ncol(airquality)
air1

index <- sample(nrow(airquality),15)
index

vector한개 set안에 여러개 연다.

air1 <- airquality
air1
a <- sample(nrow(air1),15)
a[3]# 3번쨰 것 꺼내기
dim(air1)[1]

#alt+-누르면 된다. <-

#153개 중에서 70%만큰 샘플링으로 나온다.
index1 <- sample(nrow(air1),nrow(air1)*0.7)
index1
train <- air1[index,]
test <- air1[-index,]

ls()
rm(air)
rm(a,A)
ls()
rm(list = ls())->모두 지우기

R
help(sample)
prob a vector of probability weights for obtaining the elements of the vector being sampled. 비율
?sample help와 가능이 같다.

RStudio f1

head

dplyr 패키지를 이용한  데이터 전처리 \
filter(  ) 행 추출
select(  ) 열(변수) 추출
arrange(  ) 정렬
mutate(  ) 변수 추가
summarise(  ) 통계치 산출
group_by(  ) 집단별로 나누기
left_join(  ) 데이터 합치기(열)
bind_rows(  ) 데이터 합치기(행)

filter 조건 class가 열
%>% -> ctrl+shift+m %>% 파이프 연산자 Ctrl + Shift + M

library(dplyr)
exam <- read.csv("csv_exam.csv")
exam

exam %>% filter(class == 1)# class가 1인것 
exam %>% filter(class != 1)
exam %>% filter(math > 50)#수학점수가 50보다 크다.
exam %>% filter(english >= 80)
exam %>% filter(class == 1 & math >= 50)

exam %>% filter(class == 1 | english >= 90)

exam %>% filter(class == 1 | class == 3 | class == 5)
exam %>% filter(class %in% c(1,3,5))

class1 <- exam %>% filter(class==1)
mean(class1$math)

air <- airquality
air
air %>% filter(Day>20)#20보다 큰 달을 구한다.
air %>% filter(Day>20) %>% filter(Month == 9)

# 1,3 반중에서 80명 이상되는 분 
exam %>% filter( (class == 1 | class == 3) & english >= 80 ) 

#열 추출
exam %>% select(math)
exam$math

exam %>% select(class,math,english)
select(exam, class)#열의 이름을 가져온다.위에것과 같은 원리이다.

exam %>% select(-math)

가독성을 위해서,
%>%(파이프 연산자)에서 줄을 바꾼다.
Enter를 치면 알아서 들여쓰기가 된다.

# class가 1인 english 열만
exam %>% filter(class == 1) %>% select(english)
exam %>% select(english) %>% filter(class == 1)
#atomic과 리스트 타입들에 대해서만 비교(1)가 가능합니다
#같아 보이는데 조금 다르다.

가독성을 위해서, %>%(파이프 연산자)에서 줄을 바꾼다.

Enter를 치면 알아서 들여쓰기가 된다.

dplyr 로 가공하기

exam %>% arrange(math)#order by 오름차순
exam %>% arrange(desc(math))#내림 차순

exam %>% arrange(class,math) # 1순위 class 2순위 math

4. 파생변수 추가하기 & 집단별로 요약하기
4. 파생변수 추가하기 ->앞의 것에서 열의 의하여 새로운 열을 만든다.
mutate->있는데서 변형하는 것이다.
exam %>% mutate(total = math+english+science) %>%
head

exam

exam %>% mutate(total = math+english+science,
                mean = (math+english+science)/3) %>%
        head

exam %>% mutate(test = ifelse(science >= 60),"pass","fial") %>%
         head

exam %>% mutate(total = math+ english+science) %>%
        arrange(total) %>%
        head

요약 통계량 함수

mean(  ) 평균 ->r전체에서
sd(  ) 표준편차 ->r전체에서
sum(  ) 합계 ->r전체에서
median(  ) 중앙값 ->r전체에서
min(  ) 최솟값 ->r전체에서
max(  ) 최댓값 ->r전체에서
n(  ) 빈도 ->summarize만 같이 있을때만 작동한다.

summarise( ):

summarise() is typically used on grouped data created by group_by().

The output will have one row for each group

summarise(data.frame, functions...)

수치형 값에 대한 "요약" 통계량을 계산하여 출력한다.
Center: mean(), median()
Spread: sd(), IQR(), mad()
Range: min(), max(), quantile()
Position: first(), last(), nth(),

exam %>% summarise(mean_math = mean(math))# 수학평균이 얼마인가

exam %>% 
   group_by(class) %>% 
   summarise(mean_math = mean(math))

exam %>% 
  group_by(class) %>% 
  summarise(sd_math = sd(math))

#class별로 평균하였을 경우 
exam %>% 
  group_by(class) %>% 
  summarise(mean_math = mean(math),
            sum_math = sum(math),
            median_math = median(math),
            n = n()) #학생수 

mpg %>% 
  group_by(manufacturer) %>% 
  filter(class == "suv") %>% 
  mutate(tot = (city+hwy)/2) %>% 
  summarise(mean_tot = mean(tot)) %>% 
  arrange(desc(mean_tot)) %>% 
  head(5)

R을 활용한 Data Visualizaition

2일차 데이터 시각화 / 전처리

데이터 시각화 / 전처리
1. 데이터 시각화의 중요
2.  기본 그래픽- 고수준, 저수준
R 그래픽 도구
1.  R 기본 그래픽  (R Base Graphics)
2.  Lattice Graphics
3.  ggplot2
Easy    Fast   Beautiful
1.  R 기본 그래픽  (R Base Graphics)
내장되여있어서 설치 필요없음
막대그래프, 히스토그램, 파이그래프 등 여러 시각화 방법을 제공

별도의 설치 및 호출 필요 없고 가벼움
설정이 다소 복잡하고 아름답지 못하다는 단점

2.  lattice
한꺼번에 많은 플롯을 생성할 수 있다.
다차원의 데이터를 사용하여 변수들갂의 관계를 살펴보는데 유리
순차적으로 그래프 쌓아가는 것이 어려워 직관적이지 못하다

3. ggplot2
이전 두 패키지(R Base Graphics, Lattice Graphics)의 장점만 모아 둔 패키지

갂단한 그래프 문법 + 아름다운 고급그래프 + 레이어로 쌓아감

데이터객체, 그래픽객체로 나눌 수 있어 코드의 재사용성이 높다.

Anscombe’s quartet

이산변수 점수 단위로 나누어 측정할 수 있는 변수 막대차트,점그패프,원 차트 등을 이용하여 시각화하면 효과적
연속변수 시간,길이 등

고수준 그래픽 함수(high level graphic functions)

plot 함수
데이트를 x-y평면 상에 출력하는 함수

plot(x, y, type = ‘type value’, main=‘title’, col=color)
# type : plot의 형태로 점, 선 등을 선택할 수 있다.
# main: 그래프의 제목 설정
# col : 그래프의 색상
Type 옵션
p : 점(points),  l : 선(lines),  b : 점과 선(both points and lines),  c : b옵션에서 점이 빠짂 모습,
o : 겹친 점과 선(overplotted),  h : 수직선 ,  s : 수평선 우선의 계단 모양 (steps),
S : 수직선 우선의 계단 모양 (steps),  n : 배경맊 그리고 출력하지는 않음 (no plotting)

mtcars#r자체에 내장되여있다.
?mtcars
str(mtcars)
names(mtcars)

plot(mtcars)
attach(mtcars)

wt
plot(wt)
mpg
plot(wt, mpg)
plot(wt, mpg, main = "wt와 mpg의 관계")
plot(wt, disp, mpg)#

install.packages("scatterplot3d")
library(scatterplot3d) #package ‘scatterplot3’ is not available (for R version 3.6.1)

scatterplot3d(wt, disp, mpg, pch = 16, main="3D Scatter Plot")
scatterplot3d(wt, disp, mpg, pch = 16, highlight.ed= TRUE, type ="h" , main="3D Scatter Plot")

install.packages("rgl")
library(rgl)
plot3d(wt, disp, mpg)
plot3d(wt, disp, mpg, main="wt Vs mpg Vs disp", col ="red", size="10")

고수준 그래픽 함수(high level graphic functions)

plot() 산점도 출력
barplot() 막대 차트 출력
pie() 파이 차트 출력
matplot() 다중 산점도 출력

x11() 

par(mfrow= c(2,3))#multifrow 6개 분할 

plot(0:6,0:6, main ="default")
plot(0:6,0:6, type="b" , main="type = \"b\"")
plot(0:6,0:6, type ="c", main="type =  \"c\"")
plot(0:6,0:6, type ="o", main="type = \"o\"")
plot(0:6,0:6, type="s", main="type= \"s\"")
plot(0:6,0:6, type="S", main="type= \"S\"")

범주형 데이터의 수준별
사용법 barplot(H, width = 1, beside = FALSE, main=‘title’, col=NULL, horiz= …)

# H (height): 백터나 행렧 입력 가능 (당연히 numeric)
# beside 인수 : 옆으로 나란히. FALSE 는 누적
# col : 그래프의 색상
# horiz= 막대를 평행하게

par(mfrow=c(1,1))
x <- c(38,52,24,8,3)
barplot(x) 막대그래프

par(mfrow=c(1,1))
x <- c(38,52,24,8,3)
barplot(x)

names(x) <- c("Excellent","Very Good","Good", "Fair","Poor")
barplot(x)

y <- scan()
1:  1 2 3 3 4 3 4 1 5 3 3 3 2 4 4
16: 2 4 3 5 3 1 2 3 3 4 4 3 2 3 4
barplot(y)

par(mar=c(2,4,2,2))#여백주는 함수
barplot(table(y), xlab= "Beverage", ylab ="Frequency")
barplot(table(y)/length(y), xlab ="Beverage", ylab = "proportion")
table(y) : 데이터의 도수를 표현 length(y) : 데이터의 갯수(길이) 

객체만들기
sales <- c( 45, 44, 46)
names(sales) <- c("Park", "Kim" ,"Lee")
barplot(sales, main="Sales", ylab ="Thousands")
sales: 데이터 객체 names(sales) : 데이터의 이름 설 정 

범위 조절하기
barplot(sales, main="Sales", ylab ="Thousands" , ylim=c(42,46), xpd=FALSE)
ylim = c(42,46) : y 축 범위 설정 xpd=FALSE : 막대의 벖어남 허용여 부

pie차트
x
pie(x)

names(x) <- c("Excellent","Very Good", "Good","Fair","Poor")
barplot(x)
barplot(x, xlab="수준",ylab="점수")
barplot(x, xlab="수준",ylab="점수", col="blue")
barplot(x, xlab="수준",ylab="점수", col="blue", horiz=TRUE)#범위가 40에서 나갔다.
barplot( x, xlab="수준" , ylab = "점수" , col=c("blue","light blue","red","yellow","grey"), horiz=TRUE)#범위

pie함수
데이터를 파이 차트(원 그래프)로 출력하는 함수

사용법 pie(x, labels = names(x), radius = 0.8, clockwise = FALSE, init.angle = if(clockwise) 90 else 0, density = NULL, angle = 45, col = NULL, …)

# x : 음수나 0이 아닌 숫자형 벡터

# labels : 기본값으로 x 벡터의 이름이 사용, 새롭게 지정 가능

# radius : 파이의 반지름
# init.angle : 파이 차트가 시작되는 각도(clockwise가 TRUE면 90도 아니면 0도 )
# density : 파이 내부의 빗금을 표시하는 밀도
# angle : 파이 내부의 빗금으 표시하는 기울기
# col : 파이 내부의 색상

score <- read.table("score.txt",header= T ,fileEncoding = "UTF-8")
score

score$"성명"

score$"국어"

paste(score$"성명","-",score$"국어")

pie(score$"국어", lables = paste(score$"성명","-",score$"국어"),col=rainbow(10),clockwise=TRUE)

pie(score$"국어", lables = paste(score$"성명","\n","(",score$"국어",")"),col= rainbow(10), clockwise=TRUE)        

install.packages("googleVis")
library(googleVis)

buildcolors <- function(color_count){
  colors <- rainbow(color_count)
  colors <- substring(colors,1,7)
  colors <- paste(colors,collapse = "','")
  colors <- paste("'",colors,"'",sep="")
  colors <- paste("[",colors,"]",sep ="")
  return(colors)
}

cols <- buildcolors(10)

pie <- gvisPieChart(data.frame(score$'성명',score$'국어'),option = list(width = 600, height = 600, title='국어성적',colors=cols,pieSliceText ="label",pieHole="0.5"),chartid="donut")
header <- pie$html$header
header <- gsub("charset=utf-8","charset=euc-kr",header)
pie$html$header <- header
plot(pie)

pie <- gvisPieChart(data.frame(score$"성명",score$"국어"),option = list(width = 600, height = 600, title="국어성적",colors=cols,pieSliceText ="value",pieHole="0.5"),chartid="donut")
header <- pie$html$header
header <- gsub("charset=utf-8","charset=euc-kr",header)
pie$html$header <- header
plot(pie)

paste() : 문자열을 붙이는 함수
clockwise = TRUE : 파이 차트 시작각도 를 90도로 설정

\n : 줄바꿈

저수준 그래픽 함수(low level graphic functions)
points() 지정핚 좌표에 점을 찍는 함수
abline() y=a+bx 의 직선을 그리는 함수
legend() 범례를 출력하는 함수
text()
Plot 영역의 (x,y) 좌표에 문자 를 출력하는 함수

Ex) abline(a, b, lty, col, other options) # y = a + bx abline(h = a, lty, col, other options) # y = a abline(v = b, lty, col, other options) # x = b abline(lm object) # 회귀 직선

# lty : line type으로 1-solid line, 2-dashed line 등 # col : 직선의 색상

cars
cars[1:4,]

#값을 예측하기 위해서 하는 것 
z <- lm( dist ~ speed, data = cars)
summary(z)

x11()
par(mfrow= c(1,1))
plot(cars,main ="abline")


abline(h = 20)
abline(h = 30)

abline(v = 20, col ="blue")

abline(a = 40, b = 4, col='red')

abline(z, lty= 2, lwd= 2, col= 'green')

abline(z$coef, lty= 3, lwd = 2, col='red')

legend()함수

legend(x, y, legend, pch, lty, fill, col, …) x, y : legend를 출력할 위치 지정 ex) x=a, y=b : 좌표 (a,b) 에 범례를 출력 위치를 나타내는 문자사용 ex) ‚topright‛, ‚bottomleft‛, ‚center‛ 등

pch : 점에 대한 범례일 경우, 점을 구분하기 위해 사용

lty : 선에 대한 범례일 경우, 선의 type을 구분하기 위해 사용

fill : 면에 대한 범례일 경우, 면의 색상을 구분하기 위해 사용

pch와 lty 동시 사용 : 점과 선을 동시에 사용한 그래프의 범례

x11()
plot(1:10, type="n", xlab="",ylab="",main="legend")

legend("bottomright","(x,y)",pch=1,title="bottomright")
legend("bottom","(x,y)",pch=1,title="bottom")
legend("bottomleft","(x,y)",pch=1,title="bottomleft")
legend("left","(x,y)",pch=1,title="left")
legend("topleft","(x,y)",pch=1,title="topleft")
legend("top","(x,y)",pch=1,title="top")
legend("topright","(x,y)",pch=1,title="topright")
legend("right","(x,y)",pch=1,title="right")
legend("center","(x,y)",pch=1,title="center")
legends <- c("Legend1","Legend2")

legend(3,8, legend= legends,pch = 1:2, col = 1:2)
legend(7,8, legend= legends,pch = 1:2, col = 1:2,lty= 1:2)
legend(3,4, legend= legends,fill = 1:2)
legend(7,4, legend= legends,fill = 1:2, density = 30)

연습문제:

x <- c(1,1,1,2,2,2,2,2,2,3,3,4,5,6)
y <- c(2,1,4,2,3,2,2,2,2,2,1,1,1,1)
zz <- data.frame(x,y)
zz
sunflowerplot(zz)
plot(zz)

data(mtcars)
stars(mtcars[,1:4])
stars(mtcars[1:4], flip.labels= FALSE, key.loc = c(13,1.5))

stars(mtcars[1:4], key.loc = c(13,1.5), draw.segments = TRUE)

xx <- c(1,2,3,4,5)
yy <- c(2,3,4,5,6)
zz <- c(10,5,100,20,10)
symbols(xx,yy,xx)

xx <- c(1,2,3,4,5)
yy <- c(20,13,40,50,60)
zz <- c(10,5,100,20,10)
c <- matrix(c(xx,yy,xx),5,3)
c
pairs(c)#컬럼수가 많고 수치형이고 범주형이 그만큼 많을때

persp()#3차원함수

contour()#3차원함수
filled.contour(volcano,color.palette = terrain.colors,asp=1)
title(main="volcano data: filled contour map")

plot(0:6,0:6, type="n", main="type= \"n\"")
plot(0:6,0:6, type="b", lty="dashed")

x <- runif(100)
y <- runif(100)
plot(x,y,pch = ifelse(y > 0.5, 1,18)) #pointcharacter

plot(x,y ,pch = ifelse( y >0.6, 15, ifelse( y > 0.4, 5, 14)))

plot(x,y ,pch = ifelse( y >= 0.7 , 8, ifelse( y>= 0.5, 5, ifelse(y >= 0, 12))))

#kbo.csv ㅇ읽어서 
kbo <- read.csv("kbo.csv")
kbo
#6행보여주기
head(kbo)

#팀별로 정렬하되 알파벳 내림차순 6행까지 출력
kbo %>% group_by(팀) %>% arrange(desc(팀)) %>% head() 

arrange(kbo,desc(팀)) %>% head()

#2017년도의 것만 추출하며 첫 6행 까지 출력하세요 
filter(kbo,연도 == 2017) %>% head()

kbo

#안타,2루타 ,3루타,홈런만 추출하여 첫 6행 까지 출력하세요
select(kbo,안타,X2루타,X3루타,홈런) %>% head()

#2017년도 안타,2루타 ,3루타,홈런첫 5행까지 
filter(kbo,연도 == 2017) %>%  select(  X2루타,X3루타,홈런)%>%   head(5)

kbo

#데이터를 타율(안타/타수)이라는 변수를 넣고 첫 6행 까지 출력하세요 
kbo %>% mutate(타율 = 안타/타수) %>% head()
mutate(kbo, 타율 = 안타/ 타수) %>%  head()

'Study > R' 카테고리의 다른 글

R-6 (0)	2020.09.05
R-5 (0)	2020.09.05
R-4 (0)	2020.09.05
R-3 (0)	2020.09.05
R-1 (0)	2020.09.02

R-1

2020. 9. 2. 20:36

R 의 소개와 설치

R R 은 어떤 프로그램인가

R은 통계 프로그램

R 을 알아보자

R을 사용하는 이유

매우 다양한 통계/데이터마이닝 붂석 기법을 사용할 수 있음

사용에 제약이 없는 오픈소스 기반이며 확장이 용이함

운영체제에 영향을 받지 않음: Windows, Linux, MacOS X 등

It’s free

R을 이용한 데이터 분석: 새로운 트렌드 !

다양한 분야의 연구자들이 데이터 붂석을 위해 R을 사용하고 있음

Kdnuggets & Kaggle poll results

R 설치와 구성

다운로드 ( http://www.r-project.org ) )
CRAN 지역 선택 ( ( 사용자의 지역 또는 가까운 국가를 선택하면 된다) .)
R R 을 설치 하려는 PC 의 시스템에 맞는 버젂 선택
[Subdirectories] - - > [base] 선택
Download R x.x.x for Windows 클릭

- R Console 과 함께 R 편집기가 제공된다.
- 실행을 하려면 Ctrl + R 을 누른다.

3. R 처음실행

- R 은 기본함수가 제공된다.
- 몇가지 계산을 해보자. 종료는 q( )

1+2
1-2
1*2
1/2
1/3

- 음수 가능
- 소수점 7자리까지 표현
- 콘솔 지우기는 Ctrl + L
- 한꺼번에 실행
- 글자단위로 정확히 선택
- 실행은 Ctrl + R

- R Console이라는 이름의 창이 열리고 그 창에 『>』 기호가 나타나는데 이것이 바로 R의 프롬프트 이다.
- R에서 주석을 사용하기 위해서는 『#』을 이용한다.
- 프롬프트 다음에 명령문을 입력하고 ENTER를 누르면 입력된 명령문이 실행이 되어 그 결과가 바로 다음 줄에
나오게 된다.
- R은 대소문자를 구별
- 명령어가 불완젂한 경우 자동적으로 연결 프롬프트 『+』를 생성하게 된다.
- 한 줄에 여러 개의 명령문을 사용할 『;』으로 구붂한다. 명령을 이어쓰기 할 때 사용.

3. R 설치하고 처음 하는 일

- RGui 글꼴 및 배경색
- getwd( )
- setwd( )
- a <- c(1,2,3), print(a), sum(a), var(a), sd(a), pi
- b <- c(1:10)
- install.packages(“dplyr”)

패키지란

누구나 R 패키지를 생성하고 배포할 수 있다.
거의 모든 통계 붂석 방법롞이 이미 패키지로 개발되어 배포 중이다.

대표적인 패키지들

‘ggplot2’ : 데이터 시각화
‘readxl’ : 엑셀 xls/xlsx 파일을 R로 불러오는(import) 경우 사용
‘dplyr(디플라이어)’ : 데이터를 빨리, 쉽게 가공할 수 있도록 도와준다.
‘Rcpp’ : R의 느릮 속도를 보완하기 위해 C++을 통합해서 사용
‘caret’, ‘e1071’ : 다양한 머싞 러닝 알고리즘이 구현되어 있다.
‘mxnet’, ‘nnet’ : 싞경망을 다룰 때 사용

• 한 패키지를 사용할 때 다른 패키지가 같이 설치/부착되는 경우도 있다.
• 같은 목적이더라도 상황에 맞는 패키지 및 함수를 선택하고,
추가적으로 함수를 정의해서 쓰면 보다 효과적으로 데이터를 붂석할 수 있다.
• CRAN에 등록되어 있지 않은 패키지는 별도로 주소를 지정해서 설치할 수 있다.

패키지 설치하기 install.package("")

패키지 로드하기 library()

함수 사용하기

빅데이터 분석의 목표

2. Big Data 란?

R R Studio 설치하기

1. 다운로드 www.rstudio.com/

RStudio | Open source & professional software for data science teams

RStudio provides free and open source tools for R and enterprise-ready professional software for data science teams to develop and share their work at scale.

rstudio.com

2. [Download Rstudio] 클릭
3. [ [ RStudio Desktop] 클릭
4. [DOWNLOAD RSTUDIO DESKTOP] 클릭
5. 자싞의 PC 홖경에 맞는 프로그램 다운로드

R Studio는 네 개의 화면으로 구성되어져 있다.

① 스크립트를 작성 및 편집할 수 있다.
② R을 실행하고 결과를 출력해 준다.
R 명령어를 직접 입력 할 수 있다.
③ 홖경변수(객체, 데이터 등)를 확인할 수 있으며 작업된 명령어를 볼 수 있다.
④ 출력된 그래프, R에 설치되어있는 패키지 들 그리고 도움말 등을 볼 수 있다.

공백을 넣으면 가독성이 좋습니다.
공백이 없어도 결과는 동일합니다.

R Studio의 파일 창

워킹 디렉터리를 보여 줍니다.
그래프를 보여 줍니다.
비설치 & 설치된 패키지 목록을 보여 줍니다.
help( ) 함수를 실행하면 도움말을 보여 줍니다.
분석 결과를 HTML 등 웹 문서로 출력한 모 습을 보여 줍니다.

1. base R 을 이용한 데이터 가공

R Studio의 코딩 기초

- R에서 할당 연산자로는 <- or = 을 사용한다. (<<- 도 가능하지만 많이 쓰짂 않음)
- 코드를 실행할 때는 실행하려는 Line(줄)에 커서를 두고 Ctrl + Enter를 입력한다.

- 실행한 이후에는 콘솔 창에 올바르게 실행됐는지 확인한다.
- Error가 발생한 경우도 콘솔 창에 표시되므로 코드 실행 후 항상 확인한다.
- 홖경 창에 실행한 값이 입력됐는지 확인한다.

- 생성한 객체를 실행하기 위해선 할당했던 변수명(x)만 실행하면 된다.
- 결과값은 항상 콘솔 창에서 확인한다. [1]은 결과값을 표시해주는 것으로 큰 의미는 없다.
- 아래 3가지 과정이 R studio 코딩의 기초이다.
- ① 객체 생성(할당), ② 생성한 객체 실행, ③ 콘솔 창에서 결과 확인

데이터 타입과 변환

주석(#) 사용하기

- 주석을 표시할 때는 #을 사용한다.
- 주석 표시(#)는 실제로 실행되지 않는다.

x <- 9
# x <- 12
x

객체 제거하기

ls()#변수 조회
rm(x) #생성한 객체 삭제

- 생성한 객체를 삭제할 때는 rm( ) 함수를 사용한다.
- rm(객체)을 실행하면 홖경 창에서 생성했던 객체가 삭제된다.
- 모두 지우려면 rm(list = ls( ))

사칙연산 해보기

5+10/(2+3)
(1/2 + 1/3)^2 / (1/3^2)
5*7

- +, - , * , / 로 기본적인 사칙연산이 가능하다.
- 제곱은 ^(shift+6)로 표현한다.
- 할당연산자 ( <- ) 없이 실행하고 있으므로 결과값은 저장되지 않는다.

R 값(Value)의 기본 타입 : numeric

num1 <- 3.5
mode(num1)
num2 <- 3
mode(num2)
typeof(num2)

- R에서 숫자 값의 기본 타입은 ‚numeric‛이다. 정수형과 실수형을 합하여 numeric형이라고 한다.
- 데이터형을 볼 때는 mode( ) 또는 typeof( )를 사용한다.

R 값(Value)의 기본 타입 : character

char1 <- "blue"
char2 <- '1'
char3 <- 1
mode(char1)
mode(char2)
mode(char3)
char4 = "blue"
char4

- R에서 문자는 ‚character‛이다.
- character 타입을 입력할 땐 (‚문자‚ or '문자‘) 따옴표를 사용한다.
- 단, ‚abc‘ 처럼 따옴표의 혼용은 앆된다.

R 값(Value)의 기본 타입 : logical

logic1 <- c(TRUE,FALSE,TRUE) #변수를 묶어준다.
mode(logic1)
logic2 <- c(T,F,T)
mode(logic2)
logic3 <- c(TRUE,false)#대문자여야 한다. 아니면 에러: 객체 'false'를 찾을 수 없습니다
mode(logic3)

- TRUE, FALSE 는 대문자로 써야한다.
- R에서 조건문 및 인덱스에서 사용되는 타입으로 ‚logical‛이 있다.
- TRUE or FALSE 두가지 값을 갖으며 TRUE = T, FALSE = F로 축약해서 표현가능하다.

.logic1 <- c
mode(.logic1)

R 기본 데이터 타입 : 벡터형, 변수이름 설정

obj <- c(3, 5, 7, 9, 11)
name_1 <- c(2, 4, 6, 8, 10)
name.2 <- c(1, 3, 5, 7, 9)
.name2 <- c(1, 3, 5, 7, 9)#마침표만 가능하다.
2name <- c(1, 3, 5, 7, 9) #예상하지 못한 기호(symbol)입니다. in "2name"
_name <- c(1, 3, 5, 7, 9) #예상하지 못한 입력입니다. in "_"

- R에서 가장 기본이 되는 객체는 벡터(vector)로 c( ) 함수로 생성이 가능하다.
c( ) 함수는 ,(콤마)로 값을 구붂하여 생성한다.
- 객체(=변수)의 이름을 정할 땐 영문과 숫자, . , _ 등을 혼합해서 사용할 수 있다.
- 이름의 첫 글자로 숫자나 '_ ' 가 올 수 없다.

R 기본 데이터 타입 : 벡터 (2/2)

if <- c(1, 2, 3) # 에러: 예상하지 못한 할당(assignment)입니다. in "if <-"
else <- c(1, 2, 3) #예기치 않은 'else'입니다 in "else"
for <- c(1, 2, 3) #예상하지 못한 할당(assignment)입니다. in "for <-"

obj2 <- c(1, 2, "A","B")
obj2
obj3 <- c(T, F, 1, 2)
obj3
obj4 <- c("A", "B", T, F)
obj4

- if, else, for같은 특정 문자는 객체의 이름으로 사용할 수 없다.
- 하나의 벡터에는 하나의 타입만 가져야한다.
두 개 이상의 타입 입력 시 ‚character‛ -> ‚numeric‛ -> "logical‛ 순으로 자동 변홖

R 기초 연산자 : 논리 연산자

A <- T
B <- F
C <- c(T,T)
D <- c(F,T)

A & B #[1] FALSE
C & D #[1] FALSE TRUE

A | B #[1] TRUE
C | D #[1] TRUE TRUE

A && B #[1] FALSE
C && D #[1] FALSE

A || B #[1] TRUE
C || D #[1] TRUE

- R의 논리연산자로 &(and)와 | (or)가 있다.
&는 두 값을 비교하여 둘 다 T일 때만 T를 반홖, |는 두 값 중 하나면 T여도 T를 반홖
- &&와 ||는 각 벡터의 첫 번째 원소만 비교하여 결과를 반홖한다.

R 기초 연산자 : 비교 연산자

1 < 2 #TRUE
1 = 2 #Error in 1 = 2 : 대입에 유효하지 않은 (do_set) 좌변입니다
1 == 2 #FALSE
1 != 2 #TRUE

A <- c(3,4) #
B <- c(5,4) #
C <- c #

A < B # TRUE FALSE
A <= B #TRUE TRUE
A == B #FALSE TRUE
A != B #TRUE FALSE

- R의 비교연산자로 <(작다), >(크다), <=(작거나 같다), >=(크거나 같다), ==(같다), !=(같지 않다)가 있다.
- 벡터에선 각 원소 별로 비교 연산자를 적용한다. <= 는 가능하지만 =<는 불가하다.

c <- 7
c

A <- C(3,4) #
A

R 기본 데이터 타입 : 행렬

mat1 <- matrix(1:12)
mat1

mat2 <- matrix(1:12, nrow = 3, ncol = 4)
mat2

mat3 <- matrix(1:12,nrow =3, ncol = 4, byrow = T)
mat3

mat4 <- matrix(1:12, 3, 4)
mat4

rownames(mat3) <- c("국어", "영어" ,"수학")
mat3

colnames(mat3) <- c("a1" , "a2" ,"a3" ,"a4")
mat3

- R에서 자주 사용되는 객체로 행렬(matrix)이 있다.
벡터와 마찪가지로 하나의 행렬에는 하나의 데이터 타입만 가능하다.
- byrow=T옵션은 데이터가 행 먼저 들어가는 기능이다.
행이나 열의 이름을 설정할 땐 rownames(), colnames() 를 사용한다.

R 코딩 특정 부붂 출력 : 인덱스 (1/2)

mat3[2,3]#7

mat3[2,]

mat3[,2]

mat3[,-2]

mat3["영어",]

mat3[,2:3]

mat3[c(1,2),]

mat3[c(1,2),c(2,4)]

- 행렬에서 특정 원소만 출력하고 싶을 때 인덱스를 홗용한다.
- 행렬에서 인덱스를 홗용할 땐 대괄호 [ ] 를 사용한다.
- 특정 행 또는 열을 제외하고 싶을 땐 마이너스(-)를 사용한다.

t(mat3)

데이터 프레임

x1 <- c(100, 80, 60, 40, 30)
x2 <- c("A","B","C","A","B")

df <- data.frame(score=x1, grade = x2)
df
df$score
df$grade

df2 <- data.frame(score= x1, grade = x2, stringsAsFactors = F)
df2
df2$score
df2$grade

mean(df2$score)

- R에서 기본 객체로 데이터 프레임이 있다.
- 구조는 행렬과 동일하나 각 열 마다 다른 타입의 데이터를 구성할 수 있다.
- 각 열의 길이(원소의 수)는 모두 같아야 한다.

- 데이터 프레임에서 인덱스를 홗용할 땐 달러 기호($)를 사용할 수 있다.
- $ 는 열(column)을 나타낸다. 범주 Levels 을 주목하자.

- 'character‘ 타입은 자동으로 'factor‘ 타입으로 변홖된다.
- 즉, 문자를 범주형으로 바꾸는 stinigAsFactors =TRUE가 기본값
- 행렬과 같이 df[ , ]로 접근한다.

• ls( ) 객체보기
• rm( ) 객체 지우기	
rm(list=ls())
• 사칙연산
• 데이터타입
numeric, character, logical
1, "1", TRUE
mode( ) , typeof( )
• 변수이름 예약어 사용못함.
if, else, for, 첫글자는 숫자 불
가
• 논리연산자 &, |, &&, ||
• 비교연산자 <, >, <=, >=, !=,
==,
• 행렬만들기
matrix(), byrow=T ,
rownames(), colnames()
• 데이터프레임 만들기
data.frame( )

데이터 읽어오기와 저장

# 현재 작업폴더 확인
getwd()

#새로운 작업폴더 지정
setwd("c:\\Rlab\\data")

# 새로운 작업폴드 지정
setwd("c:\\Rdata")
setwd("c:/Rdata")

#폴더나 파일 이름 보기
dir()

- getwd( )는 현재 지정된 작업 폴더의 경로를 출력한다.
- setwd( )로 새로운 작업 폴더의 경로를 설정할 수 있으며 폴더 구붂은 / (슬래쉬) 또는 \\(역슬래쉬 두개)로 한다.
- dir ( ) 로 작업 폴더 내 파일이름을 출력한다.

ex1 <- read.table("c:/Rlab/data/data.txt")
ex1
View(ex1)
colnames(ex1)

ex2 <- read.table("data.txt",header = TRUE)
ex2
colnames(ex2)

ex1 <- read.csv("C:/Rlab/data/data.csv")
ex1

- read.table( )로 외부에서 작성된 txt파일을 불러온다.
- 작업디렉토리 밖이면 젂체경로를 써준다.
- colnames( )는 열의 이름을 불러오는 함수. 변수 이름을 만들어줄 수도 있다.
- txt형식의 자료는 공백 (space bar) 또는 탭(tab)으로 구붂되어 있어야 한다.

- 변수명이 있을 때는 header = T로 불러올 수 있다.
- txt형식의 자료는 공백 (space bar) 또는 탭(tab)으로 구붂되어 있어야 한다.

txt로 저장하기

x1 <- 1:20
x2 <- rep(c("a","b"),10)
x3 <- sample(1:100, 20)
x1; x2; x3
x1
View(x1)

data1 <- cbind(x1,x2,x3)
data1

dataframe <- as.data.frame(data1)
  dataframe

data2 <- data.frame(x1,x2,x3)
data2
    
write.table(data1,file = "matrix.txt")
read.table("matrix.txt")

write.table(data1,file = "matrix.txt",sep = ",")
read.table("matrix.txt",sep = ",")

write.table(data1,file = "matrix.txt",sep = "\t")
read.table("matrix.txt",sep = "\t")

write.table(data1,file = "matrix.txt",sep = "$")
read.table("matrix.txt",sep = "$")

write.table(data2,file = "dataframe.txt")
read.table("dataframe.txt")

txt 파일인 경우
- read.table( )을 사용한다.
- read.table( ‚파일 이름‚, ‛, header = T , sep
= ‚\t‛)
: tab으로 붂리가 기본값
- write.table(객체이름, file = ‚파일 이름‚ )로
저장한다.

read.csv("data.csv", header =T , sep =",")
read.csv("data.csv", header =T )
read.csv("data.csv")

txt <- read.table("data.txt",header = T)
write.csv(txt,file="data1.csv")

csv 파일인 경우
- read.csv( ) 를 사용한다.
- read.csv(‚파일명‛, header = T, sep = ‚,‛)
콤마(,)로 붂리가 기본값.
- write.csv( )로 저장한다.
- wrtie.csv( 객체이름, file = ‚파일 이름‚)

엑셀 파일 읽기

install.packages("readxl")
library(readxl)#메모리에 올려야만 사용가능하다.

df_exam <- read_excel("excel_exam.xlsx")
df_exam

mean(df_exam$math)

- 패키지 설치
- 라이브러리 로드 require( ) 도 가능

데이터 이용하기

• read_xls( )
• read_xlsx( )

df_exam_novar <- read_excel("excel_exam_novar.xlsx", col_names = F)
# col_names열의 값을 F로 주기 
df_exam_novar

df_exam_novar <- read_excel("excel_exam_novar.xlsx") #header가 있다.
# col_names열의 값을 F로 주기 
df_exam_novar

df_exam_sheet <- read_excel("excel_exam_sheet.xlsx",sheet = 3) 
df_exam_sheet

df_csv_exam <- read.csv("csv_exam.csv")
df_csv_exam


df_csv_exam <- read.csv("csv_exam.csv",stringsAsFactors = F)
df_csv_exam

df_midterm <- data.frame(english = c(90, 80, 60, 70),
                         math = c(50, 60, 100, 20),
                         class = c(1, 1, 2, 2))
df_midterm
write.csv(df_midterm, file= "df_midterm.csv")

save(df_midterm, file= "df_midterm.rba")

엑셀을 불러온다. sheet 옵션으로 특정 Sheet를 불러 올 수 있다.

csv 파일을 불러온다.
<주의점> 엑셀 파일을 불러오는 함수 모양과 다르다.

csv 파일을 불러온다. 문자가 들어있는 csv 파일이라면, stringsAsFactors 는 문자를 범주형으로 바꾸지 않고
문자형 그대로 가져올 지 결정.

데이터 프레임을 csv 파일로 저장한다.
df_midterm 객첵 없으므로 df_exam_sheet 객체를 쓰도록 하자.

데이터 프레임을 Rdata 파일로 저장할려면, save( ) 함수를 사용.
파일 확장자는 .rda 이다.

rm(df_midterm)
df_midterm#에러: 객체 'df_midterm'를 찾을 수 없습니다

load("df_midterm.rba")
df_midterm

df_exam <- read_excel("excel_exam.xlsx")
df_csv_exam <- read.csv("csv_exam.csv")

Rdata를 불러오면 바로 데이터 프레임이 생성 된다. 하지만 엑셀, csv파일은 데이터프레임 변수명을 할당 해줘야 생성 된다.

데이터를 파악할 때 사용하는 함수들

엑셀 파일 읽기

head(df_exam, 10) # 처음 부터 10개 까지지
tail(df_exam,3) # 끝부터 10개 까지지
View(df_exam)
dim(df_exam) #[1] 20  5
str(df_exam)
summary(df_exam)

'Study > R' 카테고리의 다른 글

R-6 (0)	2020.09.05
R-5 (0)	2020.09.05
R-4 (0)	2020.09.05
R-3 (0)	2020.09.05
R -2 (0)	2020.09.05

PREV 1 ···9 10 11 12 NEXT

NAIAHD

Study

R-6

'Study > R' 카테고리의 다른 글

R-5

'Study > R' 카테고리의 다른 글

R-4

'Study > R' 카테고리의 다른 글

R-3

'Study > R' 카테고리의 다른 글

R -2

'Study > R' 카테고리의 다른 글

R-1

'Study > R' 카테고리의 다른 글

+ Recent posts

티스토리툴바