일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
Tags
- ETRI
- 에이블러
- 딥러닝
- Ai
- httr
- kaggle
- ML
- ggplot2
- r
- 하계인턴
- SQLD
- SQL
- 한국전자통신연구원
- 웹크롤링
- dx
- 다변량분석
- 머신러닝
- 하둡
- Eda
- KT 에이블스쿨
- KT AIVLE
- 기계학습
- python
- 가나다영
- 지도학습
- 빅데이터분석기사
- 에이블스쿨
- 시계열
- 한국전자통신연구원 인턴
- hadoop
- 에트리 인턴
- cnn
- 프로그래머스
- matplot
- 소셜네트워크분석
- 서평
- arima
- kt aivle school
- 빅분기
- 시각화
Archives
- Today
- Total
소품집
[Kaggle] Titanic 시각화 및 prediction (2) 본문
728x90
오늘은 kaggle 타이타닉 데이터 셋을 이용해 시각화와 예측을 해봤다.
getwd()
setwd('/Users/dayeong/Desktop/reserch/data')
# Kaggle 2DAY
# https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic
# Load packages
library(ggplot2)
library(ggthemes)
library(scales)
library(dplyr)
library(mice)
library(randomForest) # classification model
##
# Load Data
train <- read.csv('train.csv', stringsAsFactors = F)
test <- read.csv('test.csv', stringsAsFactors = F)
full <- bind_rows(train, test) # train set과 test set 병합
# check Data
str(full) # 12개의 변수와 1309개의 데이터가 존재.
##
# Feature Engineering
# 탑승객 풀 네임 중, '성'만 가져오자.
full$Title <- gsub('(.*,)|(\\..*)', '', full$Name)
table(full$Sex, full$Title) # Show title counts by sex
# Titles with very low cell counts to be combined to 'rare' level
rare_title <- c('Dona', 'Lady', 'the Countess', 'Capt', 'Col', 'Don',
'Dr','Major','Rev', 'Sir', 'jonkheer')
# Also regression mlle. ms. and mme accordingly
full$Title[full$Title == 'Mlle'] <- 'Miss'
full$Title[full$Title == 'Ms'] <- 'Miss'
full$Title[full$Title == 'Mme']<- 'Mrs'
full$Title[full$Title %in% rare_title]<- 'Rare Title'
# Create a family size variable including the passenger themselves
full$Fsize <- full$SibSp + full$Parch + 1
# Create a family variable
full$Family <- paste(full$Surname, full$Fsize, sep='_')
# Use ggplot2 to visualize the relatinship between family size & survival
# 가족 규모 생존 단위 시각화
ggplot(data=full[1:891,], aes(x=Fsize, fill=factor(Survived))) +
geom_bar(stat='count', position = 'dodge')
# Discretized famlily size
# 신원이 확인되지 않은 가족
full$FsizeD[full$Fsize==1] <- 'singleton'
full$FsizeD[full$Fsize < 5 & full$Fsize >1 ] <- 'small'
full$FsizeD[full$Fsize > 4] <- 'large'
# Mosaic plot
mosaicplot(table(full$FsizeD, full$Survived), main = 'Famliy Size by Survival', shade =T)
# This variable apperars to have a lot of missing values
full$Cabin[1:28]
# The first character is the deck. For example
strsplit(full$Cabin[2],NULL)[[1]]
# Create a Deck variable. Get passenger deck A-F:
full$Deck <- factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
# Missingness
# Sensible value imputation
# passnegers 62 and 830 are missing Embarkment
full[c(62,830), 'Embarked']
cat(paste('We will infer their values for **embarkment** based on present data that we can imagine may be relevant: **passenger class** and **fare**. We see that they paid<b> $',
full[c(62, 830), 'Fare'][[1]][1], '</b>and<b> $', full[c(62, 830), 'Fare'][[1]][2],
'</b>respectively and their classes are<b>', full[c(62, 830), 'Pclass'][[1]][1],
'</b>and<b>', full[c(62, 830), 'Pclass'][[1]][2], '</b>. So from where did they embark?'))
# Get rid of our missing passenger IDs
embark_fare <- full %>% filter(PassengerId!=62 & PassengerId!=830)
# 승선 class별 요금
# 일등석 승객의 중간 요금은 탑승 부족 승객이 지불한 80 달러와 일치되는 것을 확인할 수 있다.
# NA 값을 'C'로 대체할 수 있겠다.
ggplot(data = embark_fare, aes(x=Embarked, y=Fare, fill=factor(Pclass)))+
geom_boxplot() +
geom_hline(aes(yintercept=80),colour='red', linetype='dashed', lwd=2) +
scale_y_continuous(labels=dollar_format()) +
theme_few()
# Since their fare was $80 for 1st class, they most likely embarked from 'C'
full$Embarked[c(62,830)] <- 'C'
# show row 1044
full[1044,]
ggplot(full[full$Pclass=='3' & full$Embarked == 'S', ], aes(x=Fare)) +
geom_density(fill='#99d6ff') +
geom_vline(aes(xintercept=median(Fare, na.rm = T)), colour='red', lwd=1, linetype='dashed') +
scale_x_continuous(labels=dollar_format()) +
theme_few()
# 예측 전가
# show number of missing age values
sum(is.na(full$Age))
# 나이를 예측하기 위해서는 rpart를 사용할 수 있지만, mice로 예측하자. (?) 뭔 말인지 이해 안감.#
# Make variables factors into factors
factor_vars <- c('PassengerId','Pclass','Sex','Embarked',
'Title','Surname','Family','FsizeD')
full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))
# Perform mice imputaion excluding certatin less-than-useful variable
# 사용량이 적은 타겟을 제거하고, 변수 생성.
# rf 모델을 사용해, 이름 na 값에 대하여 랜덤으로 값을 부여함.
mice_mod <- mice(full[, !names(full) %in% c('PassengerId','Name','Ticket','Cabin','Family','Surname','Survived')],
method='rf')
# Save the complete output
mice_output <-complete(mice_mod)
# Save age Distributions
par(mfrow=c(1,2))
hist(full$Age, freq = F, main = 'AgeOriginal Data', col = 'darkgreen', ylim=c(0,0.04))
hist(mice_output$Age, freq = F, main = 'Age : Mice output', col = 'ligthgreen', ylim=c(0,0.04))
# Replace Age variable from the mice model
full$Age <- mice_output$Age
sum(is.na(full$Age)) # na 값이 0인 것을 확인할 수 있다.
##
# feature engineering : round2
# 성별로 구분된 나이대별 생존자수
ggplot(full[1:891,], aes(Age, fill=factor(Survived))) +
geom_histogram() +
facet_grid(.~Sex) +
theme_few()
# create the colum child, and indicate wheter child or adult
full$Child[full$Age < 18] <- 'Child'
full$Child[full$Age >=18] <- 'Adult'
# child survived
table(full$Child, full$Survived)
# cheak NA
md.pattern(full)
#
## prediction
train <- full[1:891,]
test <- full[892:1309,]
# Building the model
set.seed(754)
rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Title +
FsizeD + Child,
data = train)
# Show model error
plot(rf_model, ylim=c(0,0.36))
legend('topright', colnames(rf_model$err.rate), col = 1:3, fill=1:3)
# Variable importance
# Get importance
importance <- importance(rf_model)
varImportance <- data.frame(Variable = row.names(importance), Importace=round(importance[,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(importance)))) # 변수 중요도 순위 메기기
# 시각화
# Use ggplot2 to visualize the relative importance of variables
ggplot(data)
# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variable, Importace),
y = Importace, fill = Importace)) +
geom_bar(stat='identity') +
geom_text(aes(x = Variable, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = 'red') +
labs(x = 'Variable') +
coord_flip() +
theme_few()
#
# prediction
prediction <- predict(rf_model, test)
table(prediction)
728x90
'AI' 카테고리의 다른 글
[Kaggle] interactive visualization (0) | 2020.08.28 |
---|---|
[Kaggle] House prices 예측 (3) - 오늘은 실패 (0) | 2020.08.28 |
[Kaggle] Airbnb Data시각화 및 regression (1) (0) | 2020.08.26 |
[ML/DL] Topic modeling, LDA (0) | 2020.06.24 |
[ML/DL] Ensembles model - 앙상블 모델 (Adaboost, Random forest ..) (0) | 2020.06.23 |
Comments