소품집

[빅분기] 빅데이터분석기사 작업형 2유형 연습 문제 풀어보기 본문

자격증/빅데이터분석기사

[빅분기] 빅데이터분석기사 작업형 2유형 연습 문제 풀어보기

sodayeong 2023. 11. 21. 00:16

https://www.datamanim.com/dataset/03_dataq/typetwo.html

 

작업 2유형 (파이썬) — DataManim

참고 모든 문제의 y_test값은 해당 url에서 y_test로 불러와 확인가능합니다. 실제로 제출을 위해 만든 데이터의 예측 점수를 확인해보세요

www.datamanim.com

 

작업형 2 유형은 분류/회귀 중 랜덤으로 나오고, 모듈 자동완성이 안되기 때문에..^^ 패키지까지 싸악 외워야한다!

간단한 데이터 전처리 및 모델 구축과 평가로 간단하게 진행하면 될 것 같다. 

(외워) 풀어보겠습니다~!

 

분류 (Classification)

1. 서비스 이탈 예측 데이터 

import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report

#데이터 로드

x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_test.csv")

display(x_train.head())
display(y_train.head())

# 결측치 확인
print(x_train.isna().sum())
print(x_test.isna().sum())
print(y_train.isna().sum())

print(x_train.info())
print(y_train.info())
print(x_test.info())

# 1. 명목형 컬럼 삭제 (id, name)
drop_col = ['CustomerId', 'Surname']

x_train_drop = x_train.drop(columns = drop_col, axis=1)
x_test_drop = x_test.drop(columns = drop_col, axis=1)

# 2. 원-핫 인코딩
x_train_dummies = pd.get_dummies(x_train_drop)
x_test_dummies = pd.get_dummies(x_test_drop)

x_test_dummies = x_test_dummies[x_train_dummies.columns] # 순서 정렬

# y 추출
y = y_train['Exited']

# 데이터 셋 분리
x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y, test_size=0.3, random_state=2022)

# RF
rf = RandomForestClassifier()
rf.fit(x_train, y_train)

y_pred = rf.predict(x_valid)
y_pred_prob = rf.predict_proba(x_valid)[:, 1]

print(classification_report(y_valid, y_pred))
print(roc_auc_score(y_valid, y_pred_prob))

# 실제값 평가
y_predict = rf.predict(x_test_dummies)
y_predict_proba = rf.predict_proba(x_test_dummies)[:,1]

a = pd.DataFrame({'ID' : x_test['CustomerId'], 'Predict' : y_predict})
a.to_csv('test01.csv', header=True, index=False)

b = pd.DataFrame({'ID' : x_test['CustomerId'], 'Predict' : y_predict_proba})
b.to_csv('test02.csv', header=True, index=False)

 

 

2. 이직여부 판단 데이터

import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# 데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_test.csv")

print(display(x_train))
print(display(x_test))

# check
print(x_train.info())
print(x_test.info())

x_train['company_size'].value_counts()
x_train['company_type'].value_counts()

# drop columns
drop_col = ['enrollee_id', 'city']

x_train_drop = x_train.drop(columns=drop_col, axis=1)
x_test_drop = x_test.drop(columns=drop_col, axis=1)

# dummies
x_train_dummies = pd.get_dummies(x_train_drop)
x_test_dummies = pd.get_dummies(x_test_drop)

x_test_dummies = x_test_dummies[x_train_dummies.columns] # 순서 정렬

# target 정의 
y = y_train['target']

# 데이터 셋 분리
x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y, test_size=0.3, random_state=2022)

# RF
rf = RandomForestClassifier()

params = {
    'max_depth' : [100, 200, 300, 400, 500], 
    'n_estimators' : [1,2,3,4,5]
}

# GridSearch
gr = GridSearchCV(rf, param_grid = params, cv=5)
gr.fit(x_train, y_train)
print(gr.best_params_)

rf = RandomForestClassifier(max_depth = 400, n_estimators=4)
rf.fit(x_train, y_train)

y_pred = rf.predict(x_valid)
y_pred_prob = rf.predict_proba(x_valid)[:, 1]

# validation set에 한하여 성능 평가 
print(classification_report(y_valid, y_pred))
print(roc_auc_score(y_valid, y_pred))

# 실제값
y_predict = rf.predict(x_test_dummies) # 분류
y_predict_proba = rf.predict_proba(x_test_dummies)[:, 1] # 확률로 물어 볼 때

a = pd.DataFrame({ 'ID' : x_test['enrollee_id'], 'Predict' : y_predict})

b = pd.DataFrame({ 'ID' : x_test['enrollee_id'], 'Predict' : y_predict_proba})
b.to_csv('test02.csv', header=True, index=False)
a.to_csv('test01.csv', header=True, index=False)

 

 

3. 정시 배송 여부 판단 (2회 기출 문제)

import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# 데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_test.csv")

display(x_train.head())
display(y_train.head())

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)

print(x_train.info())
print(x_test.info())
print(y_train.info())

# Drop columns 
drop_col = ['ID']

x_train_drop = x_train.drop(columns=drop_col, axis=1)
x_test_drop = x_test.drop(columns=drop_col, axis=1)

# Dummies 
x_train_dummies = pd.get_dummies(x_train_drop)
x_test_dummies = pd.get_dummies(x_test_drop)

x_test_dummies = x_test_dummies[x_train_dummies.columns] # 순서 정렬

# y
y = y_train['Reached.on.Time_Y.N']

# 데이터 셋 분리
x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y, test_size=0.3, random_state=42)

# GridSearch
rf = RandomForestClassifier()

params = {'max_depth' : [100, 200, 300, 400, 500], 
         'n_estimators' : [1,2,3,4,5]}

gr = GridSearchCV(rf, param_grid = params, cv=5)
gr.fit(x_train, y_train)
print(gr.best_params_)

# RF
rf = RandomForestClassifier(max_depth = 500, n_estimators = 4)
rf.fit(x_train, y_train)

y_pred = rf.predict(x_valid)
y_pred_prob = rf.predict_proba(x_valid)[:, 1]

# 평가
print(classification_report(y_valid, y_pred))
print(roc_auc_score(y_valid, y_pred_prob))

# 실제값
y_predcit = rf.predict(x_test_dummies)
y_predict_proba = rf.predict_proba(x_test_dummies)[:, 1]

a = pd.DataFrame({'ID' : x_test['ID'], 'Predict' : y_predcit})
a.to_csv('test01.csv', header=True, index=False)

b = pd.DataFrame({'ID' : x_test['ID'], 'Predict' : y_predict_proba})
b.to_csv('test02.csv', header=True, index=False)


4. 성인 건강검진 데이터

import pandas as pd

#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_test.csv")


display(x_train.head())
display(y_train.head())

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, roc_auc_score

print(x_train.info())
print(x_test.info())
print(y_train.info())

x_train_dummies = pd.get_dummies(x_train)
x_test_dummies = pd.get_dummies(x_test)

x_test_dummies = x_test_dummies[x_train_dummies.columns]

y = y_train['흡연상태']

x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y, test_size=0.3, random_state=2022)

rf = RandomForestClassifier()

params = {
    'max_depth' : [100, 200, 300, 400, 500], 
    'n_estimators' : [1,2,3,4,5]
}

gr = GridSearchCV(rf, param_grid = params, cv=5)
gr.fit(x_train, y_train)
print(gr.best_params_)

# RF
rf = RandomForestClassifier(max_depth=300, n_estimators=5, random_state=2022)

rf.fit(x_train, y_train)

y_pred = rf.predict(x_valid)
y_pred_prob = rf.predict_proba(x_valid)[:, 1]

print(classification_report(y_valid, y_pred))
print(roc_auc_score(y_valid, y_pred_prob))

y_predict = rf.predict(x_test_dummies)
y_predict_proba = rf.predict_proba(x_test_dummies)[:, 1]

pd.DataFrame({'ID': x_test['ID'], 'Predict' : y_predict})
pd.DataFrame({'ID': x_test['ID'], 'Predict' : y_predict_proba})

 

5. 자동차 보험 가입 예측 데이터

import pandas as pd

# 데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/insurance/x_test.csv")

display(x_train.head())
display(y_train.head())

print(x_train.info())
print(y_train.info())
print(x_test.info())

drop_col = ['ID', 'id']

x_train_drop = x_train.drop(columns=drop_col, axis=1)
x_test_drop = x_test.drop(columns=drop_col, axis=1)

x_train_dummies = pd.get_dummies(x_train_drop)
x_test_dummies = pd.get_dummies(x_test_drop)

x_test_dummies = x_test_dummies[x_train_dummies.columns]

y = y_train['Response']

# RF 
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report

x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier()

params = {
    'max_depth' : [100, 200, 300, 400, 500], 
    'n_estimators' : [1,2,3,4,5]
}
gr = GridSearchCV(rf, param_grid = params, cv= 5)
gr.fit(x_train, y_train)

y_pred = gr.predict(x_valid)
print(classification_report(y_valid, y_pred))
print(gr.best_params_)

rf = RandomForestClassifier(max_depth = 100, n_estimators =4)
rf.fit(x_train, y_train)

y_predict = rf.predict(x_test_dummies)
y_predict_proba = rf.predict_proba(x_test_dummies)[:, 1]
y_predict_proba

a = pd.DataFrame({'ID' : x_test_dummies['ID'], 'Predict' : y_predict})
a.to_csv('a.csv', index=False, header=True)

 

 

6. 비행 탑승 경험 만족도 데이터 

import pandas as pd

#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_test.csv")

display(x_train.head())
display(y_train.head())

print(x_train.info())
print(x_test.info())
print(y_train.info())

# 결측치 확인
x_train.isna().sum()
x_test.isna().sum()
y_train.isna().sum()

# 각 평균치로 대체
x_train['Arrival Delay in Minutes'] = x_train['Arrival Delay in Minutes'].fillna(x_train['Arrival Delay in Minutes'].mean())
x_test['Arrival Delay in Minutes'] = x_test['Arrival Delay in Minutes'].fillna(x_test['Arrival Delay in Minutes'].mean())

x_train.isna().sum()
x_test.isna().sum()

drop_col = ['ID', 'id']

x_train_drop = x_train.drop(columns=drop_col, axis=1)
x_test_drop = x_test.drop(columns=drop_col, axis=1)

x_train_dummies = pd.get_dummies(x_train_drop)
x_test_dummies = pd.get_dummies(x_test_drop)

x_test_dummies = x_test_dummies[x_train_dummies.columns]

y = y_train['satisfaction']

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report

x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(random_state=42)
rf.fit(x_train, y_train)
y_pred = rf.predict(x_valid)

print(classification_report(y_valid, y_pred))
y_predict_proba = rf.predict_proba(x_test_dummies)[:, 1]

results = pd.DataFrame({'ID' : x_test['ID'], 'Predict' : y_predict_proba})
results.to_csv('result.csv', header=True, index=False)

 

 

7. 약물 분류 데이터

import pandas as pd

# 데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_test.csv")

display(x_train.head())
display(y_train.head())

print(x_train.isna().sum())
print(x_test.isna().sum())
print(y_train.isna().sum())

print(x_train.info())
print(x_test.info())
print(y_train.info())

drop_col = ['ID']

x_train_drop = x_train.drop(columns=drop_col, axis=1)
x_test_drop = x_test.drop(columns=drop_col, axis=1)

x_train_dummies = pd.get_dummies(x_train_drop)
x_test_dummies = pd.get_dummies(x_test_drop)

x_test_dummies = x_test_dummies[x_train_dummies.columns]

y = y_train['Drug']

x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y, test_size=0.3, random_state=42)

# RF
rf = RandomForestClassifier()

params = {
    'max_depth' : [100, 200, 300, 400, 500], 
    'n_estimators' : [1,2,3,4,5]
}

gr = GridSearchCV(rf, param_grid=params, cv=5)
gr.fit(x_train, y_train)

print(gr.best_params_)

rf = RandomForestClassifier(max_depth=100, n_estimators=5)
rf.fit(x_train, y_train)

y_pred = rf.predict(x_valid)

print(classification_report(y_valid, y_pred))

# 실제값
y_predict = rf.predict(x_test_dummies)
results = pd.DataFrame({'ID' : x_test['ID'], 'Predict' : y_predict})
results.to_csv('results.csv', index=False, header=True)

 

Regressor - 회귀

 

1. 학생 성적 예측 데이터

import pandas as pd

# 데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/studentscore/X_test.csv")


display(x_train.head())
display(y_train.head())

print(x_train.isna().sum())
print(x_test.isna().sum())
print(y_train.isna().sum())

print(x_train.info())
print(x_test.info())
print(y_train.info())

drop_col = ['StudentID']

x_train_drop = x_train.drop(columns=drop_col, axis=1)
x_test_drop = x_test.drop(columns=drop_col, axis=1)

x_train_dummies = pd.get_dummies(x_train_drop)
x_test_dummies = pd.get_dummies(x_test_drop)

x_test_dummies = x_test_dummies[x_train_dummies.columns]

import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error. r2_score

y = y_train['G3']
x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y,test_size=0.3, random_state=42)

rf = RandomForestRegressor(random_state=42)
rf.fit(x_train, y_train)

y_pred = rf.predict(x_valid)

print('validation mse' ,mean_squared_error(y_valid,y_pred))
print('validation mae' ,mean_absolute_error(y_valid,y_pred))
print('validation mape' ,mean_absolute_percentage_error(y_valid,y_pred))
print('validation rmse' ,np.sqrt(mean_absolute_percentage_error(y_valid,y_pred)))
print('validation r2 score' ,r2_score(y_valid,y_pred))

 

2. 킹카운티 주택 가격 예측 데이터

import pandas as pd

#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/x_test.csv")

display(x_train.head())
display(y_train.head())

print(x_train.isna().sum())
print(x_test.isna().sum())
print(y_train.isna().sum())

print(x_train.info())

col = ['ID', 'id', 'date']

x_train_drop = x_train.drop(columns=col, axis=1)
x_test_drop = x_test.drop(columns=col, axis=1)

x_train_dummies = pd.get_dummies(x_train_drop)
x_test_dummies = pd.get_dummies(x_test_drop)

x_test_dummies = x_test_dummies[x_train_dummies.columns]
y = y_train['price']

print(x_train_dummies.columns)
print(x_test_dummies.columns)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

x_train, x_valid, y_train, y_valid = train_test_split(x_train_dummies, y, test_size=0.3, random_state=42)

rf = RandomForestRegressor(random_state=42)

params = {
    'max_depth' : [100,200,300,400,500], 
    'n_estimators' : [1,2,3,4,5]
}

gr = GridSearchCV(rf, param_grid = params)
gr.fit(x_train, y_train)

print(gr.best_params_)

rf = RandomForestRegressor(random_state=42, max_depth=100, n_estimators=5)
rf.fit(x_train, y_train)

y_pred = rf.predict(x_valid)
print(mean_absolute_error(y_valid, y_pred))

y_predict = rf.predict(x_test_dummies)
true = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/kingcountyprice/y_test.csv')
price = true['price']

print(mean_absolute_error(price, y_predict))

result = pd.DataFrame({'predict':y_predict})
result.to_csv('result.csv', index=False)
728x90
Comments