통계&데이터분석
[코딩테스트] 데이터 분석 과제
도도o
2024. 12. 14. 13:07
데이터 전처리
1. 필요 없는 열 삭제
2. 결측치 처리
3. 이상치 처리
4. 스케일링
5. 인코딩
6. 파생변수 만들기
1. 데이터 로드 및 탐색
import pandas as pd
data = pd.read_csv()
# 데이터 탐색
data.info()
data.isna().sum()
data.describe()
data.head()
data[col].value_counts()
data = data.drop([cols], index=1)
2. 데이터 분할
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,1:], data['y'], test_size=0.2], random_state=123)
3. 데이터 전처리
## 1. 결측처리 (예시. 평균대체)
X_train[missing] = X_train[missing].fillna(X_train[missing].mean())
X_test[missing] = X_test[missing].fillna(X_train[missing].mean()) # 학습데이터로 대체
## 2. 라벨인코더
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
X_train[label] = label_encoder.fit_transform(X_train[label])
X_test[label] = label_encoder.transform(X_test[label])
## 3. 범주형, 더미
X_train[cat] = X_train[cat].astype('category')
X_test[cat] = X_test[cat].astype('category')
X_train = pd.get_dummies(X_train, columns=cat)
X_test = pd.get_dummies(X_test, columns=cat)
### 컬럼을 동일하게 만들어줌
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)
## 4. 파생변수 만들기 (qcut: 각 구각 데이터 개수 균일)
X_train['horse_qcut'] = pd.cut(X_train['horse_power'],5,labels=False)
X_test['horse_qcut'] = pd.cut(X_test['horsepower'],5,labels=False)
## 5. 스케일
from sklearn.preprocessing import MinMaxScaler
scale = ['dis', 'wei', 'hos']
scaler = MinMaxScaler()
X_train[scale] = scaler.fit_transform(X_train[scale])
X_test[scale] = scaler.transform(X_test[scale])
## 6. set 분리 - valid 생성
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2)
4. 회귀 모델링
from sklearn.linear_model import LinearRegression
model1 = LinearRegression()
model1.fit(X_train, y_train)
pred1 = model1.predict(X_valid)
from sklearn.ensemble import RandomForestRegressor
model2 = RandomForestRegressor()
model2.fit(X_train, y_train)
pred2 = model2.predict(X_valid)
print(model1.score(X_valid,y_valid)) # r-square
print(model2.score(X_valid,y_valid))
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_valid, pred1)) # mse
print(mean_squared_error(y_valid, pred2))
5. 분류 모델링
from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression()
model1.fit(X_train,y_train)
pred1 = pd.DataFrame(model1.predict_proba(X_valid))
from sklearn.ensemble import RandomForestClassifier
model2 = RandomForestClassifier()
model2.fit(X_train,y_train)
pred2 = pd.DataFrame(model2.predict_proba(X_valid))
from sklearn.metrics import roc_auc_score
score1 = roc_auc_score(y_valid, pred1[1])
score2 = roc_auc_score(y_valid, pred2[1])
print(score1, score2)
6. 최종 출력
result = pd.DataFrame(model2.predict_proba(X_test))
result = result.iloc[:,1]
pd.DataFrame({'id':X_test.index, 'result':result}).to_csv('submission.csv', index=False)