통계&데이터분석

[코딩테스트] 데이터 분석 과제

도도o 2024. 12. 14. 13:07

 

데이터 전처리

1. 필요 없는 열 삭제

2. 결측치 처리

3. 이상치 처리

4. 스케일링

5. 인코딩

6. 파생변수 만들기

 

 

 

1. 데이터 로드 및 탐색

import pandas as pd

data = pd.read_csv()

# 데이터 탐색
data.info()
data.isna().sum()
data.describe()
data.head()

data[col].value_counts()

data = data.drop([cols], index=1)

 

 

2. 데이터 분할

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,1:], data['y'], test_size=0.2], random_state=123)

 

 

3. 데이터 전처리

## 1. 결측처리 (예시. 평균대체)
X_train[missing] = X_train[missing].fillna(X_train[missing].mean())
X_test[missing] = X_test[missing].fillna(X_train[missing].mean())  # 학습데이터로 대체

## 2. 라벨인코더
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
X_train[label] = label_encoder.fit_transform(X_train[label])
X_test[label] = label_encoder.transform(X_test[label])

## 3. 범주형, 더미
X_train[cat] = X_train[cat].astype('category')
X_test[cat] = X_test[cat].astype('category')
X_train = pd.get_dummies(X_train, columns=cat)
X_test = pd.get_dummies(X_test, columns=cat)
### 컬럼을 동일하게 만들어줌
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

## 4. 파생변수 만들기 (qcut: 각 구각 데이터 개수 균일)
X_train['horse_qcut'] = pd.cut(X_train['horse_power'],5,labels=False)
X_test['horse_qcut'] = pd.cut(X_test['horsepower'],5,labels=False)

## 5. 스케일
from sklearn.preprocessing import MinMaxScaler
scale = ['dis', 'wei', 'hos']
scaler = MinMaxScaler()
X_train[scale] = scaler.fit_transform(X_train[scale])
X_test[scale] = scaler.transform(X_test[scale])

## 6. set 분리 - valid 생성
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2)

 

 

4. 회귀 모델링

from sklearn.linear_model import LinearRegression
model1 = LinearRegression()
model1.fit(X_train, y_train)
pred1 = model1.predict(X_valid)

from sklearn.ensemble import RandomForestRegressor
model2 = RandomForestRegressor()
model2.fit(X_train, y_train)
pred2 = model2.predict(X_valid)

print(model1.score(X_valid,y_valid)) # r-square
print(model2.score(X_valid,y_valid))

from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_valid, pred1)) # mse
print(mean_squared_error(y_valid, pred2))

 

 

5. 분류 모델링

from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression()
model1.fit(X_train,y_train)
pred1 = pd.DataFrame(model1.predict_proba(X_valid))

from sklearn.ensemble import RandomForestClassifier
model2 = RandomForestClassifier()
model2.fit(X_train,y_train)
pred2 = pd.DataFrame(model2.predict_proba(X_valid))


from sklearn.metrics import roc_auc_score
score1 = roc_auc_score(y_valid, pred1[1])
score2 = roc_auc_score(y_valid, pred2[1])
print(score1, score2)

 

 

6. 최종 출력

result = pd.DataFrame(model2.predict_proba(X_test))
result = result.iloc[:,1]
pd.DataFrame({'id':X_test.index, 'result':result}).to_csv('submission.csv', index=False)