패스트캠퍼스 파이썬을 활용한 데이터 전처리 Level UP 챌린지 참여 후기

Programming/Python

패스트캠퍼스 파이썬을 활용한 데이터 전처리 Level UP 챌린지 참여 후기

패치#노트 2020. 12. 8. 10:52

728x90

패스트캠퍼스

[파이썬을 활용한 데이터 전처리 Level UP]

챌린지 참여 후기

마지막 실전 프로젝트를 정리하기전 이번 챌린지 이벤트 참여에 대한 소회..

나에게 아직은 데이터분석은 생소한 분야임에는 틀림이 없다. 하지만 한단계 레벨업을 하기 위해서는 반드시 필요하다가 나 스스로가 깨닫고 있다.

수 많은 용어들과 통계학, 라이브러리 등 적응이 잘 안 되고 있지만 반복적인 학습을 통해서 매일매일 해나가도 보면 언젠가는 도달 할 수 있지 않을까?

아직도 늦었다고 생각하는가?

오늘도 한 걸음만 더 나아가보자!!!

할 수 있다!!!!!

해야만 한다!!!!!

01. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 01. (1) 문제 소개

* 대회 소개

- 출처 : 삼성 신용카드 매출 예측 경진대회

. 문제 제공자 : FUNDA (데이콘)

. 소상공인 가맹점 신용카드 빅데이터와 AI로 매출 예측 분석

dacon.io/competitions/official/140472/overview/

상점 신용카드 매출 예측 경진대회

출처 : DACON - Data Science Competition

dacon.io

// 하나의 레코드가 하나로 대응 되는 거라고 생각하면 된다.

02. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 02-1. (2) 학습 데이터 구축 (이론)

* 기본 데이터 구조 설계

- 레코드가 수집된 시간 기준으로 3개월 이후의 총 매출을 예측하도록 구조를 설계해야 한다.

- 시점은 월단위로 정의

* 시점변수 생성

// 시점 변수 생성이 필요하다.

- 시점 변수 생성 : 시점 (t) = (연도 - 2016 ) * 12 + 월

* 변주 변수 탐색

// 부적절하다고 하는 것은 데이터 기반이 아니라고 생각하면 된다.

// 범주형 변수로 상태 공간이 매우 커서, 더미화하기에는 부적절하다고 판단했다.

// 결측은 제거하지 않고 없음이라고 변환한다.

* 학습 데이터 구조 작성

// 정리되지 않은 데이터를 바탕으로 학습 데이터를 생성해야 하는 경우에는 레코드의 단위를 고려하여 학습 데이터의 구조를 먼저 작성하는 것이 바람직하다.

// 중복을 제거

* 평균 할부율 부착

// 할부를 많이 하는 그룹이 있을 것 같아서 groupby 를 이용해서 생성 replace 를 이용해서 다시 만들었고

* 기존 데이터에 부착 테크닉

// concat, merge 를 이용하면 일반적인 것은 붙일 수 있는데

// 다른 데이터에서 붙여야 하므로 case1 t가 유니크한 경우, 각 데이터를 정렬 후, 한 데이터에 대해 shift 를 사용

// case 2. t 가 유니크하지 않은 경우, t_1 변수를 생성

* 기존 데이터 부착 테크닉 Case 1

// shift 를 이용해서 concat 수행

// df2.shift(1)

// 시계열 예측에서 많이 쓰이는 패턴이다.

* 기존 데이터 부착 테크닉 Case 2

// 새로운 컬럼은 생성하서 merge를 수행한다.

* 기존 매출 합계 부착

* 기존 지역별 매출 합계 부착

* 라벨 부착하기

// 라벨의 경우에는 어떤 시점 t 에 상점 id 별로 붙이고 1 2 3 에 대해서 더 해준다.

03. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 02-2. (2) 학습 데이터 구축 (실습)

// region 에 결측이 있지만 대체를 하기에는 힘들고 제거를 하기에는 힘들다. 그래서 없음으로 표시

// 변수 목록 탐색을 통해서 상점 ID 가 일치 하는지 탐색을 해야 한다.

* 학습 데이터 구축

// 일자에서 연, 월, 일을 구분 해야 하므로 split 을 사용해서 구분

// 데이터 병합을 위한 새로운 컬럼 생성 및 기존 시간 변수를 삭제한다.

* 불필요한 변수 제거

// card_id, card_company는 특징으로 사용하기에는 너무 세분화 될수 있어서, 특징으로 유요할 가능성이 없다고 삭제.

// 주관적인 판단으로 처리한 것이다.

* 업종 특성, 지역, 할부 평균 탐색

// 대부분이 일시불이므로, installment_term 변수를 할부인지 아닌지 여부로 변환해야 한다.

// astype(int) 로 해서 할부는 0 , 할부하지 않은 것은 1로 나타난다.

// 상점별 평균 할부 비율

// .mean( ) 함수를 통해서 각 평균을 구해서 store_id 를 본다.

// 지역은 region 제거를 할수 없기 때문에 없음으로 대체 '없음'

// 업종도 비슷하다.

// 피벗 테이블을 생성하고 결측을 바로 앞 값으로 채울 수 있다.

// 값이 없는 경우는 결측으로 출력이 나옴.

// ffill 바로 앞 값과 bfill 그 다음은 바로 뒤값으로 채워준다.

// 변수 실수를 줄여야 한다.

// 모델 학습을 적용하기 위한 일반적인 전처리에 대한 준비를 했다.

04. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 03. (3) 학습 데이터 탐색 및 전처리

* 학습 데이터 기초 탐색 및 전처리

- 특징과 라벨 분리

// 필요없는 라벨들은 drop 으로 제거 해준다. axis = 1

- 학습데이터와 평가 데이터로 데이터 분할

// 특징 대비 샘플이 많다는 것을 알 수 있다. 37673

// 회귀 문제 예측 문제라고 예상하면 된다.

- 이상치 제거

// IQR rule 을 위배하지 않는 bool list 계산 (True : 이상치 없음, False : 이상치 있음)

// Y_condition 이상치가 아닌 값들로 구성된것

- 치우침 제거

// 모두 좌로 치우침을 확인 할수가 있다.

// 왜도의 절대값이 1.5이상인 컬럼만 가져왔다.

// Train_X.skew().abs() > 1.5

// 스케일링 수행

sklearn.preprocessing MinMaxScaler 수행

// 원래는 데이터 프레임을 정의하는 것이 좋기는 한데 메모리 문제로 처음부터 정의하지 않음

// 아무리 꼼꼼하더라도 모든 변수를 제거 할 필요는 없다.

// abs( ) Documentation

docs.python.org/3/library/functions.html#abs

abs(x)

Return the absolute value of a number. The argument may be an integer, a floating point number, or an object implementing __abs__(). If the argument is a complex number, its magnitude is returned.

3. Data model — Python 3.9.1rc1 documentation

A class can implement certain operations that are invoked by special syntax (such as arithmetic operations or subscripting and slicing) by defining methods with special names. This is Python’s approach to operator overloading, allowing classes to define

docs.python.org

object.__abs__(self)

05. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 04. (4) 모델 학습

// 모델 선택

// 샘플 대비 특징이 적고, 특징의 타입이 전부연속형으로 같다.

// kNN, RandomForestRegressor, LightGBM 고려함.

// 신경망을 쓰기에는 변수가 적기 때문에 좋은 결과를 기대하기는 어렵다.

// sklearn.ensemble.RandomForestRegressor Documentation

scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

class sklearn.ensemble.RandomForestRegressor(n_estimators=100, *, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)

A random forest regressor.

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Read more in the User Guide.

Parameters

n_estimatorsint, default=100

The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.

criterion{“mse”, “mae”}, default=”mse”

The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.

New in version 0.18: Mean Absolute Error (MAE) criterion.

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint or float, default=2

The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Changed in version 0.18: Added float values for fractions.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

Changed in version 0.18: Added float values for fractions.

min_weight_fraction_leaffloat, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

max_leaf_nodesint, default=None

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

min_impurity_decreasefloat, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.

min_impurity_splitfloat, default=None

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19. The default value of min_impurity_split has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use min_impurity_decrease instead.

bootstrapbool, default=True

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

oob_scorebool, default=False

whether to use out-of-bag samples to estimate the R^2 on unseen data.

n_jobsint, default=None

The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

random_stateint or RandomState, default=None

Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details.

verboseint, default=0

Controls the verbosity when fitting and predicting.

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning for details.

New in version 0.22.

max_samplesint or float, default=None

If bootstrap is True, the number of samples to draw from X to train each base estimator.

If None (default), then draw X.shape[0] samples.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0, 1).

New in version 0.22.

Attributes

base_estimator_DecisionTreeRegressor

The child estimator template used to create the collection of fitted sub-estimators.

estimators_list of DecisionTreeRegressor

The collection of fitted sub-estimators.

feature_importances_ndarray of shape (n_features,)

The impurity-based feature importances.

n_features_int

The number of features when fit is performed.

n_outputs_int

The number of outputs when fit is performed.

oob_score_float

Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.

oob_prediction_ndarray of shape (n_samples,)

Prediction computed with out-of-bag estimate on the training set. This attribute exists only when oob_score is True.

// 파라미터 그리드를 생성하고 하이퍼 마라미터 그리드를 parma_dict 에 추가함.

// sklearn.metrics.mean_absolute_error Documentation

scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html

sklearn.metrics.mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')

Mean absolute error regression loss

Read more in the User Guide.

Parameters

y_truearray-like of shape (n_samples,) or (n_samples, n_outputs)

Ground truth (correct) target values.

y_predarray-like of shape (n_samples,) or (n_samples, n_outputs)

Estimated target values.

sample_weightarray-like of shape (n_samples,), optional

Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’] or array-like of shape (n_outputs)

Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

‘raw_values’ :

Returns a full set of errors in case of multioutput input.

‘uniform_average’ :

Errors of all outputs are averaged with uniform weight.

Returns

lossfloat or ndarray of floats

If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.

MAE output is non-negative floating point. The best value is 0.0.

// 값이 작을 수록 좋기 때문에 초기 값은 매우 큰 값으로 정의함.

// LightGBM 에서 DataFrame 이 잘 처리 되지 않는 것을 방지하기 위해서 .values 를 사용하였다.

06. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 05. (5) 모델 적용

// 모델 학습이 다 끝나서 새로 들어온 데이터에서 대해서 예측을 해보는 것이다.

// pipeline 을 이용해서 구축할 수 있다.

07. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 01. (1) 문제 소개

* 아파트 실거래가 예측

// 실제 데이터는 크기가 크기 때문에 샘플 데이터로 주었다.

// 참조 데이터는 대회 문제 해결을 위해, 강사가 직접 수집한 데이터이며, 어떠한 정제도 하지 않았다.

08. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 02. (2) 변수 변환 및 부착

* 변수 부착시에 자주 발생하는 이슈 및 해결 방안

// 데이터 크기가 줄어도는 경우가 존재한다.

// np.isin 함수 사용

// numpy.isin Documentation

numpy.org/doc/stable/reference/generated/numpy.isin.html

numpy.isin(element, test_elements, assume_unique=False, invert=False)

Calculates element in test_elements, broadcasting over element only. Returns a boolean array of the same shape as element that is True where an element of element is in test_elements and False otherwise.

Parameters

elementarray_like

Input array.

test_elementsarray_like

The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.

assume_uniquebool, optional

If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.

invertbool, optional

If True, the values in the returned array are inverted, as if calculating element not in test_elements. Default is False. np.isin(a, b, invert=True) is equivalent to (but faster than) np.invert(np.isin(a, b)).

Returns

isinndarray, bool

Has the same shape as element. The values element[isin] are in test_elements.

* 불필요한 변수 제거

// 사용하지 않는 변수들은 미리 삭제하여 메모리 부담을 줄여준다.

* 범주 변수 구간화 : floor 변수

// floor 변수는 이론적으로는 연속형 변수지만, 범주형 변수로 간주하는 것이 적절하다.

// 너무 정교하게 하면 과최적화가 일어날 수가 있다.

// 박스플롯을 그려서 군집을 세분화한다.

* 시세 변수 추가

// groypby를 이용, 구별 전체 평균 시세 변수 추가

// 구별 작년 평균 시세, 구별 작년 거래량 변수 추가

// 아파트별 평균가격 변수 추가

* 실습

// engine = 'python' 에러를 줄이기 위해

// 일치하지 않는 경우를 확인해야 한다.

// isin 함수를 통해서 확인

// df_loc 가 ref_df_loc에 포함되지 않는 경우가 다수 있음을 확인했다.

// 시도와 법정동이 완전히 똑같은 행이 있어 제거를 하고, dong 에 리가 붙어 있으면 제거 해야 한다.

// apartment_id 를 삭제 할려고 했으나 완전히 유니크하지 않으므로 어느정도 사용이 가능할 것이라 보여서 살펴뒀음.

09. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 03. (3) 외부 데이터 부착

// 공원 데이터를 추가 한다.

// 동별로 유형에 공원수를 계산한 뒤, 데이터를 부착한다.

// 어린이집 데이터를 추가

// 처리에 관한 방법은 비슷하지만 어떤식으로 정리를 해야 할지는 많은 경험치가 필요할 듯 보인다.

10. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 04. (4) 모델 학습

* 데이터 분리

// 라벨 변수 할당

// 불필요한 변수를 제거하여 정의

// 학습 데이터의 크기가(27012, 22) 임을 확인, 특징 22개 데이터의 크기는 27012 이다.

// 더미화

// 샘플 대비 특징이 많지 않고, 범주형 변수의 개수도 많지 않아 더미화를 하더라도 큰 문제가 없다고 판단

* 결측 대체

// 원 데이터에는 결측이 없으나, 과거 거래와 관련돈 변수를 부착하는 과정에서 과거가 없는 데이터에 대한 결측이 생성된다.

* 모델 재학습

// 학습 데이터와 평가 데이터를 부착하고, 재학습을 실시함.

// 시간이 허락되면 해주는 것이 바람직하다.

* 실습

// 샘플 대비 특징이 매우 적다는 것은 더미화 가능하다

// 샘플이 충분히 많이 있으므로 트리 뿐만 아니라 트리 기반의 앙상블도 적절함.

* 더미화시킴

* 변수 부착 과정에서 생성된 결측 대체

// key 가 모델 function

// sklearn.metrics.mean_absolute_error 를 사용함.

// LGB 은 아스키 코드가 포함되면 작동안하는 경우가 있기 때문에 ndarrary 로 작업을 한다.

11. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 05. (5) 모델 적용

* 파이프라인 구축

// 새로들어온 데이터의 아파트 값 예측

// 파이프라인 사용에 필요한 모든 요소를 pickle 을 사용하였음

// pckl 로 저장함.

// pickle.dump

// pickle 의 파일들은 binary 로 수행한다.

// pickle Documentation

docs.python.org/3/library/pickle.html

The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” 1 or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.

Comparison with json

There are fundamental differences between the pickle protocols and JSON (JavaScript Object Notation):

JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format;
JSON is human-readable, while pickle is not;
JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs);
Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.

// 솔직히 난 json 타입을 더 선호하기는 하던데..

https://bit.ly/3m7bW22

파이썬을 활용한 데이터 전처리 Level UP 올인원 패키지 Online. | 패스트캠퍼스

데이터 분석에 필요한 기초 전처리부터, 데이터의 품질 및 머신러닝 성능 향상을 위한 고급 스킬까지 완전 정복하는 데이터 전처리 트레이닝 온라인 강의입니다.

www.fastcampus.co.kr

728x90

저작자표시