연초부터 메가톤급으로 전해졌던 현대자동차그룹과 애플의 ‘애플카’ 공동 개발 논의가 중단됐다. 자율주행 전기차종으로 알려진 애플카는 글로벌 기업인 양사의 첫 합작품이란 점에서 세간의 관심을 끌었다. 일각에선 물밑 접촉을 통해 양 사가 재차 협력에 나설 가능성도 제기하고 있지만 양측의 입장 차이가 공식적으로 확인된 만큼, 장담하긴 어려운 형편이다.
현대차와 기아, 현대모비스는 8일 전자공시를 통해 “다수의 기업으로부터 자율주행 전기차 관련 공동개발 협력 요청을 받고 있으나, 초기단계로 결정된 바 없다”며 “애플과 자율주행차량 개발에 대한 협의를 진행하지 않고 있다”고 밝혔다.
일단 현대차그룹은 협상 중단 이유에 대해선 함구하고 있다. 현대차그룹 관계자는 “애플과 관련된 내용은 공시에서 밝힌 것 이외에는 어떠한 내용도 확인해 줄 수 없다”고 입을 다물었다. 양 사의 애플카와 관련된 논의 중단은 협상 테이블에서의 주도권 다툼에서 비롯됐을 것이란 관측이 우세하다. 과거 애플의 해외 스마트폰 시장 진출 과정을 살펴보면 현지 이동통신업체나 부품 협력사 선정 과정에서 수 차례의 협상 결렬과 재논의 등을 반복해왔다. 애플은 이 과정에서 언제나 유리한 고지를 점령했다.
애플은 현대차그룹 외에도 미국의 GM, FCA-PSA, 일본의 도요타, 닛산, 혼다, 마쓰다, 스바루, 중국의 폭스콘-지리차 등 10여개 완성차 업체와 애플과 관련 논의를 진행하고 있다. 하지만 이들 중 애플이 원하는 △대량생산 △수준 높은 조립 △전기차 전용 플랫폼 △납품단가 등의 조건을 모두 맞출 수 있는 곳은 제한적이다.
고태봉 하이투자증권 리서치센터장은 “유럽은 낮은 단가를 맞추기 힘들고, 일본차 업체들은 전기차 분야 기술력이 부족하고, 중국은 무역갈등이 걸림돌이 되는 등 우리나라 기업의 경쟁력이 앞선다고 볼 수 있다”며 “애플카가 단순한 전기차가 아니라, 자율주행 기능을 갖춘 차세대 ‘모빌리티(이동수단)’으로 등장이 예상되는 가운데 현대차그룹을 주요 협상 대상으로 삼고 있다는 것은, 관련 기술력을 공식적으로 인정받은 것으로도 볼 수 있다”고 전했다.
SK바이오사이언스는 2018년 SK케미칼의 백신 사업부문이 물적분할해 설립됐다. 독감, 대상포진 백신 등으로 잘 알려져 있지만 최근에는 코로나19 백신으로 주목받고 있다. 지난해 영국 아스트라제네카와 코로나19 백신 수탁생산(CMO) 계약, 미국 노바백스와 코로나19 백신 수탁개발생산(CDMO) 계약을 맺었다. 자체 개발 중인 코로나19 백신 두 종류도 최근 임상 단계에 들어갔다.
지난달 질병관리청으로부터 ‘코로나19 백신 유통관리체계 구축·운영 사업’ 수행기관으로 선정됐다. 이에 따라 회사는 아스트라제네카, 얀센, 화이자 백신을 비롯해 백신 구매를 위한 국제 프로젝트인 코박스 퍼실리티(COVAX facility) 백신 물량의 유통·보관을 담당한다. 또 정부가 이달 노바백스 백신 2000만 명분을 선구매하기로 하면서 SK바이오사이언스는 안정적인 매출처를 하나 더 확보하게 됐다. 관련주 SK케미칼
특정 종목의 주가가 하락할 것으로 예상되면 해당 주식을 보유하고 있지 않은 상태에서 주식을 빌려 매도 주문을 내는 투자 전략이다. 주로 초단기 매매차익을 노리는 데 사용되는 기법이다. 향후 주가가 떨어지면 해당 주식을 싼 값에 사 결제일 안에 주식대여자(보유자)에게 돌려주는 방법으로 시세차익을 챙긴다. 공매도는 주식시장에 유동성을 공급하는 반면 시장 질서를 교란시키고 불공정거래 수단으로 악용되기도 한다.
금융당국은 3일 모든 상장주식에 대한 공매도 금지 조처를 5월2일까지 다시 연장하기로 했습니다. 다만, 5월3일부터는 코스피200과 코스닥150 지수를 구성하는 350개 종목에 한해 공매도를 재개하기로 했습니다. 당국이 개인투자자들의 반발과 정치권의 연장 압박에 기존의 재개 입장에서 한발 물러서 절충안을 선택한 것
과기정통부, `과학기술·ICT 분야 연구개발사업 종합시행계획` 기초연구사업 예산 1조8029억원…전년대비 2917억원 확대 6G·자율주행·PIM반도체·블록체인 등 신기술 개발에 727억 투입 탄소중립 등 기후변화 대응에 1591억, 바이오 개발에도 5336억
과기정통부는 총 5조8161억원을 투자하는 `2021년도 과학기술·정보통신방송(ICT) 분야 연구개발사업 종합시행계획`을 확정하고, 본격적으로 사업을 추진한다고 3일 밝혔다. 이번 종합시행계획은 과기정통부 전체 R&D 예산 총 8조8682억원 중에서 국가과학기술연구회, 직할출연기관 연구운영비 등을 제외한 과학기술분야 4조6061억원, ICT 분야 1조2100억원을 대상으로 하며 △기초연구(1조8029억원) △원천연구(2조8459억원) △R&D 사업화(3415억원) △인력양성(2509억원) △R&D 기반조성(5749억원) 등을 포함
국내 주요 증권사 리서치센터장들은 오는 3월16일에 공매도(空賣渡) 거래가 재개되면 증시의 단기 조정이 불가피하다고 내다봤다. 코로나19발 폭락장 이후 국내 증시가 V자 반등하며 사상 최고 행진을 거듭한 가운데 지난 1년 동안 금지된 공매도 거래가 한꺼번에 몰려 하락장이 펼쳐질 수 있다는 것이다. 특히 바이오주의 타격이 클 것으로 전망됐다. 다만 개인투자자들의 적극적인 주식 투자 참여로 인해 공매도 재개에 따른 조정폭이 그렇게 크지는 않을 것이라는 의견도 제기됐다.
대신증권에 따르면 과거에도 공매도 거래를 재개한 이후 코스피 지수는 단기 조정을 겪었다. 2009년 5월29일 공매도 재개 이후 코스피는 6월 한 달 동안 고점 대비 -3% 수준에서 기간 조정을 거쳤다. 또 2011년 11월9일 공매도 거래 재개 이후 코스피는 1770~1920p(고점 대비 7.8% 가격 조정)에서 등락했다. 다만 각각 조정 후 상승추세를 재개했다.
앞서 공매도 금지 직전 코스닥 시장에서 공매도 잔고가 많이 쌓인 종목은 신라젠, 국일제지, CMG제약, 에이치엘비, 셀트리온헬스케어 등 제약·바이오주가 대부분이었다. 이에 따라 공매도 거래가 재개됐을 때 이들 제약·바이오주 주가의 타격이 특히 클 수 있다.
Return the absolute value of a number. The argument may be an integer, a floating point number, or an object implementing__abs__(). If the argument is a complex number, its magnitude is returned.
object.__abs__(self)
05. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 04. (4) 모델 학습
A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with themax_samplesparameter ifbootstrap=True(default), otherwise the whole dataset is used to build each tree.
Changed in version 0.22:The default value ofn_estimatorschanged from 10 to 100 in 0.22.
criterion{“mse”, “mae”}, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
New in version 0.18:Mean Absolute Error (MAE) criterion.
max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then considermin_samples_splitas the minimum number.
If float, thenmin_samples_splitis a fraction andceil(min_samples_split*n_samples)are the minimum number of samples for each split.
Changed in version 0.18:Added float values for fractions.
min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at leastmin_samples_leaftraining samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then considermin_samples_leafas the minimum number.
If float, thenmin_samples_leafis a fraction andceil(min_samples_leaf*n_samples)are the minimum number of samples for each node.
Changed in version 0.18:Added float values for fractions.
min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
If int, then considermax_featuresfeatures at each split.
If float, thenmax_featuresis a fraction andint(max_features*n_features)features are considered at each split.
If “auto”, thenmax_features=n_features.
If “sqrt”, thenmax_features=sqrt(n_features).
If “log2”, thenmax_features=log2(n_features).
If None, thenmax_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more thanmax_featuresfeatures.
max_leaf_nodesint, default=None
Grow trees withmax_leaf_nodesin best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
whereNis the total number of samples,N_tis the number of samples at the current node,N_t_Lis the number of samples in the left child, andN_t_Ris the number of samples in the right child.
N,N_t,N_t_RandN_t_Lall refer to the weighted sum, ifsample_weightis passed.
New in version 0.19.
min_impurity_splitfloat, default=None
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19:min_impurity_splithas been deprecated in favor ofmin_impurity_decreasein 0.19. The default value ofmin_impurity_splithas changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Usemin_impurity_decreaseinstead.
bootstrapbool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
oob_scorebool, default=False
whether to use out-of-bag samples to estimate the R^2 on unseen data.
Controls both the randomness of the bootstrapping of the samples used when building trees (ifbootstrap=True) and the sampling of the features to consider when looking for the best split at each node (ifmax_features<n_features). SeeGlossaryfor details.
verboseint, default=0
Controls the verbosity when fitting and predicting.
warm_startbool, default=False
When set toTrue, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. Seethe Glossary.
ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller thanccp_alphawill be chosen. By default, no pruning is performed. SeeMinimal Cost-Complexity Pruningfor details.
New in version 0.22.
max_samplesint or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.
If None (default), then drawX.shape[0]samples.
If int, then drawmax_samplessamples.
If float, then drawmax_samples*X.shape[0]samples. Thus,max_samplesshould be in the interval(0,1).
New in version 0.22.
Attributes
base_estimator_DecisionTreeRegressor
The child estimator template used to create the collection of fitted sub-estimators.
y_truearray-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
y_predarray-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.
sample_weightarray-like of shape (n_samples,), optional
Sample weights.
multioutputstring in [‘raw_values’, ‘uniform_average’] or array-like of shape (n_outputs)
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
‘raw_values’ :
Returns a full set of errors in case of multioutput input.
‘uniform_average’ :
Errors of all outputs are averaged with uniform weight.
Returns
lossfloat or ndarray of floats
If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
// 값이 작을 수록 좋기 때문에 초기 값은 매우 큰 값으로 정의함.
// LightGBM 에서 DataFrame 이 잘 처리 되지 않는 것을 방지하기 위해서 .values 를 사용하였다.
06. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 05. (5) 모델 적용
// 모델 학습이 다 끝나서 새로 들어온 데이터에서 대해서 예측을 해보는 것이다.
// pipeline 을 이용해서 구축할 수 있다.
07. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 01. (1) 문제 소개
* 아파트 실거래가 예측
// 실제 데이터는 크기가 크기 때문에 샘플 데이터로 주었다.
// 참조 데이터는 대회 문제 해결을 위해, 강사가 직접 수집한 데이터이며, 어떠한 정제도 하지 않았다.
08. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 02. (2) 변수 변환 및 부착
Calculateselement in test_elements, broadcasting overelementonly. Returns a boolean array of the same shape aselementthat is True where an element ofelementis intest_elementsand False otherwise.
Parameters
elementarray_like
Input array.
test_elementsarray_like
The values against which to test each value ofelement. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.
assume_uniquebool, optional
If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
invertbool, optional
If True, the values in the returned array are inverted, as if calculatingelement not in test_elements. Default is False.np.isin(a,b,invert=True)is equivalent to (but faster than)np.invert(np.isin(a,b)).
Returns
isinndarray, bool
Has the same shape aselement. The valueselement[isin]are intest_elements.
Thepicklemodule implements binary protocols for serializing and de-serializing a Python object structure.“Pickling”is the process whereby a Python object hierarchy is converted into a byte stream, and“unpickling”is the inverse operation, whereby a byte stream (from abinary fileorbytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,”1or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.
JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded toutf-8), while pickle is a binary serialization format;
JSON is human-readable, while pickle is not;
JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementingspecific object APIs);
Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.
sampling_strategyfloat, str, dict or callable, default=’auto’
Sampling information to resample the data set.
Whenfloat, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as
where
is the number of samples in the minority class after resampling and
is the number of samples in the majority class.
Warning
floatis only available forbinaryclassification. An error is raised for multi-class classification.
Whenstr, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
'minority': resample only the minority class;
'notminority': resample all classes but the minority class;
'notmajority': resample all classes but the majority class;
'all': resample all classes;
'auto': equivalent to'notmajority'.
Whendict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
When callable, function takingyand returns adict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
If int,random_stateis the seed used by the random number generator;
IfRandomStateinstance, random_state is the random number generator;
IfNone, the random number generator is theRandomStateinstance used bynp.random.
k_neighborsint or object, default=5
If int, number of nearest neighbours to used to construct synthetic samples. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin that will be used to find the k_neighbors.
n_jobsint, default=None
Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
Whenfloat, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as
where
is the number of samples in the minority class and
is the number of samples in the majority class after resampling.
Warning
floatis only available forbinaryclassification. An error is raised for multi-class classification.
Whenstr, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
'majority': resample only the majority class;
'notminority': resample all classes but the minority class;
'notmajority': resample all classes but the majority class;
'all': resample all classes;
'auto': equivalent to'notminority'.
Whendict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
When callable, function takingyand returns adict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
versionint, default=1
Version of the NearMiss to use. Possible values are 1, 2 or 3.
n_neighborsint or object, default=3
If int, size of the neighbourhood to consider to compute the average distance to the minority point samples. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin that will be used to find the k_neighbors.
n_neighbors_ver3int or object, default=3
If int, NearMiss-3 algorithm start by a phase of re-sampling. This parameter correspond to the number of neighbours selected create the subset in which the selection will be performed. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin that will be used to find the k_neighbors.
n_jobsint, default=None
Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
[05. Part 5) Ch 20. 편향된 모델은 쓸모 없어 - 클래스 불균형 문제 - 02-2. 재샘플링 - 오버샘플링과 언더 샘플링(실습)]
// kNN 을 사용해서 클래스 불균형도 테스트를 해준다.
// KNeighborsClassifier
// 재현율 0% 로 불균형이 심각한 수준이라 볼 수 있다.
[05. Part 5) Ch 20. 편향된 모델은 쓸모 없어 - 클래스 불균형 문제 - 03-1 비용 민감 모델 (이론)]
// 모델의 학습 변경한 모델이라고 볼 수 있다. 전처리라고 보기는 좀 어렵다.
* 정의
// 비용을 위양성 비용보다 크게 설정
* 확률 모델
* 관련문법: .predict_proba
* Tip.Numpy 와 Pandas 잘 쓰는 기본 원칙 : 가능하면 배열 단위 연산을 하라
// 유니버설 함수, 브로드캐스팅, 마스크 연산을 최대한 활용
* 비확률 모델 (1) 서포트 벡터 머신
* 비확률 모델 (2) 의사결정 나무
* 관련문법 : class_weight
[05. Part 5) Ch 20. 편향된 모델은 쓸모 없어 - 클래스 불균형 문제 - 03-2 비용 민감 모델 (실습)]
The multiclass support is handled according to a one-vs-one scheme.
For details on the precise mathematical formulation of the provided kernel functions and how gamma, coef0 and degree affect each other, see the corresponding section in the narrative documentation: Kernel functions.
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
degreeint, default=3
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
gamma{‘scale’, ‘auto’} or float, default=’scale’
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
ifgamma='scale'(default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features.
Changed in version 0.22: The default value of gamma changed from ‘auto’ to ‘scale’.
coef0float, default=0.0
Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
shrinkingbool, default=True
Whether to use the shrinking heuristic. See the User Guide.
probabilitybool, default=False
Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict. Read more in the User Guide.
tolfloat, default=1e-3
Tolerance for stopping criterion.
cache_sizefloat, default=200
Specify the size of the kernel cache (in MB).
class_weightdict or ‘balanced’, default=None
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
verbosebool, default=False
Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.
max_iterint, default=-1
Hard limit on iterations within solver, or -1 for no limit.
Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, one-vs-one (‘ovo’) is always used as multi-class strategy. The parameter is ignored for binary classification.
Changed in version 0.19: decision_function_shape is ‘ovr’ by default.
New in version 0.17: decision_function_shape=’ovr’ is recommended.
Changed in version 0.17: Deprecated decision_function_shape=’ovo’ and None.
break_tiesbool, default=False
If true, decision_function_shape='ovr', and number of classes > 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict.
New in version 0.22.
random_stateint or RandomState instance, default=None
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.
Attributes
support_ndarray of shape (n_SV,)
Indices of support vectors.
support_vectors_ndarray of shape (n_SV, n_features)
Support vectors.
n_support_ndarray of shape (n_class,), dtype=int32
Number of support vectors for each class.
dual_coef_ndarray of shape (n_class-1, n_SV)
Dual coefficients of the support vector in the decision function (see Mathematical formulation), multiplied by their targets. For multiclass, coefficient for all 1-vs-1 classifiers. The layout of the coefficients in the multiclass case is somewhat non-trivial. See the multi-class section of the User Guide for details.
coef_ndarray of shape (n_class * (n_class-1) / 2, n_features)
Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.
coef_ is a readonly property derived from dual_coef_ and support_vectors_.
intercept_ndarray of shape (n_class * (n_class-1) / 2,)
Constants in decision function.
fit_status_int
0 if correctly fitted, 1 otherwise (will raise warning)
classes_ndarray of shape (n_classes,)
The classes labels.
probA_ndarray of shape (n_class * (n_class-1) / 2)probB_ndarray of shape (n_class * (n_class-1) / 2)
If probability=True, it corresponds to the parameters learned in Platt scaling to produce probability estimates from decision values. If probability=False, it’s an empty array. Platt scaling uses the logistic function 1 / (1 + exp(decision_value * probA_ + probB_)) where probA_ and probB_ are learned from the dataset [2]. For more information on the multiclass case and training procedure see section 8 of [1].
class_weight_ndarray of shape (n_class,)
Multipliers of parameter C for each class. Computed based on the class_weight parameter.
shape_fit_tuple of int of shape (n_dimensions_of_X,)
One hot encoding consists in replacing the categorical variable by a combination of binary variables which take value 0 or 1, to indicate if a certain category is present in an observation.
Each one of the binary variables are also known as dummy variables. For example, from the categorical variable “Gender” with categories ‘female’ and ‘male’, we can generate the boolean variable “female”, which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is “male” and 0 otherwise.
The encoder has the option to generate one dummy variable per category, or to create dummy variables only for the top n most popular categories, that is, the categories that are shown by the majority of the observations.
If dummy variables are created for all the categories of a variable, you have the option to drop one category not to create information redundancy. That is, encoding into k-1 variables, where k is the number if unique categories.
The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode categorical variables (object type).
The encoder first finds the categories to be encoded for each variable (fit).
The encoder then creates one dummy variable per category for each variable (transform).
Note: new categories in the data to transform, that is, those that did not appear in the training set, will be ignored (no binary variable will be created for them).
Parameters
top_categories(int,default=None) – If None, a dummy variable will be created for each category of the variable. Alternatively, top_categories indicates the number of most frequent categories to encode. Dummy variables will be created only for those popular categories and the rest will be ignored. Note that this is equivalent to grouping all the remaining categories in one group.
variables(list) – The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.
drop_last(boolean,default=False) – Only used if top_categories = None. It indicates whether to create dummy variables for all the categories (k dummies), or if set to True, it will ignore the last variable of the list (k-1 dummies).
Learns the unique categories per variable. If top_categories is indicated, it will learn the most popular categories. Alternatively, it learns all unique categories per variable.
Parameters
X(pandas dataframe of shape =[n_samples,n_features]) – The training input samples. Can be the entire dataframe, not just seleted variables.
y(pandas series,default=None) – Target. It is not needed in this encoded. You can pass y or None.
encoder_dict\_
The dictionary containing the categories for which dummy variables will be created.
numpy.quantile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)
Compute the q-th quantile of the data along the specified axis.
New in version 1.15.0.
Parameters
aarray_like
Input array or object that can be converted to an array.
qarray_like of float
Quantile or sequence of quantiles to compute, which must be between 0 and 1 inclusive.
axis{int, tuple of int, None}, optional
Axis or axes along which the quantiles are computed. The default is to compute the quantile(s) along a flattened version of the array.
outndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output, but the type (of the output) will be cast if necessary.
overwrite_inputbool, optional
If True, then allow the input arrayato be modified by intermediate calculations, to save memory. In this case, the contents of the inputaafter this function completes is undefined.
This optional parameter specifies the interpolation method to use when the desired quantile lies between two data pointsi<j:
linear:i+(j-i)*fraction, wherefractionis the fractional part of the index surrounded byiandj.
lower:i.
higher:j.
nearest:iorj, whichever is nearest.
midpoint:(i+j)/2.
keepdimsbool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arraya.
Returns
quantilescalar or ndarray
Ifqis a single quantile andaxis=None, then the result is a scalar. If multiple quantiles are given, first axis of the result corresponds to the quantiles. The other axes are the axes that remain after the reduction ofa. If the input contains integers or floats smaller thanfloat64, the output data-type isfloat64. Otherwise, the output data-type is the same as that of the input. Ifoutis specified, that array is returned instead.
[05. Part 5) Ch 19. 이상적인 분포를 만들순 없을까 변수 분포 문제 - 03. 이상치 제거 (2) 밀도 기반 군집화 활용]
* 이상치 판단 방법 2. 밀도 기반 군집화 수행
// 특정 반경내에서는 중심점.. 중심점에 안 들어오면 경계점 그 모두에 속하지 않는 것들이 이상치라고 부른다.
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
Perform DBSCAN clustering from vector array or distance matrix.
DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
min_samplesint, default=5
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed bysklearn.metrics.pairwise_distancesfor its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be aGlossary, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
New in version 0.17:metricprecomputedto accept precomputed sparse matrix.
metric_paramsdict, default=None
Additional keyword arguments for the metric function.
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
leaf_sizeint, default=30
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
pfloat, default=None
The power of the Minkowski metric to be used to calculate distance between points.
n_jobsint, default=None
The number of parallel jobs to run.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. SeeGlossaryfor more details.
Attributes
core_sample_indices_ndarray of shape (n_core_samples,)
Indices of core samples.
components_ndarray of shape (n_core_samples, n_features)
Copy of each core sample found by training.
labels_ndarray of shape (n_samples)
Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.
* 실습
// cdist 은 DBSCAN 을 볼때 참고할 때를 위해서 가져온 library 이다.
// 파라미터를 조정하면서 값들을 확인한다.
[05. Part 5) Ch 19. 이상적인 분포를 만들순 없을까 변수 분포 문제 - 04. 특징 간 상관성 제거]
class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
Principal component analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.
It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.
It can also use the scipy.sparse.linalg ARPACK implementation of the truncated SVD.
Notice that this class does not support sparse input. SeeTruncatedSVDfor an alternative with sparse data.
Number of components to keep. if n_components is not set all components are kept:
n_components==min(n_samples,n_features)
Ifn_components=='mle'andsvd_solver=='full', Minka’s MLE is used to guess the dimension. Use ofn_components=='mle'will interpretsvd_solver=='auto'assvd_solver=='full'.
If0<n_components<1andsvd_solver=='full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
Ifsvd_solver=='arpack', the number of components must be strictly less than the minimum of n_features and n_samples.
Hence, the None case results in:
n_components==min(n_samples,n_features)-1
copybool, default=True
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whitenbool, optional (default False)
When True (False by default) thecomponents_vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
svd_solverstr {‘auto’, ‘full’, ‘arpack’, ‘randomized’}If auto :
The solver is selected by a default policy based onX.shapeandn_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
If full :
run exact full SVD calling the standard LAPACK solver viascipy.linalg.svdand select the components by postprocessing
If arpack :
run SVD truncated to n_components calling ARPACK solver viascipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
If randomized :
run randomized SVD by the method of Halko et al.
New in version 0.18.0.
tolfloat >= 0, optional (default .0)
Tolerance for singular values computed by svd_solver == ‘arpack’.
New in version 0.18.0.
iterated_powerint >= 0, or ‘auto’, (default ‘auto’)
Number of iterations for the power method computed by svd_solver == ‘randomized’.
Percentage of variance explained by each of the selected components.
Ifn_componentsis not set then all components are stored and the sum of the ratios is equal to 1.0.
singular_values_array, shape (n_components,)
The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of then_componentsvariables in the lower-dimensional space.
New in version 0.19.
mean_array, shape (n_features,)
Per-feature empirical mean, estimated from the training set.
Equal toX.mean(axis=0).
n_components_int
The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
n_features_int
Number of features in the training data.
n_samples_int
Number of samples in the training data.
noise_variance_float
The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 orhttp://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.
* 실습
// 특징간 상관 관계가 너무 크다.
// VIF 계산. LinearRegression 으로 작업한 후 R sqaure
// PCA 를 활용
[05. Part 5) Ch 19. 이상적인 분포를 만들순 없을까 변수 분포 문제 - 05. 변수 치우침 제거]
For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution. The functionskewtestcan be used to determine if the skewness value is close enough to zero, statistically speaking.
Parameters
andarray
Input array.
axisint or None, optional
Axis along which skewness is calculated. Default is 0. If None, compute over the whole arraya.
biasbool, optional
If False, then the calculations are corrected for statistical bias.
Compute the kurtosis (Fisher or Pearson) of a dataset.
Kurtosis is the fourth central moment divided by the square of the variance. If Fisher’s definition is used, then 3.0 is subtracted from the result to give 0.0 for a normal distribution.
If bias is False then the kurtosis is calculated using k statistics to eliminate bias coming from biased moment estimators
Usekurtosistestto see if result is close enough to normal.
Parameters
aarray
Data for which the kurtosis is calculated.
axisint or None, optional
Axis along which the kurtosis is calculated. Default is 0. If None, compute over the whole arraya.
fisherbool, optional
If True, Fisher’s definition is used (normal ==> 0.0). If False, Pearson’s definition is used (normal ==> 3.0).
biasbool, optional
If False, then the calculations are corrected for statistical bias.
Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default is ‘propagate’.
Returns
kurtosisarray
The kurtosis of values along an axis. If all values are equal, return -3 for Fisher’s definition and 0 for Pearson’s definition.
[05. Part 5) Ch 19. 이상적인 분포를 만들순 없을까 변수 분포 문제 - 06. 스케일링]
class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)
Standardize features by removing the mean and scaling to unit variance
The standard score of a samplexis calculated as:
z = (x - u) / s
whereuis the mean of the training samples or zero ifwith_mean=False, andsis the standard deviation of the training samples or one ifwith_std=False.
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data usingtransform.
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passingwith_mean=Falseto avoid breaking the sparsity structure of the data.
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_meanboolean, True by default
If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_stdboolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation).
Attributes
scale_ndarray or None, shape (n_features,)
Per feature relative scaling of the data. This is calculated usingnp.sqrt(var_). Equal toNonewhenwith_std=False.
New in version 0.17:scale_
mean_ndarray or None, shape (n_features,)
The mean value for each feature in the training set. Equal toNonewhenwith_mean=False.
var_ndarray or None, shape (n_features,)
The variance for each feature in the training set. Used to computescale_. Equal toNonewhenwith_std=False.
n_samples_seen_int or array, shape (n_features,)
The number of samples processed by the estimator for each feature. If there are not missing samples, then_samples_seenwill be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments acrosspartial_fitcalls.
[파이썬을 활용한 데이터 전처리 Level UP-Comment] - 범주형 변수 문자에 대한 처리 방법
New in version 0.20:SimpleImputerreplaces the previoussklearn.preprocessing.Imputerestimator which is now removed.
Parameters
missing_valuesnumber, string, np.nan (default) or None
The placeholder for the missing values. All occurrences ofmissing_valueswill be imputed. For pandas’ dataframes with nullable integer dtypes with missing values,missing_valuesshould be set tonp.nan, sincepd.NAwill be converted tonp.nan.
strategystring, default=’mean’
The imputation strategy.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
New in version 0.20:strategy=”constant” for fixed value imputation.
fill_valuestring or numerical value, default=None
When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
verboseinteger, default=0
Controls the verbosity of the imputer.
copyboolean, default=True
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even ifcopy=False:
If X is not an array of floating values;
If X is encoded as a CSR matrix;
If add_indicator=True.
add_indicatorboolean, default=False
If True, aMissingIndicatortransform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
Attributes
statistics_array of shape (n_features,)
The imputation fill value for each feature. Computing statistics can result innp.nanvalues. Duringtransform, features corresponding tonp.nanstatistics will be discarded.
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
axis{0 or ‘index’, 1 or ‘columns’}
Axis along which to fill missing values.
inplacebool, default False
If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
limitint, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcastdict, default is None
A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Returns
DataFrame or None
Object with missing values filled or None ifinplace=True.
[04. Part 4) Ch 17. 왜 여기엔 값이 없을까 결측치 문제 - 03. 해결 방법 (4) 결측치 예측 모델 활용]
* 결측치 예측 모델 정의
- 결측이 발생하지 않은 컬럼을 바탕으로 결측치를 예측하는 모델을 학습하고 활용하는 방법
- (예시) V2 열에 포함된 결측 값을 추정
* 결측치 예측 모델 활용
- 결측치 예측 모델은 어느 상황에서도 무난하게 활용할 수 있으나, 사용 조건 및 단점을 반드시 숙지해야 한다.
- 사용 조건 및 단점
. 조건 1. 결측이 소수 컬럼에 쏠리면 안 된다.
. 조건 2. 특징 간에 관계가 존재해야 한다.
. 단점 : 다른 결측치 처리 방법에 비해 시간이 오래 소요된다.
* 관련문법 : sklearn.impute.KNNImputer
- 결측이 아닌 값만 사용하여 이웃을 구한 뒤, 이웃들의 값의 대표값으로 결측을 대체하는 결측치 예측 모델
- 주요 입력
. n_neighbors : 이웃 수 (주의 : 너무 적으면 결측 대체가 정상적으로 이뤄지지 않을 수 있으므로, 5 정도가 적절)
* 실습
// n_neighbors = 5 는 크게 잡는것 보다 적당한 수치로 잡아서 인스턴스화 작업을 한다.
class sklearn.impute.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False)
Imputation for completing missing values using k-Nearest Neighbors.
Each sample’s missing values are imputed using the mean value fromn_neighborsnearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.
missing_valuesnumber, string, np.nan or None, default=`np.nan`
The placeholder for the missing values. All occurrences ofmissing_valueswill be imputed. For pandas’ dataframes with nullable integer dtypes with missing values,missing_valuesshould be set tonp.nan, sincepd.NAwill be converted tonp.nan.
n_neighborsint, default=5
Number of neighboring samples to use for imputation.
weights{‘uniform’, ‘distance’} or callable, default=’uniform’
Weight function used in prediction. Possible values:
‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
callable : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
metric{‘nan_euclidean’} or callable, default=’nan_euclidean’
Distance metric for searching neighbors. Possible values:
‘nan_euclidean’
callable : a user-defined function which conforms to the definition of_pairwise_callable(X,Y,metric,**kwds). The function accepts two arrays, X and Y, and amissing_valueskeyword inkwdsand returns a scalar distance value.
copybool, default=True
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.
add_indicatorbool, default=False
If True, aMissingIndicatortransform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Parameters
bymapping, function, label, or list of labels
Used to determine the groups for the groupby. Ifbyis a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns inself. Notice that a tuple is interpreted as a (single) key.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1).
levelint, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_indexbool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sortbool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
group_keysbool, default True
When calling apply, add group keys to index to identify pieces.
squeezebool, default False
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
Deprecated since version 1.1.0.
observedbool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
New in version 0.23.0.
dropnabool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups
New in version 1.1.0.
Returns
DataFrameGroupBy
Returns a groupby object that contains information about the groups.
[04. Part 4) Ch 17. 왜 여기엔 값이 없을까 결측치 문제 - 01. 문제 정의]
* 문제 정의
- 데이터에 결측치가 있어, 모델 학습 자체가 되지 않는 문제
- 결측치는 크게 NaN 과 None 으로 구분된다.
. NaN : 값이 있어야 하는데 없는 결측으로, 대체, 추정, 예측 등으로 처리
. None : 값이 없는게 값인 결측 (e.g., 직업 - 백수) 으로 새로운 값으로 정의하는 방식으로 처리
- 결측치 처리 방법 자체는 매우 간단하나, 상황에 따른 처리 방법 선택이 매우 중요
* 용어 정의
- 결측 레코드 : 결측치를 포함하는 레코드
- 결측치 비율 : 결측 레코드 수 / 전체 레코드 개수
[04. Part 4) Ch 17. 왜 여기엔 값이 없을까 결측치 문제 - 02. 해결 방법 (1) 삭제]
* 행 단위 결측 삭제
- 행 단위 결측 삭제는 결측 레코드를 삭제하는 매우 간단한 방법이지만, 두 가지 조건을 만족하는 경우에만 수행할 수 있다.
* 열 단위 결측 삭제
- 열 다누이 결측 삭제는 결측 레코드를 포함하는 열을 삭제하는 매우 간단한 방법이지만, 두 가지 조건을 만족하는 경우에만 사용 가능하다.
. 소수 변수에 결측이 많이 포함되어 있다.
. 해당 변수들이 크게 중요하지 않음 (by 도메인 지식)
* 관련 문법 : Series / DataFrame.isnull
- 값이 결측이면 True 를, 그렇지 않으면 False 를 반환 (notnull 함수와 반대로 작동)
- sum 함수와 같이 사용하여 결측치 분포를 확인하는데 주로 사용
* 관련문법 : DataFrame.dropna
- 결측치가 포함된 행이나 열을 제거하는데 사용
- 주요 입력
. axis : 1 이면 결측이 포함된 열을 삭제하며, 0 이면 결측이 포함된 행을 삭제
. how : 'any'면 결측이 하나라도 포함되면 삭제하며, 'all'이면 모든 갑싱 결측인 경우만 삭제 (주로 any 로 설정)
Return a boolean same-sized object indicating if the values are NA. NA values, such as None ornumpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na=True).
ReturnsDataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None ornumpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na=True).
ReturnsSeries
Mask of bool values for each element in Series that indicates whether an element is not an NA value.
Convert Series to {label -> value} dict or dict-like object.
Parameters
intoclass, default dict
The collections.abc.Mapping subclass to use as the return object. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
Values of the Series are replaced with other values dynamically. This differs from updating with.locor.iloc, which require you to specify a location to update with some value.
Parameters
to_replacestr, regex, list, dict, Series, int, float, or None
How to find the values that will be replaced.
numeric, str or regex:
numeric: numeric values equal toto_replacewill be replaced withvalue
str: string exactly matchingto_replacewill be replaced withvalue
regex: regexs matchingto_replacewill be replaced withvalue
list of str, regex, or numeric:
First, ifto_replaceandvalueare both lists, theymustbe the same length.
Second, ifregex=Truethen all of the strings inbothlists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much forvaluesince there are only a few possible substitution regexes you can use.
str, regex and numeric rules apply as above.
dict:
Dicts can be used to specify different replacement values for different existing values. For example,{'a':'b','y':'z'}replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way thevalueparameter should beNone.
For a DataFrame a dict can specify that different values should be replaced in different columns. For example,{'a':1,'b':'z'}looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified invalue. Thevalueparameter should not beNonein this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
For a DataFrame nested dictionaries, e.g.,{'a':{'b':np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. Thevalueparameter should beNoneto use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary)cannotbe regular expressions.
None:
This means that theregexargument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. Ifvalueis alsoNonethen thismustbe a nested dictionary or Series.
See the examples section for examples of each of these.
valuescalar, dict, list, str, regex, default None
Value to replace any values matchingto_replacewith. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
inplacebool, default False
If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.
limitint, default None
Maximum size gap to forward or backward fill.
regexbool or same types asto_replace, default False
Whether to interpretto_replaceand/orvalueas regular expressions. If this isTruethento_replacemustbe a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which caseto_replacemust beNone.
method{‘pad’, ‘ffill’, ‘bfill’,None}
The method to use when for replacement, whento_replaceis a scalar, list or tuple andvalueisNone.
Changed in version 0.23.0:Added to DataFrame.
Returns
Series
Object after replacement.
Raises
AssertionError
Ifregexis not aboolandto_replaceis notNone.
TypeError
Ifto_replaceis not a scalar, array-like,dict, orNone
Ifto_replaceis adictandvalueis not alist,dict,ndarray, orSeries
Ifto_replaceisNoneandregexis not compilable into a regular expression or is a list, dict, ndarray, or Series.
When replacing multipleboolordatetime64objects and the arguments toto_replacedoes not match the type of the value being replaced
ValueError
If alistor anndarrayis passed toto_replaceandvaluebut they are not the same length.
Concatenate pandas objects along a particular axis with optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
Parameters
objsa sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as thekeysargument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
axis{0/’index’, 1/’columns’}, default 0
The axis to concatenate along.
join{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
ignore_indexbool, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
keyssequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.
levelslist of sequences, default None
Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
nameslist, default None
Names for the levels in the resulting hierarchical index.
verify_integritybool, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
sortbool, default False
Sort non-concatenation axis if it is not already aligned whenjoinis ‘outer’. This has no effect whenjoin='inner', which already preserves the order of the non-concatenation axis.
New in version 0.23.0.
Changed in version 1.0.0:Changed to not sort by default.
copybool, default True
If False, do not copy data unnecessarily.
Returns
object, type of objs
When concatenating allSeriesalong the index (axis=0), aSeriesis returned. Whenobjscontains at least oneDataFrame, aDataFrameis returned. When concatenating along the columns (axis=1), aDataFrameis returned.
[04. Part 4) Ch 16. 흩어진 데이터 다 모여라 - 데이터 파편화 문제 - 02. 유형 (2) 명시적인 키 변수가 있는 경우]
* 문제 정의 및 해결 방안
- 효율적인 데이터 베이스 관리를 위해, 잘 정제된 데이터일지라도 데이터가 키 변수를 기준으로 나뉘어 저장되는 경우가 매우 흔함
- SQL 에서는 JOIN 을 이용하여 해결하며, python 에서는 merge 를 이용하여 해결한다.
- 일반적인 경우는 해결이 어렵지 않지만, 다양한 케이스가 존재할 수 있으므로 반드시 핵심을 기억해야 한다.
(1) 어느 컬럼이 키 변수 역할을 할 수 있는지 확인하고, 키 변수를 통일해야 한다.
(2) 레코드의 단위를 명확히 해야 한다.
* 관련문법 : pandas.merge
- 키 변수를 기준으루 두개의 데이터 프레임을 병합(join)하는 함수
- 주요입력
. left : 통합 대상 데이터 프레임 1
. right : 통합 대상 데이터 프레임 2
. on : 통합 기준 key 변수 및 변수 리스트 (입력을 하지 않으면, 이름이 같은 변수를 key 로 식별함) . left_on : 데이터 프레임 1의 key 변수 및 변수 리스트 . right_on : 데이터 프레임 2의 key 변수 및 변수 리스트 . left_index : 데이터 프레임 1의 인덱스를 key 변수로 사용할 지 여부 . right_index : 데이터 프레임 2의 인덱스를 key 변수로 사용할 지 여부
Merge DataFrame or named Series objects with a database-style join.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexeswill be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
onlabel or list
Column or index level names to join on. These must be found in both DataFrames. Ifonis None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_onlabel or list, or array-like
Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
right_onlabel or list, or array-like
Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
left_indexbool, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.
right_indexbool, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index.
sortbool, default False
Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).
suffixeslist-like, default is (“_x”, “_y”)
A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names inleftandrightrespectively. Pass a value ofNoneinstead of a string to indicate that the column name fromleftorrightshould be left as-is, with no suffix. At least one of the values must not be None.
copybool, default True
If False, avoid copy if possible.
indicatorbool or str, default False
If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.
validatestr, optional
If specified, checks if merge is of specified type.
“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.
Returns
DataFrame
A DataFrame of the two merged objects.
[파이썬을 활용한 데이터 전처리 Level UP-Comment] - 복잡도 파라미터 튜닝에서는 다양한 파라미터에 대해서 어떤식을 튜닝을 해야 되는지.. 단순하다고 나쁜 것도 아니고 복잡하다고 좋은 것도 아니라.. 그 상황에 맞는 경험치!!^^!!
- 데이터 합치 파트는 Chapter 4 에 나와 있는 함수들을 다시 한번 되돌아 볼 수 있는 기회였다.
Exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.
GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide ascorefunction, orscoringmust be passed.
param_griddict or list of dictionaries
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
scoringstr, callable, list/tuple or dict, default=None
For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
Number of jobs to run in parallel.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. SeeGlossaryfor more details.
Changed in version v0.20:n_jobsdefault changed from 1 to None
pre_dispatchint, or str, default=n_jobs
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
An int, giving the exact number of total jobs that are spawned
A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
iidbool, default=False
If True, return the average score across folds, weighted by the number of samples in each test set. In this case, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.
Deprecated since version 0.22:Parameteriidis deprecated in 0.22 and will be removed in 0.24
cvint, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use the default 5-fold cross validation,
integer, to specify the number of folds in a(Stratified)KFold,
An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier andyis either binary or multiclass,StratifiedKFoldis used. In all other cases,KFoldis used.
ReferUser Guidefor the various cross-validation strategies that can be used here.
Changed in version 0.22:cvdefault value if None changed from 3-fold to 5-fold.
refitbool, str, or callable, default=True
Refit an estimator using the best found parameters on the whole dataset.
For multiple metric evaluation, this needs to be astrdenoting the scorer that would be used to find the best parameters for refitting the estimator at the end.
Where there are considerations other than maximum score in choosing a best estimator,refitcan be set to a function which returns the selectedbest_index_givencv_results_. In that case, thebest_estimator_andbest_params_will be set according to the returnedbest_index_while thebest_score_attribute will not be available.
The refitted estimator is made available at thebest_estimator_attribute and permits usingpredictdirectly on thisGridSearchCVinstance.
Also for multiple metric evaluation, the attributesbest_index_,best_score_andbest_params_will only be available ifrefitis set and all of them will be determined w.r.t this specific scorer.
Seescoringparameter to know more about multiple metric evaluation.
Changed in version 0.20:Support for callable added.
verboseinteger
Controls the verbosity: the higher, the more messages.
error_score‘raise’ or numeric, default=np.nan
Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.
return_train_scorebool, default=False
IfFalse, thecv_results_attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.
New in version 0.19.
Changed in version 0.21:Default value was changed fromTruetoFalse
Attributes
cv_results_dict of numpy (masked) ndarrays
A dict with keys as column headers and values as columns, that can be imported into a pandasDataFrame.
The key'params'is used to store a list of parameter settings dicts for all the parameter candidates.
Themean_fit_time,std_fit_time,mean_score_timeandstd_score_timeare all in seconds.
For multi-metric evaluation, the scores for all the scorers are available in thecv_results_dict at the keys ending with that scorer’s name ('_<scorer_name>') instead of'_score'shown above. (‘split0_test_precision’, ‘mean_train_precision’ etc.)
best_estimator_estimator
Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available ifrefit=False.
Seerefitparameter for more information on allowed values.
best_score_float
Mean cross-validated score of the best_estimator
For multi-metric evaluation, this is present only ifrefitis specified.
This attribute is not available ifrefitis a function.
best_params_dict
Parameter setting that gave the best results on the hold out data.
For multi-metric evaluation, this is present only ifrefitis specified.
best_index_int
The index (of thecv_results_arrays) which corresponds to the best candidate parameter setting.
The dict atsearch.cv_results_['params'][search.best_index_]gives the parameter setting for the best model, that gives the highest mean score (search.best_score_).
For multi-metric evaluation, this is present only ifrefitis specified.
scorer_function or a dict
Scorer function used on the held out data to choose the best parameters for the model.
For multi-metric evaluation, this attribute holds the validatedscoringdict which maps the scorer key to the scorer callable.
n_splits_int
The number of cross-validation splits (folds/iterations).
refit_time_float
Seconds used for refitting the best model on the whole dataset.
Parametersparam_griddict of str to sequence, or sequence of such
The parameter grid to explore, as a dictionary mapping estimator parameters to sequences of allowed values.
An empty dict signifies default parameters.
A sequence of dicts signifies a sequence of grids to search, and is useful to avoid exploring parameter combinations that make no sense or have no effect. See the examples below.
[03. Part 3) Ch 15. 이럴땐 이걸 쓰고, 저럴땐 저걸 쓰고 - 지도 학습 모델 & 파라미터 선택 - 02. 기준 (1) 변수 타입]
* 변수 타입 확인 방법
- DataGrame.dtypes
. DataFrame 에 포함된 컬럼들의 데이터 타입 ( object, int64, float64, bool 등 ) 을 반환
- DataFrame.infer_objects( ).dtypes
. DataFrame 에 포함된 컬럼들의 데이터 타입을 추론한 결과를 반환
. (예) ['1', '2'] 라는 값을 가진 컬럼은 비록 object 타입이나, int 타입이라고 추론할 수 있다.
- 주의 : string type 이라고 해서 반드시 범주형이 아니며, int 혹은 float type 이라고 해서 반드시 연속형은 아니다. 반드시 상태 공간의 크기와 도메인 지식 등을 고려해야 한다.
* 변수 타입에 따른 적절한 모델
- 주의 : 모델 성능에는 변수 타입만 영향을 주는 것이 아니므로, 다른 요소도 반드시 고려해야 한다.
* 혼합형 변쉥 적절하지 않은 모델 (1) 회귀 모델
- 혼합형 변수인 경우에는 당연히 변수의 스케일 차이가 존재하는 경우가 흔하다.
- 변수의 스케일에 따라 계수 값이 크게 달라지므로, 예측 안정성이 크게 떨어진다.
. 모든 특징이 라벨에 독립적으로 영향을 준다면, 이진형 특징의 계수 절대값이 스케일이 큰 연속형 특징의 계수 절대값 보다 크게 설정된다.
. 이진형 특징 값에 따라 예측 값이 크게 변동한다.
- 스케일일ㅇ을 하더라도 이진형 특징의 분포가 변하지 않으므로, 이진형 특징의 값에 따른 영향력이 크게 줄지 않는다.
* 혼합형 변수에 적절하지 않은 모델 (2) 나이브 베이즈
- 나이브베이즈는 하나의 확률 분포를 가정하기 때문에, 혼합형 변수를 가지는 데이터에 부적절하다.
. (예시) 베르누이 분포는 연속형 값을 가지는 확률 분포 추정에 매우 부적절
- 따라서 나이브베이즈는 혼합형 변수인 경우에는 절대로 고려해서는 안 되는 모델이다.
* 혼합형 변수에 적절하지 않은 모델 (3) k - 최근접 이웃
- 스케일이 큰 변수에 의해 거리가 사실상 결정되므로, k-NN 은 혼합형 변수에 적절하지 않다.
- 단, 코사인 유사도를 사용하는 경우나, 스케일링을 적용하는 경우에는 큰 무리 없이 사용 가능하다.
[03. Part 3) Ch 15. 이럴땐 이걸 쓰고, 저럴땐 저걸 쓰고 - 지도 학습 모델 & 파라미터 선택 - 03. 기준 (2) 데이터 크기]
* 샘플 개수와 특징 개수에 따른 과적합 (remind)
* 샘플 개수와 특징 개수에 따른 적절한 모델
* 실습
// random_state 가 있는 모델은 모두 같은 값으로 설정한다.
// 모델별 k 겹 교차 검증 기반 (k=5) 의 MAE 값으로 계산한다.
// cv 는 폴더의 갯수. k 값
// 특징이 적으면 복잡한 모델은 나오기 어렵다.
// 샘플이 매우 적고 특징이 상대적으로 많은 경우
[파이썬을 활용한 데이터 전처리 Level UP-Comment] - 지도학습할 때의 Grid 서치 및 parameterGrid 에 대해서 배울 수 있었다. 동영상 순서가 이상해서 처음엔 약간 이상했지만~^^;;
패치#노트코딩, 주식, 자동매매, 백테스팅, 데이터분석 등 관심 블로그
비전공자이지만 금융 및 관련 프로그래밍에 관심을 두고 열심히 공부중입니다.
우리 모두 경제적 자유를 위해 성공해봅시다!!
※ 혹시, 블로그 내용중 문제되는 내용있으시면 알려주시면 삭제/수정 토록하겠습니다.
※ 모든 내용들은 투자 권유가 아니오니 참고만 하시고, 또한 출처는 모두 표기토록 노력하겠으나 혹시 문제가 되는 글이 있다면 댓글로 남겨주세요~^^