Return the absolute value of a number. The argument may be an integer, a floating point number, or an object implementing__abs__(). If the argument is a complex number, its magnitude is returned.
A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with themax_samplesparameter ifbootstrap=True(default), otherwise the whole dataset is used to build each tree.
Changed in version 0.22:The default value ofn_estimatorschanged from 10 to 100 in 0.22.
criterion{“mse”, “mae”}, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
New in version 0.18:Mean Absolute Error (MAE) criterion.
max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then considermin_samples_splitas the minimum number.
If float, thenmin_samples_splitis a fraction andceil(min_samples_split*n_samples)are the minimum number of samples for each split.
Changed in version 0.18:Added float values for fractions.
min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at leastmin_samples_leaftraining samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then considermin_samples_leafas the minimum number.
If float, thenmin_samples_leafis a fraction andceil(min_samples_leaf*n_samples)are the minimum number of samples for each node.
Changed in version 0.18:Added float values for fractions.
min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
If int, then considermax_featuresfeatures at each split.
If float, thenmax_featuresis a fraction andint(max_features*n_features)features are considered at each split.
If “auto”, thenmax_features=n_features.
If “sqrt”, thenmax_features=sqrt(n_features).
If “log2”, thenmax_features=log2(n_features).
If None, thenmax_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more thanmax_featuresfeatures.
max_leaf_nodesint, default=None
Grow trees withmax_leaf_nodesin best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
whereNis the total number of samples,N_tis the number of samples at the current node,N_t_Lis the number of samples in the left child, andN_t_Ris the number of samples in the right child.
N,N_t,N_t_RandN_t_Lall refer to the weighted sum, ifsample_weightis passed.
New in version 0.19.
min_impurity_splitfloat, default=None
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19:min_impurity_splithas been deprecated in favor ofmin_impurity_decreasein 0.19. The default value ofmin_impurity_splithas changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Usemin_impurity_decreaseinstead.
bootstrapbool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
oob_scorebool, default=False
whether to use out-of-bag samples to estimate the R^2 on unseen data.
Controls both the randomness of the bootstrapping of the samples used when building trees (ifbootstrap=True) and the sampling of the features to consider when looking for the best split at each node (ifmax_features<n_features). SeeGlossaryfor details.
verboseint, default=0
Controls the verbosity when fitting and predicting.
warm_startbool, default=False
When set toTrue, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. Seethe Glossary.
ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller thanccp_alphawill be chosen. By default, no pruning is performed. SeeMinimal Cost-Complexity Pruningfor details.
New in version 0.22.
max_samplesint or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.
If None (default), then drawX.shape[0]samples.
If int, then drawmax_samplessamples.
If float, then drawmax_samples*X.shape[0]samples. Thus,max_samplesshould be in the interval(0,1).
New in version 0.22.
Attributes
base_estimator_DecisionTreeRegressor
The child estimator template used to create the collection of fitted sub-estimators.
y_truearray-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
y_predarray-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.
sample_weightarray-like of shape (n_samples,), optional
Sample weights.
multioutputstring in [‘raw_values’, ‘uniform_average’] or array-like of shape (n_outputs)
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
‘raw_values’ :
Returns a full set of errors in case of multioutput input.
‘uniform_average’ :
Errors of all outputs are averaged with uniform weight.
Returns
lossfloat or ndarray of floats
If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
// 값이 작을 수록 좋기 때문에 초기 값은 매우 큰 값으로 정의함.
// LightGBM 에서 DataFrame 이 잘 처리 되지 않는 것을 방지하기 위해서 .values 를 사용하였다.
06. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 05. (5) 모델 적용
// 모델 학습이 다 끝나서 새로 들어온 데이터에서 대해서 예측을 해보는 것이다.
// pipeline 을 이용해서 구축할 수 있다.
07. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 01. (1) 문제 소개
* 아파트 실거래가 예측
// 실제 데이터는 크기가 크기 때문에 샘플 데이터로 주었다.
// 참조 데이터는 대회 문제 해결을 위해, 강사가 직접 수집한 데이터이며, 어떠한 정제도 하지 않았다.
08. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 02. (2) 변수 변환 및 부착
Calculateselement in test_elements, broadcasting overelementonly. Returns a boolean array of the same shape aselementthat is True where an element ofelementis intest_elementsand False otherwise.
Parameters
elementarray_like
Input array.
test_elementsarray_like
The values against which to test each value ofelement. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.
assume_uniquebool, optional
If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
invertbool, optional
If True, the values in the returned array are inverted, as if calculatingelement not in test_elements. Default is False.np.isin(a,b,invert=True)is equivalent to (but faster than)np.invert(np.isin(a,b)).
Returns
isinndarray, bool
Has the same shape aselement. The valueselement[isin]are intest_elements.
Thepicklemodule implements binary protocols for serializing and de-serializing a Python object structure.“Pickling”is the process whereby a Python object hierarchy is converted into a byte stream, and“unpickling”is the inverse operation, whereby a byte stream (from abinary fileorbytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,”1or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.
JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded toutf-8), while pickle is a binary serialization format;
JSON is human-readable, while pickle is not;
JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementingspecific object APIs);
Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.
패치#노트코딩, 주식, 자동매매, 백테스팅, 데이터분석 등 관심 블로그
비전공자이지만 금융 및 관련 프로그래밍에 관심을 두고 열심히 공부중입니다.
우리 모두 경제적 자유를 위해 성공해봅시다!!
※ 혹시, 블로그 내용중 문제되는 내용있으시면 알려주시면 삭제/수정 토록하겠습니다.
※ 모든 내용들은 투자 권유가 아니오니 참고만 하시고, 또한 출처는 모두 표기토록 노력하겠으나 혹시 문제가 되는 글이 있다면 댓글로 남겨주세요~^^