requests 와 함께 headers 정보의 user-agent 를 같이 넘겨주게 되면 해결 할 수 있어요.
아래는 구글 영화 정보 사이트의 인기차트 내용에 대해서 보는 것인데요.
헤더 정보를 넣는냐 아니냐에 따라서 정보가 달라지는 것을 볼 수 있습니다.
[헤더 정보를 넣었을 때]
import requests
from bs4 import BeautifulSoup
url = "https://play.google.com/store/movies/top"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
movie_list = soup.select('#fcxH9b > div.WpDbMd > c-wiz > div > c-wiz > div > div > c-wiz > c-wiz > c-wiz > div > div.ZmHEEd > div:nth-child(1) > c-wiz > div > div > div.RZEgze > div > div > div.bQVA0c > div > div > div.b8cIId.ReQCgd.Q9MA7b > a > div')
movie_list
[<div class="WsMG1c nnK0zc" title="A Quiet Place Part II">A Quiet Place Part II</div>]
[헤더 정보를 빼고 호출하였을때]
import requests
from bs4 import BeautifulSoup
url = "https://play.google.com/store/movies/top"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
movie_list = soup.select('#fcxH9b > div.WpDbMd > c-wiz > div > c-wiz > div > div > c-wiz > c-wiz > c-wiz > div > div.ZmHEEd > div:nth-child(1) > c-wiz > div > div > div.RZEgze > div > div > div.bQVA0c > div > div > div.b8cIId.ReQCgd.Q9MA7b > a > div')
movie_list
[<div class="WsMG1c nnK0zc" title="F9: The Fast Saga">F9: The Fast Saga</div>,
<div class="WsMG1c nnK0zc" title="Rick and Morty">Rick and Morty</div>,
<div class="WsMG1c nnK0zc" title="Gotron Jerrysis Rickvangelion">Gotron Jerrysis Rickvangelion</div>]
import schedule
import time
def job():
print("I'm working...")
# Run job every 3 second/minute/hour/day/week,
# Starting 3 second/minute/hour/day/week from now
schedule.every(3).seconds.do(job)
schedule.every(3).minutes.do(job)
schedule.every(3).hours.do(job)
schedule.every(3).days.do(job)
schedule.every(3).weeks.do(job)
# Run job every minute at the 23rd second
schedule.every().minute.at(":23").do(job)
# Run job every hour at the 42rd minute
schedule.every().hour.at(":42").do(job)
# Run jobs every 5th hour, 20 minutes and 30 seconds in.
# If current time is 02:00, first execution is at 06:20:30
schedule.every(5).hours.at("20:30").do(job)
# Run job every day at specific HH:MM and next HH:MM:SS
schedule.every().day.at("10:30").do(job)
schedule.every().day.at("10:30:42").do(job)
# Run job on a specific day of the week
schedule.every().monday.do(job)
schedule.every().wednesday.at("13:15").do(job)
schedule.every().minute.at(":17").do(job)
while True:
schedule.run_pending()
time.sleep(1)
Cancel a job
To remove a job from the scheduler, use theschedule.cancel_job(job)method
# 특수문자 없애기-1
characters = "/'!?|*<>:\\"
str_title = "안녕하세요_!@##$%%^^|/'!?|*<>:\\: Goodmorning"
new_title = ''.join(x for x in str_title if x not in characters)
print(new_title)
#출력
안녕하세요_@##$%%^^ Goodmorning
Therangetype represents an immutable sequence of numbers and is commonly used for looping a specific number of times inforloops.
classrange(stop)classrange(start,stop[,step])
The arguments to the range constructor must be integers (either built-inintor any object that implements the__index__special method). If thestepargument is omitted, it defaults to1. If thestartargument is omitted, it defaults to0. Ifstepis zero,ValueErroris raised.
For a positivestep, the contents of a rangerare determined by the formular[i]=start+step*iwherei>=0andr[i]<stop.
For a negativestep, the contents of the range are still determined by the formular[i]=start+step*i, but the constraints arei>=0andr[i]>stop.
A range object will be empty ifr[0]does not meet the value constraint. Ranges do support negative indices, but these are interpreted as indexing from the end of the sequence determined by the positive indices.
Ranges containing absolute values larger thansys.maxsizeare permitted but some features (such aslen()) may raiseOverflowError.
Ranges implement all of thecommonsequence operations except concatenation and repetition (due to the fact that range objects can only represent sequences that follow a strict pattern and repetition and concatenation will usually violate that pattern).
start
The value of thestartparameter (or0if the parameter was not supplied)
stop
The value of thestopparameter
step
The value of thestepparameter (or1if the parameter was not supplied)
The advantage of therangetype over a regularlistortupleis that arangeobject will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores thestart,stopandstepvalues, calculating individual items and subranges as needed).
Testing range objects for equality with==and!=compares them as sequences. That is, two range objects are considered equal if they represent the same sequence of values. (Note that two range objects that compare equal might have differentstart,stopandstepattributes, for examplerange(0)==range(2,1,3)orrange(0,3,2)==range(0,4,2).)
Changed in version 3.2:Implement the Sequence ABC. Support slicing and negative indices. Testintobjects for membership in constant time instead of iterating through all items.
Changed in version 3.3:Define ‘==’ and ‘!=’ to compare range objects based on the sequence of values they define (instead of comparing based on object identity).
# 상기 주소로 가서 Candlestick class 의 parameter 을 살펴보면 다양한 종류가 있습니다.
# 이전 살펴보기에서 어떤식으로 간략히 그릴수 있는지 살펴봤었고, x, close, high, open, low 값을 통해 그릴 수 있었습니다.
# 이외에도 다양하게 있는데 하나씩 살펴볼려고 합니다.
* 찾아봐도 모르겠는건 과감히~~^^ pass~~^^;;;;;;;;;;;; 누가 좀 알려주세용~^^
* close– Sets the close values.
# 종가 가격
* closesrc– Sets the source reference on Chart Studio Cloud for close .
# Chart Studio Cloud 에서 어떤 셋팅을 하는 것 같다. 주피터에서는 별다른 작동을 안함.
* customdata– Assigns extra data each datum. This may be useful when listening to hover, click and selection events. Note that, “scatter” traces also appends customdata items in the markers DOM elements
Return the absolute value of a number. The argument may be an integer, a floating point number, or an object implementing__abs__(). If the argument is a complex number, its magnitude is returned.
A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with themax_samplesparameter ifbootstrap=True(default), otherwise the whole dataset is used to build each tree.
Changed in version 0.22:The default value ofn_estimatorschanged from 10 to 100 in 0.22.
criterion{“mse”, “mae”}, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
New in version 0.18:Mean Absolute Error (MAE) criterion.
max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then considermin_samples_splitas the minimum number.
If float, thenmin_samples_splitis a fraction andceil(min_samples_split*n_samples)are the minimum number of samples for each split.
Changed in version 0.18:Added float values for fractions.
min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at leastmin_samples_leaftraining samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then considermin_samples_leafas the minimum number.
If float, thenmin_samples_leafis a fraction andceil(min_samples_leaf*n_samples)are the minimum number of samples for each node.
Changed in version 0.18:Added float values for fractions.
min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
If int, then considermax_featuresfeatures at each split.
If float, thenmax_featuresis a fraction andint(max_features*n_features)features are considered at each split.
If “auto”, thenmax_features=n_features.
If “sqrt”, thenmax_features=sqrt(n_features).
If “log2”, thenmax_features=log2(n_features).
If None, thenmax_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more thanmax_featuresfeatures.
max_leaf_nodesint, default=None
Grow trees withmax_leaf_nodesin best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
whereNis the total number of samples,N_tis the number of samples at the current node,N_t_Lis the number of samples in the left child, andN_t_Ris the number of samples in the right child.
N,N_t,N_t_RandN_t_Lall refer to the weighted sum, ifsample_weightis passed.
New in version 0.19.
min_impurity_splitfloat, default=None
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19:min_impurity_splithas been deprecated in favor ofmin_impurity_decreasein 0.19. The default value ofmin_impurity_splithas changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Usemin_impurity_decreaseinstead.
bootstrapbool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
oob_scorebool, default=False
whether to use out-of-bag samples to estimate the R^2 on unseen data.
Controls both the randomness of the bootstrapping of the samples used when building trees (ifbootstrap=True) and the sampling of the features to consider when looking for the best split at each node (ifmax_features<n_features). SeeGlossaryfor details.
verboseint, default=0
Controls the verbosity when fitting and predicting.
warm_startbool, default=False
When set toTrue, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. Seethe Glossary.
ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller thanccp_alphawill be chosen. By default, no pruning is performed. SeeMinimal Cost-Complexity Pruningfor details.
New in version 0.22.
max_samplesint or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.
If None (default), then drawX.shape[0]samples.
If int, then drawmax_samplessamples.
If float, then drawmax_samples*X.shape[0]samples. Thus,max_samplesshould be in the interval(0,1).
New in version 0.22.
Attributes
base_estimator_DecisionTreeRegressor
The child estimator template used to create the collection of fitted sub-estimators.
y_truearray-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.
y_predarray-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.
sample_weightarray-like of shape (n_samples,), optional
Sample weights.
multioutputstring in [‘raw_values’, ‘uniform_average’] or array-like of shape (n_outputs)
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
‘raw_values’ :
Returns a full set of errors in case of multioutput input.
‘uniform_average’ :
Errors of all outputs are averaged with uniform weight.
Returns
lossfloat or ndarray of floats
If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
// 값이 작을 수록 좋기 때문에 초기 값은 매우 큰 값으로 정의함.
// LightGBM 에서 DataFrame 이 잘 처리 되지 않는 것을 방지하기 위해서 .values 를 사용하였다.
06. Ch 23. 진짜 문제를 해결해 보자 (1) - 상점 신용카드 매출 예측 - 05. (5) 모델 적용
// 모델 학습이 다 끝나서 새로 들어온 데이터에서 대해서 예측을 해보는 것이다.
// pipeline 을 이용해서 구축할 수 있다.
07. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 01. (1) 문제 소개
* 아파트 실거래가 예측
// 실제 데이터는 크기가 크기 때문에 샘플 데이터로 주었다.
// 참조 데이터는 대회 문제 해결을 위해, 강사가 직접 수집한 데이터이며, 어떠한 정제도 하지 않았다.
08. Ch 24. 진짜 문제를 해결해 보자 (2) - 아파트 실거래가 예측 - 02. (2) 변수 변환 및 부착
Calculateselement in test_elements, broadcasting overelementonly. Returns a boolean array of the same shape aselementthat is True where an element ofelementis intest_elementsand False otherwise.
Parameters
elementarray_like
Input array.
test_elementsarray_like
The values against which to test each value ofelement. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.
assume_uniquebool, optional
If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
invertbool, optional
If True, the values in the returned array are inverted, as if calculatingelement not in test_elements. Default is False.np.isin(a,b,invert=True)is equivalent to (but faster than)np.invert(np.isin(a,b)).
Returns
isinndarray, bool
Has the same shape aselement. The valueselement[isin]are intest_elements.
Thepicklemodule implements binary protocols for serializing and de-serializing a Python object structure.“Pickling”is the process whereby a Python object hierarchy is converted into a byte stream, and“unpickling”is the inverse operation, whereby a byte stream (from abinary fileorbytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,”1or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.
JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded toutf-8), while pickle is a binary serialization format;
JSON is human-readable, while pickle is not;
JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementingspecific object APIs);
Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.
sampling_strategyfloat, str, dict or callable, default=’auto’
Sampling information to resample the data set.
Whenfloat, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as
where
is the number of samples in the minority class after resampling and
is the number of samples in the majority class.
Warning
floatis only available forbinaryclassification. An error is raised for multi-class classification.
Whenstr, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
'minority': resample only the minority class;
'notminority': resample all classes but the minority class;
'notmajority': resample all classes but the majority class;
'all': resample all classes;
'auto': equivalent to'notmajority'.
Whendict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
When callable, function takingyand returns adict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
If int,random_stateis the seed used by the random number generator;
IfRandomStateinstance, random_state is the random number generator;
IfNone, the random number generator is theRandomStateinstance used bynp.random.
k_neighborsint or object, default=5
If int, number of nearest neighbours to used to construct synthetic samples. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin that will be used to find the k_neighbors.
n_jobsint, default=None
Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
Whenfloat, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as
where
is the number of samples in the minority class and
is the number of samples in the majority class after resampling.
Warning
floatis only available forbinaryclassification. An error is raised for multi-class classification.
Whenstr, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:
'majority': resample only the majority class;
'notminority': resample all classes but the minority class;
'notmajority': resample all classes but the majority class;
'all': resample all classes;
'auto': equivalent to'notminority'.
Whendict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
When callable, function takingyand returns adict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.
versionint, default=1
Version of the NearMiss to use. Possible values are 1, 2 or 3.
n_neighborsint or object, default=3
If int, size of the neighbourhood to consider to compute the average distance to the minority point samples. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin that will be used to find the k_neighbors.
n_neighbors_ver3int or object, default=3
If int, NearMiss-3 algorithm start by a phase of re-sampling. This parameter correspond to the number of neighbours selected create the subset in which the selection will be performed. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin that will be used to find the k_neighbors.
n_jobsint, default=None
Number of CPU cores used during the cross-validation loop. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
[05. Part 5) Ch 20. 편향된 모델은 쓸모 없어 - 클래스 불균형 문제 - 02-2. 재샘플링 - 오버샘플링과 언더 샘플링(실습)]
// kNN 을 사용해서 클래스 불균형도 테스트를 해준다.
// KNeighborsClassifier
// 재현율 0% 로 불균형이 심각한 수준이라 볼 수 있다.
[05. Part 5) Ch 20. 편향된 모델은 쓸모 없어 - 클래스 불균형 문제 - 03-1 비용 민감 모델 (이론)]
// 모델의 학습 변경한 모델이라고 볼 수 있다. 전처리라고 보기는 좀 어렵다.
* 정의
// 비용을 위양성 비용보다 크게 설정
* 확률 모델
* 관련문법: .predict_proba
* Tip.Numpy 와 Pandas 잘 쓰는 기본 원칙 : 가능하면 배열 단위 연산을 하라
// 유니버설 함수, 브로드캐스팅, 마스크 연산을 최대한 활용
* 비확률 모델 (1) 서포트 벡터 머신
* 비확률 모델 (2) 의사결정 나무
* 관련문법 : class_weight
[05. Part 5) Ch 20. 편향된 모델은 쓸모 없어 - 클래스 불균형 문제 - 03-2 비용 민감 모델 (실습)]
The multiclass support is handled according to a one-vs-one scheme.
For details on the precise mathematical formulation of the provided kernel functions and how gamma, coef0 and degree affect each other, see the corresponding section in the narrative documentation: Kernel functions.
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
degreeint, default=3
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
gamma{‘scale’, ‘auto’} or float, default=’scale’
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
ifgamma='scale'(default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features.
Changed in version 0.22: The default value of gamma changed from ‘auto’ to ‘scale’.
coef0float, default=0.0
Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
shrinkingbool, default=True
Whether to use the shrinking heuristic. See the User Guide.
probabilitybool, default=False
Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict. Read more in the User Guide.
tolfloat, default=1e-3
Tolerance for stopping criterion.
cache_sizefloat, default=200
Specify the size of the kernel cache (in MB).
class_weightdict or ‘balanced’, default=None
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
verbosebool, default=False
Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.
max_iterint, default=-1
Hard limit on iterations within solver, or -1 for no limit.
Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, one-vs-one (‘ovo’) is always used as multi-class strategy. The parameter is ignored for binary classification.
Changed in version 0.19: decision_function_shape is ‘ovr’ by default.
New in version 0.17: decision_function_shape=’ovr’ is recommended.
Changed in version 0.17: Deprecated decision_function_shape=’ovo’ and None.
break_tiesbool, default=False
If true, decision_function_shape='ovr', and number of classes > 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict.
New in version 0.22.
random_stateint or RandomState instance, default=None
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.
Attributes
support_ndarray of shape (n_SV,)
Indices of support vectors.
support_vectors_ndarray of shape (n_SV, n_features)
Support vectors.
n_support_ndarray of shape (n_class,), dtype=int32
Number of support vectors for each class.
dual_coef_ndarray of shape (n_class-1, n_SV)
Dual coefficients of the support vector in the decision function (see Mathematical formulation), multiplied by their targets. For multiclass, coefficient for all 1-vs-1 classifiers. The layout of the coefficients in the multiclass case is somewhat non-trivial. See the multi-class section of the User Guide for details.
coef_ndarray of shape (n_class * (n_class-1) / 2, n_features)
Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel.
coef_ is a readonly property derived from dual_coef_ and support_vectors_.
intercept_ndarray of shape (n_class * (n_class-1) / 2,)
Constants in decision function.
fit_status_int
0 if correctly fitted, 1 otherwise (will raise warning)
classes_ndarray of shape (n_classes,)
The classes labels.
probA_ndarray of shape (n_class * (n_class-1) / 2)probB_ndarray of shape (n_class * (n_class-1) / 2)
If probability=True, it corresponds to the parameters learned in Platt scaling to produce probability estimates from decision values. If probability=False, it’s an empty array. Platt scaling uses the logistic function 1 / (1 + exp(decision_value * probA_ + probB_)) where probA_ and probB_ are learned from the dataset [2]. For more information on the multiclass case and training procedure see section 8 of [1].
class_weight_ndarray of shape (n_class,)
Multipliers of parameter C for each class. Computed based on the class_weight parameter.
shape_fit_tuple of int of shape (n_dimensions_of_X,)
One hot encoding consists in replacing the categorical variable by a combination of binary variables which take value 0 or 1, to indicate if a certain category is present in an observation.
Each one of the binary variables are also known as dummy variables. For example, from the categorical variable “Gender” with categories ‘female’ and ‘male’, we can generate the boolean variable “female”, which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is “male” and 0 otherwise.
The encoder has the option to generate one dummy variable per category, or to create dummy variables only for the top n most popular categories, that is, the categories that are shown by the majority of the observations.
If dummy variables are created for all the categories of a variable, you have the option to drop one category not to create information redundancy. That is, encoding into k-1 variables, where k is the number if unique categories.
The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode categorical variables (object type).
The encoder first finds the categories to be encoded for each variable (fit).
The encoder then creates one dummy variable per category for each variable (transform).
Note: new categories in the data to transform, that is, those that did not appear in the training set, will be ignored (no binary variable will be created for them).
Parameters
top_categories(int,default=None) – If None, a dummy variable will be created for each category of the variable. Alternatively, top_categories indicates the number of most frequent categories to encode. Dummy variables will be created only for those popular categories and the rest will be ignored. Note that this is equivalent to grouping all the remaining categories in one group.
variables(list) – The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.
drop_last(boolean,default=False) – Only used if top_categories = None. It indicates whether to create dummy variables for all the categories (k dummies), or if set to True, it will ignore the last variable of the list (k-1 dummies).
Learns the unique categories per variable. If top_categories is indicated, it will learn the most popular categories. Alternatively, it learns all unique categories per variable.
Parameters
X(pandas dataframe of shape =[n_samples,n_features]) – The training input samples. Can be the entire dataframe, not just seleted variables.
y(pandas series,default=None) – Target. It is not needed in this encoded. You can pass y or None.
encoder_dict\_
The dictionary containing the categories for which dummy variables will be created.
numpy.quantile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)
Compute the q-th quantile of the data along the specified axis.
New in version 1.15.0.
Parameters
aarray_like
Input array or object that can be converted to an array.
qarray_like of float
Quantile or sequence of quantiles to compute, which must be between 0 and 1 inclusive.
axis{int, tuple of int, None}, optional
Axis or axes along which the quantiles are computed. The default is to compute the quantile(s) along a flattened version of the array.
outndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output, but the type (of the output) will be cast if necessary.
overwrite_inputbool, optional
If True, then allow the input arrayato be modified by intermediate calculations, to save memory. In this case, the contents of the inputaafter this function completes is undefined.
This optional parameter specifies the interpolation method to use when the desired quantile lies between two data pointsi<j:
linear:i+(j-i)*fraction, wherefractionis the fractional part of the index surrounded byiandj.
lower:i.
higher:j.
nearest:iorj, whichever is nearest.
midpoint:(i+j)/2.
keepdimsbool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arraya.
Returns
quantilescalar or ndarray
Ifqis a single quantile andaxis=None, then the result is a scalar. If multiple quantiles are given, first axis of the result corresponds to the quantiles. The other axes are the axes that remain after the reduction ofa. If the input contains integers or floats smaller thanfloat64, the output data-type isfloat64. Otherwise, the output data-type is the same as that of the input. Ifoutis specified, that array is returned instead.
[05. Part 5) Ch 19. 이상적인 분포를 만들순 없을까 변수 분포 문제 - 03. 이상치 제거 (2) 밀도 기반 군집화 활용]
* 이상치 판단 방법 2. 밀도 기반 군집화 수행
// 특정 반경내에서는 중심점.. 중심점에 안 들어오면 경계점 그 모두에 속하지 않는 것들이 이상치라고 부른다.
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
Perform DBSCAN clustering from vector array or distance matrix.
DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
min_samplesint, default=5
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed bysklearn.metrics.pairwise_distancesfor its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be aGlossary, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
New in version 0.17:metricprecomputedto accept precomputed sparse matrix.
metric_paramsdict, default=None
Additional keyword arguments for the metric function.
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
leaf_sizeint, default=30
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
pfloat, default=None
The power of the Minkowski metric to be used to calculate distance between points.
n_jobsint, default=None
The number of parallel jobs to run.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. SeeGlossaryfor more details.
Attributes
core_sample_indices_ndarray of shape (n_core_samples,)
Indices of core samples.
components_ndarray of shape (n_core_samples, n_features)
Copy of each core sample found by training.
labels_ndarray of shape (n_samples)
Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.
* 실습
// cdist 은 DBSCAN 을 볼때 참고할 때를 위해서 가져온 library 이다.
// 파라미터를 조정하면서 값들을 확인한다.
[05. Part 5) Ch 19. 이상적인 분포를 만들순 없을까 변수 분포 문제 - 04. 특징 간 상관성 제거]
class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
Principal component analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.
It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.
It can also use the scipy.sparse.linalg ARPACK implementation of the truncated SVD.
Notice that this class does not support sparse input. SeeTruncatedSVDfor an alternative with sparse data.
Number of components to keep. if n_components is not set all components are kept:
n_components==min(n_samples,n_features)
Ifn_components=='mle'andsvd_solver=='full', Minka’s MLE is used to guess the dimension. Use ofn_components=='mle'will interpretsvd_solver=='auto'assvd_solver=='full'.
If0<n_components<1andsvd_solver=='full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
Ifsvd_solver=='arpack', the number of components must be strictly less than the minimum of n_features and n_samples.
Hence, the None case results in:
n_components==min(n_samples,n_features)-1
copybool, default=True
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whitenbool, optional (default False)
When True (False by default) thecomponents_vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
svd_solverstr {‘auto’, ‘full’, ‘arpack’, ‘randomized’}If auto :
The solver is selected by a default policy based onX.shapeandn_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
If full :
run exact full SVD calling the standard LAPACK solver viascipy.linalg.svdand select the components by postprocessing
If arpack :
run SVD truncated to n_components calling ARPACK solver viascipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
If randomized :
run randomized SVD by the method of Halko et al.
New in version 0.18.0.
tolfloat >= 0, optional (default .0)
Tolerance for singular values computed by svd_solver == ‘arpack’.
New in version 0.18.0.
iterated_powerint >= 0, or ‘auto’, (default ‘auto’)
Number of iterations for the power method computed by svd_solver == ‘randomized’.
Percentage of variance explained by each of the selected components.
Ifn_componentsis not set then all components are stored and the sum of the ratios is equal to 1.0.
singular_values_array, shape (n_components,)
The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of then_componentsvariables in the lower-dimensional space.
New in version 0.19.
mean_array, shape (n_features,)
Per-feature empirical mean, estimated from the training set.
Equal toX.mean(axis=0).
n_components_int
The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
n_features_int
Number of features in the training data.
n_samples_int
Number of samples in the training data.
noise_variance_float
The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 orhttp://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples.
Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.
* 실습
// 특징간 상관 관계가 너무 크다.
// VIF 계산. LinearRegression 으로 작업한 후 R sqaure
// PCA 를 활용
[05. Part 5) Ch 19. 이상적인 분포를 만들순 없을까 변수 분포 문제 - 05. 변수 치우침 제거]
For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution. The functionskewtestcan be used to determine if the skewness value is close enough to zero, statistically speaking.
Parameters
andarray
Input array.
axisint or None, optional
Axis along which skewness is calculated. Default is 0. If None, compute over the whole arraya.
biasbool, optional
If False, then the calculations are corrected for statistical bias.
Compute the kurtosis (Fisher or Pearson) of a dataset.
Kurtosis is the fourth central moment divided by the square of the variance. If Fisher’s definition is used, then 3.0 is subtracted from the result to give 0.0 for a normal distribution.
If bias is False then the kurtosis is calculated using k statistics to eliminate bias coming from biased moment estimators
Usekurtosistestto see if result is close enough to normal.
Parameters
aarray
Data for which the kurtosis is calculated.
axisint or None, optional
Axis along which the kurtosis is calculated. Default is 0. If None, compute over the whole arraya.
fisherbool, optional
If True, Fisher’s definition is used (normal ==> 0.0). If False, Pearson’s definition is used (normal ==> 3.0).
biasbool, optional
If False, then the calculations are corrected for statistical bias.
Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default is ‘propagate’.
Returns
kurtosisarray
The kurtosis of values along an axis. If all values are equal, return -3 for Fisher’s definition and 0 for Pearson’s definition.
[05. Part 5) Ch 19. 이상적인 분포를 만들순 없을까 변수 분포 문제 - 06. 스케일링]
class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)
Standardize features by removing the mean and scaling to unit variance
The standard score of a samplexis calculated as:
z = (x - u) / s
whereuis the mean of the training samples or zero ifwith_mean=False, andsis the standard deviation of the training samples or one ifwith_std=False.
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data usingtransform.
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passingwith_mean=Falseto avoid breaking the sparsity structure of the data.
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_meanboolean, True by default
If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_stdboolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation).
Attributes
scale_ndarray or None, shape (n_features,)
Per feature relative scaling of the data. This is calculated usingnp.sqrt(var_). Equal toNonewhenwith_std=False.
New in version 0.17:scale_
mean_ndarray or None, shape (n_features,)
The mean value for each feature in the training set. Equal toNonewhenwith_mean=False.
var_ndarray or None, shape (n_features,)
The variance for each feature in the training set. Used to computescale_. Equal toNonewhenwith_std=False.
n_samples_seen_int or array, shape (n_features,)
The number of samples processed by the estimator for each feature. If there are not missing samples, then_samples_seenwill be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments acrosspartial_fitcalls.
[파이썬을 활용한 데이터 전처리 Level UP-Comment] - 범주형 변수 문자에 대한 처리 방법
New in version 0.20:SimpleImputerreplaces the previoussklearn.preprocessing.Imputerestimator which is now removed.
Parameters
missing_valuesnumber, string, np.nan (default) or None
The placeholder for the missing values. All occurrences ofmissing_valueswill be imputed. For pandas’ dataframes with nullable integer dtypes with missing values,missing_valuesshould be set tonp.nan, sincepd.NAwill be converted tonp.nan.
strategystring, default=’mean’
The imputation strategy.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
New in version 0.20:strategy=”constant” for fixed value imputation.
fill_valuestring or numerical value, default=None
When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
verboseinteger, default=0
Controls the verbosity of the imputer.
copyboolean, default=True
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even ifcopy=False:
If X is not an array of floating values;
If X is encoded as a CSR matrix;
If add_indicator=True.
add_indicatorboolean, default=False
If True, aMissingIndicatortransform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
Attributes
statistics_array of shape (n_features,)
The imputation fill value for each feature. Computing statistics can result innp.nanvalues. Duringtransform, features corresponding tonp.nanstatistics will be discarded.
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
axis{0 or ‘index’, 1 or ‘columns’}
Axis along which to fill missing values.
inplacebool, default False
If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
limitint, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcastdict, default is None
A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
Returns
DataFrame or None
Object with missing values filled or None ifinplace=True.
[04. Part 4) Ch 17. 왜 여기엔 값이 없을까 결측치 문제 - 03. 해결 방법 (4) 결측치 예측 모델 활용]
* 결측치 예측 모델 정의
- 결측이 발생하지 않은 컬럼을 바탕으로 결측치를 예측하는 모델을 학습하고 활용하는 방법
- (예시) V2 열에 포함된 결측 값을 추정
* 결측치 예측 모델 활용
- 결측치 예측 모델은 어느 상황에서도 무난하게 활용할 수 있으나, 사용 조건 및 단점을 반드시 숙지해야 한다.
- 사용 조건 및 단점
. 조건 1. 결측이 소수 컬럼에 쏠리면 안 된다.
. 조건 2. 특징 간에 관계가 존재해야 한다.
. 단점 : 다른 결측치 처리 방법에 비해 시간이 오래 소요된다.
* 관련문법 : sklearn.impute.KNNImputer
- 결측이 아닌 값만 사용하여 이웃을 구한 뒤, 이웃들의 값의 대표값으로 결측을 대체하는 결측치 예측 모델
- 주요 입력
. n_neighbors : 이웃 수 (주의 : 너무 적으면 결측 대체가 정상적으로 이뤄지지 않을 수 있으므로, 5 정도가 적절)
* 실습
// n_neighbors = 5 는 크게 잡는것 보다 적당한 수치로 잡아서 인스턴스화 작업을 한다.
class sklearn.impute.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False)
Imputation for completing missing values using k-Nearest Neighbors.
Each sample’s missing values are imputed using the mean value fromn_neighborsnearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.
missing_valuesnumber, string, np.nan or None, default=`np.nan`
The placeholder for the missing values. All occurrences ofmissing_valueswill be imputed. For pandas’ dataframes with nullable integer dtypes with missing values,missing_valuesshould be set tonp.nan, sincepd.NAwill be converted tonp.nan.
n_neighborsint, default=5
Number of neighboring samples to use for imputation.
weights{‘uniform’, ‘distance’} or callable, default=’uniform’
Weight function used in prediction. Possible values:
‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
callable : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
metric{‘nan_euclidean’} or callable, default=’nan_euclidean’
Distance metric for searching neighbors. Possible values:
‘nan_euclidean’
callable : a user-defined function which conforms to the definition of_pairwise_callable(X,Y,metric,**kwds). The function accepts two arrays, X and Y, and amissing_valueskeyword inkwdsand returns a scalar distance value.
copybool, default=True
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.
add_indicatorbool, default=False
If True, aMissingIndicatortransform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Parameters
bymapping, function, label, or list of labels
Used to determine the groups for the groupby. Ifbyis a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns inself. Notice that a tuple is interpreted as a (single) key.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1).
levelint, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_indexbool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sortbool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
group_keysbool, default True
When calling apply, add group keys to index to identify pieces.
squeezebool, default False
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
Deprecated since version 1.1.0.
observedbool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
New in version 0.23.0.
dropnabool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups
New in version 1.1.0.
Returns
DataFrameGroupBy
Returns a groupby object that contains information about the groups.
[04. Part 4) Ch 17. 왜 여기엔 값이 없을까 결측치 문제 - 01. 문제 정의]
* 문제 정의
- 데이터에 결측치가 있어, 모델 학습 자체가 되지 않는 문제
- 결측치는 크게 NaN 과 None 으로 구분된다.
. NaN : 값이 있어야 하는데 없는 결측으로, 대체, 추정, 예측 등으로 처리
. None : 값이 없는게 값인 결측 (e.g., 직업 - 백수) 으로 새로운 값으로 정의하는 방식으로 처리
- 결측치 처리 방법 자체는 매우 간단하나, 상황에 따른 처리 방법 선택이 매우 중요
* 용어 정의
- 결측 레코드 : 결측치를 포함하는 레코드
- 결측치 비율 : 결측 레코드 수 / 전체 레코드 개수
[04. Part 4) Ch 17. 왜 여기엔 값이 없을까 결측치 문제 - 02. 해결 방법 (1) 삭제]
* 행 단위 결측 삭제
- 행 단위 결측 삭제는 결측 레코드를 삭제하는 매우 간단한 방법이지만, 두 가지 조건을 만족하는 경우에만 수행할 수 있다.
* 열 단위 결측 삭제
- 열 다누이 결측 삭제는 결측 레코드를 포함하는 열을 삭제하는 매우 간단한 방법이지만, 두 가지 조건을 만족하는 경우에만 사용 가능하다.
. 소수 변수에 결측이 많이 포함되어 있다.
. 해당 변수들이 크게 중요하지 않음 (by 도메인 지식)
* 관련 문법 : Series / DataFrame.isnull
- 값이 결측이면 True 를, 그렇지 않으면 False 를 반환 (notnull 함수와 반대로 작동)
- sum 함수와 같이 사용하여 결측치 분포를 확인하는데 주로 사용
* 관련문법 : DataFrame.dropna
- 결측치가 포함된 행이나 열을 제거하는데 사용
- 주요 입력
. axis : 1 이면 결측이 포함된 열을 삭제하며, 0 이면 결측이 포함된 행을 삭제
. how : 'any'면 결측이 하나라도 포함되면 삭제하며, 'all'이면 모든 갑싱 결측인 경우만 삭제 (주로 any 로 설정)
Return a boolean same-sized object indicating if the values are NA. NA values, such as None ornumpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na=True).
ReturnsDataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None ornumpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na=True).
ReturnsSeries
Mask of bool values for each element in Series that indicates whether an element is not an NA value.
패치#코딩, 주식, 자동매매, 백테스팅, 데이터분석 등 관심 블로그
비전공자이지만 금융 및 관련 프로그래밍에 관심을 두고 열심히 공부중입니다.
우리 모두 경제적 자유를 위해 성공해봅시다!!
※ 혹시, 블로그 내용중 문제되는 내용있으시면 알려주시면 삭제/수정 토록하겠습니다.
※ 모든 내용들은 투자 권유가 아니오니 참고만 하시고, 또한 출처는 모두 표기토록 노력하겠으나 혹시 문제가 되는 글이 있다면 댓글로 남겨주세요~^^
댓글을 달아 주세요