本文共 5774 字,大约阅读时间需要 19 分钟。
encoders.kryo
In the past I’ve posted about the various one can use for machine learning tasks, like one-hot encoding, ordinal or binary. In my OSS package, category_encodings, I’ve added a single scikit-learn compatible encoder called , which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, or anywhere in between), and get consistently encoded categorical variables out. Note that base 1 and one-hot aren’t really the same thing, but in this case it’s convenient to consider them as such.
过去,我曾发布过有关可以用于机器学习任务的各种 ,例如单热编码,有序或二进制编码。 在我的OSS包category_encodings中,我添加了一个名为scikit-learn的兼容编码器 ,该编码器允许用户选择一个底数(2为二进制,N为序数,1为单火,或介于两者之间的任意值),并获得一致编码的分类变量。 请注意,以1为基数和以1为底的整数并不是真正相同的东西,但是在这种情况下,将它们考虑为方便。
Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems. Where it becomes useful, however, is when this encoder is coupled with a grid search.
实际上,这几乎没有增加新功能,人们很少在实际问题中使用base-3或base-8或序数或二进制以外的任何基数。 但是,在此编码器与网格搜索结合使用时才变得有用。
from __future__ import print_functionfrom sklearn import datasetsfrom sklearn.grid_search import GridSearchCVfrom sklearn.cross_validation import train_test_splitfrom sklearn.metrics import classification_reportfrom sklearn.pipeline import Pipelinefrom category_encoders.basen import BaseNEncoderfrom examples.source_data.loaders import get_mushroom_datafrom sklearn.linear_model import LogisticRegression# first we get data from the mushroom datasetX, y, _ = get_mushroom_data()X = X.values # use numpy array not dataframe heren_samples = X.shape[0]# Split the dataset in two equal partsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)# create a pipelineppl = Pipeline([ ('enc', BaseNEncoder(base=2, return_df=False, verbose=True)), ('clf', LogisticRegression())])# Set the parameters by cross-validationtuned_parameters = { 'enc__base': [1, 2, 3, 4, 5, 6]}scores = ['precision', 'recall']for score in scores: print("# Tuning hyper-parameters for %sn" % score) clf = GridSearchCV(ppl, tuned_parameters, cv=5, scoring='%s_macro' % score) clf.fit(X_train, y_train) print("Best parameters set found on development set:n") print(clf.best_params_) print("nGrid scores on development set:n") for mean, std, params in clf.grid_scores_: print("%s (+/-%s) for %s" % (mean, std * 2, params)) print("nDetailed classification report:n") print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.n") y_true, y_pred = y_test, clf.predict(X_test) print(classification_report(y_true, y_pred))
This code, from , uses a normal scikit-learn grid search to find the optimal base for encoding categorical variables. The trade-off between between how well pairwise distances between categories and the final dataset dimensionality is no longer a difficult to tune parameter.
此代码来自 ,使用常规的scikit-learn网格搜索来找到用于对分类变量进行编码的最佳基础。 在类别之间的成对距离与最终数据集维度之间的权衡不再是难以调整的参数。
By running the above script we get:
通过运行上面的脚本,我们得到:
Best parameters set found on development set:{'enc__base': 1}Grid scores on development set:{'enc__base': 1} (+/-1.99905151856) for [ 1. 1. 1. 1. 0.9976247]{'enc__base': 2} (+/-1.98737951324) for [ 0.9805492 0.99763033 0.99621212 0.9964455 0.9976247 ]{'enc__base': 3} (+/-1.95968049624) for [ 0.99411765 0.98387419 0.9651717 0.96970966 0.98633155]{'enc__base': 4} (+/-1.96534331006) for [ 0.99500636 0.96541172 0.98387419 0.99013767 0.97892831]{'enc__base': 5} (+/-1.96034803727) for [ 0.97773263 0.97556628 0.98636545 0.97058734 0.99063232]{'enc__base': 6} (+/-1.93791104567) for [ 0.96788716 0.95480882 0.97648608 0.97769848 0.96790524]Detailed classification report:The model is trained on the full development set.The scores are computed on the full evaluation set. precision recall f1-score support 0 1.00 1.00 1.00 2110 1 1.00 1.00 1.00 1952avg / total 1.00 1.00 1.00 4062# Tuning hyper-parameters for recallBest parameters set found on development set:{'enc__base': 1}Grid scores on development set:{'enc__base': 1} (+/-1.99802826596) for [ 0.99761905 1. 1. 1. 0.99744898]{'enc__base': 2} (+/-1.98660963142) for [ 0.98904035 0.98854962 0.99745547 1. 0.99148239]{'enc__base': 3} (+/-1.88434381179) for [ 0.95086332 0.8547619 0.94664667 0.98862857 0.97008487]{'enc__base': 4} (+/-1.98025257596) for [ 0.99261178 0.98005271 0.98436023 0.99618321 0.99744898]{'enc__base': 5} (+/-1.93166516505) for [ 0.98530534 0.98657761 0.89642857 0.9800385 0.98086735]{'enc__base': 6} (+/-1.94647463413) for [ 0.96687568 0.97385496 0.99507452 0.95912053 0.97123861]Detailed classification report:The model is trained on the full development set.The scores are computed on the full evaluation set. precision recall f1-score support 0 1.00 1.00 1.00 2110 1 1.00 1.00 1.00 1952avg / total 1.00 1.00 1.00 4062
Which shows us that for this relatively simple problem, with a small dataset, using the dimension-inefficient one-hot encoding (base=1) is the best option available.We’ve got a lot of cool projects in the pipeline in preparation for the 1.3.0 release, and the first release since being included in , so if you’re interested in this kind of work, head over to or reach out here to get involved.
这表明,对于一个相对简单的问题,对于一个较小的数据集,使用尺寸无效的单热点编码(base = 1)是最佳的选择。我们有很多很酷的项目正在准备中1.3.0版本以及包含的第一个版本,因此,如果您对这种工作感兴趣,请访问或联系这里。
翻译自:
encoders.kryo
转载地址:http://reqwd.baihongyu.com/