博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
encoders.kryo_category_encoders中的BaseN编码和网格搜索
阅读量:2522 次
发布时间:2019-05-11

本文共 5774 字,大约阅读时间需要 19 分钟。

encoders.kryo

In the past I’ve posted about the various one can use for machine learning tasks, like one-hot encoding, ordinal or binary.  In my OSS package, category_encodings, I’ve added a single scikit-learn compatible encoder called , which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, or anywhere in between), and get consistently encoded categorical variables out.  Note that base 1 and one-hot aren’t really the same thing, but in this case it’s convenient to consider them as such.

过去,我曾发布过有关可以用于机器学习任务的各种 ,例如单热编码,有序或二进制编码。 在我的OSS包category_encodings中,我添加了一个名为scikit-learn的兼容编码器 ,该编码器允许用户选择一个底数(2为二进制,N为序数,1为单火,或介于两者之间的任意值),并获得一致编码的分类变量。 请注意,以1为基数和以1为底的整数并不是真正相同的东西,但是在这种情况下,将它们考虑为方便。

Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems.  Where it becomes useful, however, is when this encoder is coupled with a grid search.

实际上,这几乎没有增加新功能,人们很少在实际问题中使用base-3或base-8或序数或二进制以外的任何基数。 但是,在此编码器与网格搜索结合使用时才变得有用。

from __future__ import print_functionfrom sklearn import datasetsfrom sklearn.grid_search import GridSearchCVfrom sklearn.cross_validation import train_test_splitfrom sklearn.metrics import classification_reportfrom sklearn.pipeline import Pipelinefrom category_encoders.basen import BaseNEncoderfrom examples.source_data.loaders import get_mushroom_datafrom sklearn.linear_model import LogisticRegression# first we get data from the mushroom datasetX, y, _ = get_mushroom_data()X = X.values  # use numpy array not dataframe heren_samples = X.shape[0]# Split the dataset in two equal partsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)# create a pipelineppl = Pipeline([    ('enc', BaseNEncoder(base=2, return_df=False, verbose=True)),    ('clf', LogisticRegression())])# Set the parameters by cross-validationtuned_parameters = {    'enc__base': [1, 2, 3, 4, 5, 6]}scores = ['precision', 'recall']for score in scores:    print("# Tuning hyper-parameters for %sn" % score)    clf = GridSearchCV(ppl, tuned_parameters, cv=5, scoring='%s_macro' % score)    clf.fit(X_train, y_train)    print("Best parameters set found on development set:n")    print(clf.best_params_)    print("nGrid scores on development set:n")    for mean, std, params in clf.grid_scores_:        print("%s (+/-%s) for %s" % (mean, std * 2, params))    print("nDetailed classification report:n")    print("The model is trained on the full development set.")    print("The scores are computed on the full evaluation set.n")    y_true, y_pred = y_test, clf.predict(X_test)    print(classification_report(y_true, y_pred))

This code, from , uses a normal scikit-learn grid search to find the optimal base for encoding categorical variables.  The trade-off between between how well pairwise distances between categories and the final dataset dimensionality is no longer a difficult to tune parameter.

此代码来自 ,使用常规的scikit-learn网格搜索来找到用于对分类变量进行编码的最佳基础。 在类别之间的成对距离与最终数据集维度之间的权衡不再是难以调整的参数。

By running the above script we get:

通过运行上面的脚本,我们得到:

Best parameters set found on development set:{'enc__base': 1}Grid scores on development set:{'enc__base': 1} (+/-1.99905151856) for [ 1.         1.         1.         1.         0.9976247]{'enc__base': 2} (+/-1.98737951324) for [ 0.9805492   0.99763033  0.99621212  0.9964455   0.9976247 ]{'enc__base': 3} (+/-1.95968049624) for [ 0.99411765  0.98387419  0.9651717   0.96970966  0.98633155]{'enc__base': 4} (+/-1.96534331006) for [ 0.99500636  0.96541172  0.98387419  0.99013767  0.97892831]{'enc__base': 5} (+/-1.96034803727) for [ 0.97773263  0.97556628  0.98636545  0.97058734  0.99063232]{'enc__base': 6} (+/-1.93791104567) for [ 0.96788716  0.95480882  0.97648608  0.97769848  0.96790524]Detailed classification report:The model is trained on the full development set.The scores are computed on the full evaluation set.             precision    recall  f1-score   support          0       1.00      1.00      1.00      2110          1       1.00      1.00      1.00      1952avg / total       1.00      1.00      1.00      4062# Tuning hyper-parameters for recallBest parameters set found on development set:{'enc__base': 1}Grid scores on development set:{'enc__base': 1} (+/-1.99802826596) for [ 0.99761905  1.          1.          1.          0.99744898]{'enc__base': 2} (+/-1.98660963142) for [ 0.98904035  0.98854962  0.99745547  1.          0.99148239]{'enc__base': 3} (+/-1.88434381179) for [ 0.95086332  0.8547619   0.94664667  0.98862857  0.97008487]{'enc__base': 4} (+/-1.98025257596) for [ 0.99261178  0.98005271  0.98436023  0.99618321  0.99744898]{'enc__base': 5} (+/-1.93166516505) for [ 0.98530534  0.98657761  0.89642857  0.9800385   0.98086735]{'enc__base': 6} (+/-1.94647463413) for [ 0.96687568  0.97385496  0.99507452  0.95912053  0.97123861]Detailed classification report:The model is trained on the full development set.The scores are computed on the full evaluation set.             precision    recall  f1-score   support          0       1.00      1.00      1.00      2110          1       1.00      1.00      1.00      1952avg / total       1.00      1.00      1.00      4062

Which shows us that for this relatively simple problem, with a small dataset, using the dimension-inefficient one-hot encoding (base=1) is the best option available.We’ve got a lot of cool projects in the pipeline in preparation for the 1.3.0 release, and the first release since being included in , so if you’re interested in this kind of work, head over to or reach out here to get involved.

这表明,对于一个相对简单的问题,对于一个较小的数据集,使用尺寸无效的单热点编码(base = 1)是最佳的选择。我们有很多很酷的项目正在准备中1.3.0版本以及包含的第一个版本,因此,如果您对这种工作感兴趣,请访问或联系这里。

分享这个: (Share this:)

像这样: (Like this:)

翻译自:

encoders.kryo

转载地址:http://reqwd.baihongyu.com/

你可能感兴趣的文章
selenium.Phantomjs设置浏览器请求头
查看>>
分布式数据库如何选择,几种分布式数据库优缺点一览
查看>>
BZOJ 4443: 小凸玩矩阵【二分图】
查看>>
苹果 OS X制作u盘启动盘
查看>>
Jquery便利对象
查看>>
AJAX 笔记
查看>>
MVC: Connection String
查看>>
Generally the plasma cutters are rated
查看>>
Spring源码窥探之:ImportBeanDefinitionRegistrar
查看>>
idea常用设置汇总
查看>>
Node.SelectNodes
查看>>
winform 指定浏览器打开链接
查看>>
[Leetcode] Subset II
查看>>
Lambda表达式语法进一步巩固
查看>>
Vue基础安装(精华)
查看>>
Git 提交修改内容和查看被修改的内容
查看>>
PAT - 1008. 数组元素循环右移问题 (20)
查看>>
请求出现 Nginx 413 Request Entity Too Large错误的解决方法
查看>>
配置php_memcache访问网站的步骤
查看>>
textarea 输入框限制字数
查看>>