pandas one_hot编码
1 import pandas as pd2 3 data = pd.Series(["java","python","c","python","html","java","linux"])4 data1 = pd.get_dummies(data)5 print(data1)
输出结果:
c html java linux python
0 0 0 1 0 01 0 0 0 0 12 1 0 0 0 03 0 0 0 0 14 0 1 0 0 05 0 0 1 0 06 0 0 0 1 0sklearn one_hot编码
1 import pandas as pd2 from sklearn.preprocessing import label_binarize3 data = pd.DataFrame([["java"],["python"],["c"],["python"],["html"],["java"],["linux"]],columns=["name"])4 classes = list(set(data["name"].values.tolist()))5 data2 = label_binarize(data["name"],classes=classes)6 data = pd.DataFrame(data2,columns=classes)7 print(data)
java linux html c python
0 1 0 0 0 01 0 0 0 0 12 0 0 0 1 03 0 0 0 0 14 0 0 1 0 05 1 0 0 0 06 0 1 0 0 01 from sklearn.preprocessing import LabelEncoder2 from sklearn.preprocessing import OneHotEncoder3 4 data = pd.DataFrame([["java"],["python"],["c"],["python"],["html"],["java"],["linux"]],columns=["name"])5 l = LabelEncoder()6 d = l.fit_transform(data["name"])7 o = OneHotEncoder()8 data3 = pd.DataFrame((o.fit_transform(d.reshape(-1,1))).toarray(),columns=l.classes_)9 print(data3)
c html java linux python
0 0.0 0.0 1.0 0.0 0.01 0.0 0.0 0.0 0.0 1.02 1.0 0.0 0.0 0.0 0.03 0.0 0.0 0.0 0.0 1.04 0.0 1.0 0.0 0.0 0.05 0.0 0.0 1.0 0.0 0.06 0.0 0.0 0.0 1.0 0.0label_binarize的返回值是numpy.ndarray的数据类型
OneHotEncoder的返回值是scipy.sparse.csr.csr_matrix的数据类型使用toarray()处理为numpy.ndarray的数据类型
对用 ndarray的数据使用tolist()转换为列表,使用list(set(list_data))去重复