The loaded dataset is a dictionary-like object.
Features of all instances are stored in mat['X'], and the ground truth of class labels are stored in
mat['Y'].
For instance, in the COIL20 dataset, mat['X'] is the matrix format of features:
>>>X=mat['X']
>>>print X
>>>[ 0.01568627 0.01568627 0.01568627 ..., 0.01568627 0.01568627 0.01568627]
[ 0.01960784 0.01960784 0.01960784 ..., 0.01960784 0.01960784 0.01960784]
[ 0.01568627 0.01568627 0.01568627 ..., 0.01568627 0.01568627 0.01568627]
...,
[ 0. 0. 0. ..., 0. 0. 0. ]
[ 0. 0. 0. ..., 0. 0. 0. ]
[ 0. 0. 0. ..., 0. 0. 0. ]]
And mat['Y'] is the vector format of ground truth of class labels:
>>>y = mat['Y'][:, 0]
>>>print y
>>>[ 1. 1. 1. ..., 20. 20. 20.]
Shape of the data arrays
The feature matrix is always represented by a 2D array, in the shape of (n_samples, n_features). For
example, in dataset COIL20.mat,
the function numpy.shape outputs the shape of the features:
>>>import numpy as np
>>>n_samples, n_features = np.shape(X)
>>>print n_samples, n_features
>>>1440 1024
The label is always represented by a 1D vector, in the shape of (n_labels,):
>>>n_labels = np.shape(y)
>>>print n_labels
>>>(1440L,)
For Supervised Learning Problems
Split Data into Training and Testing Set
The function sklearn.cross_validation.train_test_split splits the data into train and test sets. Here we
set the size of test data to be 20%:
from sklearn.cross_validation import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.2, random_state=40)
>>>print X_train
array([[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0.02352941, 0.02352941, 0.02352941, ..., 0.02352941, 0.02352941, 0.02352941],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
...,
[ 0.01568627, 0.01568627, 0.01568627, ..., 0.01568627, 0.01568627, 0.01568627],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ]])
>>>print X_test
array([[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0.01568627, 0.01568627, 0.01568627, ..., 0.01568627, 0.01568627, 0.01568627],
...,
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ],
[ 0. , 0. , 0. , ..., 0. , 0. , 0. ]])
>>>print y_train
[17 2 12 ..., 1 4 19]
>>>print y_test
[8 15 1 ..., 15 17 7]
Perform Feature Selection on the Training Set
We take Fisher Score algorithm as an example to explain how to perform feature selection on the training
set. First, we compute the fisher scores of all features using the training set.
Compute fisher score and output the score of each feature:
>>>from skfeature.function.similarity_based import fisher_score
>>>score = fisher_score.fisher_score(X_train, y_train)
>>>print score
>>>[ 13.96904931 0.5376816 0.19923194 ..., 3.71944606 14.01720752 14.05075518]
Rank features in an descending order according to fisher scores and outputs the ranking index:
>>>idx = fisher_score.feature_ranking(score)
>>>print idx
>>>[1023 1022 31 ..., 34 97 897]
Specify the number of selected features (e.g., 5) for the evaluation purpose:
>>>num_fea = 5
>>>selected_features_train = X_train[:, idx[0:num_fea]]
>>>selected_features_test = X_test[:, idx[0:num_fea]]
>>>print selected_features_train
>>>[[ 0. 0. 0. 0. 0. ]
[ 0.02352941 0.02352941 0.02352941 0.02352941 0.02352941]
[ 0. 0. 0. 0. 0. ]
...,
[ 0.01568627 0.01568627 0.01568627 0.01568627 0.01568627]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]]
>>>print selected_features_test
>>>[[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0.01568627 0.01568627 0.01568627 0.01568627 0.01568627]
...,
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]]
Training a Classification Model with Selected Features
Here we choose the linear SVM as an example:
from sklearn import svm
clf = svm.LinearSVC()
Then we train a classification model with the selected features on the training set:
>>>clf.fit(selected_features_train, y_train)
Prediction Phase
Predict the class labels of test data based on the trained model
>>>y_predict = clf.predict(selected_features_test)
>>>print y_predict
>>>[19 19 2 ..., 19 19 19]
Performance Evaluation
Here, we use classification accuracy to measure the performance of supervised feature selection algorithm
Fisher Score:
>>>from sklearn.metrics import accuracy_score
>>>acc = accuracy_score(y_test, y_predict)
>>>print acc
>>>0.09375
For Unsupervised Learning Problems
Feature Selection
For unsupervised learning problems, we do not need to specify the training and testing set. In other
words, we use the whole dataset for feature selection. Here, we use the Laplacian Score as an example
to explain how to perform unsupervised feature selection.
First, we construct affinity matrix which is required by Laplacian Score:
>>>from skfeature.utility import construct_W
>>>kwargs_W = {"metric":"euclidean","neighbor_mode":"knn","weight_mode":"heat_kernel","k":5,'t':1}
>>>W = construct_W.construct_W(X, **kwargs_W)
Compute and output the score of each feature
>>>from skfeature.function.similarity_based import lap_score
>>>score = lap_score.lap_score(X, W=W)
>>>print score
>>>[ 0.01269462 0.00637613 0.00333286 ..., 0.0123851 0.01271441 0.01269681]
Rank features in an ascending order according to laplacian scores and output the ranking index:
>>>idx = lap_score.feature_ranking(score)
>>>[ 34 65 966 ..., 28 963 996]
Specify the number of selected features (e.g., 5) for the evaluation purpose:
>>>num_fea = 5
>>>selected_features = X[:, idx[0:num_fea]]
>>>[[ 0.01568627 0.01568627 0.01568627 0.01568627 0.01568627]
[ 0.01960784 0.01960784 0.01960784 0.01960784 0.01960784]
[ 0.01568627 0.01568627 0.01568627 0.01568627 0.01568627]
...,
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]]
Performance Evaluation
Here, we use normalized mutual infomation score (NMI) and accuracy (ACC) to measure the performance of
unsupervised feature selection algorithm Laplacian Score. Usually, the parameter n_clusters is
set to be the same as the number of classes in the ground truth.
>>>from skfeature.utility import unsupervised_evaluation
>>>import numpy as np
>>>num_cluster = len(np.unique(y))
>>>print num_cluster
>>>20
>>>nmi,acc=unsupervised_evaluation.evaluation(X_selected=selected_features,n_clusters=num_cluster,y=y)
>>>print nmi
>>>0.415270585545
>>>print acc
>>>0.197222222222
Back to top