Tutorial | Feature Selection @ ASU

An brief introduction on how to perform feature selection with scikit-feature

Loading an dataset

First, we start a Python interpreter from shell and then load the COIL20.mat. In the following parts, $ denotes the shell prompt while >>> denotes the Python interpreter prompt:

		     $ python
		     >>>import scipy.io
		     >>>mat = scipy.io.loadmat("COIL20.mat")

The loaded dataset is a dictionary-like object. Features of all instances are stored in mat['X'], and the ground truth of class labels are stored in mat['Y'].

For instance, in the COIL20 dataset, mat['X'] is the matrix format of features:

		     >>>X=mat['X']

		     >>>print X
		     >>>[ 0.01568627  0.01568627  0.01568627 ...,   0.01568627  0.01568627  0.01568627]
			    [ 0.01960784  0.01960784  0.01960784 ...,   0.01960784  0.01960784  0.01960784]
			    [ 0.01568627  0.01568627  0.01568627 ...,   0.01568627  0.01568627  0.01568627]
			    ...,
			    [ 0.          0.          0.         ...,   0.          0.          0.        ]
			    [ 0.          0.          0.         ...,   0.          0.          0.        ]
			    [ 0.          0.          0.         ...,   0.          0.          0.        ]]

And mat['Y'] is the vector format of ground truth of class labels:


		     >>>y = mat['Y'][:, 0] 

		     >>>print y
		     >>>[  1.   1.   1. ...,   20.   20.   20.]

Shape of the data arrays

The feature matrix is always represented by a 2D array, in the shape of (n_samples, n_features). For example, in dataset COIL20.mat, the function numpy.shape outputs the shape of the features:


			 >>>import numpy as np

		     >>>n_samples, n_features = np.shape(X)
		     >>>print n_samples, n_features
		     >>>1440 1024

The label is always represented by a 1D vector, in the shape of (n_labels,):


		     >>>n_labels = np.shape(y)

		     >>>print n_labels
		     >>>(1440L,)

For Supervised Learning Problems

Split Data into Training and Testing Set

The function sklearn.cross_validation.train_test_split splits the data into train and test sets. Here we set the size of test data to be 20%:


		     from sklearn.cross_validation import train_test_split

		     >>> X_train, X_test, y_train, y_test = train_test_split(
             ...     X, y, test_size=0.2, random_state=40)
		     >>>print X_train
		     array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ],
                    [ 0.02352941,  0.02352941,  0.02352941, ...,  0.02352941,  0.02352941,  0.02352941],
       		        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ],
       		        ..., 
       		        [ 0.01568627,  0.01568627,  0.01568627, ...,  0.01568627,  0.01568627,  0.01568627],  
       		        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ],
       		        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ]])
		     >>>print X_test
		     array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ],
       		        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ],
       		        [ 0.01568627,  0.01568627,  0.01568627, ...,  0.01568627,  0.01568627,  0.01568627],
       		        ..., 
       		        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ],
       		        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ],
       		        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ]])
		     >>>print y_train
			 [17  2 12 ...,  1  4 19]
		     >>>print y_test
			 [8  15  1 ...,  15 17  7]

Perform Feature Selection on the Training Set

We take Fisher Score algorithm as an example to explain how to perform feature selection on the training set. First, we compute the fisher scores of all features using the training set.

Compute fisher score and output the score of each feature:


			 >>>from skfeature.function.similarity_based import fisher_score

		     >>>score = fisher_score.fisher_score(X_train, y_train)
		     >>>print score
		     >>>[ 13.96904931   0.5376816    0.19923194 ...,   3.71944606  14.01720752  14.05075518]

Rank features in an descending order according to fisher scores and outputs the ranking index:


		     >>>idx = fisher_score.feature_ranking(score)

		     >>>print idx
		     >>>[1023 1022   31 ...,   34   97  897]

Specify the number of selected features (e.g., 5) for the evaluation purpose:


		     >>>num_fea = 5

		     >>>selected_features_train = X_train[:, idx[0:num_fea]]
		     >>>selected_features_test = X_test[:, idx[0:num_fea]]
		     >>>print selected_features_train
		     >>>[[ 0.          0.          0.          0.          0.        ]
			     [ 0.02352941  0.02352941  0.02352941  0.02352941  0.02352941]
			     [ 0.          0.          0.          0.          0.        ]
			     ..., 
			     [ 0.01568627  0.01568627  0.01568627  0.01568627  0.01568627]
			     [ 0.          0.          0.          0.          0.        ]
			     [ 0.          0.          0.          0.          0.        ]]
		     >>>print selected_features_test
		     >>>[[ 0.          0.          0.          0.          0.        ]
			     [ 0.          0.          0.          0.          0.        ]
			     [ 0.01568627  0.01568627  0.01568627  0.01568627  0.01568627]
			     ..., 
			     [ 0.          0.          0.          0.          0.        ]
			     [ 0.          0.          0.          0.          0.        ]
			     [ 0.          0.          0.          0.          0.        ]]

Training a Classification Model with Selected Features

Here we choose the linear SVM as an example:


			 from sklearn import svm

			 clf = svm.LinearSVC()

Then we train a classification model with the selected features on the training set:


			 >>>clf.fit(selected_features_train, y_train)

Prediction Phase

Predict the class labels of test data based on the trained model


			 >>>y_predict = clf.predict(selected_features_test)

			 >>>print y_predict
			 >>>[19 19  2 ..., 19 19 19]

Performance Evaluation

Here, we use classification accuracy to measure the performance of supervised feature selection algorithm Fisher Score:


			 >>>from sklearn.metrics import accuracy_score

			 >>>acc = accuracy_score(y_test, y_predict)
			 >>>print acc
			 >>>0.09375

For Unsupervised Learning Problems

Feature Selection

For unsupervised learning problems, we do not need to specify the training and testing set. In other words, we use the whole dataset for feature selection. Here, we use the Laplacian Score as an example to explain how to perform unsupervised feature selection.

First, we construct affinity matrix which is required by Laplacian Score:


			 >>>from skfeature.utility import construct_W

		     >>>kwargs_W = {"metric":"euclidean","neighbor_mode":"knn","weight_mode":"heat_kernel","k":5,'t':1}
		     >>>W = construct_W.construct_W(X, **kwargs_W)

Compute and output the score of each feature


			 >>>from skfeature.function.similarity_based import lap_score

		     >>>score = lap_score.lap_score(X, W=W)
		     >>>print score
		     >>>[ 0.01269462  0.00637613  0.00333286 ...,  0.0123851   0.01271441   0.01269681]

Rank features in an ascending order according to laplacian scores and output the ranking index:


		     >>>idx = lap_score.feature_ranking(score)

		     >>>[ 34  65 966 ...,  28 963 996]

Specify the number of selected features (e.g., 5) for the evaluation purpose:


		     >>>num_fea = 5

		     >>>selected_features = X[:, idx[0:num_fea]]
		     >>>[[ 0.01568627  0.01568627  0.01568627  0.01568627  0.01568627]
		         [ 0.01960784  0.01960784  0.01960784  0.01960784  0.01960784]
		         [ 0.01568627  0.01568627  0.01568627  0.01568627  0.01568627]
	             ..., 
			     [ 0.          0.          0.          0.          0.        ]
			     [ 0.          0.          0.          0.          0.        ]
			     [ 0.          0.          0.          0.          0.        ]]

Performance Evaluation

Here, we use normalized mutual infomation score (NMI) and accuracy (ACC) to measure the performance of unsupervised feature selection algorithm Laplacian Score. Usually, the parameter n_clusters is set to be the same as the number of classes in the ground truth.


			 >>>from skfeature.utility import unsupervised_evaluation

			 >>>import numpy as np

			 
>>>num_cluster = len(np.unique(y))

			 
>>>print num_cluster

			 
>>>20

			 
>>>nmi,acc=unsupervised_evaluation.evaluation(X_selected=selected_features,n_clusters=num_cluster,y=y)
			 >>>print nmi

			 
>>>0.415270585545

			 
>>>print acc

			 
>>>0.197222222222