-
-
Save agramfort/557193 to your computer and use it in GitHub Desktop.
| from scipy import stats | |
| class GaussianKernelDensityEstimation(object): | |
| """docstring for GaussianKernelDensityEstimation""" | |
| def __init__(self): | |
| self.gkde = None | |
| def fit(self, X): | |
| """docstring for fit""" | |
| self.gkde = stats.gaussian_kde(X.T) | |
| return self | |
| def predict(self, X): | |
| return self.gkde.evaluate(X.T) | |
| if __name__ == '__main__': | |
| import numpy as np | |
| # Create some dummy data | |
| X = np.random.randn(400,2) | |
| X[200:,:] += 3.5 | |
| # Regular grid to evaluate kde upon | |
| n_grid_points = 128 | |
| xmin, ymin = X.min(axis=0) | |
| xmax, ymax = X.max(axis=0) | |
| xx = np.linspace(xmin - 0.5, xmax + 0.5, n_grid_points) | |
| yy = np.linspace(ymin - 0.5, ymax + 0.5, n_grid_points) | |
| xg, yg = np.meshgrid(xx, yy) | |
| grid_coords = np.c_[xg.ravel(), yg.ravel()] | |
| # Compute density estimation | |
| kde = GaussianKernelDensityEstimation() | |
| kde.fit(X) # Fit | |
| zz = kde.predict(grid_coords) # Evaluate density on grid points | |
| zz = zz.reshape(*xg.shape) | |
| import matplotlib.pylab as pl | |
| pl.close('all') | |
| pl.set_cmap(pl.cm.Paired) | |
| pl.figure() | |
| pl.contourf(xx, yy, zz, label='density') | |
| pl.scatter(X[:,0], X[:,1], color='k', label='samples') | |
| pl.axis('tight') | |
| pl.legend() | |
| pl.title('Kernel Density Estimation') | |
| pl.show() |
Yes we can (TM). We can use our ball tree to select only the data points from the training set that contribute significantly to a given test point. The problem is that with this strategy, it is harder to vectorize the 'predict' on the test data. Whether scipy's default behavior or the one I am suggesting is the best depends the 'shape' of you data: number of training point, number of testing points, and number of dimensions.
That means that we need a heuristic to decide which strategy to apply.
Also, grouping the test points that are close together to select a common 'active set' for them, and thus vectorizing on that bunch the predict would help.
There is some work to do to get a decent KDE :)
ok so we'll say that this is work in progress... At least the discussion was constructive :)
argh ! I didn't know. Any chance we can do better?