cs189-hw01
Prerequisite
首先来了解一下我们用到的模块都是干什么的。
scikit-learn
Scikit-learn
是一个开源机器学习库,支持监督式学习和无监督式学习。它还提供了用于模型拟合、数据预处理、模型选择、模型评估和许多其他实用程序的各种工具。
hw1我们只会涉及SVM方面。下面会介绍。
datasets
有一系列的.npz
文件,可以被load为Python的字典,每个字典包括三个fields
- training data, the training set features. Rows are sample points and columns are features.
- training labels, the training set labels. Rows are sample points. There is one column: the labels corresponding to rows of training data above.
- test data, the test set features. Rows are sample points and columns are features. You will fit a model to predict the labels for this test set, and submit those predictions to Kaggle.
除此之外,还有三个datasets
- toy-data.npz is a synthetic dataset with two features (2-dimensional) and two classes. The training set has 1,000 examples, and no test set is provided. This dataset is only used in Section 2 of this homework.
- mnist-data.npz contains data from the MNIST dataset. There are 60,000 labeled images of handwritten digits for training, and 10,000 for testing. The images are flattened grayscale, 28 × 28 pixels. There are 10 possible labels for each image, namely, the digits 0–9.
- spam-data.npz contains featurized spam data. The labels are 1 for spam and 0 for ham. The data folder includes the script featurize.py and the folders spam, ham (not spam), and test (unlabeled test data); you may modify featurize.py to generate new features for the spam data.
上面的信息似乎介绍得差不多了,下面开始介绍SVM
SVM
支持向量机(英语:support vector machine,常简称为SVM,又名支持向量网络)是在分类与回归分析中分析数据的监督式学习模型与相关的学习算法。给定一组训练实例,每个训练实例被标记为属于两个类别中的一个或另一个,SVM训练算法创建一个将新的实例分配给两个类别之一的模型,使其成为非概率二元线性分类器。SVM模型是将实例表示为空间中的点,这样映射就使得单独类别的实例被尽可能宽的明显的间隔分开。然后,将新的实例映射到同一空间,并基于它们落在间隔的哪一侧来预测所属类别。
有关这一部分的作业将会以Latex格式贴在这里:(先欠着)
Data Partitioning and Evaluation Metrics
Data Partitioning
基本就是要从datasets中分离出“training” data 和 “validation” data。先整理数据,然后shuffle一下再做分离 For the MNIST dataset, write code that sets aside 10,000 training images as a validation set. For the spam dataset, write code that sets aside 20% of the training data as a validation set.
这里介绍几个会用到的函数,大部分都是NumPy的东西。
numpy.reshape
numpy.reshape(a, /, shape=None, *, newshape=None, order=’C’, copy=None)source
Gives a new shape to an array without changing its data.
numpy.concatenate
numpy.concatenate((a1, a2, …), axis=0, out=None, dtype=None, casting=”same_kind”)
Join a sequence of arrays along an existing axis.其中axis控制连接的轴
与append的区别是可以连接多维数组。
还有我们十分重要的工具: jupyter notebook
Evaluation Metrics
实际上就是评估模型(算accuracy)用的,在下一个问题中会用到。
除此之外,我们还需要normalize数组,所以需要用到以下函数
\[Normalized\,data= ( data- min(data) )/( max(data)-min(data) )\]