特徴量

特徴量（英: feature）はデータを変形して得られ、その特徴を表現し、続く処理に利用される数値である^[1]。表現（英: representation）とも。

概要[編集]

生データは必ずしも良い形をしていない。2つの属性が同じ意味を持ち冗長であったり、逆に2つの意味が1つの値に含まれ絡み合う場合もある。生データを変形し良い形にできればデータを上手く利用できる。この変形され、良い形を持ち、後続タスクで利用される値が特徴量である。

特徴量は生データから抽出される。抽出方法は専門家の知見を利用して考案される場合と機械学習によってデータから学習される場合がある。

特徴量は利用のために存在する。例えば分類（写真 → 特徴量 → 物体カテゴリ）、生成（文字 → 特徴量 → 画像）、圧縮（音声 → 特徴量 → 音声）に用いられる。その用途ごとに特徴量が持つべき特性は異なる。例えば圧縮用の特徴量はそのサイズが重視されるが、他の用途ではサイズの優先度が低い。

抽出[編集]

特徴量はデータを変換することで生成される。この変換を特徴抽出（英: feature extraction）という。専門家の知見を用いた人手による変換規則の探求は特徴量エンジニアリング（英語版）^[2]と呼ばれ、機械学習による場合は表現学習と呼ばれる。

特徴抽出は観測値/生データを特徴量空間へと射影するというニュアンスから「埋め込み（英: embedding）」とも呼ばれる。自然言語処理では単語に対する特徴抽出が「単語の埋め込み（word embedding）」と呼ばれる。

表現学習[編集]

表現学習（英語版）（英: representation learning, feature learning）は機械学習による特徴抽出である。

表現学習には様々な手法がある^[3]。以下はその例である。

主成分分析（PCA）
線形判別分析（LDA）
BERT

変換手法の分類基準には線形/非線形、教師あり/自己教師あり/教師なし、シャロー/ディープなどがある。利用方式とも関係がある。

タスク学習の前に表現学習をおこなう場合、表現学習は事前学習（英: pretraining）であると言える。事前学習はタスク学習と分離可能なため、表現学習では大量のデータを用いた教師なし事前学習（英: unsupervised pretraining）をおこない、タスク学習でラベル付きの教師あり学習をおこなうこともできる。また距離学習はデータを可測空間へ埋め込む学習であるため、表現学習として利用できる^[4]。

特定の特徴量セットから有用なもののみを選択することを特徴選択という。

特性[編集]

特徴量は用途に合わせて様々な特性を求められる。観点として抽出コスト・人間解釈性・後続タスクの性能などがある。また特徴量は離散と連続の2つに分類される。離散（英: discrete）は有限な集合であり、連続（英: continuous）は次元が設定されその中は連続となっている。特徴量空間が可測か否かでも分類される。また属性のもつれ（英: entanglement）も重要な特性である。

評価[編集]

特徴量には様々な評価指標が存在する。適切な指標は下流タスクにより異なる。以下は評価手法の例である。

線形判別（英: linear evaluation）: 特徴量で線形判別器を学習しそのテスト精度をもって特徴量評価とする^[5]。下流分類タスクに適切。

利用[編集]

手法[編集]

特徴量とタスクを分離するかに基づいて2つの利用方法に大別できる^[6]。

タスク入力[編集]

特徴量はタスクとその学習への入力として利用できる（feature-based approach）^[7]。これは特徴抽出とタスクを分離できるからである。

利点として異なるデータセットを利用した特徴量とタスクの学習が可能な点が挙げられる。例えば物体識別タスクにはラベル付きデータが必要であり（教師あり学習）、データ収集には手間がかかる。一方、画像の表現学習にはラベル無し写真を用いる手法（教師なし学習・自己教師あり学習）があり、こちらは少ない労力で大量のデータが収集できる。ゆえに大量のデータによる表現学習で優れた特徴量を得て、優れた特徴量とラベルを用いて少データでのタスク学習を行うことで、優れた識別器を得られる。

ファインチューニング[編集]

表現学習とタスク学習は分離可能であるが、完全には分離せず段階的に進めることもできる。すなわち表現モデルをまず学習し (事前学習)、その上で表現モデルとタスクモデルを繋げて一体化した上でタスク学習をおこなう（fine-tuning approach）^[8]。事前学習とタスク学習で異なるデータを利用できるため、分離時と同じ利点を得られる。さらに表現モデル部分もタスクに最適化される特徴がある。言語モデルにおけるBERTはその顕著な例である^[9]。

用途[編集]

特徴量は様々な用途で利用される。

生成[編集]

特徴量は生成タスクに利用される。生成タスクではしばしば生成される属性の操作が求められる。例えば顔写真生成において髪色の指定が求められる。特徴量として髪色を入力できればこれが可能になる。その際、髪色特徴量が他の属性を壊さないことが求められる。ゆえに生成用の特徴量にはdisentanglementがしばしば求められる。

オートエンコーダの潜在表現（英: latent representation）は特徴量である。

脚注[編集]

^ "data representation learning is a critical step to facilitate the subsequent ... tasks. ... how to learn the intrinsic structure of data and discover valuable information from data" Zhong, et al. (2017). An overview on data representation learning: From traditional feature learning to recent deep learning.
^ "manual feature engineering methods, such as image descriptors (... SIFT ... LBP ... HOG ...) and document statistics ( ... TF-IDF ...)" Zhong, et al. (2017). An overview on data representation learning: From traditional feature learning to recent deep learning.
^ "Since about 100 years ago, many data representation learning methods have been proposed." Zhong, et al. (2017). An overview on data representation learning: From traditional feature learning to recent deep learning.
^ "Closely related to contrastive learning is the family of losses based on metric distance learning or triplets" Khosla, et al. (2020). Supervised Contrastive Learning.
^ "we follow the widely used linear evaluation protocol ... where a linear classifier is trained on top of the frozen base network, and test accuracy is used as a proxy for representation quality" Chen, et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations.
^ "There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning." Devlin, et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
^ " The feature-based approach ... uses task-specific architectures that include the pre-trained representations as additional features." Devlin, et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
^ "The fine-tuning approach ... introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters." Devlin, et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
^ "We show that pre-trained representations reduce the need for many heavily-engineered task-specific architectures. BERT is the first fine-tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures." Devlin, et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[1] "data representation learning is a critical step to facilitate the subsequent ... tasks. ... how to learn the intrinsic structure of data and discover valuable information from data" Zhong, et al. (2017). An overview on data representation learning: From traditional feature learning to recent deep learning.

[2] "manual feature engineering methods, such as image descriptors (... SIFT ... LBP ... HOG ...) and document statistics ( ... TF-IDF ...)" Zhong, et al. (2017). An overview on data representation learning: From traditional feature learning to recent deep learning.

[3] "Since about 100 years ago, many data representation learning methods have been proposed." Zhong, et al. (2017). An overview on data representation learning: From traditional feature learning to recent deep learning.

[4] "Closely related to contrastive learning is the family of losses based on metric distance learning or triplets" Khosla, et al. (2020). Supervised Contrastive Learning.

[5] "we follow the widely used linear evaluation protocol ... where a linear classifier is trained on top of the frozen base network, and test accuracy is used as a proxy for representation quality" Chen, et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations.

[6] "There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning." Devlin, et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[7] " The feature-based approach ... uses task-specific architectures that include the pre-trained representations as additional features." Devlin, et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[8] "The fine-tuning approach ... introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters." Devlin, et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[9] "We show that pre-trained representations reduce the need for many heavily-engineered task-specific architectures. BERT is the first fine-tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures." Devlin, et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]