> For the complete documentation index, see [llms.txt](https://kerasnoone.gitbook.io/garnet/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://kerasnoone.gitbook.io/garnet/zi-ran-yu-yan-chu-li/zhu-ti-mo-xing/lsa/0x01-lsa.md).

# LSA

LSA(Latent Semantic Analysis)潜在语义分析, 又称为LSI(Latent Semantic Index), 是一种常用的主题模型.

## 观点

文本和词汇之间存在着某种相关关系, 若干篇文本和若干个词汇根据这种相关关系构成了一定的语义结构. 去除结构中冗余次要的影响因素, 达到优化该结构的目的.

## 做法

将高维度的**词汇-文本共现矩阵**, 通过**奇异值分解**(SVD)法, 将原来的文章向量映射到低维度的潜在语义空间中, 即主题空间中, 空间的维度等于主题的数量.

## 意义

* 维度降低, 缩小了问题的规模
* 使得表面毫不相关的词, 体现出深层次的联系

## 数学表示

$$d$$篇文本, $$t$$个词汇, 构成一个大小为$$(t, d)$$的**词汇-文本共现矩阵**$$X$$. 其中的每个元素$$X\_{ij}$$值可以是:

* 第$$j$$个词在第$$i$$篇文本中出现的次数
* **tf-idf**值

**LSA步骤**如下:

* **SVD**将矩阵$$X$$分解分$$X=T\_0S\_0D\_0^T$$, $$T\_0$$大小为$$(t, r)$$, $$S\_0$$大小为$$(r,r)$$对角矩阵, 对角元素为**奇异值**, $$D\_0^T$$大小为$$(r,d)$$
* 考虑$$S\_0$$中最大的$$k$$个元素, $$k\lt{r}$$, $$k$$即是降维后的维度, 也是主题的数量. 取$$S\_0$$中相应的$$k$$个值组成$$k$$阶对角矩阵, 同时取出$$T\_0$$中对应的$$k$$列, $$D\_0^T$$中对应的$$k$$行, 得到$$X\approx{\hat{X}}=TSD^T$$, 其中$$T$$大小为$$(t,k)$$, $$S$$大小为$$(k,k)$$, $$D^T$$大小为$$(k,d)$$, $$\hat{X}$$即是优化后的语义结构
* 对于新文本, 先将其转化为词汇频率或tf-idf向量$$X\_q$$, 即一个列向量$$(t,1)$$. 对$$X\_q$$进行转换得到$$D\_q=X\_q^TTS^{-1}$$, 大小为$$(1,k)$$
* $$D$$, $$(d,k)$$保存了训练中所有$$d$$篇文本的降维后由主题组成的向量, 将$$D\_q$$与$$D$$比较产生相似度的度量

## LSA算法优点

* 反映的不再是简单的词汇出现的频率和分布关系, 而是利用主题表现的强化的语义关系
* 低位, 有效处理大规模文本库

## LSA算法缺点

* SVD对数据变化较敏感, 缺乏先验, 显得太机械
* **bag-of-word**模型, 忽略了语法, 词语顺序等信息
* 超参数: 主题数量对结果有较大影响, 而且模型的表现随着参数变化无规律波动大, 难以调参


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://kerasnoone.gitbook.io/garnet/zi-ran-yu-yan-chu-li/zhu-ti-mo-xing/lsa/0x01-lsa.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
