冷启动推荐

冷启动

冷启动就需要user或者item的特征来配合了, 仅使用协同过滤的FM模型是无法做到冷启动的.

我们选择StackExchange dataset作为数据集, 这个数据集除了行为交互数据, 还有item feature, 因此可以做关于item的冷启动.

import numpy as np
from lightfm.datasets import fetch_stackexchange

data = fetch_stackexchange('crossvalidated',
                           test_set_fraction=0.1,
                           indicator_features=False,
                           tag_features=True)
data

{'train': <3221x72360 sparse matrix of type '<class 'numpy.float32'>'
     with 57830 stored elements in COOrdinate format>,
 'test': <3221x72360 sparse matrix of type '<class 'numpy.float32'>'
     with 4307 stored elements in COOrdinate format>,
 'item_features': <72360x1246 sparse matrix of type '<class 'numpy.float32'>'
     with 198963 stored elements in Compressed Sparse Row format>,
 'item_feature_labels': array(['bayesian', 'prior', 'elicitation', ..., 'events', 'mutlivariate',
        'sample-variance'], dtype='<U50')}

除了划分好的train和test, 还有一个shape为(72360, 1246)的item_feature矩阵, 共72360个item, item对应着1246个特征.

train, test = data['train'], data['test']
item_features = data["item_features"]
tag_labels = data['item_feature_labels']
tag_labels

array(['bayesian', 'prior', 'elicitation', ..., 'events', 'mutlivariate',
       'sample-variance'], dtype='<U50')

NUM_THREADS = 2
NUM_COMPONENTS = 30
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6

model = LightFM(loss='bpr', item_alpha=ITEM_ALPHA, no_components=NUM_COMPONENTS)
model = model.fit(train, item_features=item_features, epochs=NUM_EPOCHS, num_threads=NUM_THREADS)

train_auc = auc_score(model, train, item_features=item_features, num_threads=NUM_THREADS).mean()
test_auc = auc_score(model, test, item_features=item_features, num_threads=NUM_THREADS).mean()
train_auc, test_auc

(0.8141281, 0.72254986)

user_count = np.array(train.sum(axis=0)).ravel()
np.sum(user_count == 0)

统计得到共有33827个item是没有跟用户发生过交互行为的, 可以认为是新的物品上线, 为现有的用户进行推荐.

取其中一个没有发生过交互的物品计算出要被推荐的用户.

t = model.predict(item_ids=np.ones(train.shape[0]) * (train.shape[1] - 1), user_ids=np.arange(train.shape[0]), item_features=item_features)
t.argsort()

array([ 145,  159,  128, ..., 1981,  375, 2349], dtype=int64)

上面就是应该把这个新的item推荐给的用户列表.

特征Embedding空间

训练之后也得到了每个item feature对应的向量, 由于在同一个空间中, 就可以通过向量之间的各种距离, 衡量两个特征的近似性.

def get_similar_tags(model, tag_id):
    # Define similarity as the cosine of the angle
    # between the tag latent vectors

    # Normalize the vectors to unit length
    tag_embeddings = (model.item_embeddings.T
                      / np.linalg.norm(model.item_embeddings, axis=1)).T

    query_embedding = tag_embeddings[tag_id]
    similarity = np.dot(tag_embeddings, query_embedding)
    most_similar = np.argsort(-similarity)[1:4]

    return most_similar


for tag in (u'bayesian', u'regression', u'survival'):
    tag_id = tag_labels.tolist().index(tag)
    print('Most similar tags for %s: %s' % (tag_labels[tag_id],
                                            tag_labels[get_similar_tags(model, tag_id)]))

Most similar tags for bayesian: ['metropolis-hastings' 'gibbs' 'prior']
Most similar tags for regression: ['multiple-regression' 'goodness-of-fit' 'poisson-regression']
Most similar tags for survival: ['odds-ratio' 'kaplan-meier' 'epidemiology']

最后更新于5年前

hashtag冷启动

hashtag特征Embedding空间

冷启动

特征Embedding空间