7月12日杨红霞(阿里巴巴)报告

发布者:周晓英发布时间:2018-07-09浏览次数:403



统计与数据科学学院



  -------------------------------------------------

             学术讲座



讲座题目:Extremely Large Scale Graphical Model in Practice


人:杨红霞    阿里巴巴


间:20187129:30


地点:统计与数据科学学院432教室


摘 要:Extremely large scale graphical model has been playing an increasingly important role in big data companies. In particular,graph inference combined with deep learning has achieved successful phased results in many of Alibaba's business scenarios. The data of the Alibaba ecosystem is extremely rich and varied, covering everything from shopping, travel, entertainment, and payment. We are working on the development of a new generation of graph learning platform that can efficiently perform inference analysis on billions of nodes and billions of edges. In this talk, I will share two related works that have been accepted by IJCAI and KDD 2018 respectively:


1. Network representation learning (RL) aims to transform the nodes in a network into low-dimensional vector spaces while preserving the inherent properties of the network. Though network RL has been intensively studied, most existing works focus on either network structure or node attribute information. In this paper, we propose a novel framework, named ANRL, to incorporate both the network structure and node attribute information in a principled way. Specifically, we propose a neighbor enhancement autoencoder to model the node attribute information, which reconstructs its target neighbors instead of itself. To capture the network structure, attribute-aware skip-gram model is designed based on the attribute encoder to formulate the correlations between each node and its direct or indirect neighbors. We conduct extensive experiments on six real-world networks, including two social networks, two citation networks and two user behavior networks. The results empirically show that ANRL can achieve relatively significant gains in node classification and link prediction tasks.


2. The e-commerce era is witnessing a rapid increase of mobile Internet users. Major e-commerce companies nowadays see billions of mobile accesses every day. Hidden in these records are valuable user behavioral characteristics such as their shopping preferences and browsing patterns. And, to extract these knowledge from the huge dataset, we need to first link records to the corresponding mobile devices. This Mobile Access Records Resolution (MARR) problem is confronted with two major challenges: (1) device identifiers and other attributes in access records might be missing or unreliable; (2) the dataset contains billions of access records from millions of devices. To the best of our knowledge, as a novel challenge industrial problem of mobile Internet, no existing method has been developed to resolve entities using mobile device identifiers in such a massive scale. To address these issues, we propose a SParse Identifier-linkage Graph (SPI-Graph) accompanied with the abundant mobile device pro ling data to accurately match mobile access records to devices. Furthermore, two versions (unsupervised and semi-supervised) of Parallel Graph-based Record Resolution (PGRR) algorithm are developed to effectively exploit the advantages of the large-scale server clusters comprising of more than 1,000 computing nodes. We empirically show superior performances of PGRR algorithms in a very challenging and sparse real data set containing 5.28 million nodes and 31.06 million edges.



讲座人简历:美国杜克大学博士学位,原IBM全球研发中心Watson研究员,Yahoo!主任数据科学家,现任阿里巴巴资深算法专家,带领团队开发基于数据中台的智能算法,稳定的支持了阿里巴巴搜索、广告等30几个核心BU和其业务场景。在顶级统计和机器学习国际学术期刊会议发表论文30余篇(包括JASA, ICML, ATSTATS, KDD, ICDMCIKMWWW等),美国专利9项,任职 Applied Stochastic Models in Business and Industry副主编,International Statistical Institute理事,2017浙江省千人。


邀请人: 徐铣明 博士


欢迎广大师生参加!

统计与数据科学学院

2018628