As the core applications of the internet, recommendation, search, and advertising have always been the main battlefields of development and innovation in the industry in this era of personalization. They are also the technical moats built by giant companies such as Google, Amazon, and Alibaba. These fields often deal with internet-scale problems, with high nonlinear complexity and large amounts of data, which makes them naturally suitable for data-driven methods. Around 2015, the wave of deep learning swept in, and immediately detonated a comprehensive technological revolution in the entire field.
For companies that pursue commercial value, the impact brought by deep learning goes far beyond the contribution of a new algorithm. Looking back at the entire history of internet technology development, machine learning, as a new production tool, was introduced very early and has been applied in search, advertising, and other fields. However, a large number of complex models researched and published by early academia can only stay in the laboratory stage and are difficult to apply on a large scale by the industry. There are two main reasons: One is that the assumptions of the models are too harsh and too far from the actual application, which makes it difficult to produce an effective result; the other reason is that the calculation scale in the industry is huge, and there are complex engineering challenges in training and solving the models. To successfully develop a new model, it often takes months or even years for a professional large-scale parallel computing team to work on it, from the initial design all the way to its application. At that time, there was very little intersection between industry and academia.
However, the emergence of deep learning has completely changed the game of technology R&D, driving exponential growth in productivity. Unlike traditional machine learning, deep learning breaks the closed loop of industrial-grade machine learning algorithm development that requires specialized modeling and optimization skills, as well as specialized programming skills for distributed computing. It provides a new paradigm of algorithm development that is modular and easily adaptable, like building blocks:
(1) With a large number of excellent and open-source deep learning training frameworks providing packaged basic modules, the work of designing new model algorithms has become tool assembly.
(2) The optimization of the deep learning model can be easily completed with a series of standard optimizers, without manual gradient derivation and algorithm design, and most optimizers have been embedded in the deep learning framework and don’t need additional programming development.
(3) Machine learning engineers or scientists can focus on the understanding of domain problems and model design and build deep model architecture in a way similar to civil engineers’ drawing. The next work is handed over to software engineers. Training of the model is done through optimization of the deep learning framework to achieve the best computing efficiency and performance. In other words, the design and implementation of the model are decoupled.
This innovative R&D approach has enabled the proliferation of new models and new algorithms in the era of deep learning, greatly raised up the technological level in industrial applications, and had a huge commercial influence on the information industry. In the previous decades, there were only a handful of machine learning models that could be applied in the industry. But today, a technical intern can easily complete several experimental attempts of deep model algorithms each day. This is a huge improvement in productivity. In a way, deep learning has lifted the fear and liberated the imagination of complex machine learning model applications in the industry.
The transformation of the R&D approach has further reshaped the internal driving force of technological innovation. Evidently, after the booming of deep learning, the innovation of core model algorithms has gradually been dominated by the industry and driven by industrial practices and domain applications in fields such as recommendation, search, and advertising. The most advanced algorithms often come from the top teams of leading companies and are no longer created by academic machine learning laboratories specializing in research based on hypothetical problems or theoretical developments.
These algorithms developed from the industry often have a strong domain dependency. They often put aside the beautiful theoretical appearance and pursue simplicity and pragmatism. This is exactly what young practitioners tend to overlook and also the most precious highlight of this book. The greatest value of this book is not to list a large number of typical model algorithms that are well known in the industry for detailed dissection, but to try to guide learning and mastery of the ideas behind industrial model design from the perspective of technology creation, using the specific scenarios that the technology was invented as blueprints. It presents the real “silver bullet” – what problem it is trying to solve.
Nowadays, there are generally two approaches in the industrial technology R&D:
(1) Finding nails with a hammer: Track the latest top conference papers or technical blogs of large companies, look for innovations, try them in one’s own scenario, and rely on luck to get results.
(2) Problem-driven: Define the problem clearly, think about the technical requirements, and then find or conceive the corresponding technical tools.
Sadly, many technical teams or machine learning engineers in the industry are still accustomed to using the first approach, not due to lack of ability, but due to cognitive inertia and lack of technical confidence. As the leader of Alibaba’s Targeted Advertising Model team, I will introduce the second approach taking the R&D path of our team as an example:
What we have been contemplating and looking for is a model algorithm that “can really make use of the large amount of personalized internet behavior data accumulated within the Alibaba’s e-commerce system.” With this goal in mind, in the past three to four years, we have creatively proposed, developed, and produced a series of personalized behavior prediction models such as DIN, DIEN, MIMN, and ESMM, which have brought tens of billions of revenue increment to Alibaba’s advertising business. There are two main considerations behind these models:
(1) For a personalized internet behavior model in e-commerce, what kind of deep model structure should be used to recognize the inherent pattern? For this consideration, we chose to use the attention-style structure (reverse activation of user interest expression, see the DIN model), the GRU-style structure (the evolution of interest over time, see the DIEN model), and the memory-style structure (interest memory and induction, see the MIMN model) in the model design.
(2) The depiction of user interests is generally more accurate with more user behavior data, so what kind of technical architecture can be used to accommodate more data? For this consideration, we started with single-point behavioral modeling (DIN, DIEN, and MIMN) and further developed into joint modeling (ESMM) of multiple behavioral paths, and started with short-sequence behavioral data modeling (DIN and DIEN), built to ultra-long behavioral sequences modeling (MIMN).
In fact, the initial technical benefits of deep learning, which has become the industry standard for most companies since its inception, have been mostly depleted. As far as I know, most of the top teams in the industry with comprehensive deep learning adoption have entered the stage of stagflation. I refer to this phase of technological progress as the first stage (1.0) of industrial-grade deep learning. The signs that this 1.0 stage has reached its peak are:
(1) With the evolution of the modular model architecture, its marginal benefits are diminishing over time.
(2) Deep models have entered a data-starvation stage, where increasing the volume of data by 10× or 100× is expected to fill existing model capacity and improve accuracy.
(3) Most new large-scale algorithm optimizations and improvements require significant upgrades and modifications to the engineering systems architecture.
A bottleneck has emerged. Where can we find the next technological breakthrough?
Since 2018, I have been advocating for and implementing a new systems architecture for recommendation, search, and advertising fields, in order to adapt to the explosive development of leading machine learning capabilities enabled by deep learning. I believe that the next evolution in technology will enter a new phase, which I refer to as the industrial-grade deep learning 2.0 stage. In this 2.0 stage, deep learning will no longer be a novel weapon, but rather new infrastructure and tools; computational power will no longer be the driving force of deep learning, but rather a new constraint due to the explosion of model complexity; technological advancements will move from relying on single-point breakthroughs in deep learning algorithms to more complex and systemic technology systems, further creating technological dividends. The key breakthrough point lies in the collaborative design of algorithms and systems architecture (algo-systems codesign).
To give a specific example, candidate generation has always been an important part of recommendation and advertising technology systems. Due to the computational scale far exceeding that of re-ranking, historically ranking models have gone through various iterations from the simplest statistical feedback models to lightweight LR or FM models under feature pruning, and currently mainstream two-tower deep learning models. However, the two-tower structure limits the ability to perform feature crossing between the users and the items, and the final target fitting can only be in the form of vector inner products and its related variations, greatly limiting the model’s expressive ability. In 2019, our team made a bold and new attempt: We redefined the candidate generation architecture, using fully real-time computation, and introduced network quantization compression, distillation, and other technologies to balance computational cost and model accuracy. Under certain computational constraints, this architecture can support candidate retrieval models that adopt any complex deep network structure for online inference. This architecture allows our latest candidate generation models to almost approach the complexity of re-ranking models and support efficient online iteration, achieving more than double-digit percentage performance improvements in most scenarios after being put online (this is the biggest single-point model technology breakthrough we have made in 2018).
In the era of industrial deep learning 2.0, we can anticipate that the pattern of technological evolution will further upgrade: from algorithm-centric practice and problem-driven approaches to more holistic considerations of larger technical systems, including domain-specific problem characteristics, data, computing power, algorithms, architectures, and engineering systems, all of which will be integrated into a unified framework for technical innovation. For professionals working in the fields of recommendation, search, and advertising, machine learning engineers must take into account the thinking of system engineers, while system engineers must keep up with the algorithmic trend and try to lead the architecture of algorithms. This phase of technology appears “unfavorable”: complex, high cost, difficult to replicate, and requiring high demands on individuals (both machine learning and engineering abilities), among other challenges. However, I believe that the profound rule of technological development is to evolve from simple to complex and further into a more simplified technical system and trend, which is also the thinking approach that this book hopes to convey to its readers.
This is the best of times and also the worst of times. I hope this foreword brings inspiration and motivates you to explore the fascinating world of deep learning!