TikTok is a video-sharing app that let users create and share short videos. It impresses users with its personalized “just for you” recommendations specifically. It is highly addictive and very popular among Gen Z. Behind it, it is powered by artificial intelligence technologies.
The architecture of the TikTok recommendation system includes three components: big data frameworks, machine learning, and microservices architecture.
1. Big data frameworks are the starting point of the system. It provides real-time data streaming processing, data computing, and data storage.
2. Machine learning is the brain of the recommendation system. A range of machine learning and deep learning algorithms and techniques are applied to build models and generate recommendations to suit individual preferences.
3. Microservices architecture is the infrastructure underneath to make the whole system serve fast and efficiencies.
Big Data Frameworks
No data, no intelligence.
Most data are coming from the users’ smartphones. That includes an operating system and installed app etc. More importantly, TikTok pays special attention to the users’ activity logs, such as watch time, swipe, likes, shares, and comments.
The log data are collected and aggregated through flume and scribe. They are piped into the Kafka queue. Then Apache Storm processes data streams in real-time with other components in the Apache Hadoop eloquence.
Apache Hadoop ecosystem is a distributed system for data processing and storage. This includes MapReduce, the first generation of distributed data processing system. It processes data in parallel with batch processing. YARN is a framework for job scheduling and cluster resource management. HDFS is a distributed file system. HBase is a scalable, distributed database that supports structured data storage for large tables. Hive is a data warehouse infrastructure that provides data summarization and querying. Zookeeper is a high-performance coordination service.
As data volumes grow fast, real-time data processing frameworks come into the picture. Apache Spark is the third-generation framework that helps with near real-time distributed processing for big data workloads. Spark enhances the performance of MapReduce by processing in memory. In the last couple of years, TikTok applies the fourth-generation framework Flink. It is designed to do real-time streaming processing natively.
The database systems include MySQL, MongoDB, and many others.
This is the center of how TikTok earns the household name of a hyper-personalized, addictive algorithm.
After vast datasets pour in, next is content analysis, user profiling, and context analysis. The neural-network deep learning frameworks such as TensorFlow are used to perform computer vision and native language processing (NLP). Computer vision will decipher images with photos and videos. NLP includes classification, labeling, and evaluations.
The classic machine learning algorithms are used, including logistic regression (LRconvolutional neural networkCNN), recurrent neural network (RNN), and gradient boosting decision trees (GBDT). The common recommendation approaches are applied, such as content-based filtering (CBF), collaborative filtering (CF), and more advanced matrix factorization (MF).
The secret weapons that TikTok uses to read your mind are:
1. Algorithm experimental platform: The engineers experiment with the mixing of multiple machine learning algorithms such as LR and DNN, and then run the testing (A/B test) and do the adjustment.
2. Extensive classification and labeling: The models are based on the users’ engagement such as watch time, swipes, and the commonly used likes or shares (what you do as a reflection of your subconscious says more about you than what you say). The number of user features, vectors, and categories is more than most of the recommendation systems in the world — and they keep adding more.
3. User feedback engine: It updates the models after retrieving feedback from the users in multiple iterations. The experience management platform is built on this engine and ultimately improves the perditions and recommendations.
To solve the cold-start problem in recommendations, the recall strategy is used. It is to select thousands of candidates from tens of millions of videos that have been proven to be popular and high quality.
Meanwhile, some of the AI work has been moved to the client-side for a super-fast response. That includes real-time training, modeling, and reasoning done on the devices. The machine learning frameworks such as TensorFlow Lite or ByteNN are used on the client-side.
TikTok has embraced cloud-native infrastructure. The recommendation components such as user profiling, predictions, cold-start, recall, and user feedback engine are serving as APIs. The services are hosted in clouds such as Amazon AWS and Microsoft Azure. As the outcome of the system, the video curation will be pushed to the users through the cloud.
TikTok employs Kubernetes-based containerization technology. Kubernetes is known as a container orchestrator. It is the toolset to automate the application’s life cycle. Kubeflow is dedicated to making deployments of machine learning workflows on Kubernetes.
As part of the cloud-native stack, Service mesh is another tool to handle service-to-service communication. It controls how different parts of an application share data with one another. It inserts features or services at platform layers, rather than application layers.
Due to the requirement of high concurrency, the services are built with Go language and gRPC. In TikTok, Go has become the dominant language in service development because of its good build-in network and concurrency support. gRPC is a Remote Procedure Control framework to build and connect services efficacy.
The success of Tiktok is that it would go the extra mile to provide the best user experience. They build in-house tools to maximize performance at a low-level (system level). For example, ByteMesh is an improved version of Service Mesh, KiteX is a high-performance Golang gRPC framework, and Sonic is an enhanced Golang JSON library. Other in-house tools or systems include parameter servers, ByteNN, and abase — to name a few.
As a TikTok machine learning principal Xiang Liang put it, sometimes the infrastructure beneath is more important than the (machine learning) algorithms above.