User Behavior Tracking System: Overview
The overall user behavior tracking system consists of four main components:
- 01 Data Collection
- 02 Data Processing
- 03 Data Storage
- 04 Data Analysis
01 Data Collection: This component is responsible for capturing user interactions and events
Common Sampling Methods:
Common sampling methods include fixed data point (Khan et al, 2024), e.g 5 data pts for 1 window regardless of time duration , fixed time interval (Khan et al, 2024 & Mazza at al, 2020), e.g. 60 fps sampling, 60 data pts per second, and event-based sampling (Weinmann et al , 2022), e.g. record data when user mouse movement changes.
Our method & rationale:
The collected data columns can be represented rows of raw data, which include x, y, timestamp, event name, and element (which html element is the user on)
This design implements a fixed pixel size sampling, indicating we sample when the changes in x, y value of mouse position exceeds 1 pixel.
For detailed description of which event we are listening to and how we implement the data collection, please refer to this article: User Behavior Tracking Data Collection.
02 Data Processing: This component is responsible for transforming raw data into a format suitable for uploding
Common Storage Methods:
Common data processing methods include data cleaning (e.g., handling missing values, removing outliers), data transformation (e.g., normalizing data, encoding categorical variables), and data aggregation (e.g., summarizing data at different levels of granularity) and decide the data type for storage (e.g., JSON, CSV, binary formats like msgpack).
Our method & rationale:
For our case, data cleaning is not necessary since we are collecting data at the pixel level, which is less likely to have missing values or outliers. However, we do need to transform the data into a format suitable for uploading. We choose to use MessagePack (msgpack) for data serialization because it is a binary format that is more compact and faster to serialize/deserialize compared to JSON. This choice allows us to efficiently handle the large volume of data generated by tracking user behavior at the pixel level.
Protobuf v.s. Msgpack:
Protobuf is a language-neutral, platform-neutral, and extensible mechanism for serializing structured data. It offers excellent efficiency in both size and speed compared to JSON or XML. However, it requires a predefined schema and code generation for serialization/deserialization. This adds significant complexity whenever we need to add or modify columns in our nested event-based data structure.
In contrast, MessagePack provides simple, schema-less serialization functions that work directly with native data structures (such as nested dictionaries). This makes it much easier to implement for our current use case and allows other developers to fork the project and make their own adjustments with minimal friction.
Therefore, we have chosen MessagePack for its ease of use and superior flexibility when handling our specific hybrid (time-interval + event-based) data format. If in the future, an cooperation with other teams requires a more standardized and efficient serialization format, we can consider switching to Protobuf. For now, MessagePack strikes the right balance between performance and customization for our user behavior tracking system.
For detailed comparison between Protobuf and MessagePack, please refer to this article: Protobuf vs MessagePack.
03 Data Storage: Once the data is collected, it needs to be stored efficiently for
Common Storage Methods:
It is common to use self-owned relational databases (e.g., MySQL, PostgreSQL) or NoSQL databases (e.g., MongoDB, Cassandra) for storing user behavior data. The choice of database depends on factors such as the volume of data, the need for scalability, and the specific use cases for data retrieval and analysis.
The other thought is to use 3rd party data storage services (e.g., AWS S3, Google Cloud Storage) for storing large volumes of user behavior data. These services offer scalability and durability, but may introduce additional costs and latency compared to self-managed databases.
Our method & rationale:
We choose Amazon S3 for data storage for a couple of reasons:
- Scalability: Amazon S3 can handle large volumes of data, which is essential for our case considering we monitor at 1 pixel level in a 5 mins survey session
- Streaming: Amazon S3 supports streaming data, which allows to upload the data while users are still interacting with the application, reducing the latency when they submit the survey
- Cost-effectiveness: Amazon S3 offers a pay-as-you-go pricing model, which can be more cost-effective for storing large volumes of data compared to self-managed databases.
04 Data Analysis: This component is responsible for analyzing the stored data to extract insights
References
- Khan, S., Devlen, C., Manno, M., & Hou, D. (2024). Mouse dynamics behavioral biometrics: A survey. ACM Computing Surveys, 56(6), 1-33.
- Mazza, C., Monaro, M., Burla, F., Colasanti, M., OrrĂ¹, G., Ferracuti, S., & Roma, P. (2020). Use of mouse-tracking software to detect faking-good behavior on personality questionnaires: an explorative study. Scientific reports, 10(1), 4835.
- Weinmann, M., Valacich, J. S., Schneider, C., Jenkins, J. L., & Hibbeln, M. (2022). The path of the righteous: Using trace data to understand fraud decisions in real time. MIS Quarterly, 46(4), 2317-2336.