The rapid expansion of location-based services has triggered the acquisition and analysis of various types of individual trajectory recordings, giving rise to a multitude of tasks involving the processing of mobility traces within recommendation systems, smart-city solutions, human-environment interactions, and public health matters.
However, the sensitive nature of individual movements leads to privacy constraints and regulations on their use and sharing, limiting the possibilities for free dataset exchanges or public disclosures. Geoprivacy protection has indeed gained an increasing public awareness and a primary importance among social, ethical, and legal implications of users’ personal data, reinforcing the paradigm that considers part of an individual’s rights to prevent the disclosure of personal sensitive visited locations. For this reason, the actual acquisition of motion information from a large number of users requires adhering to strict regulations and informed consents, and its permanent storage and share is often prohibited by state’s laws. Therefore, in many cases, the use of good-quality mobility traces is strongly encumbered, and specific ad-hoc solutions are required to be applied before their targeted usage.
The trivial practice of removing user identifiers is generally not considered a valuable answer, as “de-identified” location sequences are still very strong indicators on the identity of their generators, therefore still causing serious privacy threats. In practice, a very common solution consists of downgrading the original spatial resolution, blurring, to some extent, the initial location units by aggregating track points into larger territory divisions, such as administrative areas or geographic grids. However, spatial aggregation, besides not always successfully preserving user privacy, also reduces the effectiveness of spatial analysis, negatively affecting processing steps, final outcomes, and corresponding findings. Consequently, achieving privacy protection without any change on the original data resolution is considered the preferential track to follow.
The general trend in literature is grounded on the idea of defining “privacy-protected” data formats as “spatially-altered” versions of real acquired individual traces, mainly addressing the problem by manually or statistically modifying the spatial coordinates referring to location visits in the original trajectories. The strong downside is represented by the inevitable difficulties of properly balancing the geoprivacy protection aspect and the potential for an actual relevant use on meaningful downstream spatial–temporal analyses. If spatial uncertainty is chosen to increase, the dataset quality decreases with respect to its feasible use on downstream tasks, as data characteristics may be excessively altered (heavy resolution downgrade, intense random spatial perturbation). In the opposite way, less location alterations lead to higher risks that a user can be easily reidentified.
In contrast to obfuscating location visits to add more uncertainty, we attempt on addressing the problem in a different and peculiar perspective. The underlying idea is to generate a completely synthetic dataset, whose samples are singularly different from the original data, but whose collective sets share similar global characteristics and performances on downstream tasks. Moreover, while previous approaches mainly refer to manually designed procedures (which, if disclosed, may allow reverse-engineering solutions for recovering the original trajectory data), we intended to leverage a “black-box” approach to transform input data into synthetic samples.
We therefore propose a generative deep learning solution for handling location-based trajectory formats, with the goal of producing realistic synthetic location sequences. In particular, the process relies on a generative adversarial network (GAN) framework, which is intended to automatically learn high-level features of real trajectories, in order to be able, at a later stage, to generate realistic synthetic traces.
Original location-based trajectories, in the form of discrete location sequences, are first transformed into sequences of embeddings, whereby each location is associated to a pre-trained embedding vector, dense representation of motion relatedness according to the way people collectively move over the territory. This new trajectory format is then inserted into the GAN framework, made of a generator network and a discriminator network, both leveraging LSTM layers to mine the underlying dependencies of sequential trajectory data. While the generator is earmarked to transform random input noise into reasonable synthetic traces, the discriminator is trained to discriminate the generated fake sequences from the real trajectories in the dataset. The two networks compete against each other during training, expecting this competition to push them to excel at outperforming each other. Such deep neural network model automatically learns patterns directly from motion traces, without any manual feature extraction, leveraging the collective mobility of users over the territory to grasp the underlying indicators of human motion activity. Once properly trained, the generator is intended to be able to produce realistic “fake traces” (conceived as “collected” from “people that do not exist”) having similar global characteristics and downstream performances of the original given trajectory dataset.
The synthetic generation of discrete location sequences employs a continuous latent space on two levels: location vector space to represent place-to-place mobility relationships, and trajectory vector space to model trace-to-trace semantic consistency. By generating an encoded feature representation of trajectory rather than directly generating the trajectory itself, we aim to fulfil a satisfactory trade-off between data utility and privacy preservation, ensuring an enhanced individual data diversity (limited overlapping between a synthetic and a real trace individually) but similar collective data properties (comparable summarisation indexes for the whole synthetic and real datasets).
Relying on this perspective, the quality of the generated synthetic data should be investigated according to three aspects: privacy must be guaranteed to some extent (i.e. synthetic trajectories should not be the same or too similar to the real existing trajectories); the generated synthetic dataset should be “reasonable” (i.e. global characteristics should be coherent to the ones of the real dataset); its use should lead to similar performances on downstream tasks (i.e. performance should not drop when processing the dataset with regard to external analytic tasks, such as the next place prediction problem). In short, the focus relies on evaluating the trade-off between the degree of privacy protection and the effectiveness of spatial–temporal analyses, providing novel insights into the integration of artificial intelligence within geospatial disciplines and actively contributing to the expansion of geoAI solutions for human mobility analysis.
|