TY - GEN
T1 - Anonymizing NYC taxi data
T2 - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016
AU - Douriez, Marie
AU - Doraiswamy, Harish
AU - Freire, Juliana
AU - Silva, Claudio T.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/12/22
Y1 - 2016/12/22
N2 - The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers (the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether 'perfect' anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-Temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.
AB - The widespread use of location-based services has led to an increasing availability of trajectory data from urban environments. These data carry rich information that are useful for improving cities through traffic management and city planning. Yet, it also contains information about individuals which can jeopardize their privacy. In this study, we work with the New York City (NYC) taxi trips data set publicly released by the Taxi and Limousine Commission (TLC). This data set contains information about every taxi cab ride that happened in NYC. A bad hashing of the medallion numbers (the ID corresponding to a taxi) allowed the recovery of all the medallion numbers and led to a privacy breach for the drivers, whose income could be easily extracted. In this work, we initiate a study to evaluate whether 'perfect' anonymity is possible and if such an identity disclosure can be avoided given the availability of diverse sets of external data sets through which the hidden information can be recovered. This is accomplished through a spatio-Temporal join based attack which matches the taxi data with an external medallion data that can be easily gathered by an adversary. Using a simulation of the medallion data, we show that our attack can re-identify over 91% of the taxis that ply in NYC even when using a perfect pseudonymization of medallion numbers. We also explore the effectiveness of trajectory anonymization strategies and demonstrate that our attack can still identify a significant fraction of the taxis in NYC. Given the restrictions in publishing the taxi data by TLC, our results indicate that unless the utility of the data set is significantly compromised, it will not be possible to maintain the privacy of taxi medallion owners and drivers.
KW - Privacy attacks
KW - Spatio-Temporal data
KW - Taxi data
KW - Trajectory privacy
UR - http://www.scopus.com/inward/record.url?scp=85011277402&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85011277402&partnerID=8YFLogxK
U2 - 10.1109/DSAA.2016.21
DO - 10.1109/DSAA.2016.21
M3 - Conference contribution
AN - SCOPUS:85011277402
T3 - Proceedings - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016
SP - 140
EP - 148
BT - Proceedings - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 October 2016 through 19 October 2016
ER -