TY - GEN

T1 - Adversarial Combinatorial Bandits with General Non-linear Reward Functions

AU - Chen, Xi

AU - Han, Yanjun

AU - Wang, Yining

N1 - Publisher Copyright:
Copyright © 2021 by the author(s)

PY - 2021

Y1 - 2021

N2 - In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback. We show that, with N arms and subsets of K arms being chosen at each of T time periods, the minimax optimal regret is Θ(equation presented) d(√NdT) if the reward function is a d-degree polynomial with d < K, and ΘK(√NKT) if the reward function is not a low-degree polynomial. Both bounds are significantly different from the bound O(√poly(N, K)T) for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures. Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual (NK) assortment as independent.

AB - In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback. We show that, with N arms and subsets of K arms being chosen at each of T time periods, the minimax optimal regret is Θ(equation presented) d(√NdT) if the reward function is a d-degree polynomial with d < K, and ΘK(√NKT) if the reward function is not a low-degree polynomial. Both bounds are significantly different from the bound O(√poly(N, K)T) for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures. Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual (NK) assortment as independent.

UR - http://www.scopus.com/inward/record.url?scp=85140357996&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85140357996&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85140357996

T3 - Proceedings of Machine Learning Research

SP - 4030

EP - 4039

BT - Proceedings of the 38th International Conference on Machine Learning, ICML 2021

PB - ML Research Press

T2 - 38th International Conference on Machine Learning, ICML 2021

Y2 - 18 July 2021 through 24 July 2021

ER -