TY - JOUR
T1 - Automatic document classification via transformers for regulations compliance management in large utility companies
AU - Dimlioglu, Tolga
AU - Wang, Jing
AU - Bisla, Devansh
AU - Choromanska, Anna
AU - Odie, Simon
AU - Bukhman, Leon
AU - Olomola, Afolabi
AU - Wong, James D.
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
PY - 2023/8
Y1 - 2023/8
N2 - The operation of large utility companies such as Consolidated Edison Company of New York, Inc. (Con Edison) typically rely on large quantities of regulation documents from external institutions which inform the company of upcoming or ongoing policy changes or new requirements the company might need to comply with if deemed applicable. As a concrete example, if a recent regulatory publication mentions that the timeframe for the Company to respond to a reported system emergency in its service territory changes from within X time to within Y time—then the affected operating groups will be notified, and internal Company operating procedures may need to be reviewed and updated accordingly to comply with the new regulatory requirement. Each such regulation document needs to be reviewed manually by an expert to determine if the document is relevant to the company and, if so, which department it is relevant to. In order to help enterprises improve the efficiency of their operation, we propose an automatic document classification pipeline that determines whether a document is important for the company or not, and if deemed important it forwards those documents to the departments within the company for further review. Binary classification task of determining the importance of a document is done via ensembling the Naive Bayes (NB), support vector machine (SVM), random forest (RF), and artificial neural network (ANN) together for the final prediction, whereas the multi-label classification problem of identifying the relevant departments for a document is executed by the transformer-based DocBERT model. We apply our pipeline to a large corpus of tens of thousands of text data provided by Con Edison and achieve an accuracy score over 80 % . Compared with existing solutions for document classification which rely on a single classifier, our paper i) ensemble multiple classifiers for better accuracy results and escaping from the problem of overfitting, ii) utilize pretrained transformer-based DocBERT model to achieve ideal performance for multi-label classification task and iii) introduce a bi-level structure to improve the performance of the whole pipeline where the binary classification module works as a rough filter before finally distributing the text to corresponding departments through the multi-label classification module.
AB - The operation of large utility companies such as Consolidated Edison Company of New York, Inc. (Con Edison) typically rely on large quantities of regulation documents from external institutions which inform the company of upcoming or ongoing policy changes or new requirements the company might need to comply with if deemed applicable. As a concrete example, if a recent regulatory publication mentions that the timeframe for the Company to respond to a reported system emergency in its service territory changes from within X time to within Y time—then the affected operating groups will be notified, and internal Company operating procedures may need to be reviewed and updated accordingly to comply with the new regulatory requirement. Each such regulation document needs to be reviewed manually by an expert to determine if the document is relevant to the company and, if so, which department it is relevant to. In order to help enterprises improve the efficiency of their operation, we propose an automatic document classification pipeline that determines whether a document is important for the company or not, and if deemed important it forwards those documents to the departments within the company for further review. Binary classification task of determining the importance of a document is done via ensembling the Naive Bayes (NB), support vector machine (SVM), random forest (RF), and artificial neural network (ANN) together for the final prediction, whereas the multi-label classification problem of identifying the relevant departments for a document is executed by the transformer-based DocBERT model. We apply our pipeline to a large corpus of tens of thousands of text data provided by Con Edison and achieve an accuracy score over 80 % . Compared with existing solutions for document classification which rely on a single classifier, our paper i) ensemble multiple classifiers for better accuracy results and escaping from the problem of overfitting, ii) utilize pretrained transformer-based DocBERT model to achieve ideal performance for multi-label classification task and iii) introduce a bi-level structure to improve the performance of the whole pipeline where the binary classification module works as a rough filter before finally distributing the text to corresponding departments through the multi-label classification module.
KW - BERT
KW - Document classification
KW - Machine learning
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85153728227&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85153728227&partnerID=8YFLogxK
U2 - 10.1007/s00521-023-08555-4
DO - 10.1007/s00521-023-08555-4
M3 - Article
AN - SCOPUS:85153728227
SN - 0941-0643
VL - 35
SP - 17167
EP - 17185
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 23
ER -