Developing Data Handling Guidelines for Open-Source LLM Training in Compliance with Section 37 under Thailand’s PDPA and Related Legal Provisions
Keywords:
Open-source LLM, PDPA Section 37, data handling, data securityAbstract
This study examines the application of Section 37 under Thailand’s Personal Data Protection Act (PDPA) to the training of open-source Large Language Model (LLM), a context often characterized by decentralization and limited institutional oversight. Beyond doctrinal and comparative legal analysis, the research incorporates semi-structured interviews with open-source AI developers in Thailand to ground legal findings in real-world practices. Drawing from structural risk analysis and practitioner feedback, the study proposes a conceptual framework for LLM-based data handling guidelines. The framework presents a modular, role-sensitive approach accompanied by practical tables to assist data controllers in operationalizing Section 37. Designed to support resource-constrained environments, the proposed guidelines aim to support legally compliant, scalable, and responsible development of open-source LLM in Thailand.
References
Andrus, M., Jia, A., Jia, R., Koh, P. W., Kummerfeld, J. K., Narayanan, A., & Zhang, J. (2024). Towards accountable foundation models through auditable model outputs. arXiv. Doi: https://doi.org/10.48550/arXiv.2504.15585
Ayyamperumal, S. G., & Ge, L. (n.d.). Current state of LLM risks and AI guardrails. Carnegie Mellon University. Doi: https://doi.org/10.48550/arXiv.2406.12934
Big Science Workshop. (2023). Bloom: A 176B-parameter open-access multilingual language model. arXiv. Doi: https://doi.org/10.48550/arXiv.2211.05100
British Standards Institution. (2023). Webinar: ISO/IEC 42001 – AI management system standard overview. BSI Group.
California State Legislature. (2018). California Consumer Privacy Act of 2018 (CCPA), Cal. Civ. Code § 1798.100–1798.199. Retrieved from https://leginfo.legislature.ca.gov
Carlini, N., Jagielski, M., Tang, L., Tramèr, F., Zhang, C., & Wallace, E. (2023). Extracting training data from diffusion models. arXiv. Doi: https://doi.org/10.48550/arXiv.2301.13188
Dark Reading. (2024). Hundreds of LLM servers expose corporate, health, and other online data. Retrieved from https://www.darkreading.com/application-security/hundreds-of-llm-servers-expose-corporate-health-and-other-online-data
European Parliamentary Research Service. (2020). The impact of the General Data Protection Regulation (GDPR) on artificial intelligence. European Parliament. Retrieved from https://www.europarl.europa.eu/thinktank/en/document/EPRS_STU(2020)641530
Fernandez, E. B., & Brazhuk, A. (2022). A critical analysis of Zero Trust Architecture (ZTA). SSRN. Doi: https://doi.org/10.2139/ssrn.4210104
Ghaleb, A., Traore, I., & Ganame, K. (2019). A generic agentless endpoint framework for security monitoring of cloud computing endpoints. In 2019 IEEE Conference on Communications and Network Security (CNS) (pp. 1–9). Doi: https://doi.org/10.1109/CNS.2019.8802828
Government of Thailand. (2019). Personal Data Protection Act, B.E. 2562 (2019). Royal Thai Government Gazette.
Manchanda, S., Gupta, K., Majumder, B. P., Shridhar, K., & Vig, L. (2024). The open-source advantage in large language models. arXiv. Doi: https://doi.org/10.48550/arXiv.2412.12004
National Institute of Standards and Technology. (2024). Artificial intelligence risk management framework: Generative artificial intelligence profile (NIST AI 600-1). U.S. Department of Commerce. Doi: https://doi.org/10.6028/NIST.AI.600-1
Organisation for Economic Co-operation and Development (OECD). (2019). OECD recommendation on artificial intelligence. Retrieved from https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0449
Personal Data Protection Committee (PDPC). (2022). Guidelines on personal data protection measures, 2022. Royal Thai Government Gazette.
Royal Thai Government Gazette. (2019). Personal Data Protection Act B.E. 2562 (PDPA). Retrieved from https://www.ratchakitcha.soc.go.th/DATA/PDF/2562/A/069/T_0052.PDF
Singh, S., Singhania, P., Ranjan, A., Kirchenbauer, J., Geiping, J., Wen, Y., Jain, N., Hans, A., Shu, M., Tomar, A., Goldstein, T., & Bhatele, A. (2024). Democratizing AI: Open-source scalable LLM training on GPU-based supercomputers. arXiv. Doi: https://doi.org/10.48550/arXiv.2502.08145
Wang, Z., Zhong, W., Wang, Y., Zhu, Q., Mi, F., Wang, B., Shang, L., Jiang, X., & Liu, Q. (2024). Data management for training large language models: A survey. arXiv. Doi: https://doi.org/10.48550/arXiv.2312.01700
Zhou, X., Weyssow, M., Widyasari, R., Zhang, T., He, J., Lyu, Y., Chang, J., Zhang, B., Huang, D., & Lo, D. (2024). LessLeak-Bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks. arXiv. Doi: https://doi.org/10.48550/arXiv.2502.06215