Classification of rbcL Gene in Plants Based on Protein K-mer Representation and Random Forest
DOI:
https://doi.org/10.55537/cosie.v5i1.1401Keywords:
rbcL, Viridiplantae, k-mer protein, Random Forest, klasifikasi gen, BioinformatikaAbstract
The ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit gene (rbcL) is one of the most widely used molecular markers in plant phylogenetics, DNA barcoding, and taxonomy. Automatic identification of this gene from large-scale protein sequence collections represents a compelling bioinformatics challenge, especially with the rapid growth of publicly available genomic and proteomic data. This study aims to develop a machine-learning–based classification model to distinguish rbcL from other plastid genes such as ndhF, psbA, atpB, and rbcS. Protein sequence data were obtained from the NCBI Protein database and automatically labeled using FASTA metadata. Numerical representation of sequences was performed using protein k-mers (k = 2), yielding 400 features per sequence. A Random Forest model with 300 estimators was trained using an 80% training and 20% testing split. Evaluation using confusion matrix, precision, recall, F1-score, and ROC–AUC demonstrated high classification performance, with accuracy ≥ 0.95 and an AUC value approaching 1.0. These findings indicate that an alignment-free k-mer approach can effectively identify rbcL sequences and has strong potential for application in AI-driven gene annotation pipelines for plant plastid genomes
Downloads
References
[1] W. J. Kress and D. L. Erickson, “A Two-Locus Global DNA Barcode for Land Plants: The Coding rbcL Gene Complements the Non-Coding trnH-psbA Spacer Region,” PLoS ONE, vol. 2, no. 6, p. e508, Jun. 2007, doi: 10.1371/journal.pone.0000508.
[2] D. Alzahrani, E. Albokhari, S. Yaradua, and A. Abba, “Complete chloroplast genome sequences of Dipterygium glaucum and Cleome chrysantha and other Cleomaceae Species, comparative analysis and phylogenetic relationships,” Saudi J. Biol. Sci., vol. 28, no. 4, pp. 2476–2490, Apr. 2021, doi: 10.1016/j.sjbs.2021.01.049.
[3] M. K. Vasquez, E. K. Stock, K. J. Terrell, J. Ramirez, and J. A. Kyndt, “Unraveling Evolutionary Dynamics: Comparative Analysis of Chloroplast Genome of Cleomella serrulata from Leaf Extracts,” Int. J. Plant Biol., vol. 15, no. 3, pp. 914–926, Sep. 2024, doi: 10.3390/ijpb15030065.
[4] A. Muwaffiq Faza et al., “In Silico Evaluation of rbcL, matK, and psbA-trnH Regions on the Genus Spatholobus (Fabaceae),” J. Ris. Biol. Dan Apl., vol. 6, no. 2, pp. 73–81, Sep. 2024, doi: 10.26740/jrba.v6n2.p73-81.
[5] S. Letsiou et al., “DNA Barcoding as a Plant Identification Method,” Appl. Sci., vol. 14, no. 4, p. 1415, Feb. 2024, doi: 10.3390/app14041415.
[6] NCBI, NCBI Protein Database — Search results for “rbcL”. National Center for Biotechnology Information., Jan. 01, 2024. Accessed: Sep. 22, 2025. [Online]. Available: https://www.ncbi.nlm.nih.gov/protein
[7] J. Shaw and Y. W. Yu, “Fast and robust metagenomic sequence comparison through sparse chaining with skani,” vol. Volume 20, Sep. 2023, doi: https://doi.org/10.1038/s41592-023-02018-3.
[8] L. He, S. Sun, Q. Zhang, X. Bao, and P. K. Li, “Alignment-free sequence comparison for virus genomes based on location correlation coefficient,” Infect. Genet. Evol., vol. 96, p. 105106, Dec. 2021, doi: 10.1016/j.meegid.2021.105106.
[9] L. He et al., “A new alignment-free method: K-mer Subsequence Natural Vector (K-mer SNV) for classification of fungi,” BMC Bioinformatics, vol. 26, no. 1, p. 170, Jul. 2025, doi: 10.1186/s12859-025-06152-x.
[10] M. T. Swain and M. Vickers, “Interpreting alignment-free sequence comparison: what makes a score a good score?,” NAR Genomics Bioinforma., vol. 4, no. 3, p. lqac062, Jul. 2022, doi: 10.1093/nargab/lqac062.
[11] A. Zielezinski, S. Vinga, J. Almeida, and W. M. Karlowski, “Alignment-free sequence comparison: benefits, applications, and tools,” Genome Biol., vol. 18, no. 1, p. 186, Dec. 2017, doi: 10.1186/s13059-017-1319-7.
[12] D. Tang et al., “KCOSS: an ultra-fast k-mer counter for assembled genome analysis,” Bioinformatics, vol. 38, no. 4, pp. 933–940, Jan. 2022, doi: 10.1093/bioinformatics/btab797.
[13] J. T. Lee, X. Li, C. Hyde, P. A. Liberator, and L. Hao, “PfaSTer: a machine learning-powered serotype caller for Streptococcus pneumoniae genomes,” Microb. Genomics, vol. 9, no. 6, Jun. 2023, doi: 10.1099/mgen.0.001033.
[14] J. Wang et al., “Scaffolding protein functional sites using deep learning,” Science, vol. 377, no. 6604, pp. 387–394, Jul. 2022, doi: 10.1126/science.abn2100.
[15] S. Yin, “UPFPSR: a ubiquitylation predictor for plant through combining sequence information and random forest,” vol. 19, no. 1, Nov. 2022, doi: 10.3934/mbe.2022035.
[16] D. Simón, O. Borsani, and C. V. Filippi, “RFPDR: a random forest approach for plant disease resistance protein prediction,” PeerJ, vol. 10, p. e11683, Apr. 2022, doi: 10.7717/peerj.11683.
[17] S. Seo, M. Oh, Y. Park, and S. Kim, “DeepFam: deep learning based alignment-free method for protein family modeling and prediction,” Bioinformatics, vol. 34, no. 13, pp. i254–i262, Jul. 2018, doi: 10.1093/bioinformatics/bty275.
[18] N. A. Saputra, L. S. Riza, A. Setiawan, and I. Hamidah, “A Systematic Review for Classification and Selection of Deep Learning Methods,” Decis. Anal. J., vol. 12, p. 100489, Sep. 2024, doi: 10.1016/j.dajour.2024.100489.
[19] D. J. Van Zyl et al., “Alignment-Free Viral Sequence Classification at Scale,” Dec. 11, 2024, Genomics. doi: 10.1101/2024.12.10.627186.
[20] A. L. Delcher, “Fast algorithms for large-scale genome alignment and comparison,” Nucleic Acids Res., vol. 30, no. 11, pp. 2478–2483, Jun. 2002, doi: 10.1093/nar/30.11.2478.
[21] Suraj Varma, Analysis of a Protein sequence(fasta dataset), Dec. 30, 2024. Accessed: Sep. 23, 2025. [Online]. Available: https://github.com/suraj5424/Protein-sequence-analysis
[22] J. Sadaiyandi, P. Arumugam, A. K. Sangaiah, and C. Zhang, “Stratified Sampling-Based Deep Learning Approach to Increase Prediction Accuracy of Unbalanced Dataset,” Electronics, vol. 12, no. 21, p. 4423, Oct. 2023, doi: 10.3390/electronics12214423.
[23] E. E. Ojeda Avilés, D. Olmos Liceaga, and J.-H. Jung, “Stratified Sampling Algorithms for Machine Learning Methods in Solving Two-scale Partial Differential Equations,” J. Sci. Comput., vol. 104, no. 3, p. 110, Sep. 2025, doi: 10.1007/s10915-025-03024-7.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Armansyah Armansyah

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.



