The El-Sozduk Kyrgyz–Russian Lexical Corpus, based on the Yudakhin Kyrgyz–Russian Dictionary — segmented, normalized, and enriched with linguistic metadata.
Choose the level of access that fits your evaluation or project needs.
A small introductory sample for a first look at the corpus structure.
A 1000-segment dataset for qualified buyer review. Available in Standard and Premium plans.
Licensed access to the full corpus — approximately 85K structured segments, in Standard or Premium plan.
Tailored datasets built to buyer specifications.
We offer two plans for the corpus, suited to different review and licensing needs.
Core schema with ready-to-use buyer-facing fields for practical review.
Expanded schema with the full tag set for deeper linguistic evaluation.
Based on the Yudakhin Kyrgyz–Russian Dictionary.
El-Sozduk is a long-running Kyrgyz digital language initiative focused on practical language resources, structured lexical data, and language technology.
The dataset initiative is led by Chorobek Saadanbekov, founder of the Kyrgyz Translate Community that led the effort to bring Kyrgyz into Google Translate, and a long-term builder of Kyrgyz digital language infrastructure through El-Sozduk, Kyrgyz Wikipedia, and related language technology projects.
We work with Kyrgyz lexical materials as a domain-specific team with deep familiarity with the language, its structure, and its digital use cases — not as a generic data vendor.
The El-Sozduk Kyrgyz–Russian Lexical Corpus is segmented into distinct bilingual units with linguistic metadata and reviewed through a structured pipeline.
Structured segmentation and review pipeline with versioned releases and full metadata coverage.
Fine-tuning, evaluation sets, and grounding data for models covering Kyrgyz.
Kyrgyz–Russian bilingual indexing and retrieval for search engines and RAG systems.
Parallel corpus data for low-resource MT pipelines involving Kyrgyz.
Structured dictionary data for terminology databases and lexicographic research.
Annotated Turkic language data for academic NLP and computational linguistics.
Structured data for usage pattern analysis, frequency studies, and corpus linguistics.
Data is delivered in standard formats ready for integration into your pipeline.
Corpus access and usage are provided under defined El-Sozduk terms. Broader commercial rights are discussed separately during qualified inquiries.
Tell us about your organization and how you plan to use Kyrgyz language data. We review every request and respond within 1–2 business days.
All fields marked with * are required.
We will review your inquiry and respond within 1–2 business days using the contact details you provided.