El-Sozduk develops structured Kyrgyz language datasets built from authoritative lexical sources and transformed through segmentation, normalization, metadata enrichment, and review for professional evaluation and commercial use.
Choose the level of access that fits your evaluation or project needs.
A 100-segment introductory dataset for initial review and technical evaluation.
A 1,000-segment dataset for qualified organizations and research teams to evaluate fit.
Broader licensed access by agreement.
Tailored datasets built to buyer specifications.
Our first corpus is a structured Kyrgyz–Russian dataset based on the Yudakhin Kyrgyz–Russian dictionary, one of the foundational lexical resources for the Kyrgyz language.
The commercial product is the El-Sozduk transformation layer: segmentation, normalization, metadata design, canonical-form derivation, derivational analysis, and reviewed buyer-ready delivery.
El-Sozduk is a long-running Kyrgyz digital language initiative focused on practical language resources, structured lexical data, and language technology.
The dataset initiative is led by Chorobek Saadanbekov, founder of the Kyrgyz Translate Community that led the effort to bring Kyrgyz into Google Translate, and a long-term builder of Kyrgyz digital language infrastructure through El-Sozduk, Kyrgyz Wikipedia, and related language technology projects.
We work with Kyrgyz lexical materials as a domain-specific team with deep familiarity with the language, its structure, and its digital use cases — not as a generic data vendor.
Each dictionary entry is segmented into distinct bilingual units with linguistic metadata and reviewed through a structured pipeline.
Individual word meanings with Kyrgyz headword and Russian translation, preserving dictionary structure.
Real-world usage examples from the Yudakhin dictionary, paired with translations.
Phraseological units identified and classified as standalone bilingual segments.
Kyrgyz proverbs and sayings extracted with translations and cultural context.
Multi-word terms and compound expressions with correct lemma attribution.
Fixed multi-word units and collocations segmented as distinct entries.
Words with identical spelling but different meanings, separated into distinct groups.
Fine-grained meaning separation with numbered senses, metadata, and POS tags per sense.
Structured segmentation and review pipeline with versioned releases and full metadata coverage.
Fine-tuning, evaluation sets, and grounding data for models covering Kyrgyz.
Kyrgyz–Russian bilingual indexing and retrieval for search engines and RAG systems.
Parallel corpus data for low-resource MT pipelines involving Kyrgyz.
Structured dictionary data for terminology databases and lexicographic research.
Annotated Turkic language data for academic NLP and computational linguistics.
Structured data for usage pattern analysis, frequency studies, and corpus linguistics.
Qualified organizations can request a controlled evaluation sample. The Evaluation Pack is intended for technical review, internal testing, and licensing assessment. Commercial use and redistribution are not included in evaluation access.
Data is delivered in standard formats ready for integration into your pipeline.
Commercial access is provided under El-Sozduk licensing terms. Detailed rights, restrictions, and delivery scope are shared during qualified discussions.
Tell us about your organization and how you plan to use Kyrgyz language data. We review every request and respond within 1–2 business days.
All fields marked with * are required.
We will review your inquiry and respond within 1–2 business days using the contact details you provided.