Now accepting early access requests

Structured Kyrgyz language data for AI, search, machine translation, lexicography, and research

El-Sozduk develops structured Kyrgyz language datasets built from authoritative lexical sources and transformed through segmentation, normalization, metadata enrichment, and review for professional evaluation and commercial use.

~85,000
Structured bilingual segments
40K+
Dictionary entries processed
8
Segment types
JSONL
Structured delivery
Built by El-Sozduk·The largest online Kyrgyz dictionary platform·Licensing and custom dataset development available

Access Options

Choose the level of access that fits your evaluation or project needs.

Demo Sample

A 100-segment introductory dataset for initial review and technical evaluation.

  • Sample segment file (JSONL)
  • Schema documentation
  • Quick overview of data structure
  • Available on request

Commercial License

Broader licensed access by agreement.

  • Full or scoped corpus delivery
  • Licensed for commercial use
  • Versioned releases
  • Dedicated support

Custom Corpus Development

Tailored datasets built to buyer specifications.

  • Custom scope, structure, and formats
  • Additional source integration by agreement
  • Priority delivery timeline
  • Direct collaboration with the El-Sozduk team

Structured Kyrgyz–Russian Language Data

Our first corpus is a structured Kyrgyz–Russian dataset based on the Yudakhin Kyrgyz–Russian dictionary, one of the foundational lexical resources for the Kyrgyz language.

The commercial product is the El-Sozduk transformation layer: segmentation, normalization, metadata design, canonical-form derivation, derivational analysis, and reviewed buyer-ready delivery.

Why El-Sozduk

El-Sozduk is a long-running Kyrgyz digital language initiative focused on practical language resources, structured lexical data, and language technology.

The dataset initiative is led by Chorobek Saadanbekov, founder of the Kyrgyz Translate Community that led the effort to bring Kyrgyz into Google Translate, and a long-term builder of Kyrgyz digital language infrastructure through El-Sozduk, Kyrgyz Wikipedia, and related language technology projects.

We work with Kyrgyz lexical materials as a domain-specific team with deep familiarity with the language, its structure, and its digital use cases — not as a generic data vendor.

What You Get

Each dictionary entry is segmented into distinct bilingual units with linguistic metadata and reviewed through a structured pipeline.

📖

Senses

Individual word meanings with Kyrgyz headword and Russian translation, preserving dictionary structure.

💬

Usage Examples

Real-world usage examples from the Yudakhin dictionary, paired with translations.

💡

Idioms

Phraseological units identified and classified as standalone bilingual segments.

Proverbs

Kyrgyz proverbs and sayings extracted with translations and cultural context.

🔗

Compounds

Multi-word terms and compound expressions with correct lemma attribution.

📎

Stable Expressions

Fixed multi-word units and collocations segmented as distinct entries.

🔄

Homonym Groups

Words with identical spelling but different meanings, separated into distinct groups.

🎯

Sense-Level Distinctions

Fine-grained meaning separation with numbered senses, metadata, and POS tags per sense.

Production-Grade Pipeline

Structured segmentation and review pipeline with versioned releases and full metadata coverage.

40K+
Dictionary entries processed
~85,000
Structured bilingual segments
Full
Metadata coverage
HITL
Human-in-the-loop review

Who Uses Structured Kyrgyz Data?

🤖

AI & LLM Training

Fine-tuning, evaluation sets, and grounding data for models covering Kyrgyz.

🔍

Multilingual Search

Kyrgyz–Russian bilingual indexing and retrieval for search engines and RAG systems.

🌐

Machine Translation

Parallel corpus data for low-resource MT pipelines involving Kyrgyz.

📚

Lexicography & Terminology

Structured dictionary data for terminology databases and lexicographic research.

🎓

NLP Research

Annotated Turkic language data for academic NLP and computational linguistics.

📊

Language Analytics

Structured data for usage pattern analysis, frequency studies, and corpus linguistics.

Evaluation Access

Qualified organizations can request a controlled evaluation sample. The Evaluation Pack is intended for technical review, internal testing, and licensing assessment. Commercial use and redistribution are not included in evaluation access.

Delivery Formats

Data is delivered in standard formats ready for integration into your pipeline.

Licensing

Commercial access is provided under El-Sozduk licensing terms. Detailed rights, restrictions, and delivery scope are shared during qualified discussions.

Request access or discuss a custom corpus

Tell us about your organization and how you plan to use Kyrgyz language data. We review every request and respond within 1–2 business days.

Fill in the form to request access.
  • 📧
    Quick Response We reply within 1–2 business days
  • 📦
    Demo Samples Ready Introductory data available for immediate review
  • 🤝
    Flexible Terms Licensing adapted to your use case and scale

Business Inquiry

All fields marked with * are required.

Your information is handled confidentially and used only to process your inquiry.

Thank you. Your request has been received.

We will review your inquiry and respond within 1–2 business days using the contact details you provided.

📧 Email: elsozduk.kg@gmail.com
👤 Contact: Chorobek Saadanbekov
📞 Phone: +996 771 704 222