Now accepting early access requests

Structured Kyrgyz language data for AI, search, machine translation, lexicography, and research

The El-Sozduk Kyrgyz–Russian Lexical Corpus, based on the Yudakhin Kyrgyz–Russian Dictionary — segmented, normalized, and enriched with linguistic metadata.

Request the Evaluation Pack Request sample access

~85,000

Structured bilingual segments

40K+

Dictionary entries processed

Segment types

JSONL

Structured delivery

Built by El-Sozduk·The largest online Kyrgyz dictionary platform·Licensing and custom dataset development available

Get Started

Access Options

Choose the level of access that fits your evaluation or project needs.

Free Sample

A small introductory sample for a first look at the corpus structure.

Mini segment file (JSONL)
Schema preview
For initial familiarization
Available on request

Request sample access

Recommended

Evaluation Pack

A 1000-segment dataset for qualified buyer review. Available in Standard and Premium plans.

1000 structured segments
Standard or Premium plan
Full metadata coverage
For qualified organizations

Request the Evaluation Pack

Commercial License

Licensed access to the full corpus — approximately 85K structured segments, in Standard or Premium plan.

~85,000 structured segments
Standard or Premium plan
Licensed for commercial use
Versioned releases
Dedicated support

Contact about corpus access

Custom Corpus Development

Tailored datasets built to buyer specifications.

Custom scope, structure, and formats
Additional source integration by agreement
Priority delivery timeline
Direct collaboration with the El-Sozduk team

Start Discussion

Corpus Plans

Standard or Premium

We offer two plans for the corpus, suited to different review and licensing needs.

Standard

Core schema with ready-to-use buyer-facing fields for practical review.

15+ structured fields (core schema)
Ready-to-use buyer-facing fields
Simplified schema for practical evaluation and ingestion
Suitable for first technical and commercial review

Request Standard details

Recommended

Premium

Expanded schema with the full tag set for deeper linguistic evaluation.

Full tag set (expanded schema)
All Standard fields plus deeper annotation layer
Expanded evaluation visibility
For deeper linguistic, lexicographic, and research-oriented evaluation

Request Premium details

The Corpus

El-Sozduk Kyrgyz–Russian Lexical Corpus

Based on the Yudakhin Kyrgyz–Russian Dictionary.

About

Why El-Sozduk

El-Sozduk is a long-running Kyrgyz digital language initiative focused on practical language resources, structured lexical data, and language technology.

The dataset initiative is led by Chorobek Saadanbekov, founder of the Kyrgyz Translate Community that led the effort to bring Kyrgyz into Google Translate, and a long-term builder of Kyrgyz digital language infrastructure through El-Sozduk, Kyrgyz Wikipedia, and related language technology projects.

We work with Kyrgyz lexical materials as a domain-specific team with deep familiarity with the language, its structure, and its digital use cases — not as a generic data vendor.

The Corpus

What’s in the Corpus

The El-Sozduk Kyrgyz–Russian Lexical Corpus is segmented into distinct bilingual units with linguistic metadata and reviewed through a structured pipeline.

Approximately 85,000 structured bilingual segments
Sense-level and segment-level structure
Senses, examples, compounds, phraseological units, proverbs, stable expressions
Linguistic metadata: domain, grammar, style, dialect, etymology, usage labels
Buyer-ready flat delivery with optional hierarchical mirror
Documentation pack included

Current Status

Production-Grade Pipeline

Structured segmentation and review pipeline with versioned releases and full metadata coverage.

40K+

Dictionary entries processed

~85,000

Structured bilingual segments

Full

Metadata coverage

HITL

Human-in-the-loop review

Applications

Who Uses Structured Kyrgyz Data?

🤖

AI & LLM Training

Fine-tuning, evaluation sets, and grounding data for models covering Kyrgyz.

🔍

Multilingual Search

Kyrgyz–Russian bilingual indexing and retrieval for search engines and RAG systems.

🌐

Machine Translation

Parallel corpus data for low-resource MT pipelines involving Kyrgyz.

📚

Lexicography & Terminology

Structured dictionary data for terminology databases and lexicographic research.

🎓

NLP Research

Annotated Turkic language data for academic NLP and computational linguistics.

📊

Language Analytics

Structured data for usage pattern analysis, frequency studies, and corpus linguistics.

Delivery

Delivery Formats

Data is delivered in standard formats ready for integration into your pipeline.

JSONL — primary flat delivery format
Hierarchical JSON — optional companion mirror
Documentation pack — README, schema, data dictionary, statistics, sample queries
Versioned releases — stable release identifiers
Private API — by agreement

Licensing

Corpus access and usage are provided under defined El-Sozduk terms. Broader commercial rights are discussed separately during qualified inquiries.

Request access or discuss a custom corpus

Tell us about your organization and how you plan to use Kyrgyz language data. We review every request and respond within 1–2 business days.

→ Fill in the form to request access.

📧

Quick Response We reply within 1–2 business days
📦

Free Sample Ready Introductory data available for immediate review
🤝

Flexible Terms Licensing adapted to your use case and scale

Business Inquiry

All fields marked with * are required.

Full Name *

Company / Organization *

Role / Title

Work Email *

Country *

Access Type *

Use Case *

Message

Your information is handled confidentially and used only to process your inquiry.

✓

Thank you. Your request has been received.

We will review your inquiry and respond within 1–2 business days using the contact details you provided.

📧 Email: elsozduk.kg@gmail.com

👤 Contact: Chorobek Saadanbekov