Skip to main content

Sanskrit Travelogue Corpus

The Sanskrit Travelogue corpus unifies texts from 8 major Sanskrit digital libraries into a single, deduplicated dataset with standardised formatting. It is the largest openly available aggregated Sanskrit text corpus.

At a Glance

Unique texts12,395
Segments9,092,023
Words73,089,921
TransliterationIAST (International Alphabet of Sanskrit Transliteration)
FormatTSV (tab-separated values) + Parquet
Duplicates removed693 texts (5.3%), 1.9M segments (17.4%)
LicenseCC-NC (Creative Commons Non-Commercial)

Source Collections

The corpus aggregates texts from 8 independent digitisation projects. Each collection specialises in a different subset of the Sanskrit literary tradition.

CollectionTextsShareDescription
SanskritDocuments8,61469.5%Community-contributed texts from sanskritdocuments.org. Broad coverage of stotras, puranas, upanishads, and miscellaneous works
Dharmanexus1,39811.3%Buddhist and Hindu texts from the Dharmamitra project. Richest categorisation system (77 genre codes)
GRETIL8006.5%Gottingen Register of Electronic Texts in Indian Languages. Scholarly TEI XML editions from the University of Gottingen
DSBC7516.1%Digital Sanskrit Buddhist Canon. Canonical Buddhist texts from the University of the West
Muktabodha4954.0%Muktabodha Indological Research Institute. Specialised Tantric and Shaiva texts
DCS2311.9%Digital Corpus of Sanskrit (Oliver Hellwig). Linguistically annotated texts with morphological analysis
SARIT850.7%Search and Retrieval of Indic Texts. Scholarly critical editions of the highest editorial quality
YogaVaisaradi210.2%Foundation dedicated to Krishnamacarya. Yoga and philosophy texts with commentaries
Two collections excluded

CTS (Classical Text Server) is excluded due to copyright restrictions. UOH (University of Hyderabad) is excluded because it contains modern texts not suitable for a classical Sanskrit corpus.

Data Format

The corpus is distributed as two paired TSV files (or Parquet equivalents):

Metadata (metadata.tsv)

One row per text. Identifies each work and its provenance.

ColumnTypeDescriptionFill Rate
text_idintUnique identifier100%
collectionstringSource collection name100%
titlestringBook/text title in IAST100%
authorstringAuthor name12.0%
categorystringGenre/category label38.6%
word_countintTotal words in the text100%
segment_countintNumber of segments100%
avg_segment_lengthfloatWords per segment100%
sourcestringSource identifier100%
notesstringFree-text notes3.6%
On metadata sparsity

Author and category fill rates are limited by what the source collections provide. SanskritDocuments (69% of the corpus) has no author metadata. Category coverage is strongest for Dharmanexus (77-code system) and DCS (subject field). See the Metadata Enrichment section for details.

Segments (segments.tsv)

One row per text segment (verse, prose paragraph, or structural unit).

ColumnTypeDescription
segment_idstring{text_id}_{segment_number}
text_idintForeign key to metadata
segment_numberintPosition within the text
textstringSanskrit text content in IAST
typestringSegment type: verse, prose, note, text
chapterstringChapter identifier (if available)
sectionstringSection identifier (if available)
verse_numberstringVerse number (if available)
page_numberstringPage number (if available)

Processing Pipeline

The corpus is built through a multi-stage pipeline:

Raw sources           Cleaned TSVs           Unified corpus
(HTML, XML, ---> (per-collection ---> (metadata.tsv
TEI, ITX) metadata.tsv + segments.tsv)
+ segments.tsv)
|
+----------+----------+
| |
Deduplication Metadata Enrichment
(693 texts (author, category,
removed) word counts)
| |
+----------+----------+
|
Final corpus
(12,395 texts,
9.09M segments)

Stage 1: Download

Four automated scrapers download texts from DSBC, SanskritDocuments, UOH, and YogaVaisaradi. The remaining collections (DCS, GRETIL, Dharmanexus, Muktabodha, SARIT) are obtained from pre-existing datasets.

Stage 2: Convert

Each collection has a dedicated converter that transforms its native format (HTML, XML, TEI, ITX, plain text, CSV) into standardised TSV. All converters inherit from a shared BaseConverter class and are configured via a central config.yaml. All text is normalised to IAST transliteration.

Stage 3: Merge

The CorpusMerger class incrementally merges 9 per-collection TSV files into a single unified corpus. The merger uses streaming writes and chunked reads to handle the full dataset without memory issues.

Stage 4: Deduplicate

Because the 8 source projects often digitised the same canonical works, cross-source overlap is significant. Deduplication uses three complementary pipelines:

  1. Title-based matching -- Fuzzy title comparison (SequenceMatcher at 90% threshold) with cross-transliteration normalisation
  2. LLM-verified content comparison -- Agent-based segment comparison for candidates found by lowered title thresholds (70%) and content fingerprinting
  3. GRETIL HTML cross-check -- Specialised pipeline for 682 HTML-only GRETIL texts that needed verification against the merged corpus

When two texts are confirmed duplicates, the copy from the higher-quality collection is kept, following a priority ranking: SARIT > GRETIL > Muktabodha > YogaVaisaradi > DCS > DSBC > Dharmanexus > SanskritDocuments.

Result: 693 texts (5.3%) and 1.9M segments (17.4%) removed.

Stage 5: Enrich Metadata

The enrichment pipeline pulls author, category, and other metadata from per-collection source files that were discarded during the merge step. It also computes text-level statistics (word count, segment count, average segment length) and exports to Parquet format.

Transliteration

All texts in the corpus are normalised to IAST (International Alphabet of Sanskrit Transliteration), the standard romanisation scheme for Sanskrit. The source collections use a variety of schemes:

SchemeExampleUsed by
IASTViṣṇupurāṇaTarget format
Devanagariविष्णुपुराणMuktabodha titles, some DSBC
SLP1viRNupurARaDCS
Harvard-KyotoViSNupurANaGRETIL HTML
ITRANSvishhNupurANaSanskritDocuments

The pipeline auto-detects input transliteration schemes and converts to IAST using the indic-transliteration and aksharamukha libraries, with custom converters for non-standard schemes (Kyoto-Harvard, SV).

Deduplication Methodology

Sanskrit duplicate detection faces unique challenges:

  • Transliteration diversity -- The same title can appear as Viṣṇupurāṇa (IAST), ViSNupurANa (HK), or विष्णुपुराण (Devanagari)
  • Collection-specific naming -- Muktabodha appends IDs (ānandatantra-M00066), GRETIL prefixes authors (nāgārjuna-ratnāvalī), Dharmanexus prefixes Unknown:
  • Segmentation differences -- The same text may be segmented by verse, chapter, or paragraph across collections
  • Near-duplicates -- Variant editions, commentaries, and abridgements share content without being identical

Title normalisation strips collection-specific prefixes and suffixes, then removes diacritics and lowercases for cross-scheme comparison. Content fingerprinting hashes head/middle/tail segment blocks to catch duplicates with entirely different titles but identical content.

Metadata Enrichment

The raw merged corpus has very sparse metadata (5.5% author fill rate, 11.3% category). The enrichment pipeline recovers metadata from per-collection source files:

CollectionKey contribution
DCSAuthor (128 texts), dates (208), subject (227)
DharmanexusCategory codes (1,398 texts) with 77-code classification system
GRETILAuthor (437 texts) from TEI metadata
DSBCLimited (sparse source metadata)
MuktabodhaAuthor (176) via catalog number matching
SARITAuthor and title from nested TEI/JSON metadata
SanskritDocumentsGenre (2,620) inferred from folder structure
YogaVaisaradiStatic category ("Yoga")

Dharmanexus Category System

The richest categorisation comes from Dharmanexus, which uses a hierarchical 77-code system:

PrefixDomainExamples
KBuddhist ScriptureK01 Vinaya, K10 Sutra, K14 Tantra
TBuddhist TreatiseT01 Stotra, T04 Madhyamaka
GVVedaGV00 Samhita, GV04 Upanishad
GEEpicGE07 Mahabharata, GE09 Ramayana
GPPuranaGP10 Bhagavata, GP12 Other
GRReligionGR12 Jaina, GR13 Shaiva, GR14 Vaishnava
GKLiteratureGK16 Poetry, GK19 Drama
GSShastraGS24 Grammar, GSP Philosophy

Corpus Statistics

Collection distribution

SanskritDocuments ████████████████████████████████████  69.5%
Dharmanexus ██████ 11.3%
GRETIL ███ 6.5%
DSBC ███ 6.1%
Muktabodha ██ 4.0%
DCS █ 1.9%
SARIT 0.7%
YogaVaisaradi 0.2%

Text sizes

The corpus spans a wide range of text sizes, from short stotras (devotional hymns) with a handful of verses to massive epics with tens of thousands of segments.

Acknowledgements

This corpus aggregates texts from multiple digitisation projects. All credit for the original texts and their digital editions goes to:

  • GRETIL (Gottingen Register of Electronic Texts in Indian Languages) -- University of Gottingen
  • SARIT (Search and Retrieval of Indic Texts) -- Collaborative scholarly project
  • Muktabodha Indological Research Institute -- Digital library of Sanskrit texts
  • Digital Sanskrit Buddhist Canon (DSBC) -- University of the West
  • Digital Corpus of Sanskrit (DCS) -- Oliver Hellwig
  • Dharmanexus -- Dharmamitra project
  • sanskritdocuments.org -- Community-contributed texts
  • University of Hyderabad -- Sanskrit text collection
  • Yoga Vaisharadi -- Foundation dedicated to Krishnamacarya