Economic historians use market integration as a key measure of economic development (Shiue and Keller Reference Shiue2007; Studer Reference Studer2008). Although language barriers have been stressed in the macroeconomic literature as inhibiting trade and the diffusion of technology (Spolaore and Wacziarg Reference Spolaore and Romain2009; Guiso, Sapienza, and Zingales Reference Guiso, Paola and Luigi2009), the role of these variables in market integration within countries, particularly in the developing world, has received comparatively little attention, despite the sizable economic impacts that these barriers can have in other contexts (Spolaore and Wacziarg Reference Spolaore2018; Ashraf and Galor Reference Ashraf2013). In this article, we consider the economy of colonial India, in which a large number of dissimilar languages prevail. In particular, we ask: Do market pairs that are more linguistically distant display less market integration, conditional on physical distance and other measures of dissimilarity?
We collect data from Wages and Prices in India on grain and salt prices for 206 South Asian markets between 1861 and 1921. These markets span the territories of modern-day Bangladesh, Burma, India, and Pakistan. We merge these markets to populations by language collected from the 1901 colonial census of India. We map these languages into 257 ISO language codes from Ethnologue, which also provides us with language trees. Taking the correlation coefficient between the price series at a pair of markets i and j, we show that, conditional on physical distance, religious distance, dissimilarities in geography, and fixed effects for markets i and j, prices at i and j are less correlated if i and j are more linguistically distant. Our estimates suggest that two markets with unrelated languages will, compared to two markets sharing a common tongue, have correlation coefficients that are 0.067 less in the case of wheat, 0.189 less in the case of salt, and 0.035 less in the case of rice, relative to means of 0.81 (wheat), 0.54 (salt), and 0.81 (rice) across all market pairs in the data. These are large relative to the coefficients we estimate for physical distance, and suggest a possible role for cultural distance in raising trade costs, even for relatively low-value, homogenous goods.
In assessing the mechanisms that link linguistic distance to market integration, we turn to both the economic literature and to the history of colonial India. Linguistic distances need not matter exclusively for market integration through language; that is, language itself is one of many imperfect measures of broader ancestral distance. This concept may include shared history, institutions, culture, and norms, among other characteristics (Spolaore and Wacziarg Reference Spolaore2016). Language barriers may represent more general barriers to the transmission of vertical traits (Spolaore and Wacziarg Reference Spolaore and Romain2009, Reference Spolaore2018). They may capture differences in tastes, and hence the presence or absence of certain markets (Atkin Reference Atkin2013, Reference Atkin2016). They may affect the costs of information transmission and coordination (Gomes Reference Gomes2014). They may otherwise affect trade costs through interaction, migration, business connections, conflict, or xenophobia (Bai and Kung Reference Bai and James2020; Laval, Patin, and Rueda Reference Laval, Etienne and Valeria2016; Rauch and Trindade Reference Rauch2002). They may work through costs of language or education acquisition (Isphording and Otten Reference Isphording2014; Jain Reference Jain2017; Laitin and Ramachandran Reference Laitin and Rajesh2016; Shastry Reference Shastry2012). They may correlate with common preferences for public goods, redistribution, and infrastructure (Desmet, Gomes, and Ortuño-Ortín Reference Desmet, Joseph and Ignacio2020; Desmet, Ortuño-Ortín, and Wacziarg Reference Desmet, Ignacio and Romain2012, Reference Desmet2017).
To assess which of these explanations may account for our results, we assemble data from a wide range of primary and secondary sources. We show that market pairs that are more linguistically distant from each other are also more genetically distant, but that this summary measure of barriers to the diffusion of technological and institutional innovations is not itself a sufficient statistic for the coefficient on linguistic distance. We find little evidence that linguistic distance predicts missing markets or fewer shared trading communities. Historical differences in literacy across market pairs do correlate with linguistic distance, but do not fully account for its correlation with price integration. Although more linguistically similar market pairs evidence longer periods of time connected to the colonial railway system, this fails to explain away the correlation. Thus, while linguistic distance may have operated in part as a marker of other population differences, as a barrier to the acquisition of similar levels of human capital, and as a barrier to the co-acquisition of public goods that facilitated trade, not one of these mechanisms can fully account for the barriers of linguistic cleavages.
Our article contributes principally to two literatures. The first investigates the role of linguistic distance, in particular, and cultural distances, more broadly, in shaping economic outcomes. Linguistic similarity predicts greater trade between countries (Melitz and Toubal Reference Melitz and Farid2014; Hutchinson Reference Hutchinson2005; Egger and Lassmann Reference Egger2012; Anderson and Van Wincoop Reference Anderson2004). More generally, linguistic, religious, and cultural distances across societies correlate with ancestral distance and predict a wide range of economic outcomes (Spolaore and Wacziarg Reference Spolaore2018). Within Indian economic history, social divisions of language, caste, and religion have been particularly salient. Industrial segregation was driven by information sharing within ethnolinguistic communities (Gupta Reference Gupta2014). Caste and religious divisions, as well as the preferences of caste, ethnic, and religious elites contributed to reduced spending on schooling, which had effects that persisted until the 1970s (Chaudhary Reference Chaudhary2009; Chaudhary et al. Reference Chaudhary, Aldo, Steven and Se2012; Chaudhary and Garg Reference Chaudhary and Manuj2015).
Second, we contribute to a literature on market integration and trade. Building on works such as Persson (Reference Persson1999) and Shiue and Keller (Reference Shiue2007), several contributions in economic history have measured price integration across markets to compare levels of economic development across regions (Studer Reference Studer2008; O’Rourke and Williamson Reference O’Rourke2002; Federico Reference Federico2011). Footnote 1 In the study of Indian economic history, Persaud (Reference Persaud2019) has shown that price volatility mattered by spurring international migration. More generally, our work is related to a broader literature on the evolution of trade and market integration throughout history (Pascali Reference Pascali2017; Jacks, Meissner, and Novy Reference Jacks, Meissner and Dennis2008; Estevadeordal et al. Reference Estevadeordal, Brian and Taylor2003).
We also make a substantial data contribution, digitizing both detailed language data from the colonial census and price data spanning a wider set of markets and commodities (68,181 observations) than addressed by the work of Allen (Reference Allen2007), Andrabi and Kuehlwein (Reference Andrabi and Michael2010), or Studer (Reference Studer2008).
The most similar studies to ours, Falck et al. (Reference Falck, Stephan, Alfred and Jens2012) and Lameli et al. (Reference Lameli, Volker, Jens and Nikolaus2015), use dialect similarity within Germany to predict intra-regional trade and migration. Our work differs from these in several respects. Notably, the linguistic cleavages existing in India are greater than those among the often mutually-intelligible dialects of German. We consider possible roles of genetic distance Footnote 2 and transport investment. Finally, we provide evidence from a large and multilingual developing country, cover a longer time period, examine price integration as an outcome, and use a more spatially disaggregated unit of analysis.
HISTORICAL BACKGROUND
Language in South Asia
There are four language families prominently represented in South Asia: Indo-European, Dravidian, Sino-Tibetan, and Austro-Asiatic (Asher Reference Asher2008). Prior to the arrival of Indo-European languages roughly 3,500 years ago, the sub-continent was predominantly Dravidian-speaking (Asher Reference Asher2008).
Almost half the world’s population speaks an Indo-European language descended from the protolanguage that originated at least 6,000 years ago in eastern Anatolia (Gamkrelidze and Ivanov Reference Gamkrelidze1990). These spread throughout Europe and South Asia through both population movement and replacement of languages used by existing populations (Renfrew Reference Renfrew1989; Haak et al. Reference Haak2015). Most speakers of Indo-European languages in South Asia speak Indo-Aryan languages such as Hindi and Bengali. Indo-Aryan languages date back at least as far as 100 bce (Asher Reference Asher2008; Emeneau Reference Emeneau1956). The principal Dravidian languages became separated no later than 1000 ce, the main literary languages being Telugu, Kannada, Tamil, and Malayalam (Asher Reference Asher2008). Tamil cave inscriptions date to the second century bc, Malayalam inscriptions to the ninth century ad, Kannada inscriptions to 450 ad, and Telugu places names to the second century ad (Krishnamurti Reference Krishnamurti2003). Austro-Asiatic languages, divided primarily into the Mon-Khmer and Munda branches, predate the Indo-European languages in South Asia, and may have been present as long as the Dravidian languages (Asher Reference Asher2008). The small number of Sino-Tibetan speakers in South Asia speak primarily Tibeto-Burman languages (Asher Reference Asher2008).
Within India, the presence of multiple languages has been shaped by population movements and divergence of relatively isolated speakers (Asher Reference Asher2008). The rapid adoption of Indo-European languages suggests these had been adopted by the broader Dravidian speaking community as a lingua franca (Krishnamurti Reference Krishnamurti2003), although the Dravidian boundary has been shifting southwards for a very long time, and Dravidian languages were largely absent from the Gangetic valley by 0 ad (Emeneau Reference Emeneau1956). Languages in close proximity to each other have influenced each other (Montaut Reference Montaut2005, p. 91). Malayalam uses several Sanskrit words, inflected words, and phrases (Krishnamurti Reference Krishnamurti2003). Indian languages borrow from each other through extensive bilingualism, and Indo-European and Dravidian languages have had grammatical impacts on each other (Krishnamurti Reference Krishnamurti2003; Emeneau Reference Emeneau1956). A particular feature of India is the durability of migrant languages, for example, the continued use of Gujurati by communities that have lived in Tamil Nadu for several centuries (Montaut Reference Montaut2005, p. 94).
Markets in Colonial India
The secondary literature on Indian history provides some information on how local prices of foodgrains were determined. Andrabi and Kuehlwein (Reference Andrabi and Michael2010) cite figures demonstrating that production was regionally concentrated, and that most food grains were largely consumed within India. For example, in 1919, the Punjab and the United Provinces accounted for 70 percent of the acreage devoted to growing wheat, while Bengal, Bihar, Orissa, and Madras accounted for 70 percent of the acreage devoted to growing rice. Only 5 percent of wheat and 7 percent of rice was exported beyond India in 1895. Exchange even within India was limited. The non-monetary sector of the economy was large (Kumar Reference Kumar1983), even in 1950 (Chandavarkar Reference Chandavarkar1983).
At the start of our period, 1861, trade costs were high. Land transport was expensive and slow, with food grains largely hauled by oxen walking along dilapidated roads and carrying loads on their backs or in carts (Bhattacharya Reference Bhattacharya1983). In Western India, for example, where few roads existed, trade relied on donkeys, camels, and bullocks (Divekar Reference Divekar1983). Intraregional trade in low-value commodities was possible along rivers, but access to this trade was spatially limited (Derbyshire Reference Derbyshire1987). Bullocks required a year to travel the distance that a railway would later cover in a week (McAlpin Reference McAlpin1974). Where a lack of roads made wheeled transportation difficult, caravans carried cotton and grain (Roy Reference Roy2012). Large-scale, long-distance shipments of grain were generally unprofitable (Hurd Reference Hurd1975). The costs of overland transport limited market integration (Kessinger Reference Kessinger1983). Migration rates were low and wage convergence among districts over the nineteenth century was slow (Collins Reference Collins1999). Speed, cost, and seasonality constrained the geographical scope of the commercial orbit of the United Provinces (Derbyshire Reference Derbyshire1987).
These costs fell during the 60-year time period of our analysis. The telegraph network spread through India in the 1850s and 1870s (Collins Reference Collins1999). Increasing commercialization benefitted from the replacement of the fragile military occupation with settled governance, a growing market for raw materials in Europe, and infrastructural improvements such as canal irrigation, metalled roads, and railway construction (Derbyshire Reference Derbyshire1987; Kumar Reference Kumar1983). The railways, in particular, reduced price dispersion across markets (Hurd Reference Hurd1975), increased incomes (Donaldson Reference Donaldson2018), and reduced famines (Burgess and Donaldson Reference Burgess and Dave2010); they are likely to have also increased price co-movement across districts. Price dispersion fell more rapidly for cash crops such as cotton than for food grains (McAlpin Reference McAlpin1974). Andrabi and Kuehlwein (Reference Andrabi and Michael2010) find evidence of trade in grain from districts that lacked railroads to neighboring districts with rail connections.
How did markets themselves work? Bhattacharya (Reference Bhattacharya1983) describes prototypical local market places in Eastern India in which farmers sold directly to consumers and middlemen in small quantities, and itinerant traders made small profits exploiting price differences within limited areas. Large farmers served as links among village markets and larger towns by buying grain from smaller farmers through credit contracts, holding stock while waiting for a favorable market, and taking grain to the mart or river mart offering the best price. Merchants’ agents played a similar role. Larger towns gave rise to a stratified system of retail sellers, wholesale merchants, and those who bought from wholesalers and sold to retailers. Divekar (Reference Divekar1983), Kumar (Reference Kumar1983), and Kessinger (Reference Kessinger1983) provide similar descriptions for other regions of India in the first half of the nineteenth century.
Later in the century, commission agents and buyers’ agents operated in towns that contained railway stations and banks (Roy Reference Roy2014). They owned capital such as carts, grain pits, and warehouses. Commission agency and auction-type sales were prevalent. Company agents contracted with farmers in the villages, while landlords and others lent money to these farmers and were repaid in grain that they also sold to the commission and buyers’ agents. In more remote areas, itinerant traders, including peasants, brought crops to bazaars. At this time, forward trade seldom occurred. Europeans were largely absent from this trade, particularly from local transactions, although they were occasionally company agents and commission agents in railway towns. This helps explain why Europeans, sharing a common language, did not do more to drive market integration and may help explain our results.
Generally, prices in local markets correlated with fluctuations in the overall Indian money supply (Adams and West Reference Adams and Robert1979). Prices were typically lower in producing regions (Andrabi and Kuehlwein Reference Andrabi and Michael2010). On average, prices rose slowly through the nineteenth century and rapidly during WWI (McAlpin Reference McAlpin1983).
Language in Markets in Colonial India
The languages used in trade varied from market to market, depending on which trading castes were dominant in each location. These are often described in the Imperial Gazetteers for each province. Footnote 3 In the Punjab, for example, the multilingual Banias, Khatris, and Aroras who spoke local languages such as Punjabi and Gujarati were dominant in different parts of the province. Predominantly Urdu-speaking Shaikhs and largely Gujarati-speaking Khojas were also important (p. 49). In wheat markets, cultivators themselves traded directly with exporters (p. 87). In Bengal, much of the trade was in the hands of Marwari Agarwals and Oswals, who might often speak local languages. Hindi-speaking Rauniars and Kalwars were more prominent in Bihar (p. 91). In Madras, the Tamil-speaking Chettis and Telugu-speaking Komatis controlled trade in the districts where these languages dominated. Traders themselves were, however, often multilingual, and changed the language used depending on the market. As Montaut (Reference Montaut2005, p. 94), drawing on Pandit (Reference Pandit1977), puts it:
The classic example is of the Gujarati merchant one century ago, who uses Kacchi (a dialect of Gujarati) in the local market, Marathi for wider transactions in the region, standard Gujarati for readings, Hindustani when he travels (railway station), Urdu in the mosque, with some Persian and Arabic, but also sant bhasha in devotional songs, his variety of Gujarati for family interaction, English when dealing with officials.
EMPIRICAL STRATEGY AND DATA
Empirical Strategy
In this article, we use price data covering M South Asian markets. Each observation is a market-pair, indexed ij. For product p, traded between markets i and j, we estimate:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_eqn1.png?pub-status=live)
In Equation (1), \[{\rho _{ij}}^p\] is the correlation coefficient for the price of p between markets i and j. LinguisticDistance ij, described later, captures linguistic distance between the two markets.
\[{x_{ij}}^p\] is a vector of controls. We use this to account for a wide set of dissimilarities between i and j that may correlate with linguistic distance and with the degree of price integration. In our baseline estimations,
\[{x_{ij}}^p\] includes a constant, as well as controls for proximity (log distance in kilometers between the markets, whether both markets are coastal, and whether both markets are connected by the same river), geographic similarity (the correlations in precipitation and temperature between the markets, and their absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, and terrain slope), agricultural similarity (absolute differences in suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, or tomato), other measures of similarity (whether the markets are in the same province, and their religious distance), and characteristics of the data (first year, last year, and the number of years in which the price is available for both markets).
One limitation of our empirical strategy is the possibility that our control variables are measured with greater error than our principal right-hand-side variable of interest, that is, LinguisticDistance ij. This could lead to our estimates of β
p
being overstated. We note, then, that linguistic distance may be interpreted more broadly, for example, as a measure of greater ancestral distance. \[{\delta _i}^p\] and
\[{\eta _j}^p\] are fixed effects for market i and market j. The sample is all market pairs ij such that i ≠ j, i > j, and there are sufficient observations to compute
\[{\rho _{ij}}^p\]. That is, we have at most
\[\frac{{{M^2} - M}}{2}\] observations in any one regression. We cluster standard errors by both market i and market j in the baseline (Cameron, Gelbach, and Miller Reference Cameron, Gelbach and Miller2011). Because of the possible spatial dependence induced by forming every pairwise combination of markets, we show results in the Online Appendix in which we cluster at alternative levels and compute Conley (Reference Conley1999) standard errors.
Data
We use several sources of data. We discuss our sources for prices in colonial India, for linguistic distance across markets, and for our additional controls.
PRICES
Our data on prices are taken from three editions (1921, 1907, and 1885) of Wages and Prices in India. These are initially in reported in sers per rupee: we invert this measure to obtain nominal prices. For 206 markets in modern-day Pakistan, India, Bangladesh, and Burma, these data provide prices for more than a dozen crops: Arhar Dal, Bajra, Barley, Gram, Jawar, Kangni, Maize, Marua, Rice, Salt, Wheat, Bulrush Millet and Similar, Great Millet and Similar, and Lesser Millets. The data covers both British India and the Princely States. These do not represent all markets in India—almost every populated place would have a market of some sort. Rather, these are markets in which the colonial government collected price data. More populous districts and districts in British India are more likely to appear in the data, and, in provinces such as Coorg that have few districts, at least one district is likely to be present.
In most of our results, we focus on the three most commonly reported prices: rice, wheat, and salt. The data do not allow us to consider differences between different varieties of wheat or salt. However, we also show that estimates of Equation (1) with several other crops produce similar results. The price data cover the period 1861 through 1921, with many markets entering our data for the first time in 1869. While the data-collection methods differed across markets in early years, from 1872 onwards uniform fortnightly returns of retail prices were used.
Footnote 4 So long as there are at least three years in which a price is reported in both markets i and j, we can compute a correlation coefficient for that product for the ij pair. This quantity, \[{\rho _{ij}}^p\], is our principal dependent variable.
In Figure 1, we provide intuition for our results by mapping the correlation between the price of rice in a single market, the largely Punjabi-speaking city of Ludhiana, with the price of rice in all other markets in our data. It is clear from the figure that rice prices track those in Ludhiana more closely in regions that speak more closely-related languages such as Hindi and Gujarati and less closely in regions that speak more distantly-related languages such as Burmese and Telugu. These regions are, however, also closer in physical proximity to Ludhiana, and many of the markets that most closely track prices in Ludhiana lie on the Indo-Gangetic Plain. Thus, our analysis relies on estimation of Equation (1) to demonstrate that the correlation between linguistic distance and price integration cannot be explained away by other observable differences in proximity or geography.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_fig1.png?pub-status=live)
Figure 1 LUDHIANA: RICE PRICE CORRELATIONS
Source: Wages and Prices in India.
LINGUISTIC DISTANCE
To compute linguistic distances among the markets in our data, we use two additional data sources. These are the 1901 Census of India and version 19 of the Ethnologue Global Dataset. For each district that existed in 1901, the census data report the number of speakers of each language. For example, the three most commonly spoken languages reported for Ludhiana District are “Punjabi” (665,476), “Hindostani” (2,970), and “Kashmiri” (1,224). We assign each market to the language composition of the district that contained it in 1901. For consistency with the Ethnologue data on distances, we aggregate these to the level of ISO language codes. For Ludhiana, the three most commonly spoken languages become pan, hin, and kas. The data do not, unfortunately, mention second languages.
To compute the distances among these languages, we turn to Ethnologue. Every language in this source is categorized using a language tree with a maximum number of 15 branches. These classifications are based on several sources, the most important of which is Frawley (Reference Frawley2003). Such “cladistic” measures have become widely used in economics (Desmet, Ortuño-Ortín, and Wacziarg Reference Desmet, Ignacio and Romain2012; Gomes Reference Gomes2014). Footnote 5
Following Esteban, Mayoral, and Ray (Reference Esteban, Laura and Debraj2012), we take the distance d mn between any two languages m and n as:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_eqn2.png?pub-status=live)
Similarly following Esteban, Mayoral, and Ray (Reference Esteban, Laura and Debraj2012), we choose δ = 0.05 as a baseline and use δ = 0.5 for robustness. To aggregate these to distances among markets, given population shares of languages m and n in each district i and j of s mi and s nj, we follow Spolaore and Wacziarg (Reference Spolaore and Romain2009) and compute linguistic distance among districts as:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_eqn3.png?pub-status=live)
In Figure 2, we map the linguistic distances among every district in our data and Ludhiana. While it is evident that the markets at which languages more closely related to Punjabi are spoken are geographically close to Ludhiana, it is also clear that this correlation of linguistic and physical distance is not perfect. Distances change relatively rapidly over space when the linguistic composition of the population similarly changes rapidly. Further, regions that are relatively similar in physical distance can be quite dissimilar in their linguistic distance. Punjabi and Bengali, for example, both share the branches Indo-European, Indo-Iranian, and Indo-Aryan. Punjabi and Tamil, by contrast, share no branches, as Tamil is a Dravidian language. And yet the distance between the Punjab and Bangladesh is not markedly different than the distance between the Punjab and Tamil Nadu. The log distance in kilometers between Ludhiana and Dacca is 7.40, whereas it is 7.76 between Ludhiana and Madurai.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_fig2.png?pub-status=live)
Figure 2 LUDHIANA: LINGUISTIC DISTANCES
Source: Census of India 1901.
ADDITIONAL CONTROLS
Some of our control variables are computed directly. Distance in kilometers is computed using the latitude and longitude of the market. “Both coastal” and “both connected by the same river” indicators are computed in ArcMap using a shapefile of district boundaries. “Minimum year,” “maximum year,” and “number of common observations” are computed directly from the price data.
The “same province” indicator is based on the provinces that contained each market in 1901. The “religious distance” variable is computed using the same equation as Equation (3), taking the religious composition of each district as reported in Table 8 of the 1921 Census (Literacy By Religion). We assume that the distance d qr between any religion q and r is 1 if q ≠ r and 0 if q = r. Footnote 6
Data on land quality are taken from Ramankutty et al. (Reference Ramankutty, Foley, John and Kevin2002) and have been used in several economic studies, such as Michalopoulos (Reference Michalopoulos2012) and Ashraf and Galor (Reference Ashraf and Oded2011). Footnote 7 It is an index based on soil and climate characteristics and is not particular to any one type of agriculture. “Ruggedness” is the measure of terrain ruggedness initially introduced by Nunn and Puga (Reference Nunn and Diego2012). Footnote 8 Our measure of “malaria prevalence” was originally created by Kiszewski et al. (Reference Kiszewski, Andrew, Andrew, Pia, Sonia and Jeffrey2004). Footnote 9 Altitude data are taken from the Consultative Group for International Agricultural Research’s Shuttle Radar Topography Mission 30 dataset. Footnote 10 Means of precipitation, temperature, and suitabilities for specific crops are taken from the Food and Agriculture Organization’s Global Agro-Ecological Zones data portal. Footnote 11 Similar suitability measures have been used by Alesina, Giuliano, and Nunn (Reference Alesina, Paola and Nathan2013) and Alsan (Reference Alsan2015). Correlations in rainfall are computed using the Matsuura and Willmott (Reference Matsuura and Cort2007) gridded series. Footnote 12 We join each market to the nearest point in these data and compute correlations in annual rainfall over the period 1900–2000. Humidity data are taken from the Climatic Research Unit at the University of East Anglia. Footnote 13
Like many studies that control for geographic confounders with historical outcome variables, we are compelled to use present-day raster data (e.g., Alsan (Reference Alsan2015) and Nunn and Puga (Reference Nunn and Diego2012)). We expect that this will add measurement error to our right-hand-side variables, but that it is unlikely this measurement error will induce spurious correlation between linguistic distance and market integration. For the variables that require geographic data (i.e., the coastal and river indicators, as well as those using raster data), we begin with a district map for modern India. Footnote 14 We compute the coastal and river indicators at this level, and compute other geographic variables by averaging over raster points within a district. If a market in our data shares the name of a modern-day district (or an updated name, as in the case of Benares and Varanasi), we have a unique match between the market and the modern district polygon. Otherwise, we match all districts that split from the erstwhile district that previously shared the name of the market to that market.
Summary Statistics
Summary statistics are presented in Table 1. Some general patterns are apparent from this table. First, relative to a maximum number of observations of \[\frac{{{{206}^2} - 206}}{2} = 21,115\], we typically have fewer pairwise correlation coefficients. This is because not all products are traded in all markets. Second, while the degree of price integration is relatively high (>0.8 for both wheat and rice), there is variation in price integration both across space and across markets. Some market pairs exhibit negative price correlations. Market integration is more limited for salt than for rice and wheat; the average price correlation for salt (<0.35) is lower, and more than a quarter of these correlations are negative. One possible explanation of this lower correlation is the limited number of inland production sites for salt; this limits arbitrage opportunities in response to shocks, causing lower average salt price correlations across markets. Linguistic distances range from close to 0 (i.e., market pairs in which both markets are dominated by the same language) to 1 (i.e., market pairs in which the dominant languages spoken are unrelated).
Table 1 SUMMARY STATISTICS
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab1.png?pub-status=live)
Source: See the text.
RESULTS
Results by Market
Before presenting estimates of Equation (1), we present preliminary descriptive evidence.Footnote 15 For each market i in our data, we estimate:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_eqn4.png?pub-status=live)
In Equation (4), \[{\rho _{ij}}^p\] and
\[{x_{ij}}^p\] are defined as in Equation (1). For each market i, we obtain a coefficient
\[{\beta _{\text{i}}}^p\] that captures the degree to which its prices more closely track prices at other markets that are more linguistically similar, conditional on other measures of distance and dissimilarity.
To present these results, we order markets from those with the most negative estimates of \[{\beta _{\text{i}}}^p\] to those with the most positive estimates and present the point estimates and 95 percent confidence intervals in Figures 3, 4, and 5. For each of the three major crops, the majority of coefficients is negative and significant. This demonstrates two points. First, our main results pooling together all market pairs are not driven by a small number of markets. Second, Equation (1) yields estimates of β p that capture a central tendency in the sample.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_fig3.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_fig4.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_fig5.png?pub-status=live)
Main Results
In Table 2, we present our main estimates of Equation (1). Across the three major crops, linguistic distance predicts reduced market integration. This is statistically significant in all specifications save one: wheat with controls but without fixed effects. There are several ways to consider the magnitudes involved. First, taking the estimates from Column (4), a one standard deviation increase in linguistic distance, conditional on controls and fixed effects, predicts a reduction in the price correlation between markets i and j by 0.121 standard deviations for wheat, 0.181 standard deviations for salt, and 0.088 standard deviations for rice.
Table 2 MAIN RESULTS
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab2.png?pub-status=live)
* = Significant at the 10 percent level.
**= Significant at the 5 percent level.
*** = Significant at the 1 percent level.
Notes: Standard errors clustered by market i and market j in parentheses. All regressions are OLS and include a constant. Controls are minimum year, maximum year, number of observations, ln(distance) in km, both coastal, connected to river, rainfall correlation, temperature correlation, and absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, slope, religion, and suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, and tomato. Fixed effects are for markets i and j.
Source: See the text.
It is striking that the coefficients and standardized magnitudes are largest for salt. Not only are salt markets less integrated in the data, in that they have lower mean correlation coefficients, there is also more dispersion in integration for salt, in that the standard deviation of the correlation coefficients across market pairs is larger. Salt was a differentiated good that could only be produced in a small number of locations (Donaldson Reference Donaldson2018). Further, in order to facilitate the taxation of salt, the British constructed an Inland Customs Line, which incorporated the Great Hedge of India, in order to prevent salt smuggling (Moxham Reference Moxham2001).
An alternative approach to magnitudes is to divide \[\widehat{{\beta ^p}}\] by the coefficient estimated on ln(Distance) in Column (4). This suggests that moving one unit in linguistic distance (i.e., from a closely-related language to an unrelated one) predicts a reduction in the price correlation comparable to a distance change of 789 percent for wheat, 1,328 percent for salt, and 210 percent for rice. At the mean distance across pairs within our sample (1,154 kilometers), this would correspond to distance increases of 9,101, 15,326, and 2,418 kilometers, respectively, all of which would be out of sample. These large numbers are driven in part by the small coefficients estimated on distance once additional controls are included.
In Online Appendix Table A4, we compare the pairwise correlations between our outcome variables and the measures of physical and linguistic distance. Both distance measures enter significantly and negatively on their own and, if both are put on the right-hand side at once, both continue to enter negatively and significantly, while the coefficient on each is reduced slightly. Both have similar R-squared values when included as right-hand-side variables alone, and including both on the right-hand side increases the R-squared.
MECHANISMS
In this section, we outline the mechanisms suggested in both the economic and historical literatures that provide plausible links between linguistic distance and market integration. We then assess these empirically to the extent our data allow.
Mechanisms in the Literature
A recent economic literature has emphasized several possible channels that might link linguistic distance to market outcomes, and several of these mechanisms are reflected in observations made about colonial Indian markets in the secondary historical literature. One branch of this economic literature has focused on the importance of barriers to the transmission of the traits that are imparted across generations in driving dissimilarities in economic outcomes across populations (Spolaore and Wacziarg Reference Spolaore and Romain2009, Reference Spolaore2018). Alternatively, differences in language may proxy for differences in tastes, which, in turn, shape prices and the volume of trade (Atkin, Reference Atkin2013, Reference Atkin2016). Where these taste-based differences lead to a thin local market for a given good, we might anticipate prices that do not track those in other South Asian markets. Similarly, if there are fixed costs of arbitrage between two markets, the limited size of the market for an unpopular product will reduce the returns to arbitrage.
Another branch of the economic literature suggests mechanisms by which language barriers may inhibit market integration by raising trade costs. For example, linguistic distance may affect the costs of acquiring information (Gomes Reference Gomes2014; Allen Reference Allen2014). Alternatively, linguistic distance may act as a barrier to flows of people, who are likely to be put off by migration costs, the difficulty of establishing business connections, or by xenophobia (Bai and Kung Reference Bai and James2020; Falck et al. Reference Falck, Stephan, Alfred and Jens2012; Lameli et al. Reference Lameli, Volker, Jens and Nikolaus2015; Rauch and Trindade Reference Rauch2002; Iwanowsky Reference Iwanowsky2018). These mechanisms would lead to missing or costly links in the network connecting any two markets.
This branch of the economics literature aligns most closely with descriptions of trade in the secondary literature on Indian history. Collins (Reference Collins1999) cites linguistic barriers as an explanation of the low migration rates in India and hence as a limiting factor on price integration. Several writers have highlighted the importance of trade networks that corresponded with linguistic divisions. In colonial India, trading networks were often caste or kinship networks (Bhattacharya Reference Bhattacharya1983; Kessinger Reference Kessinger1983). Markovits (Reference Markovits2008, pp. 188–96) mentions several such “middlemen minorities.” Footnote 16 These groups, Divekar (Reference Divekar1983) argues, contributed to the “unification of markets in India.” They adopted new forms of business partnership and circulated information over wide regions. If the costs of one group maintaining a presence in a given market due to its linguistic dissimilarity are greater, this would be expected to increase transactions costs with other markets in which they are present.
Linguistic distance may also make it more difficult to acquire a language in which trade is conducted or to acquire common levels of education; Isphording and Otten (Reference Isphording2014), Jain (Reference Jain2017), Laitin and Ramachandran (Reference Laitin and Rajesh2016), and Shastry (Reference Shastry2012) all find evidence that the costs of acquiring a new language—or education provided in that new language—are higher for those whose mother tongue is more dissimilar to the new language. Finally, linguistic distance may proxy for differences in preferences over public goods, redistribution, and the provision of infrastructure (Desmet, Gomes, and Ortuño-Ortín Reference Desmet, Joseph and Ignacio2020; Desmet, Ortuño-Ortín, and Wacziarg Reference Desmet, Ignacio and Romain2012, Reference Desmet2017). If these public goods and infrastructure investments affect trade costs, they may help explain our main result.
Mechanisms: Evidence
GENETIC DISTANCE
To evaluate whether linguistic distance operates as a proxy for a broader set of barriers to the transmission of information, technology, and culture, we compute a measure of the genetic distance among the markets in our data. We show that, while linguistic distance and genetic distance are correlated, neither one is a “sufficient statistic” that fully accounts for the coefficient of the other.
We obtain data on genetic distance from Pemberton, DeGiorgio, and Rosenberg (Reference Pemberton, Michael and Rosenberg2013). Similar to the data used by Spolaore and Wacziarg (Reference Spolaore and Romain2009), these data contain pairwise Weir and Cockerham (Reference Weir and Cockerham1984) F ST coefficients based on differences in allele frequencies from microsatellites. While the raw data report coefficients based on 5,795 individuals from 267 human populations, we restrict ourselves to the data on ethnic groups indigenous to South Asia. These are the Balochi, Brahui, Burusho, Hazara, Kalash, Makrani, Pathan, Sindhi, Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Marathi, Marwari, Miso, Oriya, Parsi, Punjabi, Tamil, and Telugu. While these groups cover the majority of the population in our sample, there are some major missing groups, of which Urdu is the largest.
Following Spolaore and Wacziarg (Reference Spolaore and Romain2009), given population shares of groups m and n in districts i and j of s mi and s nj with genetic distance \[{F_{ST}}^{mn}\], we compute genetic distance among districts as:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_eqn5.png?pub-status=live)
Note that we re-scale s 1i and s 2j as fractions of the population matched to the genetic data, rather than as fractions of the full district population. We present a map of genetic distances from Ludhiana in Online Appendix Figure A1. This has many similarities to Figure 2. Other regions of South Asia that are proximate to the Punjab are more genetically similar, although it is clear that South Indian groups in Dravidian-speaking regions are more genetically dissimilar, conditional on physical distance. The apparent proximity with Burma is overstated due to the lack of coverage of major Burmese populations in the genetic data.
Our aim is to assess whether linguistic distance proxies for broader (and possibly deeper) barriers to the diffusion of information, culture, and technology. We re-estimate Equation (1), first with genetic distance as an outcome, and second with genetic distance as an additional control. We report the results in Table 3. Linguistic and genetic distance are correlated, even conditional on our baseline fixed effects and controls.Footnote 17 Genetic distance itself predicts less market integration and diminishes the coefficient on linguistic distance, but does not fully eliminate it in any specifications where linguistic distance was significant in Table 2. With fixed effects and controls, the change in coefficient on linguistic distance is slight when compared with Table 2. These results imply that, while linguistic distance may indeed proxy for other differences across populations, its relationship with market integration cannot be fully accounted for by the additional transaction costs imposed by barriers to the diffusion of beliefs, traditions, and practices stemming from ancestral distance.
Table 3 GENETIC DISTANCE
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab3.png?pub-status=live)
* = Significant at the 10 percent level.
**= Significant at the 5 percent level.
*** = Significant at the 1 percent level.
Notes: Standard errors clustered by market i and market j in parentheses. All regressions are OLS and include a constant. Controls are minimum year, maximum year, number of observations, ln(distance) in km, both coastal, connected to river, rainfall correlation, temperature correlation, and absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, slope, religion, and suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, and tomato. Fixed effects are for markets i and j.
Source: See the text.
COARSE AND FINE DISTINCTIONS
We show that it is the highest-level distinctions in our data, such as those between Indo-European and Dravidian languages, that drive our results. This is, however, a crude proxy, and we cannot rule out the possibility that languages here proxy for past patterns of migration and state formation that themselves shaped markets and trade routes.
Recall that, in our baseline analyses, we computed the distance between any two languages m and n as:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_eqnu1.png?pub-status=live)
While this follows the convention in the literature, it does not allow us to distinguish whether coarser distinctions (e.g., those between Indo-European and Dravidian languages) or lesser divisions (e.g, those between Bengali and Punjabi) drive our results. We replace d mn with a dummy for having ≤N shared branches, for N = {1, …, 15}. We re-estimate Equation (1), and present our results in Figures 6, 7, and 8. These correspond to Column (4) with fixed effects and controls. In all three figures, it is clear that coarser distinctions matter more than finer ones. Indeed, we show in Online Appendix Table A5 that limiting our sample only to district pairs in which the dominant language in both districts is Indo-European leads to coefficient estimates on linguistic distance that, while still negative, are generally insignificant and less robust across specifications. That is, our results are driven by coarser language distinctions, particularly those that separate major language families.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_fig6.png?pub-status=live)
Figure 6 RESULTS BY LEVEL: WHEAT
Source: Authors’ estimates.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_fig7.png?pub-status=live)
Figure 7 RESULTS BY LEVEL: SALT
Source: Authors’ estimates.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_fig8.png?pub-status=live)
Figure 8 RESULTS BY LEVEL: RICE
Source: Authors’ estimates.
Consider a language such as Gujarati (Indo-European, Indo-Iranian, Indo-Aryan, Intermediate Divisions, Gujarati, Gujarati). It has no branches in common with a Dravidian language such as Tamil. It shares one branch with languages such as Yiddish that are Indo-European but not Indo-Iranian. It shares two branches with languages such as Balochi that are Indo-Iranian but not Indo-Aryan. It shares three branches with an Indo-Aryan language such as Hindi that is classified under “Western Hindi” rather than “Intermediate Divisions.” It shares four branches with a language such as Nepali that is within these “Intermediate Divisions,” but is not within the Gujarati sub-class. It shares five branches with other Gujarati languages (such as Jandavra). In all three figures, language divisions with two common branches or fewer yield visibly greater differences than finer distinctions. These results suggest that our main results derive from divisions on the scale of Gujarati-Tamil, Gujarati-Yiddish, and Gujarati-Balochi, rather than from finer distinctions as those among Gujarati and Hindi, Nepali, or Jandavra. These coarser distinctions are those that have been shown before to correlate with conflict, redistribution, and public goods provision—suggesting they are correlated with deeper differences in preferences—as opposed to finer distinctions that inhibit coordination and integration (Desmet, Ortuño-Ortín, and Wacziarg Reference Desmet, Ignacio and Romain2012). This is suggestive evidence that our results are driven not simply by ease of communication, but also by more fundamental differences in preferences.
MISSING MARKETS
To test whether missing markets, due, for example, to differences in tastes drive the correlation between linguistic distance and market integration, we evaluate whether linguistic distance predicts whether two given markets report a certain good’s price in the same year, and whether markets that are more linguistically distant from their neighbors experience more volatile prices. When we look at the situation for major crops, we find little evidence of missing markets increasing with linguistic distance. Only limited evidence suggests that prices are more variable at markets that are more linguistically different from those around them.
We take two approaches. First, we test whether linguistic distance predicts how frequently prices are available for two markets in the same year. Taking \[{N_{ij}}^p\] as the number of common price observations at markets i and j for product p, we estimate Equation (1), except that we now take
\[{N_{ij}}^p\] as the dependent variable, and no longer control for minimum year, maximum year, or the number of common observations. Results are presented in Table 4. There is only weak evidence of missing markets correlating with linguistic distance; while we find a negative correlation between linguistic distance and
\[{N_{ij}}^p\] for wheat, no such correlation is available for salt or rice. We find similar failures of linguistic distance to predict
\[{N_{ij}}^p\] when using lesser crops from the data such as barley and maize, although we do not report these here. One explanation of the different result for wheat is the greater variability of the outcome variable: the standard deviation of the number of common years for wheat is 22.6, versus 8.8 for salt and 9.7 for rice. That is, as wheat is reported less often in many markets, there is more variation to be explained.
Table 4 MISSING MARKETS: NUMBER OF COMMON YEARS
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab4.png?pub-status=live)
* = Significant at the 10 percent level.
**= Significant at the 5 percent level.
*** = Significant at the 1 percent level.
Notes: Standard errors clustered by market i and market j in parentheses. All regressions are OLS and include a constant. Controls are ln(distance) in km, both coastal, connected to river, rainfall correlation, temperature correlation, and absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, slope, religion, and suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, and tomato. Fixed effects are for markets i and j.
Source: See the text.
As a second approach, we evaluate whether markets that are more linguistically distant than those within a set radius experience prices that are more volatile. Our logic here is that linguistic distance from neighbors may lead to more volatile prices because of reduced trade and arbitrage. For each market i, we keep the other markets within 500 kilometers and take the average of their linguistic distance from i (denoted \[\overline {{\text{LinguisticDistanc}}{{\text{e}}_{{\text{ij}}}}} \]) as well as the average of the controls (denoted
\[\overline {{x_{ij}}^p} \]). We estimate:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_eqn6.png?pub-status=live)
In Equation (6), \[C{V_i}^p\] is the coefficient of variation of the price of product p at market i. We estimate Equation (6) by ordinary least squares (OLS) and report robust standard errors. Results are presented in Table 5. While we find evidence that wheat prices are more volatile at markets that are more linguistically distant from others in their neighborhood, we find no similar evidence for rice or salt. The differences by crop here are somewhat puzzling, as it is rice prices that are most volatile in our data, as measured by the coefficient of variation.
Table 5 MISSING MARKETS: VOLATILITY
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab5.png?pub-status=live)
* = Significant at the 10 percent level.
**= Significant at the 5 percent level.
*** = Significant at the 1 percent level.
Notes: Robust standard errors in parentheses. All regressions are OLS and include a constant. Controls are minimum year, maximum year, number of observations and averages of ln(distance) in km, both coastal, connected to river, rainfall correlation, temperature correlation, and absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, slope, religion, and suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, and tomato.
Source: See the text.
TRADING COMMUNITIES
To evaluate whether the presence of trading networks sharing a common tongue drives our results (e.g., as might be the case if small communities of traders have lower costs of establishing themselves in regions where the dominant language resembles their own), we correlate linguistic distance with the common presence of communities such as the Marwaris or Parsis. We find little evidence that the co-presence of these communities correlates with linguistic distance.
We focus on one group that has received particular attention in the literature: the Marwaris. By 1920, between 200,000 and 400,000 Marwaris, most of them working as traders, lived outside of the Rajputana Agency (Markovits Reference Markovits2008). These traders drew on capital and personnel from throughout the subcontinent. They gained dominant positions in regional trade, importing, exporting, and moneylending. These communities held assets jointly in patrilineal extended families, sharing information and personnel (Roy Reference Roy2014).
For each pair of markets i and j, we estimate the absolute difference in Marwari share, or \[A{D_{ij}}^{{\text{Marwari}}} = {\text{ }}|{S_i}^{{\text{Marwari}}} - {S_j}^{{\text{Marwari}}}|\]. We then estimate Equation (1) with
\[A{D_{ij}}^{{\text{Marwari}}}\] as both an outcome and as a control. That is, we test whether linguistic distance predicts the colocation of Marwaris across district pairs, and the degree to which the co-presence of this trading community can account for the conditional correlation between lingusitic distance and market integration. Results are presented in Table 6. There is little evidence of linguistic distance driving differences in the presence of this trading community, and little evidence that it explains price integration.
Table 6 TRADING COMMUNITIES
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab6.png?pub-status=live)
* = Significant at the 10 percent level.
**= Significant at the 5 percent level.
*** = Significant at the 1 percent level.
Notes: Standard errors clustered by market i and market j in parentheses. All regressions are OLS and include a constant. Controls are minimum year, maximum year, number of observations, ln(distance) in km, both coastal, connected to river, rainfall correlation, temperature correlation, and absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, slope, religion, and suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, and tomato. Fixed effects are for markets i and j.
Source: See the text.
Results are similar if we perform the same exercise for the other communities listed, although we do not report these for space. While we cannot observe all these communities in our data, several are recorded in the census either as linguistic or religious groups. In particular, we are able to observe the Parsis, Afghanis, Gujaratis, Khatris, Memons, Multanis, and Sindhis. We also observe the Vanis, but they are not present in the markets in our data. Since the English could also be potentially thought of as another migrant mercantile community, we also consider their presence. Results are again similar, and again not reported, using the English. Our results are particularly unlikely to be explained by the spread of the English language: less than one-tenth of 1 percent of the population in the 1901 census is recorded as “English” by language.
Alternatively, if we replace the absolute difference in the population share of a minority group with the maximum for a market pair, results are very similar. Because a group is often present in one market and not another, the maximum across a pair is highly correlated with the absolute difference in shares. Similarly, we find little correlation between linguistic distance and the minimum presence of a trading community across a market pair, and our results are not generally sensitive to controlling for this minimum. Again, we omit these results for space.
LITERACY
In a related test for the costs of information, we examine whether linguistic distance correlates with differences in literacy rates. While linguistically distant markets have more dissimilar literacy rates, this does not diminish the correlation of linguistic distance with market integration.
For data on literacy, we use the 1921 Census of India. These data report literacy at the district level, and we match each market to the district that contains it. As with the presence of trading communities, for each community, we take this difference as both an outcome and as a control. We present results in Table 7. More linguistically distant markets have more dissimilar literacy rates, but this does little to predict price correlations, or to explain away their correlation with linguistic distance.
Table 7 LITERACY RATE
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab7.png?pub-status=live)
* = Significant at the 10 percent level.
**= Significant at the 5 percent level.
*** = Significant at the 1 percent level.
Notes: Standard errors clustered by market i and market j in parentheses. All regressions are OLS and include a constant. Controls are minimum year, maximum year, number of observations, ln(distance) in km, both coastal, connected to river, rainfall correlation, temperature correlation, and absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, slope, religion, and suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, and tomato. Fixed effects are for markets i and j.
Source: See the text.
INFRASTRUCTURE
Finally, we examine whether linguistic distance proxies for shared preferences over public goods, in particular, those that facilitate trade. We show that more linguistically distant markets spend less time both connected to the railway network, but, nonetheless, this does not fully account for our main result.
Following a procedure similar to Donaldson (Reference Donaldson2018), we use the 1934 edition of History of Indian Railways Constructed and In Progress to identify the year each market became connected to the colonial railway. This source divides the Indian railway system into segments (e.g., “Karimganj to Badarpur”) with a date of opening (in this example, 4-12-96) and length in miles (in this example, 12.00). We use these data to code the first date at which the district containing each market was connected to the Indian Railway system. For each market pair ij, we can then identify the number of years up to 1921 that both markets were connected to the railway system. We then estimate Equation (1) with this variable as both an outcome and as a control. We present results in Table 8. More linguistically distant markets spend more time both connected to the railroad; however, this does little to predict price correlations or explain away their correlation with linguistic distance. One possible contributing factor to these results is the nature of the Indian railways, which were often built to track pre-existing trade routes (Andrabi and Kuehlwein Reference Andrabi and Michael2010).
Table 8 RAILWAY CONNECTIONS
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab8.png?pub-status=live)
* = Significant at the 10 percent level.
**= Significant at the 5 percent level.
*** = Significant at the 1 percent level.
Notes: Standard errors clustered by market i and market j in parentheses. All regressions are OLS and include a constant. Controls are minimum year, maximum year, number of observations, ln(distance) in km, both coastal, connected to river, rainfall correlation, temperature correlation, and absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, slope, religion, and suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, and tomato. Fixed effects are for markets i and j.
Source: See the text.
ROBUSTNESS
Selection on Unobservables
In this section, we demonstrate the robustness of our results to selection on unobservables. We present a number of additional exercises in the Online Appendix.
To demonstrate robustness to selection on unobservables, we use the approach of Altonji, Elder, and Taber (Reference Altonji, Elder and Taber2005) as implemented by Bellows and Miguel (Reference Bellows and Edward2009) and Nunn and Wantchekon (Reference Nunn and Leonard2011). We estimate Equation (1) with either a limited set of controls or with a full set of controls, and compute:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_eqn7.png?pub-status=live)
We report results where the restricted set of controls is either empty or contains only ln(Distance). Larger values of this statistic imply that the selection on unobservables would need to have a larger effect on β relative to that of observables in order to be consistent with a true β of 0. Results are presented in Table 9. The coefficient estimates for wheat are sensitive to controls regardless of what is in the base set of controls, but are not as sensitive to the addition of fixed effects. Results for salt and rice appear sensitive to adding fixed effects and controls together, but this is driven by ln(Distance). Once this is included as a baseline control, AET is negative (i.e., controls push β away from zero) or greater than one. That is, we find that the estimate of β is sensitive to controls for wheat, while for salt and rice, the estimate of β is no longer sensitive to controls once ln(Distance) has been included.
Table 9 ALTONJI-ELDER-TABER STATISTICS
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210407073348538-0319:S0022050720000650:S0022050720000650_tab9.png?pub-status=live)
* = Significant at the 10 percent level.
**= Significant at the 5 percent level.
*** = Significant at the 1 percent level.
Notes: Standard errors clustered by market i and market j in parentheses. All regressions are OLS and include a constant. Controls are minimum year, maximum year, number of observations, ln(distance) in km, both coastal, connected to river, rainfall correlation, temperature correlation, and absolute differences in: altitude, latitude, longitude, rainfall, temperature, land quality, ruggedness, malaria, humidity, precipitation, slope, religion, and suitabilities for growing banana, chickpea, cocoa, cotton, groundnut, dryland rice, oil palm, onion, soybean, sugar, tea, wetland rice, white potato, wheat, and tomato. Fixed effects are for markets i and j.
Source: See the text.
CONCLUSION
In this article, we have shown that markets in colonial South Asia that were more linguistically distant from each other displayed less market integration, conditional on many other measures, including distance, literacy gaps, transportation links, and measures of dissimilarity. This finding holds across multiple products and markets, and survives several sensitivity checks. Genetic distance and lack of railway connections may help explain these results, but on their own, these factors do not explain the lack of market integration. There is less evidence for missing markets and presence of trading communities as mechanisms. The results show that cultural and linguistic barriers are salient to the functioning of markets, and that their importance is not limited to political economy or post-colonial, modern economies. Furthermore, the contribution of these cultural factors that enhance or impede market integration is substantial relative to other factors such as physical distance. More linguistically-similar markets are more likely to have been connected earlier via transport infrastructure (the colonial railway system), but this connection alone does not explain away the coefficient. These results indicate the importance and persistence of cultural differences in market integration, trade, and price volatility. Testing whether markets with greater gains from trade learn the languages necessary for trade over time, and whether newer information and communication technologies reduce the importance of linguistic distance, remain important questions for future work.