1 Introduction
If an image is worth a thousand words, what is a video worth? Although previous scholars have used image data (e.g., Torres Reference Torres2018; Casas and Williams Reference Casas and Williams2019; Xi et al. Reference Xi, Ma, Liou, Steinert-Threlkeld, Anastasopoulos and Joo2019) to answer important political science questions, less attention has been paid to video-based measures. Using the largest collection of Cable-Satellite Public Affairs Network (C-SPAN) videos ever complied, this paper introduces motion detection as a way to begin to use video-as-data. In this paper, I use this measure to understand the extent to which members of Congress (MCs) literally cross the aisle, but it could be used to study a wide range of political phenomena, like nonverbal displays during political speech (Koppensteiner and Grammer Reference Koppensteiner and Grammer2010) or the intensity of police–citizen encounters (Makin et al. Reference Makin, Willits, Koslicki, Brooks, Dietrich and Bailey2019). In this way, the present study is not the end, but the beginning of an important new line of research in which videos are viewed as more than collections of images.
Motion detection is a useful starting place since it underlies other computer vision techniques, like gait analysis (Aggarwal and Cai Reference Aggarwal and Cai1999), object (Prajapati and Galiyawala Reference Prajapati and Galiyawala2015) and pedestrian tracking (Li et al. Reference Li, Chang, Wang, Ni, Hong and Yan2014), or scene detection (Koprinska and Carrato Reference Koprinska and Carrato2001). Although there are a variety of ways to detect motion (Manchanda and Sharma Reference Manchanda and Sharma2016), this study uses frame differencing because it is less computationally intensive and can be applied to low resolution videos like many of the videos found on C-SPAN’s website or video sharing sites like YouTube. For example, optical flow techniques have also been used extensively in the literature (Shafie, Hafiz, and Ali Reference Shafie, Hafiz and Ali2009), but these methods typically require graphical processing units (GPUs) that are unavailable to many social scientists which is why frame differencing is offered as a useful way to begin video-based research.
Although scholars in other fields have used frame differencing to understand interpersonal dynamics (e.g., Paxton and Dale Reference Paxton and Dale2013; Ramseyer and Tschacher Reference Ramseyer and Tschacher2014), this technique has not yet been used in political science. In this paper, frame differencing is used to measure the extent to which MCs speak with members of the opposing party after floor votes. Not only is this a behavior of interest to a number of legislative scholars (Masket Reference Masket2008; Cohen and Malloy Reference Cohen and Malloy2014), but also as I will show below, my video-based measure yields reasonable results which suggests it could be a useful alternative to the more time- and labor-intensive approaches used by previous scholars (e.g., Caldeira and Patterson Reference Caldeira and Patterson1987). Ultimately, I find Democrats and Republicans are increasingly less willing to talk to one another after floor votes and this behavior is predictive of future party votes, but this is one of the many ways motion detection can be used by social scientists.
2 The Congressional Workplace
Congress is a unique workplace in that (a) interacting with people with different political views is often necessary to achieve personal and institutional objectives, yet (b) people are physically sorted on the basis of their political views. One of the few formal settings where Democrats and Republicans informally talk to one another (Hughes et al. Reference Hughes1976) and “mill around” is the well of the U.S. House of Representatives (Green and Hogan Reference Green and Hogan1982). These “water cooler conversations” have been found to have a moderating effect in other contexts (Mutz and Mondak Reference Mutz and Mondak2006), but the mingling shown in Figure 1 has never been studied in political science, despite previous scholars emphasizing the importance of spatial proximity in legislative settings (Masket Reference Masket2008; Cohen and Malloy Reference Cohen and Malloy2014).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210427160133581-0221:S104719872000025X:S104719872000025X_fig1.png?pub-status=live)
Figure 1 Overhead shot of members of Congress mingling after a roll-call vote. Not only does this shot show all of the social interactions that take place after a floor vote, but it is a quintessential part of C-SPAN coverage. All the analyses presented below consider videos similar to the frame shown here.
Kirkland (Reference Kirkland2011)’s distinction between strong and weak legislative ties helps explain why such informal conversations may be important to legislative outcomes. When ties are strong, new information is rarely exchanged, preventing the size of the legislative coalition from expanding. Since weak legislative ties originate from less frequent interactions, when those interactions do take place, information is more likely to be novel and ultimately shared which helps lay the foundation for future cooperation. Although previous scholars have demonstrated the importance of social interactions in other legislative settings (e.g., Caldeira and Patterson Reference Caldeira and Patterson1987), this application is the first to suggest such relationships may also be cultivated on the House floor.
In Section S2.3 of the SI, an agent-based model is used to show that video motion increases when people walk from one side of the room to the other in order to talk.Footnote 1 “At present Democrats and Republicans literally sit ‘across the aisle’ from one another,” with Democrats sitting on the east side of the Chamber to the right of the Speaker of the House and Republicans sitting on the Speaker’s left (Gibson Reference Gibson2010, 21). Thus, MCs have to walk a greater distance to speak with members of the opposition, meaning videos should contain more motion when these interactions occur.Footnote 2 This leads to the following expectation: on average, when the videos of social interactions immediately after floor votes display less motion, party votes are more likely to occur later that day.
I offer two potential mechanisms, but do not make any strong causal claims. First, the socialization that occurs between votes (or lack thereof) could serve as an important signal to the rank-and-file. To use Kirkland (Reference Kirkland2011)’s example, just as Ted Kennedy (D-MA) working with Orrin Hatch (R-UT) in the 108th Senate could help convince some Republicans to take a closer look at Kennedy’s legislation, when Democrats and Republicans refuse to speak with one another after a floor vote there is even less incentive for MCs to work with one another which could increase party-line voting later that day.
Second, the conversations that occur between votes can be thought of as an example of “weak ties,” in that they are novel, timely, and provide information sharing opportunities. When these conversations occur between Democrats and Republicans they offer an opportunity for both sides to work out potential differences on later votes. Conversely, when such conversations are polarized—meaning they happen on opposite sides of the room—then there are comparatively fewer opportunities to find common ground, ultimately producing more party votes.
However, this research letter does not aim to resolve a debate that has existed in some fashion since perhaps Matthews (Reference Matthews1959). Rather than saying definitively how the “social fabric” of Congress influences legislative behavior (Cho and Fowler Reference Cho and Fowler2010, 125), the degree to which Democrats and Republicans literally cross the aisle is offered as one way motion detection can be used for social science research. Scholars should use this measure to answer their own research questions and this application is meant to do nothing more than motivate this future work.
3 Data and Measure
3.1 C-SPAN Videos
This study employs the largest collection of video data—6,526 videos or 1,413 hours of C-SPAN coverage—ever used in political science research. Each video was approximately 16 minutes long with the first video occurring on January 7, 1997 and the last video occurring on December 13, 2012. Segments similar to the frame shown in Figure 1 are the focus of my analysis. Not only are these overhead shots quintessential examples of C-SPAN coverage, but they also show all of the interactions that occur after floor votes while other camera angles only show a small fraction of these discussions.Footnote 3
Figure 2 shows how these shots were identified in the 6,411,694 frames extracted from the C-SPAN videos. A research assistant first manually identified 17,700 “good” frames using a random sample of videos. I then used a video hashing (or fingerprinting) algorithm developed Zauner (Reference Zauner2010) to compare each frame to every other frame. A high-performance computing cluster then calculated 113,935,802,380 pairwise comparisons. Frames that shared at least 10 of 16 hexadecimal characters (after hashed) with at least one of the “good” frames were said to include the overhead shot.Footnote 4 A review of the output and an initial validation exercise shows my approach yields reasonable results.Footnote 5
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210427160133581-0221:S104719872000025X:S104719872000025X_fig2.png?pub-status=live)
Figure 2 Figure explaining motion detection technique and how the overhead shots were extracted from the C-SPAN videos. Please see Section S1 in the Supplemental Information for more details about how the overhead shots were extracted and video motion was detected.
3.2 Motion Detection
In total, 70,717 clips were created with each video containing around 17 relevant clips, meaning within each video there were approximately 17 uninterrupted sequences of frames like Figure 1. Similar to Seshadrinathan and Bovik (Reference Seshadrinathan and Bovik2007), frames were differenced using the structural similarity (SSIM) between each frame. As explained in Section S1.4 of the SI, this measure—developed by Wang et al. (Reference Wang, Bovik, Sheikh and Simoncelli2004)—has many desirable properties, which is why it was used for this study. Ultimately, the overall motion of the video is the average SSIM across all sequential pairs of frames, which was calculated for the 2,935 C-SPAN videos in which at least one clip with two or more overhead shots could be identified.
Generally speaking, if $\textbf {x}$ and
$\textbf {y}$ are the pixel matrices associated with two images, then the SSIM can be defined as:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210427160133581-0221:S104719872000025X:S104719872000025X_eqn1.png?pub-status=live)
where $\mu _x$ and
$\mu _y$ are the means of x and y, respectively. Similarly,
$\sigma _x$ and
$\sigma _y$ are the SDs of x and y, leaving
$\sigma _{xy}$ as the cross correlation of x and y.
$C_1$ and
$C_2$ are small stabilizing constants and are defined in Python as follows:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210427160133581-0221:S104719872000025X:S104719872000025X_eqnu1.png?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210427160133581-0221:S104719872000025X:S104719872000025X_eqnu2.png?pub-status=live)
where $max_{\textbf {x},\textbf {y}}$ and
$min_{\textbf {x},\textbf {y}}$ are the maximum and minimum pixel values across both matrices. If either matrix has any white, then
$max_{\textbf {x},\textbf {y}}$ will equal 1. Similarly, any black will set
$min_{\textbf {x},\textbf {y}}$ to zero, making
$C_1$ and
$C_2$ 0.0001 and 0.0003, respectively.
3.3 Validation
To help interpret the measure, I scaled the average SSIM to SDs above and below the mean—with positive values implying less motion.Footnote 6 In order to validate the measure, 500 clips were randomly sampled. The extracted frames were then coded for whether someone was (1) or was not (0) walking in the well of the House. Figure S8 in the SI reports the results. Regardless of the Congress, frames with MCs walking in the well of the House are less similar, implying there is significantly more motion ($p < 0.001$).Footnote 7
Next, 100 clips were randomly sampled from those used above and an undergraduate research assistant manually tracked each MC using the Fiji distribution of ImageJ.Footnote 8 Abstract representations of each clip were then created using the tracking information. Next, I sequentially removed each MC and created new videos. The motion of these videos were then compared, allowing me to determine the degree to which each MC’s walking patterns influenced the original video’s overall motion. Again, MCs were coded for whether they did (1) or did not (0) cross the aisle. Figure S10 in the SI shows MCs produce significantly ($p < 0.001$) more motion when crossing the aisle, further validating my measure.Footnote 9
Finally, I plotted average video motion against the absolute difference between the mean Democratic and Republican DW-NOMINATE scores for the 105th–112th Congresses. The results are shown in Figure 3. Ultimately, as time progresses (1) the absolute difference between Democratic and Republican DW-NOMINATE scores increases and (2) the videos of MCs mingling after floor votes have more SSIM implying less motion. Not only are these two variables highly correlated ($\rho = 0.79$), but Pearson’s correlation coefficient is also statistically significant at the .02-level (
$t = 3.18$, df
$= 6$, and
$p < .02$). In the SI, this measure is also validated using simulated bipartisan social interactions (see Section S2.3) and by predicting polarized legislative speech (see Section S3.5).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210427160133581-0221:S104719872000025X:S104719872000025X_fig3.png?pub-status=live)
Figure 3 Plot of the average structural similarity for the 105th–112th Congress. Positive values imply the frames are more similar to one another implying there is less motion. The correlation between video motion and the absolute difference in Democratic and Republican DW-NOMINATE scores is reported in the top left. Both variables were standardized to standard deviations above/below the mean
This latter result helps underline why video motion is a uniquely important measure. For example, Table S8 in the SI shows the ideology of the bill sponsor—as measured by DW-NOMINATE—is not predictive of polarized legislative speech. This is not to say that ideology plays no role in legislative speech, but rather to suggest that instead of being a replacement to these more traditional measures, video motion actually provides an important compliment. Indeed, interpersonal interactions are one of the many facets of legislative life (Patterson Reference Patterson1959; Bogue and Marlaire Reference Bogue and Marlaire1975; Caldeira and Patterson Reference Caldeira and Patterson1987). What I provide is a way to objectively measure such behavior using C-SPAN video data which is less time- and labor-intensive as compared to previous approaches.
3.4 Modeling Strategy
In the models below, my independent variable—called Structural Similarity—is the standardized SSIM described above. The dependent variable—called Future Party Votes—is the number of party votes that occur after the current video divided by the total number of remaining votes. All control variables are described and justified in Table S1 of the SI. Perhaps most importantly, a control is included for the number of party votes that occurred before the current video divided by the total number of prior votes.Footnote 10 In Table 1, this variable is labeled Previous Party Votes, the unit of analysis is the floor vote, and only votes on House bills and resolutions are included. Finally, all estimates are from Tobit regressions with standard errors clustered around the bill/resolution since the dependent variable ranges from 0 to 1 and the same bill/resolution can appear multiple times.
Table 1 When MC’s cross the aisle, future party votes are less likely to occur.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20210427160133581-0221:S104719872000025X:S104719872000025X_tab1.png?pub-status=live)
Note: Dependent variable is the number of party votes that occur after the current video divided by the total number of remaining votes. Structural Similarity is described on page 19. Positive values imply less video motion. Unit of analysis is a given floor vote. Since some House bills and resolutions have several votes, standard errors are clustered around each piece of legislation (e.g., HR 820). 95% confidence intervals are also reported. All models include Congress fixed-effects and were estimated using the tobit function in Stata (v15). See Dietrich (Reference Dietrich2020) for data and replication code.
4 Results
Beginning with Model 1, Structural Similarity is positive and statistically significant ($p < 0.001$), suggesting party votes are more likely after videos of floor votes in which motion declines. Model 2 shows this result holds even when controlling for a number of factors, including Previous Party Votes. Here, when additional controls are included the positive coefficient associated with Structural Similarity (0.054) is close to three times the standard error (0.020), meaning as video motion declines—implied by the greater similarity between frames—party votes are significantly more likely to occur. These results are replicated a number of times in the SI using a variety of model specifications and various robustness checks which all lead to the same substantive conclusion.Footnote 11
To help interpret these results, predicted values were computed using the coefficients from Model 1. In the $112\mbox {th}$ Congress, when Structural Similarity is allowed to vary from
$-\frac {1}{2}$ standard deviation (SD) (more motion) to
$+\frac {1}{2}$ SD (less motion) the predicted number of party votes increases 7.31 percentage points (0.34 to 0.41 votes). Allowing the same variable to vary from
$-1$ SD (more motion) to
$+1$ SD (less motion) increases the predicted number of party votes by 14.61 percentage points (0.30 to 0.45 votes). Finally, in the
$112^{th}$ Congress, when Structural Similarity is allowed to vary from
$-2$ SD (more motion) to
$+2$ SD (less motion) the predicted number of party votes increases 29.22 percentage points (0.23 to 0.52 votes), suggesting as video motion decreases party votes are more likely to occur later that day.
Although this application is meant to demonstrate one way frame differencing could be used for social science research, many may be interested in the causal ordering. Does the lack of inter-party dialogue lead to party-line voting? Or does party-line voting create bad blood between the political parties which precludes inter-party dialogue between floor votes? Both could also be influenced by polarization in the electorate which leads to party-line voting in Congress while also making opposing MCs less likely to speak to each other. It is undoubtedly difficult, if not impossible, to answer such questions using the observational design employed in this study, nor is that the goal of this research letter. However, in the SI, several additional tests were conducted to try to provide some initial answers, even though additional work must be done.
First, Table S12 shows that Structural Similarity is not a statistically significant predictor of Previous Party Votes, meaning the motion of a video at time t does not predict party votes at time $t-1$,
$t-2$, …,
$t-n$. Similarly, Table S13 shows Previous Party Votes is not a statistically significant predictor of Structural Similarity. If the relationship between party-line voting and inter-party discussions is due to polarization in the electorate, then Previous Party Votes, Future Party Votes, and Structural Similarity should consistently predict one another. Not only is this not what I found, but Table S14 shows my main results also hold when an additional control is included for the percent of voters who identify as Independent. Collectively, when these results are compared to those reported in Table 1 and Tables S1–S10, more evidence suggests Structural Similarity is predictive of party voting and not the other way around. Finally, Table S11 shows there is no significant interaction between Structural Similarity and Sponsor Party Leader which is inconsistent with the partisan signaling mechanism outlined on page 8, but these tests are only suggestive and should be taken with the appropriate amount of skepticism.
5 Discussion and Conclusion
Although previous scholars have studied image data (e.g., Torres Reference Torres2018; Casas and Williams Reference Casas and Williams2019; Xi et al. Reference Xi, Ma, Liou, Steinert-Threlkeld, Anastasopoulos and Joo2019), the methodology I introduce is unique to video data, which includes television broadcasts, social media clips, and recorded deliberations, among other things. Rather than thinking of these videos as collections of images, this study offers motion detection as one of the many ways the temporal dynamics of videos can be used to answer important social science questions. In this way, this research letter should be thought of as an important first step towards better harnessing the use of video-as-data.
Frame differencing is not as computationally intensive and can be successfully used with both low- and high-resolution videos which is why it is offered as a useful starting place for this line of inquiry. Although my application is compelling, it is only used to highlight the methodology which can be used by a wide range of scholars. For example, the “energy” of campaign events or party conventions could be quantified by the degree to which videos change from one frame to the next (Zhong et al. Reference Zhong, Ye, Wang, Yang and Xu2007). Changes in motion could also be used to detect crowd panic (de Almeida et al. Reference de Almeida, Cassol, Badler, Musse and Jung2016), like what often occurs when protesters and counter-protesters clash.
More localized motion has also been shown to be associated with nonverbal displays (Murthy and Jadon Reference Murthy and Jadon2009) which would be of interest to scholars of elite speech, especially those who emphasize the importance of the way words are spoken in the U.S. House of Representatives (Dietrich, Hayes, and O’Brien Reference Dietrich, Hayes and O’Brien2019) and on the Supreme Court (Dietrich et al. Reference Dietrich, Hayes and O’Brien2019). Finally, emotional contagion can also be measured using frame differencing (Paxton and Dale Reference Paxton and Dale2013; Ramseyer and Tschacher Reference Ramseyer and Tschacher2014). This relationship would be of interest to various political psychologists who have studied emotion using both experimental (Huddy, Mason, and Aarøe Reference Huddy, Mason and Aarøe2015) and observational designs (Valentino et al. Reference Valentino, Brader, Groenendyk, Gregorowicz and Hutchings2011).
My results are also substantively important since they suggest the social environment on Capitol Hill (or lack thereof) may be more important than previously thought. Although these results are far from definitive, the fact they are suggestive lays an important foundation for future research. The approaches used by previous scholars to understand the “social fabric” of Congress are more labor- and time-intensive than the methodology introduced in this paper (e.g., Patterson Reference Patterson1959, Reference Patterson1972; Caldeira and Patterson Reference Caldeira and Patterson1988), meaning not only will my results motivate a new line of congressional research, but the measure introduced in this study should help future scholars more easily measure complex social interactions, both on and off of Capitol Hill.
Acknowledgments
I am grateful to Robert X. Browning, Alan Cloutier, and the rest of the staff at the C-SPAN Archives for their help and support. I would also like to thank the anonymous reviewers and editor at Political Analysis as well as Arthur Spirling, Molly Roberts, Yiqing Xu, Jeff Mondak, and conference and seminar participants at UC-San Diego and PolMeth for all their helpful comments and suggestions. Finally, this paper would not have been possible without my fantastic research assistants: Jielu Yao and Logan Drake.
Data Availability Statement
The replication materials for this paper can be found at Harvard Dataverse at https://doi.org/10.7910/DVN/YQPEVQ.
Supplementary material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2020.25.