Statistics and Big Data
I am late, I know… I came back last week from the NTTS (New Techniques and Technologies for Statistics) 2013 conference in Brussels and have not yet had a minute to stop and write down my impressions. Fortunately I live-tweeted during the conference, so I haven’t completely lost trace of my thoughts while there… let me try to put them together a bit more coherently now.
NTTS is a bi-annual conference of official statistics – those institutions and people who provide governments with essential quantitative information about society and the economy. Key actors include Eurostat and National Statistical Offices of countries, sometimes statistical agencies of particular ministries or government departments, as well as international institutions that also produce data such as the OECD, and academics. Official statistics used to work primarily with surveys – asking citizens and firms to report on their habits, activities, situations. Examples are the Labour Force Survey, or the Survey on Income and Living Conditions, both conducted throughout the whole of Europe. Northern European countries initiated the tradition, now adopted by an increasing number of European countries, of supplementing surveys with administrative records: official registers of vital events such as births, deaths and marriages; tax returns filed with the fiscal administration; registers of attendance in public schools; admissions to, and discharges from, public hospitals; and so on. Administrative data have the advantage of being already available, with no need of bothering people with forms to fill; of being abundant and sometimes even exhaustive; and of being very detailed and often much more accurate than any recollection by individuals. Traditionally, economists in particular have always being suspicious of surveys, and much more inclined to trust administrative data which suffer much less from any respondent-related cognitive or memory bias.
But Big Data are now shaking up this long-established world of surveys and administrative records. Very different, very promising, and typically produced by other actors (notably private companies rather than public-sector services), Big Data seem to question the existence and usefulness of more traditional data sources. Big Data are digital traces of our activities, collected by the electronic devices we now all use: our Internet usages, Facebook or Twitter accounts, credit card purchases, loyalty schemes with companies (like frequent flyer programmes with airlines, or customer rewards cards with supermarkets), use of public transport as recorded by our cards (such as Oyster in London or Navigo in Paris), CCTV films etc. The sheer amount of these data, the level of detail and the accuracy of the information collected are appealing: my credit card statement certainly provides a more precise account of my expenses of last month (when, where, how much, for what…), than what I may recollect even with my best effort. Companies see a lot of potential in use of Big Data (see the 2010 The Economist report on Data Deluge, or the 2011 McKinsey report, presenting Big Data as the next frontier for innovation) researchers are enthusiastic, and even public administrations now start realising this is a seachange.
Filed under: Data, Internet and social media, Research, Social science methodology | Leave a Comment
Just as an additional part of my interview on data has come out on SFR Player, I am preparing to attend a major conference on data and statistics next week in Brussels, with participation of national statistical institutes from European and other countries. Official statisticians who mainly used to do surveys, now fully realize the potential of administrative data – large amounts of information already available in the state’s records, cheaper to obtain than surveys or (horribly expensive!) censuses, often more comprehensive and sometimes almost exhaustive, while also avoiding all types of respondent-induced bias. The registers of schools, hospitals and tax offices are likely to be richer and more accurate than whatever an individual may bother to recall and/or declare about their education, health or income. Social science researchers also demand more and more access to administrative data – especially economists who never really liked surveys, and always mistrusted respondents (I’ll develop this in a future post). Northern European countries pioneered use of these data for official statistics and research, and other governments are now following suit.
Meanwhile, the private sector also discovers a wealth of non-survey data, notably through the Internet and data mined from social networking and other online services – the so-called “big data”. Similarly to administrative data, these are traces of human activity that are recorded by some automated computer system, and do not depend on an individual’s memory, cognitive bias or psychological state. In this sense they can be more accurate than surveys, interviews and questionnaires. Even at the individual level, I can retrace my monthly expenditures more easily and precisely by looking at my credit card records online, than painstakingly trying to recall date, time, amount and place of each transaction. So, from the viewpoint of a researcher, mining large amounts of these data may give a very precise and accurate picture of a human activity, with a level of detail that surveys could never hope to achieve. Combined with the computational capacities of today, which allow handling larger amounts of data than was ever possible in the past, the social and economic sciences have a new tool in their hands that may bring a decisive improvement in their capacity to understand society and advise policy makers.
Particularly interesting is the vision of relational activities provided by big data from social networking services: relationships between people recorded as they access and use the service. When I’m on Twitter, it means that I’m talking to someone. When I’m on Facebook, it means that I have friends (or at least contacts if we don’t want to take the “friendship” metaphor too strictly). For a social networks scholar, this is great to know.
Continue reading ‘Again on data and big data in social research’
Filed under: Data, Research | Leave a Comment
Tags: Big data, Network Analysis, Quantitative methods, Social science data, social theory, Statistical modeling, Web-based social networks
I was interviewd by SFR PLAYER (an online magazine published by SFR, a major Telecom provider in France) on the changes induced by the use of big data in my work as a social science researcher. The video interview (in French) is available here. The same issue features an interview with danah boyd and various specialists of open data, data journalism, Internet data.
Interestingly, the video was shot at the École Normale Supérieure Campus Jourdan in Paris, a place that hosts part of Réseau Quetelet, the French national data service for the social sciences and humanities. The Jourdan unit ADISP handles primarily statistical data – data from surveys conducted by INSEE, the French national statistical agency, and administrative data such as those of the ministries of education and labour – for use in the social sciences.
In fact, research in social sciences has always used data as a basic ingredient. Data from official and public-sector statistics have long set the standard, and access to these data is ever more in demand today. European initiatives like the Data without Boundaries project in which I am taking part, aim precisely to bring improvements in this area.
We are now in the midst of a major upgrade today with the availability of big data, data from the Internet, the digital traces of our activities. They have the advantage that they can be retrieved, saved, coded and processed much faster, much more easily and in much larger amounts than more classical records such as registers of students in schools or of patients in hospitals. McKinsey has already pointed to potential economic benefits of big data for business, and research has taken notice too.
Continue reading ‘Data and big data: digital traces of social phenomena to nourish research’
Filed under: Data, Internet and social media, Research, Social science methodology | Leave a Comment
Tags: Big data, Quantitative methods, Small data, Social science data, Social simulation, Statistical data, Web-based social networks
The one-day workshop on “Introduction to Social Network Analysis” that I gave two weeks ago (wow, time flies…) at the University of Greenwich was a great satisfaction! A good audience of about 15 people (not too few, not too many), all very bright and nice. We had interesting and stimulating questions, and it was quite an inspiring event – I take this opportunity to thank all those who attended!
I am going to soon give another such workshop. It will take place on Tuesday, 21 May in the afternoon and on Wednesday, 22 May in the morning at the University of Hamburg, Germany, and it will be one of the many workshops preceding the 2013 Sunbelt conference (for those who aren’t familiar with it, it is the main international venue for experts of social network analysis). It will follow pretty much the same structure as at Greenwich, but based on past experience, I will shorten the theoretical introduction and dedicate more time to network metrics and measures, and their practical calculation and visualisation on the computer (with Gephi). This will make the workshop more interactive, while allowing enough time for participants to become familiar with formal concepts they may never have heard before. The other novelty is that the workshop will be taught jointly with Yasaman Sarabi, a PhD student at Greenwich who specializes in organisational network analysis. Together, we will be better able to support participants and help them with the software.
For those who have a particularly strong interest in social network analysis and cannot be content with just a one-day “taster”, I will also offer an intensive, two-week course on “Doing research with SNA“. It will be part of a Summer School organised at the University of Greenwich on 17 – 25 June 2013, just before the annual conference of the UK Social Networks Association. This course, also taught with Yasaman Sarabi, will review theories and measures, with computer applications (also using Gephi, but also UCINET and Netdraw); in addition, it will offer insight into, and hands-on experience in, research design, organising working groups in which participants set up and conduct a mini-research project, and then present their results. The objective is to help participants identify how they can integrate social network analysis into their own research, and how to reframe their questioning in order to allow for network concepts to play a role. The summer course targets PhD students and junior researchers as a priority, and (like all workshops I give) presupposes no preliminary knowledge of social network analysis, statistics, or computer programming.
More information on the Sunbelt workshop is available here.
More information on the Greenwich summer course is here.
Filed under: Business networks, Social networks, Social science methodology | Leave a Comment
Tags: Inter-organisational Networks, Intra-organisational networks, Network Analysis, Networks and Markets, Quantitative methods, Social science data, Web-based social networks
I am going to give another one-day workshop on Introduction to Social Network Analysis in a couple of weeks time -more precisely on Monday, 14th January, at the University of Greenwich, London, as part of a Winter School for researchers and PhD students in social science, management and economics, dedicated to Analytical Software.
The rationale is pretty much the same as usual. I have stressed many times how the recent rise of online social networking services (Facebook, LinkedIn, Twitter etc.) has drawn massive attention to the field of study of social network analysis (SNA). Yet social networks have always existed and are in fact a constant of human experience – whether in the family, with friends, at school or on the workplace, to name but a few examples. Likewise, SNA already has a respectable history and has been successfully applied to study a wide variety of social contexts.
The workshop is aimed at those who are new to the field, and would like to better
understand whether and how they can use it to enhance their own scholarly practice (whether it is research, teaching or consultancy). All social science backgrounds are welcome, and participants are assumed not to have any previous knowledge of SNA (or statistics or software use, programming etc.). The goal of the workshop is to provide attendees with basic insight into what social network analysis is, and how it can be used in social science research, together with some hands-on experience of how to use network data and how to graphically represent networks, calculate key metrics, and perform some elementary analyses with Gephi, a powerful, though user-friendly, open-source software for visualizing and analyzing networks graphs.
Continue reading ‘A new “Introduction to SNA” short course soon!’
Filed under: Business networks, Social networks, Social science methodology | 1 Comment
Tags: Inter-organisational Networks, Intra-organisational networks, Mixed methods, Personal networks, Quantitative methods, Social science data, Web-based social networks
2012 in review
WordPress prepared a 2012 annual report for this blog…. here is it!
Here’s an excerpt:
This blog got about 11,000 views in 2012.
Filed under: Social networks | Leave a Comment
I have already mentioned our study ANAMIA, undertaken in collaboration with an interdisciplinary team of sociologists, social psychologists, philosophers, economists, and computer scientists in France and the UK. We look at the so-called “pro-ana” and “pro-mia” websites, blogs and forums (where “ana” and “mia” stand for anorexia nervosa and bulimia nervosa), which have raised lively and recurrent controversies in recent years, motivated by the concern that they may contribute to maintaining and even spreading eating disorders.
Over three years, we have studied the structure, function and influence of the online social networks of persons with eating disorders, and potential consequences on their nutrition and health. Previous research on the pro-ana phenomenon simply looked at the contents of websites (typically only English-language ones), or sometimes at the effects of viewership on random samples of audiences (typically not actual users). We decided to move forwards and let users themselves speak; besides, for the very first time, we interrogated them not just on their health and their internet practices, but also their sociability. Indeed through a web-based survey with a special graphical application (see figure below for an example, and see this post for more detail), we reconstituted their entire personal networks of friends, family members, schoolmates, colleagues and acquaintances, both online and offline. In order not to limit the study the the English-speaking webosphere, we distributed the survey in two languages, English and French, and obtained responses from 284 persons.
We had three questions in mind when we started the study.
- First, is the Internet a refuge for people who feel uncomfortable with their social surroundings, who find it difficult to share their eating disorder experience?
- Second, are these persons trying to hurt themselves by visiting these sites -for example, are they aiming to learn (or to teach) tricks to limit their calorie intake, exercise more intensely, hide their weight loss from others?
- Third, does their use of websites and social media involve a rejection of established health norms, and perhaps of the entire health system?
The project results have brought us a few “surprises”. Well, to be sure, we as social scientists didn’t expect things to be as simply clear-cut as the media would present them. But what we found went well beyond what we could imagine at first.
Filed under: Internet and social media, Social networks, Socioeconomic studies of health, Sociology | Leave a Comment
Tags: Eating behaviors, Eating disorders, Food choices, Network Analysis, Pro-ana and pro-mia websites, Sociology, Trans-disciplinarity, Web-based social networks, Well-being
I was yesterday at the Just-in-Time-Sociology (JITSO) workshop in Lausanne (oh, how I still like this town, after such a long time!). JITSO was a small-scale, nice and friendly event for like-minded social researchers, who feel the urge to use their baggage of theories and techniques to provide science-informed responses to today’s fast-paced social, political and economic events. JITSO may mean attempting rapid, but still research-based, reactions to events that require immediate policy decisions, and for which the social sciences have been traditionally ill-prepared. This is, for example, what Antonio Casilli and I endeavoured to do with our study of the UK riots in August last year. JITSO may also mean documenting events as they unfold, before their digital traces disappear. It is the case of all those studies of the Internet and social media whose empirical basis shifts daily -not only because people change and evolve (this would be common to all types of social science data), but also because the few big firms that dominate the IT market change their rules, terms of use, and practices so frequently. An example is the analysis of the French blogosphere of persons with eating disorders, which I am undertaking with Antonio Casilli and Fred Pailler in the ANAMIA research project.
Two major issues appeared yesterday. One is the sustainability and durability of research projects originated as “just-in-time” ones. Rapid response papers are necessarily imperfect or incomplete, precisely because they need to be put together in such a short time; so they require integration in a longer term perspective, aiming at theoretical refinement and empirical validation. But over time, media enthusiasm may fade away, expected sources of funding may not materialize, data may be difficult or expensive to collect, and ethical committees may become more conservative in granting authorizations to projects such as these, which typically have strong political orientation (and implications).
The other issue is the availability of data. Internet firms and social networking services are understanding more and more clearly the economic value of data, and are more and more reluctant to give them away for free. This is creating increasing obstacles for research programmes that need these data as digital traces allowing to keep track of, and reconstitute, ongoing social change. There needs to be some strong political action to prevent this from happening .
The difficulties are clear, but the good thing is that the JITSO programme of research has a lot of freshness and willingness to go ahead, not least by exploring new modes of peer-reviewing and publication. Not all the methods or topics are new, of course, but the approach has finally become explicit, and as such, potentially recognised. I look forward to future editions of JITSO -though it’s not yet clear when, or where.
The slides of my keynote speech are available here.
Filed under: Internet and social media, Research, Social science methodology, Sociology | 1 Comment
Tags: 2011 UK riots, Agent-based models, Mixed methods, Public policy analysis, Quantitative methods, Social science data, Social simulation, social theory, Sociology, Web
On 14th December 2012, the French National Library (BNF, Bibliothèque Nationale de France) in Paris will host the ANR ANAMIA symposium “Understanding Pro-Ana: Body, Networks and Nutrition” (Comprendre le phénomène pro-ana : corps, réseaux, alimentation). Presentations will be in French (see program here). An English summary is available here. Attendance is free of charge but for organisational reasons, participants are invited to register in advance here.
Filed under: Internet and social media, Social networks, Socioeconomic studies of health | Leave a Comment
Tags: Eating behaviors, Eating disorders, Food choices, Mixed methods, Network Analysis, Pro-ana and pro-mia websites, Web-based social networks, Well-being
[SAVE THE DATE: on 14th December 2012, we will hold a symposium on “Understanding Pro-Ana: Body, Networks and Nutrition” (Comprendre le phénomène pro-ana : corps, réseaux, alimentation) at Bibliothèque Nationale de France, Paris. It is an output of the research project ANAMIA of which the study presented here is part].
With Antonio Casilli and Lise Mounier, two colleagues in our ANAMIA research project, we have a new peer-reviewed article on “Eliciting personal network data in web surveys through participant-generated sociograms”, which has been accepted for publication in the Field Methods journal, and is expected to come out in Vol. 26, issue 2, 2014.
We present an innovative method to collect personal network data in a web survey. Via a user-friendly flash applet, respondents can draw their own social networks of acquaintances, whether offline or online. These “ego-centered” networks display as targets whose centre represents the survey participant (ego).
We ask participants to draw around them their contacts. For example, asking them to report their online contacts, we prompt them to think of:
My online contacts… are people whom I have talked to and/or interacted with in the last six months, and whom I meet for example on discussion forums, blogs, email, MSN, social networks (Facebook, Last.fm etc.). At the centre of this target, it’s me. I shall place the others around me, the closest towards the centre and the others further away.
By clicking on a “+” button, participants can add new acquaintances (alters), drag and drop them around the target. They then indicate who they are (name, type of relationship, and gender) and to draw ties between them if they exist, or to group them together if they belong to a common social circle.
The data retrieved in this way enable us to calculate classical network metrics (size, density, tie strength etc.) and, if more than one target has been filled, to compare different networks: for example, establish similarities and differences between the online and offline personal networks of participants.
We have used this method to conduct a survey of users of websites, blogs and forums dedicated to eating disorders, in France and the UK.
A preliminary version of the article is available here. More information on the research project ANAMIA, for which the tool has been conceived, is available here.
Filed under: Internet and social media, Social networks, Social science methodology, Sociology | 1 Comment
Tags: Eating behaviors, Eating disorders, Network Analysis, Pro-ana and pro-mia websites, Quantitative methods, Social science data, Sociology, Web, Well-being




