Data Capture and Analysis Tools

1. Data Capture

1.1. Web Scraping

1.2. API and Google Developers

1.3. Getting Data From a Website

1.3.1 Capturing Structured Data

1.3.2. Capturing Content Data

1.3.3. Capturing Contextual Data

1.4. Screen capture

1.4.1. Screen Capture Tools

1.4.2. Screen Recording (Screencasting) Tools

2. Data Organisation

2.1. Note Taking Tools

2.2. Reference Manager Tools

2.3. Data Management Tools

2.4. Project Management Tools

3. Data analysis

3.1. Collective Data Analysis

3.1.1. Analyzing (General) Statistical Data

3.1.2. Analyzing Websites Traffic Data (Web Analytics)

3.1.3. Analyzing Web Search History

3.1.4: Analyzing Web Browsers and Operating Systems Trends

3.1.5. Analyzing Word Density

3.1.6. Analyzing Urban Movement (through Social Media Data)

3.1.7. Analyzing Geographical Data

3.1.8. Analyzing Relational Data (Network Analysis)

3.1.9. Analyzing Data Visually

3.2. Social Media Search and Sentiment Analysis Tools

3.2.1. Multi-platform

3.2.2. Facebook

3.2.3. Twitter

3.3. Analyzing Traffic and Ranking of the Main Social Media Platforms

3.3.1. Facebook

3.3.2. YouTube

3.3.3. Google+

3.3.4. Twitter

3.4. Individual Data Analysis

3.4.1. Analyzing Web Browser History Trends

3.4.2. Analyzing Web Navigation Time

3.4.3. Analyzing Personal Daily Activities (Personal Analytics)

3.4.4. Social Media Management Tools

1. Data Capture

In Computer Science, data capture is any process of converting information into a form that can be handled by a computer. Data capture technology is needed when information and data exist on scanned images and electronic files of various formats. In information era, getting information from web pages is essential for any company, corporation or organization.

1.1. Web Scraping

Web Scraping is the practice of getting large amount of information from a website by using a web scraping software.

A web scraping software will interact with websites in the same way web browsers do. But instead of displaying the data served by the website on screen, the web scraping software saves the required data from the web page to a local file or database.

Google Chrome comes with an already built-in developer environment, but users can also download other extensions from Google Store. Firefox can be easily configured for web scraping with Firebug, an add-on which enables users to edit, debug, and monitor CSS, HTML, and JavaScript live in any web page.

A selection of digital tools is covered in this wiki; however, it is recommended to go to the Digital Methods’ library to see a bigger selection of tools available.

[ Back to Top ]

1.2. API and Google Developers

In computer programming, an application programming interface (API) specifies a software component in terms of its operations, their inputs and outputs and underlying types. Its main purpose is to define a set of functionalities that are independent of their respective implementation, allowing both definition and implementation to vary without compromising each other. The use of APIs provides access to non-public Internet environments, such as those requiring authentication through login and password, because the data collection runs directly through the back-end of the social media service to which the data belongs.

Google Developer is a set of web, where one can easily search for APIs, SDKs, guides, samples and documentations. It can be built into Chrome by following these steps.

[ Back to Top ]

1.3. Getting Data From a Website

There are three different categories of web data: structured, content and contextual. Each category requires different tools for retrieving information.

1.3.1. Capturing Structured Data

Structured data is a general name for all markups that abide by a predetermined set of rules. These rules include defining types of data as well as the relationships between them. These can later be read by different programs, like browsers and search bots, for example. Structured data includes:

* schema.org

* Microformats

* Microdata

* RDFa

Structured data of a website can be easily viewed by looking at the source code of a website. The source code could be unintelligible to most users without any knowledge of programing languages. The aforementioned elements of both Firefox and Chrome can be helpful in order to understand the code.

Introductory Tutorials can be found here:

* Chrome

* Firefox

Further tools of interest:

* Source Code Search: This tool can be found on the digital methods website, and can be useful when looking for a certain type of source code element. While there are similar functions in Chrome and Firebug, it can also be convenient for a simple search of code elements within a website.

1.3.2. Capturing Content Data

Content data refers to text and/or images users see on websites. A simple method of capturing content is known as “copy-pasting”. However, with the softwares listed below, users can capture the data in a more sophisticated way, since they are equipped with more useful features (e.g. storing a bibliography).

* Zotero - While more of a reference organization tool, Zotero has a very useful function of making screenshots of a web page. These screenshots capture all of the data on a page (including source code) and not just a jpg.

* Text Ripper - This tool allows for an extraction of pure text from any website.

* Link Ripper - With this tool one can capture all internal and outgoing links of a web page.

* Harvester - Getting full link names from websites can be a difficult task, as the source code (especially minified source code) provides lengthy links which are interweaved with other content. Using a link ripper to extract all links on a web page can therefore save plenty of time.

* Image Scraper - An easy tool to get all the image files from a web page.

* Tag Cloud Generator - A simple tool to count word re-occurrences on a web page. While this can already be seen as a data analysis tool, it can be useful to extract certain bits of text from a website as well. For instance, URLs can be analyzed for the most used words on a website, which spares one from copying a lot of text. This can be useful when analyzing multiple websites.

Data obtained through one the aforementioned methods can be rather messy but users can organize the information using various tools.

1.3.3. Capturing Contextual Data

Contextual data encompasses different statistical information about various aspects of a website (number of visitors, page views, etc.) and individuals (name, age, address, phone number, and other demographical information).

* Time Stamp Ripper - This tool can display the last time a web page was modified.

* Meta Data Comparison - A tool which captures metadata of a websites and allows side-by-side comparison.

* GeoExtraction and GeoIP are tools offered by the Digital Methods Initiative, which can extract a location from a URL or IP address.

* Wolfram Alpha - A great tool for looking up certain statistical data about web pages.

* Alexa - A website for examining how much traffic a site has and what demographics are interested in it.

[ Back to Top ]

1.4. Screen capture

Screen capture is a process of gathering visual material from a user’s computer, either a Screenshot or a Screencast. Screenshot is an image file which shows the contents of a computer screen at the moment of the capture. Screen capture can also be a video screen capture or Screencasting (a digital recording of a computer screen output). It usually also contains an audio track captured by the inbuilt or connected microphone.

1.4.1. Screen Capture Tools

* Grab Them All - Firefox add-on that takes screenshots (JPG or PNG) of sites listed in a .txt file.

* Screen Capture (by Google) - A Chrome extension that allows one to take a screenshot (JPG or PNG) of any part of the visible screen or an entire tab or webpage. It can capture large web pages that require the user to scroll down vertically and/or horizontally

* Print Screen - Screen can also be captured by pressing the PrtSc key on the keyboard (Windows) or combination of three keys Command + Shift + 3 (Macbook).

1.4.2. Screen Recording (Screencasting) Tools

* Screenr - An easy to use online screen recording tool. It records a selected portion of one’s screen, which one can accompany with a voice recording using a microphone. The video can then be published on Screenr.com or You Tube.

* Screencast-O-Matic - A similar online screen recording tool. Offers more functions (e.g. enables hiding or highlighting mouse movements, and a longer recording time (15 min. instead of 5). It’s downside is that it will display a logo on the video.

*

Jing - A free screencast software for Mac and Windows, which requires online registration. The program is easy to use but the offered functions are limited. It can be used to take screenshots.

*

CamStudio Recorder - An open source screencast software for Windows. Remember to change preferences before recording, as videos are stored in c:windows\temp by default. The file sizes are large unless one significantly lowers the video quality.

[ Back to Top ]


---++ ---+++++ 2. Data organisation Organization of data is important when dealing with capturing and analysing of data. The main objective is a systematic organization of data, regardless whether it be a physical or a virtual database. Data systems are structured in simple and logical structures. This allows users and applications to access and use the data fast and easy. ---+++ ---+++++ 2.1. Note Taking Tools*

Google Drive - This online storage helps to manage and save notes, tables, pictures, recordings, videos and more. It enables the user to share notes on more devices via cloud.

*

Basket Note Pads - A free note-taking program, powered by KDE. It collects research results in an easy way, and organises them in separate “baskets”. It can collect various types of data (images, links, e-mails, etc.)

*

Evernote - A web-based service, that is free if one uploads up to 40MB/month. Apart from taking notes, one can scan post-its, import pictures, PDFs, SMS and tweets. This software is also compatible with smartphones and tablets.

*

KeepNote - Works with Windows, Linux, and Mac OS X, and is considered to be one of the most popular free note-taking programs. With Keep Note, one can store class notes, create TO-DO lists and much more, with rich-text formatting and images. Using full-text search, one can retrieve any note for later reference.

*

Freeplane - A free and open source software to support thinking, sharing information and getting things done at work, in school and at home. The core of the software consists of functions for mind mapping (also called concept mapping or information mapping), and tools for using mapped information. Freeplane runs on any operating system on which a current version of Java is installed and from USB.

*

FreeMind - A free mind-mapping tool written in Java.

[ Back to Top ] ---+++ ---+++++ 2.2. Reference Manager Tools

The following softwares help users with collecting data information together with automatic bibliography and citations generator.

*

Zotero - A free, easy-to-use software which helps gather, organize and cite research sources. Available with 100MB free online storage. Zotero works best with Firefox but is also supported by Chrome.

*

Mendeley - A free desktop software (Windows, Mac, Linux) that enables organization and sharing of research papers, trend discovery, creating bibliographies with 1GB free online storage, automatic back-up and cross-platform synchronization.

*

Qiqqa - A primarily Windows-based software tool aimed at research and PDF management.

*

Docear - Integrated academic-aid tool with digital library for PDF documents, reference manager, note taking and mind map capabilities.

*

CiteULike - An online social bookmarking-based reference manager tool.

*

JabRef - An open-source bibliography reference manager with a Bib Tex native format.

*

Google Scholar Citations - Google Scholar added a new option called Google Scholar Citations. Authors can use this service to compute citation metrics and track them over time. The same caveats that apply to citation searching in Google Scholar apply to Google Scholar Citations so check the information in the previous box to learn about those.

*

Google Alerts - This service sends emails to the user when it finds new results — such as web pages, newspaper articles, or blogs — that match the user's search term. The step-by-step version to create one’s own alerts can be found at the following link.

[ Back to Top ] ---+++ ---+++++ 2.3. Data Management Tools

The softwares listed below enable users to clean and organize their data. The programs help with reading and understanding the information and data visualization.

*

Data Wrangler - A free data transformation and analyzing app.

*

Google Refine - A data sorting and transformation desktop tool.

*

Google Fusion Tables - A free web application tool to gather, visualize and share data tables.

[ Back to Top ] ---+++ ---+++++ 2.4. Project Management Tools

These softwares ease work on projects where multiple persons are involved.

*

Freedcamp - A free project management tool.

*

Redbooth - An online, paid collaboration and project management software.

*

Collabtive - A cloud-based project management tool.

[ Back to Top ]


---++ ---+++++ 3. Data AnalysisData Analysis is the process of systematically applying statistical techniques to illustrate, evaluate, condense and recap data. Logical techniques are also often used. The process of evaluating data examines each component of the data provided. Data from various sources is gathered, reviewed, and then analyzed in order to arrive to some sort of findings or conclusions. There is a variety of specific data analysis methods, some of which include data mining, text analytics, business intelligence, and data visualizations. Before one can analyze and visualize data, it is suggested to clean the data first. This process is called Data Cleaning.

Data Cleaning, sometimes also referred to as data cleansing or data scrubbing, is the process of removing invalid data from databases. Poor data quality can affect the results and cause the data analysis to be invalid. Typical data cleaning tasks include records matching, deduplication, and column segmentation. This leads to an alteration of data, after which it is considered cleaned.

[ Back to Top ]

---+++++ 3.1. Collective Data Analysis

Collective data analysis is a process that uses data assembled from many sources for its basic operation. It can be sourced from number of clicks, views (article, video, songs, etc.), quantity of users, time spent on the web page, geographic location and other information. By using the analyzing tools, collective data analysis shows an average value of the data.

---+++++ 3.1.1. Analyzing (General) Statistical Data*

PSPP – A free (cross-platform) software replacement for SPSS (Statistical Package for the Social Sciences) which executes the most common statistical operations.

*

SOFA Statistics – A free (cross-platform) user-friendly software which executes statistical analysis and reporting.

---+++++ 3.1.2. Analyzing Websites Traffic Data (Web Analytics)*

Google Analytics - A user friendly application which provides basic and in-depth analysis on website traffic and user behavior.

*

Piwik - A free software alternative to Google Analytics.

---+++++ 3.1.3. Analyzing Web Search History*

Google Trends - An application which provides a comparative analysis between 2 to 5 search terms (comparing them historically and also geographically and linguistically) based on a sample of all Google web searches.

*

Wikimedia Statistics - General statistics on Wikipedia using criteria such as page views, page creation, country and language.

*

Wikipedia Article Traffic - A tool which retrieves the page views for any given Wikipedia article.

Web Browsers - Historical data on web browsers usage gathered from five different sources.

*

Operating Systems - Historical data on operating systems usage gathered from different sources.

---+++++ 3.1.5. Analyzing Word Density*

Ranks - A tool which analyzes the keyword density of a single webpage or a website providing also a word tag cloud besides other useful web statistics.

*

Live Keywords - A tool which analyzes the keyword density of a text.

*

Infomous - Another user-friendly tool that can be used to quickly create topic clouds.

---+++++ 3.1.6. Analyzing Urban Movement (through Social Media Data)*

Livehoods - The Livehoods Project presents a methodology for studying the dynamics, structure, and character of a city on a large scale by using social media (from data such as tweets and check-ins) and machine learning. The study is currently available for 7 cities in the U.S. and Canada.

---+++++ 3.1.7. Analyzing Geographical Data*

Quantum GIS - A free (open source) GIS software for advanced users interested in creating maps.

*

Google Fusion Tables - A tool that offers an easy way of creating maps to visualize data in CSV files or spreadsheets (to create intensity maps and maps with data markers).

*

ManyEyes - A similar tool that also allows to easily create maps, while providing access to numerous data sets.

*

GeoExtraction - A useful tool to extract geographic locations from text.

*

GeoIP - Enter URLs and this tool will find its primary IP address with some additional information.

---+++++ 3.1.8. Analyzing Relational Data (Network Analysis)*

TouchGraph SEO - A Java Application that allows users to visually explore connections between clusters of websites.

*

Issue Crawler - A tool for hyperlink analysis developed by the Govcom.org foundation that locates and visualizes networks. Registration with Issuecrawler.net is required. See instructions and scenarios of use.

---+++++ 3.1.9. Analyzing Data Visually*

Google Fusion Tables - An accessible tool that allows users to integrate and visualize data in CSV files or spreadsheets. It also provides access to a host of data sets created by a community.

*

ManyEyes - A similar user-friendly tool that allows to quickly create maps and different kinds of graphs (pie charts, etc.), and likewise boasts a large number of community resources. A potential downside would be that all work is automatically published online when saved.

[ Back to Top ] ---+++ ---+++++ 3.2. Social Media Search and Sentiment Analysis Tools ---++++ ---+++++ 3.2.1. Multi-platform*

Social Mention - Social Mention is a social media search and analysis platform that aggregates user generated content into a single stream of information.

*

Social Seek - Get all the latest tweets, news, videos, photos, and more on any topic in one place.

*

Topsy - Topsy is a real-time search engine. Topsy indexes and ranks search results based upon the most influential conversations millions of people are having every day about each specific term, topic, page or domain.

*

Icerocket - A Trend Tool; enter items to see mentions trended over time. Enter up to five queries under Trend Terms.

*

TipTop - A search on any topic reveals people’s emotions and experiences about it, as well as other concepts that they are discussing in connection with the original search.

*

SharedCount - Track the shares, likes, tweets, and more for a certain given URL. Enter a web address of a page and find out how much it has been shared in different social networking and bookmarking sites.

*

WolframAlpha - More than a search engine. It gives access to the world of facts and data and calculates answers across a range of topics.

*

Wildfire Social Media Monitor - Glean insights about the growth of a social media fanbase on the leading social networks.

---+++++ 3.2.2. Facebook*

Facebook Search - Facebook search is notoriously bad and will not find much. It relies heavily on on’s social graph, so the results may not be meaningful for one’s target audience.

---+++++ 3.2.3. Twitter*

WhatHashtag? - What Hashtag? is service that allows to find the most used Twitter hashtags for the keywords entered.

*

Twitrratr - Twitrratr has a list of positive keywords and a list of negative keywords. It searches Twitter for a keyword and the results are cross-referenced against adjective lists, then displayed accordingly.

*

Twazzup - Useful for understanding who are the influencers on a given topic and what are the trending sources.

*

Twitter Grader - Twitter Grader lets one check the power of a Twitter profile compared to millions of other users that have been graded.

[ Back to Top ] ---+++ ---+++++ 3.3. Analyzing Traffic and Ranking of the Main Social Media Platforms ---++++ ---+++++ 3.3.1. Facebook (Page Insight)

Since Pages are public spaces, one’s engagement with Pages is also public. That means most of the data is available for one to create one’s own General Page Metrics. All the FAQs about the whole process are reachable at the following link.

---+++++ 3.3.2. YouTube Analytics

The API's support for YouTube Insight data has been officially deprecated as of September 30 2013. The YouTube Data API (v2) has been officially deprecated as of March 4 2014. Even though one can still use YouTube Data API (v3), it is easier to go directly to the analytics site of YouTube.

YouTube Analytics monitors the performance of channels and videos with up-to-date metrics and reports. There is a large amount of data available in different reports (e.g. Views, Traffic sources, Demographics). One can create different types of charts and also make a snapshot with the YouTube Creator Studio.

---+++++ 3.3.3. Google+

Google+ does not have its own analytics dashboard. One can see the statistics about the users (not about the pages) by using Google Webmaster Tools and and navigating to Author Stats. It only shows data from the last 15 days.

---+++++ 3.3.4. Twitter

In order to collect data sets from Twitter, users can reach Twitter Analytics at the link, where they only have to sign in and can examine the dashboards (Twitter Maps, Twitter activity, Followers).

[ Back to Top ] ---+++ ---+++++ 3.4. Individual Data Analysis

Individual data analysis is a process of gathering and analyzing data for each individual. With this process one can track personal details related to their average usage of the web. Just like the Collective data analysis, Individual data analysis sources number of clicks, views, time spent on the web page, and similar information in order to show an average value of the data for the individual user.

In order to analyze web browser history trends, one can use tools which gather web browsing history of a user and display data according to several criteria (most visited websites, frequency and time distribution of visits, etc.)

*

History Trends - Chrome add-on (interactive charts)

*

Visual History - Chrome add-on (interactive graph)

*

About Me - Mozilla add-on (charts)

---+++++ 3.4.2. Analyzing Web Navigation Time

For analyzing web navigation time, one can use tools which gather data based on the time a user spends online in relation to the browser history.

*

TimeStats - Chrome add-on (charts and graphs)

*

Mind The Time - Mozilla add-on (graph)

---+++++ 3.4.3. Analyzing Personal Daily Activities (Personal Analytics)

There are web based applications which facilitate the recording of individual daily activities and subsequently provide the user with an analysis for Personal Analytics purposes. These include:

*

Daytum - An application which allows the collection, categorization and communication of everyday data. All the basic functionalities can be used for free.

*

Google Account Activity - An application for users with a Google account, which provides statistics on their email usage and web search activities.

---+++++ 3.4.4. Social Media Management Tools*

Klout - Klout’s mission is to help people understand and leverage their influence.

*

Kred - Kred is a social-media scoring system that seeks to measure a person’s online influence. Kred, created by the San Francisco-based social analytics company People Browsr, also attempts to measure the engagement (or as they call it, the outreach) of a person or a company.

*

Hootsuite - Allows one to monitor and post to multiple social networks, including Facebook and Twitter.

*

WhoUnfollowedMe - Who.unfollowed.me is a service that helps one track their unfollowers, in real time, without waiting for a DM, or email.

*

Twitter Counter - Twitter Counter is the number-one site for tracking one’s Twitter stats.

*

Pinerly - Users can schedule pins, receive real time analytics on pins that are most effective, use multiple accounts, and also bookmark.

*

TweetPsych - Tweet Psych uses linguistic analysis algorithms (RID and LIWC) to build a psychological profile of a person based on the content of their tweets. The service analyzes user’s last 1000 tweets.

*

Social Buzz - Social Buzz is a real-time search engine for Facebook, Twitter and Google+. It was designed to provide a different kind of user experience for curious users, and deep analytics for the marketing professionals, including categories such as: posts types, top links/domains, keywords, sentiment, top users and posts.

[ Back to Top ]

This topic: MoM > WebHome > Collections > DataTools
Topic revision: 22 Sep 2014, JosipBatinic
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback