Sign In

Seeing big data through the cloud

The Inter­na­tion­al Data Cor­po­ra­tion (IDC) pre­dicts world­wide rev­enues for pub­lic cloud ser­vices will nudge $73 bil­lion (£47 bil­lion) by 2015. In Gartner’s view, the mar­ket will be worth $150 bil­lion (£98 bil­lion) by 2014. For­rester Research esti­mates it will hit $241 bil­lion (£157 bil­lion) six years after that.

But while the num­bers dif­fer, ana­lysts are in wide­spread agree­ment about the direc­tion of trav­el. Not only is the mar­ket for pub­lic cloud com­put­ing ser­vices boom­ing, it is doing so at pre­cise­ly the moment when big data is grow­ing expo­nen­tial­ly – result­ing in a com­pelling col­li­sion of plat­forms and infra­struc­tures.

How­ev­er, beyond the out-of-the-ball­park pre­dic­tions (after all, this is an indus­try in which hype is hard­ly unknown), what are the pay­offs and risks for busi­ness­es con­sid­er­ing cloud-based data stor­age?

Accord­ing to Andrew Green­way, Accenture’s glob­al cloud com­put­ing pro­gramme lead, a char­ac­ter­is­tic of big data is that busi­ness require­ments for infor­ma­tion from the data tend to change quick­ly. “There­fore, when deal­ing with big data projects, it can be a cost­ly and risky task to build the infra­struc­ture required your­self,” he says.

Data stor­age is sim­ply run­ning out of con­trol and the cloud offers a solu­tion

“Flex­i­ble ser­vice pro­vi­sion, through the cloud, allows the busi­ness to pay for what they use, when they want to use it. By using cloud ser­vices, organ­i­sa­tions do not have to build clus­ters of stor­age, which risk becom­ing an under-utilised invest­ment as projects end and data require­ments shift.

“Many organ­i­sa­tions are tak­ing a hybrid approach includ­ing cloud and on-premis­es tech­nol­o­gy. Bear­ing in mind the com­plex­i­ty and required inte­gra­tion between data sources and the new tech­nolo­gies com­ing onto the mar­ket, it’s impor­tant to choose care­ful­ly the right solu­tions, oth­er­wise it is easy to spend a lot on data ware­hous­es that don’t then deliv­er the flex­i­bil­i­ty the busi­ness demands.”

Don­ald Farmer, vice pres­i­dent of prod­uct man­age­ment, QlikView, says the chal­lenges with big data and cloud stor­age are three­fold. “First, there isn’t real­ly one thing called ‘the cloud’. If you have data from many sources, they are typ­i­cal­ly spread through many dif­fer­ent clouds and the chal­lenge is how you man­age the com­plex­i­ty of mul­ti­ple clouds,” he says.

“Sec­ond, we often talk about ‘the three Vs of big data” – veloc­i­ty, vari­ety and vol­ume. There’s a fourth V, too – vague­ness. Peo­ple real­ly don’t know what they want to do with the data or how they go about find­ing the slice of data that’s rel­e­vant to their par­tic­u­lar busi­ness prob­lem.

“Third is the dis­tinc­tion between data that is born in the cloud and data that is moved there for stor­age. That’s a shift from keep­ing your data on premis­es to keep it ‘on promis­es’. And that, psy­cho­log­i­cal­ly, feels more dan­ger­ous.

“The oppor­tu­ni­ties are real­ly con­sid­er­able. The biggest one is the free­ing up of resources. Data stor­age is sim­ply run­ning out of con­trol and it’s becom­ing a tremen­dous chal­lenge to busi­ness. Cloud stor­age offers a solu­tion to that. The oppor­tu­ni­ties around pro­vid­ing appli­ca­tions in the cloud, which are sim­pler to admin­is­ter, are also very attrac­tive.”

Tim More­ton, chief exec­u­tive of Acunu, says the very low cost of scal­a­bil­i­ty is real­ly impor­tant for fast-grow­ing busi­ness­es. Stream­ing video ser­vice Net­flix is a great exam­ple, he says. “They have stat­ed that they would not have been able to hit the sub­scriber num­bers they have built in the last three years, with­out using the cloud for all of their stor­age, pro­cess­ing and dis­tri­b­u­tion of online videos,” says Mr More­ton.

“The rea­son for that is they would not have had the mon­ey to build data-cen­tre capac­i­ty fast enough. So for organ­i­sa­tions which are grow­ing very quick­ly, with data at the heart of their busi­ness mod­els, the cloud is real­ly impor­tant.

“The biggest chal­lenge is that it is very expen­sive to get your data out, once you’ve got it in. So for big organ­i­sa­tions that means you are tying your­self to a future in which you process your data in the cloud.

“While there are ben­e­fits, it also means that some of those pro­tec­tions, such as secu­ri­ty and reg­u­la­to­ry require­ments, can get cir­cum­vent­ed. But if you look at the avail­abil­i­ty of a ser­vice, like Amazon’s and the oth­er big cloud-host­ing providers, they are prob­a­bly far more reli­able than most organ­i­sa­tions’ data cen­tres.”

Jim Dietz, prod­uct mar­ket­ing man­ag­er at Ter­a­da­ta, says the chal­lenges with big data, in the areas of vol­ume and veloc­i­ty, are avail­abil­i­ty and secu­ri­ty. “When you want to do impor­tant things with data, to be able to analyse it in detail and get real busi­ness val­ue out of it, you need to do it quick­ly,” he says. “That’s espe­cial­ly true of web-click data.

“Hav­ing speed that’s con­sis­tent in the pub­lic cloud is often­times hard to do. We’re see­ing peo­ple han­dling big data for busi­ness intel­li­gence and ana­lyt­ics more in the pri­vate cloud, either on premis­es or in con­trolled premis­es, where they know they can guar­an­tee the avail­abil­i­ty, speed of access and the secu­ri­ty of the data.

“Of the oppor­tu­ni­ties, one is that busi­ness analy­sis becomes much more afford­able in a cloud envi­ron­ment, because now you are able to take the resources of that infra­struc­ture and share it among a large num­ber of busi­ness func­tions and process­es so that the cost per analy­sis goes down.”

FOCUS

What is dirty data? And why it matters

When analysing data, you’re try­ing to iden­ti­fy pat­terns and rela­tion­ships. Any “noise”, any­thing that obscures the infor­ma­tion you want, can loose­ly be described as “dirty data”.

Angela Eager, Tech­Mar­ketView research direc­tor, explains: “Dirty data is becom­ing more of a prob­lem because not only do we have a greater vol­ume of data, but we are also draw­ing in more data from unstruc­tured sources, which are inher­ent­ly dirty, such as data from social media, sen­ti­ment analy­sis, GPS, RFID and smart meter­ing.

“Tra­di­tion­al­ly, you were refer­ring to things like dupli­cat­ed records, but that’s not appro­pri­ate in today’s infor­ma­tion world. Take dupli­cate data into a web­site-vis­its envi­ron­ment, for exam­ple, then one per­son mak­ing mul­ti­ple vis­its is valu­able infor­ma­tion.

“Com­bat­ing dirty data is all about under­stand­ing the con­text using art backed by sci­ence. There are lot of tools emerg­ing, such as sen­ti­ment and text analy­sis tools, that can look out for key­words, per­haps used in the same phrase, and which use back­ground algo­rithms to deter­mine whether the rela­tion­ship between those words is sig­nif­i­cant.”