FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 12-31-2010, 09:58 PM
Atif CEYLAN
 
Default PostgreSQL+ZFS

Hi all,
I have a large postgresql database system and I want to migrate to a new
and fast storage system (10 Gbp/s FC network). But 150x3 ssd disk on my
db server and I want to use ext4 file system (raid5) at the ssd disks as
xlog storage or use zfs (raidz) as disk buffer cache.

What is your idea?

--
/**
* @author Atıf CEYLAN
* Software Developer& System Admin
* http://www.atifceylan.com
*/


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4D1E6013.2010900@atifceylan.com">http://lists.debian.org/4D1E6013.2010900@atifceylan.com
 
Old 01-01-2011, 03:24 PM
Stan Hoeppner
 
Default PostgreSQL+ZFS

Atif CEYLAN put forth on 12/31/2010 4:58 PM:
> Hi all,
> I have a large postgresql database system and I want to migrate to a new
> and fast storage system (10 Gbp/s FC network). But 150x3 ssd disk on my
> db server and I want to use ext4 file system (raid5) at the ssd disks as
> xlog storage or use zfs (raidz) as disk buffer cache.
> What is your idea?

In as few words as possible? You're a nut job. The mere mention of
using FUSE ZFS in any production context on Linux proves it. As does
mentioning running RAID 5 on 3 SSDs, or any SSDs for that matter, or
running RAID 5 in a db context, for that matter.

What exactly are you really asking for advice on? You've left out the
most important detail: what your database application is, and what you
actually do with it.

What type of data does your db house?
How much data? Total GB?
Are you currently short of space?
Are you currently short of IOPS capacity?
How many concurrent transactions?
What types of transactions?
Is it read only/heavy or transactional like point of sale?

What is a "large postgresql database system"? What exactly do you mean
by this? Does large mean heavy transaction load? Or does it simply
mean lots of data housed? Or is it simply BS?

Giving any relevant recommendations WRT SAN, SSD, or filesystem
performance is meaningless at this point when you've given exactly zero
details about your database workload. The workload drives all other
aspects of the system design. For all we know your db could simply
contain your pirated music collection and would be completely
comfortable on a single 1TB SATA disk.

Give us the pertinent details and we can probably give you some decent
"ideas", as you requested.

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4D1F5543.3010108@hardwarefreak.com">http://lists.debian.org/4D1F5543.3010108@hardwarefreak.com
 
Old 01-01-2011, 07:16 PM
"Boyd Stephen Smith Jr."
 
Default PostgreSQL+ZFS

In <4D1F5543.3010108@hardwarefreak.com>, Stan Hoeppner wrote:
>Atif CEYLAN put forth on 12/31/2010 4:58 PM:
>> I have a large postgresql database system and I want to migrate to a new
>> and fast storage system (10 Gbp/s FC network). But 150x3 ssd disk on my
>> db server and I want to use ext4 file system (raid5) at the ssd disks as
>> xlog storage or use zfs (raidz) as disk buffer cache.
>> What is your idea?
>
>In as few words as possible? You're a nut job. The mere mention of
>using FUSE ZFS in any production context on Linux proves it.

Agreed. I wouldn't consider btrfs or ZFS for production work on Linux right
now.

>As does
>mentioning running RAID 5 on 3 SSDs,

Is your problem with RAID5 or the SSDs?

Sudden disk failure can occur with SSDs, just like with magnetic media. If
you are going to use them in a production environment they should be RAIDed
like any disk.

RAID 5 on SSDs is sort of odd though. RAID 5 is really a poor man's RAID;
yet, SSDs cost quite a bit more than magnetic media for the same amount of
storage.

>or any SSDs for that matter, or

SSDs intended as HD replacements support more read/write cycles per block than
you will use for many decades, even if you were using all the disk I/O the
entire time.

SSDs intended as HD replacements are generally faster than magnetic media,
though it varies based on manufacturer and workload.

I see little to no problem using SSDs in a production environment.

>running RAID 5 in a db context, for that matter.

Some people just hate on RAID 5. It is fine for it's intended purpose, which
is LOTS for storage with some redundancy on identical (or near-identical)
drives. I've run (and recovered) it on 3-6 drives.

However, RAID 1/0 is vastly superior in terms of reliability and speed. It
costs a bit more for the same amount of usable space, but it is worth it.

In a DB context in particular, you are probably going to be doing many small
reads. RAID 5 does not speed up those operations significantly, whereas a
good RAID 1/0 will reduce seek time by nearly 50%.

I suggest you use RAID 1/0 on your SSDs, quite a few RAID 1/0 implementations
will work with 3 drives. RAID 1/0 should be a little more performant and a
little less CPU intensive than RAID 5 for transaction logs. As far as file
system, I think ext3 would be fine for this workload, although it would
probably be worth it to benchmark against ext4 to see if it gives any
improvement.
--
Boyd Stephen Smith Jr. ,= ,-_-. =.
bss@iguanasuicide.net ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-'
http://iguanasuicide.net/ \_/
 
Old 01-01-2011, 08:58 PM
Atif CEYLAN
 
Default PostgreSQL+ZFS

On 01/01/2011 06:24 PM, Stan Hoeppner wrote:

Atif CEYLAN put forth on 12/31/2010 4:58 PM:


Hi all,
I have a large postgresql database system and I want to migrate to a new
and fast storage system (10 Gbp/s FC network). But 150x3 ssd disk on my
db server and I want to use ext4 file system (raid5) at the ssd disks as
xlog storage or use zfs (raidz) as disk buffer cache.
What is your idea?


In as few words as possible? You're a nut job. The mere mention of
using FUSE ZFS in any production context on Linux proves it. As does
mentioning running RAID 5 on 3 SSDs, or any SSDs for that matter, or
running RAID 5 in a db context, for that matter.

What exactly are you really asking for advice on? You've left out the
most important detail: what your database application is, and what you
actually do with it.

What type of data does your db house?


%95 text type %5 blob type records.

How much data? Total GB?


~300 GB

Are you currently short of space?


no, don't need more space.

Are you currently short of IOPS capacity?


yes

How many concurrent transactions?


minimum 100-200 transactions, maximum 800-1000 concurrent transactions.

What types of transactions?


usually update and insert

Is it read only/heavy or transactional like point of sale?


I didn't understand this.

What is a "large postgresql database system"? What exactly do you mean
by this? Does large mean heavy transaction load? Or does it simply
mean lots of data housed? Or is it simply BS?


heavy transaction load.

Giving any relevant recommendations WRT SAN, SSD, or filesystem
performance is meaningless at this point when you've given exactly zero
details about your database workload. The workload drives all other
aspects of the system design. For all we know your db could simply
contain your pirated music collection and would be completely
comfortable on a single 1TB SATA disk.

Give us the pertinent details and we can probably give you some decent
"ideas", as you requested.





--
/**
* @author Atıf CEYLAN
* Software Developer& System Admin
* http://www.atifceylan.com
*/


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4D1FA384.4090407@atifceylan.com">http://lists.debian.org/4D1FA384.4090407@atifceylan.com
 
Old 01-02-2011, 01:30 AM
Stan Hoeppner
 
Default PostgreSQL+ZFS

Boyd Stephen Smith Jr. put forth on 1/1/2011 2:16 PM:

> Is your problem with RAID5 or the SSDs?

RAID 5

> Sudden disk failure can occur with SSDs, just like with magnetic media. If

This is not true. The failure modes and rates for SSDs are the same as
other solid state components, such as system boards, HBAs, and PCI RAID
cards, even CPUs (although SSDs are far more reliable than CPUs due to
the lack of heat generation). SSDs only have two basic things in common
with mechanical disk drives: permanent data storage and a block device
interface. SSD, as the first two letters of the acronym tell us, have
more in common with other integrate circuit components in a system. Can
an SSD fail? Sure. So can a system board. But how often do your
system boards fail? *That* is the comparison you should be making WRT
SSD failure rates and modes, *not* comparing SSDs with HDDs.

> you are going to use them in a production environment they should be RAIDed
> like any disk.

I totally disagree. See above. However, if one is that concerned about
SSD failure, instead of spending the money required to RAID (verb) one's
db storage SSDs simply for fault recovery, I would recommend freezing
and snapshooting the filesystem to a sufficiently large SATA drive, and
then running differential backups of the snapshot to the tape silo.
Remember, you don't _need_ RAID with SSDs to get performance. Mirroring
one's boot/system device is about the only RAID scenario I'd ever
recommend for SSDs, and even here I don't feel it's necessary.

> RAID 5 on SSDs is sort of odd though. RAID 5 is really a poor man's RAID;
> yet, SSDs cost quite a bit more than magnetic media for the same amount of
> storage.

Any serious IT professional needs to throw out his old storage cost
equation. Size doesn't matter and hasn't for quite some time. Everyone
has more storage than they can possibly ever use. Look how many
free*providers (Gmail) are offering _unlimited_ storage.

The storage cost equation should no longer be based on capacity (should
never have been IMO), but capability. The disk drive manufacturers have
falsely convinced buyers over the last decade that size is _the_
criteria on which to base purchasing decisions. This can't be further
from fact. Mechanical drives have become so cavernous that most users
never come close to using the available capacity, not even 25% of it.
SSDs actually cost *less* than HDDs with the equation people should be
using, which is based on _capability_ and goes something like this, and
is not based on dollars but an absolute number--higher score is better:

storage_value=((IOPS+throughput)/unit_cost) + (MTBF/1M) - power_per_year

Power_per_year depends on local utility rates which can vary wildly
depending on locale. For this comparison I'll use kwh pricing of $0.12
which is the PG&E average in the Los Angeles area.

For a Seagate 146GB 15k rpm SAS drive ($170):
http://www.newegg.com/Product/Product.aspx?Item=N82E16822148558
storage_value = ((274 + 142) / 170) + (1.6) - 110
storage_value = -106

For an OCZ Vertex II 160GB SSD SATA II device ($330):
http://www.newegg.com/Product/Product.aspx?Item=N82E16820227686
storage_value = ((50000 + 250) / 330) + (2.0) - 18
storage_value = 136

Notice the mechanical drive ended up with a substantial negative score,
and that the SSD is 242 points ahead due to massively superior IOPS.
This is because in today's high energy cost world, performance is much
more costly when using mechanical drives. The Seagate drive above
represents the highest performance mechanical drive available. It cost
$170 (bare drive) to acquire but costs $110 per year to operate in a
24x7 enterprise environment. Two years energy consumption will be
greater than the acquisition cost. By contrast, running the SSD costs a
much more reasonable $18 per year, and it will take 18 years of energy
consumption to surpass the acquisition cost. As the published MTBF
ratings on the devices is so similar, 1.6 vs 2 million hours, this has
zero impact in the final ratings.

Ironically, the SSD is actually slightly _larger_ in capacity than the
mechanical drive in this case, as the SSDs fall between 120GB and 160GB,
and I chose the larger pricier option to give the mechanical drive more
of a chance. It doesn't matter. The SSD could cost $2000 and it will
still win by a margin of 115, for two reasons: 182 times the IOPS
performance and 1/6th the power consumption.

For the vast majority of enterprise/business workloads, IOPS and power
consumption are far more relevant than than total storage space,
especially for transactional database systems. The above equation bears
this out.

> SSDs intended as HD replacements support more read/write cycles per block than
> you will use for many decades, even if you were using all the disk I/O the
> entire time.

Yep. Most SSDs will, regardless of price.

> SSDs intended as HD replacements are generally faster than magnetic media,
> though it varies based on manufacturer and workload.

All of the currently shipping decent quality SSDs outrun a 15k SAS drive
in every performance category. You'd have to buy a really low end
consumer model such as the cheap A-Data's and Kingstons to get less
streaming throughput than an SAS drive. And, obviously, every SSD, even
the el chapos, run IOPS circles around the fastest mechanicals.

But if we're talking strictly a business environment, one is going to be
buying higher end models of SSDs. And you don't have to go all that far
up the price scale either. The major price factor in SSDs is no longer
performance now that there are so many great controller chips available,
but is size. The more flash chips in the device, the higher the cost.
The high performance controller chips (Sandforce et al) no longer have
that much bearing on price.

> I see little to no problem using SSDs in a production environment.

Me neither.

> Some people just hate on RAID 5. It is fine for it's intended purpose, which
> is LOTS for storage with some redundancy on identical (or near-identical)
> drives. I've run (and recovered) it on 3-6 drives.

It's fine in two categories:

1. You never suffer power failure or a system crash
2. Your performance needs are meager

Most SOHO setups do fine with RAID 5. For any application that stores
large volumes of little or never changing data it's fine. For any
application that performs constant random IO, such as a busy mail server
or db server, you should use RAID 10.

> However, RAID 1/0 is vastly superior in terms of reliability and speed. It
> costs a bit more for the same amount of usable space, but it is worth it.

Absolutely agree on both counts, except in one particular case: with the
same drive count, RAID 5 can usually out perform RAID 10 in streaming
read performance, but not by much. RAID 5 reads require no parity
calculations so you get almost the entire spindle stripe worth of
performance. Where RAID 10 really shines is in mixed workloads. Throw
a few random writes into the streaming RAID 5 workload mentioned above
and it will slow things down quite dramatically. RAID 10 doesn't suffer
from this. Its performance is pretty consistent even with simultaneous
streaming and random workloads.

> I suggest you use RAID 1/0 on your SSDs, quite a few RAID 1/0 implementations
> will work with 3 drives. RAID 1/0 should be a little more performant and a
> little less CPU intensive than RAID 5 for transaction logs. As far as file
> system, I think ext3 would be fine for this workload, although it would
> probably be worth it to benchmark against ext4 to see if it gives any
> improvement.

Again, RAID isn't necessary for SSDs.

Also, I really, really, wish people would stop repeating this crap about
mdraid's various extra "RAID 10" *layouts* being RAID 10! They are NOT
RAID 10!

There is only one RAID 10, and the name and description have been with
us for over 15 years, LONG before Linux had a software RAID layer.
Also, it's not called "RAID 1+0" or "RAID 1/0". It is simply called
"RAID 10", again, for 15+ years now. It requires 4, or more, even
number of disks. RAID 10 is a stripe across multiple mirrored pairs.
Period. There is no other definition of RAID 10. All of Neil's
"layouts" that do not meet the above description _are not RAID 10_ no
matter what he, or anyone else, decided to call them!!

Travel through your time machine back to 1995 to 2000 go into the BIOS
firmware menu of a Mylex, AMI, Adaptec, or DPT PCI RAID controller.
They all say RAID 10, and they all used the same "layout", which is
hardware sector mirroring of two disks and striping filesystem blocks
across those mirrored pairs.

/end RAID 10 nomenclature rant

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4D1FE347.8040008@hardwarefreak.com">http://lists.debian.org/4D1FE347.8040008@hardwarefreak.com
 
Old 01-02-2011, 03:50 AM
Stan Hoeppner
 
Default PostgreSQL+ZFS

Atif CEYLAN put forth on 1/1/2011 3:58 PM:
> On 01/01/2011 06:24 PM, Stan Hoeppner wrote:

>> How much data? Total GB?

> ~300 GB

>> Are you currently short of space?

> no, don't need more space.

Perfect.

>> Are you currently short of IOPS capacity?

> yes

Got it.

>> How many concurrent transactions?

> minimum 100-200 transactions, maximum 800-1000 concurrent transactions.

>> What types of transactions?

> usually update and insert

Write heavy.

>> What is a "large postgresql database system"? What exactly do you mean
>> by this? Does large mean heavy transaction load? Or does it simply
>> mean lots of data housed? Or is it simply BS?

> heavy transaction load.


Cool. If you don't need more than 300GB of space the answer is easy.
Get one of these 120,000 random write IOPS 360GB RevoDrive PCIe x4 cards
and put everything on it, db files, transaction logs, all of it. For
less than $1200 USD you'll get the IOPS performance of an 800 disk, FC
15k rpm RAID 10 fiber channel SAN array from EMC, costing about $2
million USD. Your latency will be an order of magnitude lower though
because the flash is connected directly to your PCIe bus. The only
thing such a SAN setup would have that you won't is dozens of terabytes
of space and more link throughput, neither of which you need.

You only need the additional IOPS, not the space, so you save $2 million
and get superior performance to boot. This is the true power and
economy of SSD technology, and how its price should be evaluated, not
dollars per gigabyte, but dollars per IOPS and dollars per watt. The $$
spent on the electric bill for a year of running that EMC array with its
many racks of disk trays would buy you dozens of these RevoDrive cards.

http://www.newegg.com/Product/Product.aspx?Item=N82E16820227662
http://www.ocztechnology.com/products/solid-state-drives/pci-express/revodrive/ocz-revodrive-x2-pci-express-ssd-.html

120,000 4k random write IOPS (overkill)
400 MB/s sustained write throughput (overkill)
PCI Express x4 interface
This is not a drive, but a PCB solution. Supreme reliability, just like
a motherboard. No mirroring or RAID required. Simply snapshot the
filesystem and dump it to tape or D2D using differential backup.

This card works fine with Linux if you have a recent kernel, even though
OCZ targets the desktop with this model. The 512GB Z-Drive card they
target at "servers and workstations" has only 1/10th the write IOPS
capability of the RevoDrive 380, and is $600 more expensive. As far as
I can tell the Z-Drive has no advantage, but possibly official technical
support.

Also, I recommend using the XFS filesystem due to its superior direct IO
performance with databases. Configure PGSQL to use direct IO. When you
make the XFS filesystem, consume the entire drive, creating 36
allocation groups. Refer to "man mkfs.xfs". This will maximize
parallel IOPS throughput to the SSD. Buy this card and do these things,
and you will be absolutely stunned by the performance you get out of it.
This storage card with XFS on top should easily handle 100,000 inserts
_per second_ if you have enough CPU horsepower to drive that load.

If you go this route, please let us know how well it works for you. I'm
sure many here would be eager to know. Well, others than myself.

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4D200417.9030407@hardwarefreak.com">http://lists.debian.org/4D200417.9030407@hardwarefreak.com
 
Old 01-02-2011, 05:30 AM
"Boyd Stephen Smith Jr."
 
Default PostgreSQL+ZFS

On Saturday 01 January 2011 20:30:31 Stan Hoeppner wrote:
> Boyd Stephen Smith Jr. put forth on 1/1/2011 2:16 PM:
> > Is your problem with RAID5 or the SSDs?
>
> RAID 5
>
> > Sudden disk failure can occur with SSDs, just like with magnetic media.
> > If
>
> This is not true.

This is true. While single-block failures are most likely, controller
failures will cause a whole-disk to fail. This is similar to a daughter card
failing. While rare, I've certainly seen it happen, and some NAS steps use
multipath across two HBAs to avoid the downtime associated with a HBA failure.
This is very similar to RAID 1 across 2 SSDs.

> The failure modes and rates for SSDs are the same as
> other solid state components, such as system boards, HBAs, and PCI RAID
> cards, even CPUs (although SSDs are far more reliable than CPUs due to
> the lack of heat generation).

Agreed.

> > you are going to use them in a production environment they should be
> > RAIDed like any disk.
>
> I totally disagree.

Respectfully disagree. However, I do see your point that RAIDing SSDs is not
*as* critical as RAIDing magnetic media.

>
> > RAID 5 on SSDs is sort of odd though. RAID 5 is really a poor man's
> > RAID; yet, SSDs cost quite a bit more than magnetic media for the same
> > amount of storage.
>
> Any serious IT professional needs to throw out his old storage cost
> equation. Size doesn't matter and hasn't for quite some time. Everyone
> has more storage than they can possibly ever use. Look how many
> free*providers (Gmail) are offering _unlimited_ storage.

I know I don't have all the local storage I need, and I have 6TB attached to
my desktop. It's currently full to the point where I can't archive data that
I acquire on less reliable media.

I think the old equations are still valuable. If capacity is not a priority
or easily satisfied, your observations are particularly valuable.

> Also, I really, really, wish people would stop repeating this crap about
> mdraid's various extra "RAID 10" *layouts* being RAID 10! They are NOT
> RAID 10!
>
> There is only one RAID 10, and the name and description have been with
> us for over 15 years, LONG before Linux had a software RAID layer.

> Also, it's not called "RAID 1+0" or "RAID 1/0". It is simply called
> "RAID 10", again, for 15+ years now.

Simply not true. The correct naming for layered RAID has never been
standardized. I frown on the "RAID 10" naming because it looks like it should
be pronounced "RAID Ten".

> It requires 4, or more, even
> number of disks. RAID 10 is a stripe across multiple mirrored pairs.
> Period. There is no other definition of RAID 10. All of Neil's
> "layouts" that do not meet the above description _are not RAID 10_ no
> matter what he, or anyone else, decided to call them!!

While, this is pedantically true, it is a rather silly distinction to make.
With all the layouts, the disks are divided into a number of blocks, then
pairs of these blocks are mirrored and the data is striped across all the
mirrors. This builds a RAID 1/0, where the "disks" are just parts of the
physical disks. The "D" in RAID refers to physical disks, but for quite a
while RAID is put into practice on top of various abstraction layers, so the
mdadm blocks certainly qualify.
--
Boyd Stephen Smith Jr. ,= ,-_-. =.
bss@iguanasuicide.net ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-'
http://iguanasuicide.net/ \_/
 

Thread Tools




All times are GMT. The time now is 05:09 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org